2010_1 !2010 Using Discovered, Polyphonic Patterns to Filter Computer-generated Music Tom Collins, Robin Laney, Alistair Willis, and Paul Garthwaite The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK t.e.collins@open.ac.uk Abstract. A metric for evaluating the creativity of a music-generating system is presented, the objective being to generate mazurka-style music that inherits salient patterns from an original excerpt by Frédéric Chopin. The metric acts as a filter within our overall system, causing rejection of generated passages that do not inherit salient patterns, until a generated passage survives. Over fifty iterations, the mean number of generations required until survival was 12:7, with standard deviation 13:2. In the interests of clarity and replicability, the system is described with reference to specific excerpts of music. Four concepts|Markov modelling for generation, pattern discovery, pattern quantification, and statistical testing|are presented quite distinctly, so that the reader might adopt (or ignore) each concept as they wish. 1 Aim and Motivation A stylistic composition (or pastiche) is a work similar in style to that of another composer or period. Examples exist in `classical' music (Sergey Prokofiev's Symphony No. 1 is in the style of Joseph Haydn) as well as in music for film and television, and in educational establishments, where stylistic composition is taught `as a means of furthering students' historical and analytical understanding' (Cochrane 2009). If a computational system produces successful stylistic compositions (`successful' in the sense of the `indistinguishability test' of Pearce and Wiggins (2001) for instance), then it is capable of a task that, in the human sphere, is labelled creative. The creativity metric presented below is intended as preliminary to (not a replacement of) an `indistinguishability test'. This paper relates ongoing research on a computational system with the aim of modelling a musical style.1 The motivation for the system is as follows. Cope (2005, pp. 87-95) describes a data-driven model that can be used to generate passages of music in the style of Johann Sebastian Bach's chorale harmonisations. His model can be cast as a first-order Markov chain and we have replicated this aspect of his model, with some modifications (see Sect. 2 for details). Our method is applied to a database consisting of Frédéric Chopin's mazurkas. This choice 1 See Pearce et al. (2002) on how computational modelling of musical styles constitutes one motivation for automating the compositional process. 1 of database is refreshing (Bach chorales have become the standard choice) and explores the range of music in which Markov chain models can be applied. A passage generated by a first-order Markov model `often wanders with uncharacteristic phrase lengths and with no real musical logic existing beyond the beat-to-beat syntax' (Cope 2005, p. 91) and Cope discusses strategies for addressing this problem. One such strategy|of incorporating musical `allusions' into generated passages|has been criticised for having an implementation that does not use robust, ecient algorithms from the literature (Wiggins 2008, p. 112-113). Our system is motivated by a desire to investigate the above `allusion' strategy, armed with more robust algorithms (or their underlying concepts), and is illustrated in Fig. 1. Subsequent sections describe various parts of this schematic, as indicated by the dotted boxes, but an overview here may be helpful. By an `allusion' we mean that an excerpt is chosen from one of 49 Chopin mazurkas (bottom left of Fig. 1), with the objective of generating mazurka-style music that inherits salient patterns from the chosen excerpt.2 To this end salient patterns are handpicked from the chosen excerpt, using the concept of maximal translatable pattern (Meredith et al. 2002). This is the meaning of the box labelled `pattern discovery' in Fig. 1. The discovered patterns are then stored as a `template'. Meanwhile the dotted box for Sect. 2 in Fig. 1 indicates that a passage (of approximately the same length as the chosen excerpt) can be generated on demand. Illustrated by the diamond box in Fig. 1, the same type of patterns that were discovered and stored as a template are now sought algorithmically in the computer-generated passage. For a computer-generated passage to survive the filtering process, we ask that it exhibits the same type of patterns as were discovered in the chosen excerpt, occurring the same number of times and in similar locations relative to the duration of the passage as a whole. In Sect. 4 the concept of ontime percentage and the Wilcoxon -sample test are employed to quantify and compare instances of the same type of pattern occurring in different passages of music. !"#"$"%&'()*#"+*+*,' -.)/+*0%'1"2345"%' 6&(#7'8' 9)41":)*');' <"45)='1)>&?' @"%%",&');'13%+(' ,&*&4"#&>'$A'1)>&?' BC(&4/#'(.)%&*';4)1' "'-.)/+*'1"2345"' 6&(#7'D' @"E&4*' >+%()=&4A' @"E&4*%'+*'&C(&4/#' %#)4&>'"%'#&1/?"#&' 6&(#7'F' G&*&4"#&>' /"%%",&' %34=+=&%' H?#&4+*,' !)&%7' ,&*&4"#&>' /"%%",&'H#' #&1/?"#&I' J)' K&%' Fig. 1. A schematic of the system to be described. 2 We would like to thank Craig Stuart Sapp for creating these kern scores and MIDI files, hosted at http://kern.humdrum.net. The database used in this paper consists of opuses 6, 7, 17, 24, 30, 33, 41, 50, 56, 59, 63, 67, and 68. 2 2 Generation of Computer Music Using a Markov Model Since initial attempts to marry music theory and Markov chains (Hiller and Isaacson 1959), their application to music analysis and composition has received considerable attention (see Loy 2005 for an accessible introduction and Norris 1997 for the supporting mathematics). The generation of a so-called `Markovian composition' is outlined now. Suppose that we join the Markovian composition partway through, and that the chord D4-C6 indicated by an arrow in bar 2 of Fig. 2 has just been composed, but nothing further (`middle C' is taken to be C4, and look out for clef changes). The 49 mazurkas by Chopin having been analysed, it is known, say, that in this corpus Chopin uses the D4-C6 chord (or any of its transpositions) on 24 occasions. The computer generates a random number uniformly between 1 and 24, say 17. The 17th instance of this chord (actually one of its transpositions, A2-G4) occurs in bar 38 of the Mazurka in B Major, op. 41/3. The chord that follows in this mazurka, A2D]4, is transposed appropriately (to D4-G]5) and becomes the next chord of the Markovian composition. To continue composing, one can repeat the last few sentences with D4-G]5 as the current chord. The passages shown in Figs. 2{3 were generated in this way. ! 1 67/3, 0 "! 56/1, 119 56/1, 116 67/4, 18 "! "! 7/5, 5 "# "!! " 63/3, 18 63/3, 26 63/3, 58 "! "! 41/3, 38 41/4, 60 "! $"! 63/1, 77 67/1, 49 "! "" $ !! "!"" $ !! "" $ !!"" !!""!!"" $ !! "" $ $ !! "" !! % &! ' """ !!! ! """ ( !!! "! """ $ !!! "! ""!!"" !! "" !! "" !! Fig. 2. A passage from the computer-music generator that does not survive filtering (contrast with Fig. 3). The italicised numbers refer to opus and bar numbers of particular fragments. For instance `67/3, 0' means opus 67, number 3, bar 0 (anacrusis). The durations in this system are dotted to improve the legibility of the first bar. The above explanation is simplified slightly, for what happens if one or more already-articulated notes are held while others begin? This is discussed with reference to the chord C3-F]4-D5 indicated by an arrow in bar 6 of Fig. 3. In this chord, the bottom C3 is held over to the next chord, C3-E]4-C]5.We observe that a note in a given chord can be held in one of four ways: 1. Not held beyond the given chord 2. Held over to the next chord 3. Held from the previous chord 4. Held both from the previous chord and to the next chord Thus in our model, the chord C3-F]4-D5 indicated by an arrow in Fig. 3 would be represented by the pair of vectors (18; 8), (2; 1; 1). The first vector is the chord 3 !" !" # 1 67/4, 0 %$! 33/3, 17 56/1, 108 $& 63/1, 33 24/3, 19 $ '$ $! $ $ $ 59/1, 91 $ $ 67/3, 23-28 $ $ $ $ ( $ ) * $ * (( $! $ $ $ $$ $ $$ $ 22 $$ $ 27 31 *(! % %$$ 33 $ $ # 5 $$ !! $$ %% +$$ ' $$ $$ ' % $ $ % % $$ ' $ $ $ $ ' !! $$ '% +$$ $$ 67/1, 32 $$ ' $ $ % $ $ $$ , 24/3, 26 56/1, 18 $$ $ ',, $$ $ ) (! * 38 $ $ $ $ *(! $$$ % 61 $$$ (! * $$ $ $ ' , $$ $$ $ $$ $ $ % $ '$ Fig. 3. A passage from the computer-music generator that does survive filtering. The darker noteheads in the left hand are referred to as pattern P, and indexed (in order of increasing ontime and pitch height) to help with understanding Table 1 (p. 8). spacing (18 semitones from C3 to F]4, and 8 from F]4 to D5) and the second vector encodes how each of the three notes are held (or not), according to the list just given. How is an actual chord of certain duration produced in our model, if it only uses chord spacing and holding information? Above, an example was given in which the 17th instance of a chord spacing was chosen from 24 options. By retaining the bar and opus numbers of the original mazurka to which this choice corresponds, we are able to utilise any contextual information (that is not already implied by the chord spacing and holding information) to produce an actual chord of certain duration. This Markov model alone is no match for the complex human compositional process, but it is capable of generating a large amount of material that can be analysed for the presence of certain patterns. 3 Discovering Patterns in Polyphonic Music Meredith et al. (2002) give an elegant method for intra-opus analysis of polyphonic music. In `intra-opus analysis' (Conklin and Bergeron 2008, p. 67), a single piece of music or excerpt thereof is analysed with the aim of discovering instances of self-reference. The human music analyst performs this task almost as a prerequisite, for it is arguable that music `becomes intelligible to a great extent through self-reference' (Cambouropoulos 2006, p. 249). Listening to the passage in Fig. 4, for instance, the human music analyst would notice: 1. The repetition of bars 11-12 at bars 13-14 2. The repetition of the rhythms in bar 12 at bar 14 (implied by the above remark), at bar 15, and again at bar 16 (here except the offbeat minim B4) 4 No doubt there are other matters of musical interest (the tonicization of the dominant at bar 16, the crossing of hands in bars 17-18), as well as `lesser' instances of self-reference, but here attention is restricted to remarks 1 and 2. It should be emphasised that these remarks are `human discoveries'; not from an algorithm. However, the key concepts of Meredith et al. (2002) can be used to automate the discovery of these types of pattern, so that they can be sought algorithmically in a computer-generated passage.3 !" !" [Vivace h. = 60] #$$$$ 11 [ f ] % 7 8 % % % 15 23 %%& $ ' %% %% ( %% 31 %% )%% ** + %% % % %%& $ ' %% %% ( %% %% )%% ** 3 3 + ,$$$$ 1 %% %% 9 %% %% 17 %% %% 25 %%%% 33 %%%% 39 %%%% %% %% 52%%% % 60%%% % %%%% %%%% %%%% #$$$$ 15 %% )%% * * + +* %% ) % * + ) , p %%%%% % *+! ,$$$$ % 90 $%%% 95 %%% % 104%% %+% 107 %% %% %% + %%%% %% + Fig. 4. Bars 11-18 of the Mazurka in E Major, op. 6/3 by Frédéric Chopin. As in Fig. 3, some noteheads are darker and indexed to help with understanding Table 1 (p. 8). The formal representation of music as points in multidimensional space can be traced back at least as far as Lewin (1987). Each note in Fig. 4 can be represented as a point in multidimensional space, a `datapoint' d = (x; y; z), consisting of an ontime x, a MIDI note number y and a duration z (a crotchet is set equal to 1). The set of all datapoints for the passage in Fig. 4 is denoted D, for `dataset'. For any given vector v, the maximal translatable pattern (MTP) of the vector v in a dataset D is defined by Meredith et al. (2002) as the set of all datapoints in the dataset that, when translated by v, arrive at a coordinate corresponding to another datapoint in D. MTP(v;D) = fd 2 D j d + v 2 Dg: (1) For instance, P = MTP(w;D); where w = (6; 0; 0); (2) is indicated by the darker noteheads in Fig. 4. The vector w = (6; 0; 0) identifies notes that recur after 6 crotchet beats, transposed 0 semitones and with 3 Further automation of this part of the system is a future aim. 5 unaltered durations (due to the final 0 in w). This is closely related to remark 1 (on p. 4), which observes the repetition of bars 11-12 at 13-14, that is after 13􀀀11 = 2 bars (or 6 crotchet beats). It can be seen from Fig. 4, however, that P contains two notes each in bars 13, 15 and 16, which are repeated in bars 15, 17, 18 respectively. This is less closely related to remark 1, showing that the human and computational analytic results do not align exactly. One highlights an idiosyncrasy|perhaps even a shortcoming|of the other, depending on your point of view. As well as the definition of a maximal translatable pattern, the other key concept in Meredith et al. (2002) is the translational equivalence class (TEC). Musically, the translational equivalence class of a pattern consists of the pattern itself and all other instances of the pattern occurring in the passage. Mathematically, the translational equivalence class of a pattern P in a dataset D is TEC(P;D) = fQ D j P Qg; (3) where P Q means that P and Q contain the same number of datapoints and there exists one vector u that translates each point in P to a point in Q. We return to the specific example of the dataset D containing the datapoints for the passage in Fig. 4, and suppose P is defined by (2). It can be verified that the translational equivalence class of P in D is TEC(P;D) = fP; (P;w)g; (4) where (P;w) denotes the set of all vectors p + w, and p is a datapoint in P. Equation (2) helps to identify notes whose durations and MIDI note numbers recur after 6 crotchet beats. The set in (4) contains the pattern P and (P;w), the only other instance of the pattern in the excerpt. Together, the equations suggest how to automate discovery of the type of pattern described in remark 1. What of remark 2, the repetition of the rhythms in bar 12 at bar 14 (after 6 beats), bar 15 (after 9 beats) and bar 16 (after 12 beats)? As this is a rhythmic pattern, it is useful to work with a `rhythmic projection' D0 of the dataset D. If d = (x; y; z) is a member of D then d0 = (x; z), consisting of an ontime and duration, is a member of the projected dataset D0. It should be noted that two distinct datapoints d; e 2 D can have a conicident projection, that is d0 = e0, just as two objects placed side by side might cast coincident shadows. The repetition observed in remark 2 occurs at 6 and 9 and 12 beats after the original, so let S = MTP(u0;D0) \MTP(v0;D0) \MTP(w0;D0); (5) where u0 = (6; 0), v0 = (9; 0), and w0 = (12; 0). The set MTP(u0;D0) in (5) corresponds to notes whose durations recur after 6 crotchet beats. The second set MTP(v0;D0) corresponds to notes whose durations recur after 9 beats, and the third set MTP(w0;D0) to notes whose durations recur after 12 beats. Taking their intersection enables the identification of notes whose durations recur after 6, 9 and 12 beats, which is closely related to remark 2. It can be verified that TEC(S;D0) = fS; (S;u0); (S; v0); (S;w0)g: (6) 6 As with pattern P, the human and computational analytic results for pattern S do not align exactly. All of the notes in bar 12 of Fig. 4 are identified as belonging to pattern S, but so are a considerable number of left-hand notes from surrounding bars. While it is not the purpose of this paper to give a fullyedged critique of Meredith et al. (2002), Sect. 3 indicates the current state of progress toward a satisfactory pattern discovery algorithm for intra-opus analysis. 4 Filtering Process 4.1 Quantifying an Instance of a Musical Pattern When an instance of an arbitrary pattern P has been discovered within some dataset D, as in the previous section, how can the position of the pattern be quantified, relative to the duration of the excerpt as a whole? Here the straightforward concept of ontime percentage is used. For a note having ontime t, appearing in an excerpt with total duration T, the note has ontime percentage 100t=T . For instance, the excerpt in Fig. 4 has total duration 24 (= 8 bars 3 beats). Therefore, taking the F] at the top of the first chord in bar 13, with ontime 6, this note has ontime percentage 100t=T = 100 6=24 33%. When calculating the ontime percentage of each datapoint p in a pattern P, a decision must be made whether to include repeated values in the output. For instance, the six notes in the first chord in bar 13 will have the same ontime percentage, so repeated ontime percentages indicate a thicker texture. The inclusion of repeated values does not affect the appropriateness of the statistical test described in Sect. 4.2, but it may affect the result: two otherwise similar lists of ontime percentages might be distinguishable statistically due to a high proportion of repeated values in one collection but not the other. Here the decision is taken not to include repeated values. Two lists of ontime percentages are shown in columns 2 and 5 of Table 1 (overleaf). The bottom half of column 5 is derived from the darker notes in Fig. 3, referred to as pattern P. Column 2 and the top half of column 5 are derived from the darker notes in Fig. 4, pattern P. 4.2 Applying Wilcoxon's Two-sample Test in Musical Scenarios Let us suppose we have two random samples, one consisting of m observations x1; x2; : : : ; xm, and the other consisting of n observations y1; y2; : : : ; yn. Columns 2 and 5 of Table 1 serve as an example, with m = 17 and n = 6. It should be pointed out that the random-sample supposition almost never applies in musical scenarios. Increasingly however, the assumption is being made in order to utilise definitions such as the likelihood of seeing a pattern (Conklin and Bergeron 2008). Wilcoxon's two-sample test helps to determine whether two sets of observations have the same underlying distribution. The calculation of the test statistic will be demonstrated, with the theoretical details available elsewhere (Neave and Worthington 1988). The test statistic W is calculated by assigning ranks R1;R2; : : : ;Rn to the set of observations, y1; y2; : : : ; yn, as though they 7 Table 1. The note indices, ontime percentages and combined sample ranks of two patterns are shown, P indicated by the darker noteheads in Fig. 4, and P from Fig. 3. Pattern P Pattern P continued Note Ontime Note Ontime index % Rank index % Rank 1 0.0 1 90 54.2 19 7 1.4 2 95 58.3 20 8 2.8 3 104 66.7 21 9 4.2 4 107 70.8 23 15 6.3 5 17 8.3 6 Pattern P 23 10.4 7 22 28.0 12 25 12.5 8 27 32.0 14 31 15.6 9 31 36.0 16 33 16.7 10 33 40.0 17 39 20.8 11 38 48.0 18 52 29.2 13 61 68.0 22 60 33.3 15 P rank total: 99 appear in a combined sample with the other set. This has been done in column 6 of Table 1 (see also column 3). Then W = Pn i=1 Ri is a random variable, and from Table 1, a value of w = 99 has been observed. Either the exact distribution of W or, for large sample sizes, a normal approximation can be used to calculate IP(W w). Using a significance threshold of f = 0:05 and with m = 17; n = 6, a value of W outside of the interval [43; 101] needs to be observed in order to reject a null hypothesis that the two sets of observations have the same underlying distribution. As we have observed w = 99, the null hypothesis cannot be rejected. What does the above result mean in the context of musical patterns? We have taken P and P, two instances of the same type of pattern occurring in different passages of music, and compared their ontime percentages. Not being able to reject the null hypothesis of `same underlying distribution' is taken to mean that a computer-generated passage survives this element of the filtering process. We are notionally content that the relative positions of P and P are not too dissimilar. There are five further elements to the filtering process here, with the Wilcoxon two-sample test being applied to the ontime percentages of: 1. (P;w) and (P;w), where w = (6; 0; 0) 2. S and S, where S is given in (5), and S denotes the corresponding pattern for the computer-generated passage in Fig. 3 3. (S;u0) and (S;u0), where u0 = (6; 0) 4. (S; v0) and (S; v0), where v0 = (9; 0) 5. (S;w0) and (S;w0), where w0 = (12; 0) At a significance threshold of f = 0:05, the passage in Fig. 3 survives each element of the filtering process, whereas the passage in Fig. 2 does not. 8 5 Discussion This paper has presented a metric for evaluating the creativity of a musicgenerating system. Until further evaluation has been conducted (by human listeners rather than just by the creativity metric), we are cautious about labelling our overall system as creative. The objective in the introduction was to generate mazurka-style music that inherits salient patterns from a chosen excerpt. It is encouraging that Fig. 3|which arguably sounds and looks more like a mazurka than Fig. 2|survives the filtering process, whereas Fig. 2 does not. In terms of meeting the aim of pattern inheritance, there is considerable room for improvement: Figs. 3 and 4 do not sound or look much alike, and a human music analyst would be hard-pressed to show how the filtered output (Fig. 3) inherits any salient patterns from the chosen excerpt (Fig. 4). One solution would be to include more filters. Another solution would be to raise the significance threshold, ff. By making the null hypothesis of `same underlying distribution' easier to reject, it becomes harder for a generated passage to survive filtering. Over fifty iterations, the mean number of generations required until survival was 12:7, with standard deviation 13:2. Raising f may increase pattern inheritance, but it may also have a non-linear impact on these statistics. The verbatim quotation in Fig. 3 (of bars 23-28 from the Mazurka in C Major, op. 67/3) raises several issues that relate to further work. First, we will consider including in the system a mechanism for avoidance of verbatim quotation. Second, the quotation contains a prominent sequential pattern that is different in nature to the intended inheritance (the two patterns observed in remarks 1 and 2 on p. 4). Using the concept of morphetic pitch defined in Meredith et al. (2002) it is possible to identify such sequential patterns, so the sequence itself is not a problem, only that its presence was unintended. Measures exist for the prominence (Cambouropoulos 2006) or interest (Conklin and Bergeron 2008) of a pattern relative to others in a passage of music. The adaptation of these measures to polyphonic music would constitute a worthwhile addition, both to Meredith et al. (2002) and to the use of discovered, polyphonic patterns in filtering computer-generated music. We have more general concerns about the extent to which the first-order Markov model generalises from Bach chorales to Chopin mazurkas. From a musical point of view the mazurkas may be too rich. The verbatim quotation mentioned above is indicative of a sparse transition matrix, which might be made more dense by including more mazurkas or other suitable compositions. There are several ways in which the system described could be fine-tuned. First, computational time could be saved by filtering incrementally, discarding generated passages before they reach the prescribed length if for some reason they are already bound not to survive filtering. Second, both Lewin (1987) and Meredith et al. (2002) propose (differing) methods for ordering notes. These could be used instead of or as well as ontime percentages, to investigate the ffect on the output of the system described. Third, if a region of original music is spanned entirely by a pattern (so that there are no non-pattern notes in this region) and this is also true of its recurrence(s), then this ought to be stored in the template (see 9 the definition of compactness in Meredith 2006). Again this would save computational time that is currently wasted in our system. Finally, sometimes the occurrence of a certain type of pattern implies the occurrence of another type of pattern. For example, bars 11-12 of Fig. 4 (approximately pattern P) recur at bars 13-14, implying that the rhythms of bar 12 (approximately pattern S) will recur in bar 14. This may seem obvious for only two discovered patterns, P and S, but when more patterns are discovered, the way in which these might be arranged into a hierarchy is worthy of further investigation. 6 Acknowledgements This paper benefited from a helpful discussion with David Meredith. We would also like to thank the three anonymous reviewers for their comments. 2010_10 !2010 Towards Analogy-Based Story Generation Jichen Zhu1 and Santiago Onta~non2 1 Department of Digital Media, University of Central Florida 12461 Research Parkway Suite 500 Orlando, FL, USA 32826-3241 jzh@mail.ucf.edu 2 IIIA, Artificial Intelligence Research Institute CSIC, Spanish Council for Scientific Research Campus UAB, 08193 Bellaterra, Spain santi@iiia.csic.es Abstract. Narrative is one of the oldest creative forms, capable of depicting a wide spectrum of human conditions. However, many existing stories generated by planning-based computational narrative systems are confined to the goal-driven, problem-solving aesthetics. This paper focuses on analogy-based story generation. Informed by narratology and computational analogy, we present an analytical framework to survey this area in order to identify trends and areas that have not received sufficient attention. Finally, we introduce the new developments of the Riu project as a case study for possible new narrative aesthetics supported by analogy. 1 Introduction Computational narrative explores the age-old creative form of storytelling by algorithmically analyzing, understanding, and most importantly, generating stories. Despite the progress in the area, current computer-generated stories are still aesthetically limited compared to traditional narratives. In both plot-centric and character-centric approaches for story generation, the widely used planning paradigm has a strong impact on these stories' goal-driven, problem-solving aesthetics. In order to broaden the range of computer generated narratives, this paper analyzes the relatively under-explored area of story generation using computational analogy. Recent developments in cognitive science demonstrate the importance of analogy as a powerful cognitive faculty to make sense of the world [5, 10] as well as an ffective literary tool to enhance such understandings through narratives [30]. Compared to the large body of planning-based work, significantly fewer endeavors have been spent on analogy. We argue that analogy is a promising direction towards novel narrative forms and aesthetics that planning-based approaches cannot provide. More broadly, our focus on analogy is aligned with Gelernter's account on computational creativity, in which analogy functions as the crucial link between \high focus" analytical cognitive activities and \low focus" ones connected through shared emotions [9]. 75 Drawn from narratology and computational analogy, we propose an analytical framework to identify different key aspects of analogy-based story generation and systematically classify existing systems accordingly. We will also discuss the impact of these aspects (e.g. representation formalism) on the aesthetics of potential analogy-generated stories. The purpose of our overview is to recognize existing trends and the unexplored areas in this relatively new area of research. In this paper, we adopt a broad definition of analogy to include not only classic computational analogy techniques, but also other related areas, such as casebased reasoning (CBR) [1], conceptual blending theory [5, 6], and metaphor theory [18]. Finally, we will introduce the primary results of our Riu project as a case study for analogy-based computational narrative systems. In the remainder of this paper, Section 2 describes our motivation from the vantage point of aesthetics and narratology. Section 3 presents a brief introduction of computational analogy. Based on Chatman's narratology, Section 4 presents our framework of three dimensions of analgy-based story generation and classifies existing systems. Section 5 illustrates our approach through a case study of the new developments of the Riu project. Finally, Section 6 summarizes the paper and future research directions. 2 Aesthetics and Computer-Generated Narrative Interactive narratives carry the prospect of a fully edged medium with similar levels of breath and depth as traditional media of storytelling [24, 27]. However, the current state of computer-generated stories, a crucial component of interactive narrative, is still far from this goal. In spite of the accomplishments of planning-based approaches, the stories they generate often fall into a very small range of narrative aesthetics. We are not simply referring to how polished the final writing style is. Instead, our primary concern is the built-in narrative affordances and constraints of specific architectures in relation to the type of stories they generate. On the one hand, planning's ability to specify the desired final state gives authors tremendous amount of control over the story. On the other hand, its intrinsic goal-driven, problem-solving operations place an unmistakable stamp on the generated stories. One of the most salient examples of such planningbased aesthetics is Meehan's 1976 system Tale-Spin, whose style is still in uential among many recent systems. Below is an excerpt of a story generated by TaleSpin: Joe Bear was hungry. He asked Irving Bird where some honey was. Irving refused to tell him, so Joe offered to bring him a worm if he'd tell him where some honey was. Irving agreed. But Joe didn't know where any worms were, so he asked Irving, who refused to say. So Joe offered to bring him a worm if he'd tell him where a worm was...[22, p.129] Certainly, the stories generated by modern planning-based systems have become much more complex and other non-planning approaches have been devel76 oped, some of which will be discussed in Section 4.2. For example, the VisualDaydreamer system explores very different non-verbal narrative aesthetics using animated abstract visual symbols whose actions are emotionally connected [25]. However, our intention here is to systematically survey this relatively unexplored area of computer analogy-based story generation and identify promising new directions that may broaden the aesthetic range of computational narratives. As one of such directions, our Riu system explores sequencing narrative elements by their associations with similar events, a literary technique famously experimented in stream of consciousness literature to depict human subjectivity [17]. 3 Computational Analogy Computational models of analogy operate by identifying similarities and transferring knowledge between a source domain S and a target domain T. This process is divided by Hall [12] into four stages: 1) recognition of a candidate analogous source, S, 2) elaboration of an analogical mapping between source domain S and target domain T, 3) evaluation of the mapping and inferences, and 4) consolidation of the outcome of the analogy for other contexts (i.e. learning). The intuitive assumption behind analogy is that if two domains are similar in certain key aspects, they are likely to be similar in other aspects. Existing analogy systems can be classified into three classes based on their underlying architecture [8]. Symbolic models (e.g., ANALOGY [3] and the Structure Mapping Engine [4]), heavily rely on the concepts of symbols, logics, planning, search, means-ends analysis, etc. from the \symbolic AI paradigm." Connectionist models (e.g., ACME [15], LISA [16], and CAB [19]), on the other hand, adopt the connectionist framework of nodes, weights, spreading activations, etc. Finally, the hybrid models (e.g., COPYCAT [23], TABLETOP [7] and LETTER-SPIRIT [21]) blend elements from the previous two classes. 4 Analogy in Story Generation Although several analogy-based systems have been developed to generate stories, there has not been any serious attempt to thoroughly and systematically identify different possibilities of analogy-based story generation. In order to better understand the area, this section presents a new analytical framework to classify different systems with the goal of presenting a clear picture of the current state and identify the areas that have not received sucient attention. 4.1 Analytical Framework In this section, we propose three dimensions to classify the landscape of analogybased story generation: 1) the scope of analogy, 2) the specific technique of computational analogy, and 3) the story representation formalism. The first dimension uses narratology theory to identify the scope of analogy | the level at which analogy is used in a narrative. In the widely accepted 77 Narrative Story Discourse Events Existents Actions Happenings Characters Setting Fig. 1. Chatman's Taxonomy of Narrative Components [2, p.19]. theory of Chatman [2] (Figure 1), a narrative can be divided into two parts: the story and the discourse3. A story is composed of events and existents, each of which can be further divided into actions and happenings, and characters and settings respectively. Finally, discourse is the ways in which a story is narrated4. As the narrative progresses, these different elements may affect one another. For instance, events can affect existents. Based on Chatman's taxonomy, we are able to locate the level at which analogy is performed, i.e. its scope. Our description below is organized from the local to the global scale: Events: analogy can be used to map individual events (including character actions and happenings) from S to T. Analogy at this level focuses on transferring only events and/or the structure of multiple events, without taking existents into account. Existents: existents can be fairly complicated structures. For instance, a character may have background, personality, and relations. If we partially specify a character in T, the rest of the character traits may be automatically defined by drawing analogy from another character in S, given that a strong analogical mapping can be found between them. Story: analogy at the story level takes both events and existents into consideration as a whole. For instance, analogy at this level can map one complete scene (including existents and sequences of events) in S to another in T. Discourse: analogy at the discourse level focuses on mapping discursive strategies, regardless of the story content. Narrative: analogy at the complete narrative level considers story and discourse as a whole. Analogy at this global level is useful to identify global structural similarities, such as \explaining past experiences using ash-backs," which can only be captured when considering story and discourse together. It is worthwhile to stress that certain analogies at a more global narrative level cannot be achieved by performing analogy at its child levels separately. For 3 Some authors also use the terms of fable and sjuzet to represent a similar division. 4 In Chatman's terminology, what we conventionally call story generation is actually narrative generation as it includes both story and dicourse. 78 System Scope Technique Representation Riedl & Leon's [26] story CAB generation planning-based PRINCE [14] existents analogy identification WordNet GRIOT [13] / MRM [35] existents Conceptual Blending logical clauses Minstrel [31] story CBR Rhapsody ProtoPropp [11] story CBR OWL Virtual Storyteller [28] story CBR planning-based MEXICA [33] existents engagement/re ection relationship graphs Table 1. Classification of Existing Analogy-Based Story Generation Systems. instance, analogies at the entire story level may not be found at either the events or existents levels alone. The second dimension is the computational analogy method used by a system. As mentioned before, we also include related areas such as CBR, conceptual blending and metaphor theory in our definition of analogy. We further differentiate the purpose of analogy as identification from generation. Identification involves only generating an analogical mapping for identifying similarities among two domains. Such similarities can be exploited later for story generation by other techniques, as shown in Section 4.2. By contrast, generation involves transferring inferences (knowledge) from S to T after completing the analogical mapping, i.e. analogy itself is used for story generation. From our survey of existing systems, most CBR-based systems only create mappings between S and T to assess similarity, and use other techniques for story generation. Hence, they fall into the identification category. This is because traditionally CBR techniques separate case retrieval (where similarity is used) from case reuse (where solutions are generated). The third dimension is the system's story representation formalism. Different story representation formalisms afford analogical transfers at different levels, and hence allow different computational analogy methods to be applied. Magerko [20] distinguishes three types of approaches to represent stories: planning languages (which emphasize causality and structure), modular languages (which emphasize the content of the story without focusing on the temporal relations among the elements), and finally hybrid languages. If someone is interested in analogy-based generation at the story level, then modular languages (such as plot points, beats) will not be adequate, since those languages do not specify a story structure. On the contrary, such languages represent the useful information to work at the individual events and the existents level. 4.2 Classification of Existing Systems The above framework can help us to classify existing analogy-based story generation systems, and more importantly, identify trends and unexplored areas. Table 1 shows the analysis of various systems using the above three dimensions. The first column is the name of the system, if any; the second column shows 79 the level at which analogy is made (notice that this might be different from the level at which the system generates narrative components); next is the particular analogy technique used; finally, the last column shows the particular story representation formalism used for analogy. Among the systems that adopt classic computational analogy, Riedl and Leon's system [26] combines analogy and planning. It uses analogy as the main generative method and uses planning to fill in the gaps in the analogy-generated content. The system performs analogy at the story level using the CAB algorithm [19] and uses a representation consisting of planning operators. The PRINCE system [14] uses analogy to generate metaphors and enrich the story by explaining a story existent in the domain T using its equivalent in S. In this case, analogy is used for identification, and a secondary method for generating local metaphors for the overall narration. GRIOT [13] and the Memory, Reverie Machine (MRM) system [35], the latter is built on GRIOT, use the ALLOY conceptual blending algorithm to generate affective blends in the generated output, poetry in the case of GRIOT and narrative text in the case of MRM. Several systems use a case-based reasoning (CBR) approach, including Minstrel [31], ProtoPropp [11] and the Virtual Storyteller [28]. All of these three systems perform mappings at the story level for story generation. These CBR systems possess a case base of previously authored stories. When a system needs to generate a story satisfying certain constraints, one of the stories in the case base satisfying the maximum number of such constraints is retrieved, and later adapted if necessary through some adaptation mechanism. Reminiscent of CBR, MEXICA [33] performs mapping at the existents level in order to generate stories using an engagement/re ection cycle (also used in the Visual Daydreamer [25]). In particular, MEXICA represents the current state of the story as a graph, where each node is a character and each link represents their relation (e.g., \love" and \hate"). MEXICA maps the current state to the states in the pre-authored memories and retrieves the most similar one for the next action. Based on Table 1, we can see that despite of their uses of different analogy methods and story representations, all systems perform analogy at the story or existents level. No attempts to date have been spent on performing analogy solely at the events level, solely at the discourse level, or at the complete narrative level. These are some promising lines of future research, even though analogy at the narrative level may require considerably large structures to represent it. Moreover, the systems discussed in this section extend the range of aesthetic possibilities by generating stories beyond what is achievable by planning approaches. 5 A Case Study: Riu Riu is a text-based interactive system that explores the same story-world as Memory, Reverie Machine (MRM) [34, 35]. Compared to MRM which was developed on the framework of Harrell's conceptual-blending-based GRIOT system 80 [13], Riu uses computational analogy to in uence the narratives being generated. The goal of the Riu system is to recreate the intricate interplay between the subjective inner life of the main character and the material world through computational narrative. The system produces stories about a robot character Ales, who initially lost his memories, and who constantly oscillates between his gradually recovering memory world and reality5. Compared to planning-based systems, Riu generates narratives without a strong sense of an end goal. Instead, the events and existents in the memory world and reality trigger and in uence one another. This theme of Riu, inspired by stream of consciousness literature such as Mrs. Dalloway [32], requires novel uses of analogy and is dicult to achieve by planning. The representation formalism of both the story and the memory episodes is in uenced by Talmy's force dynamics model [29]. It is composed of a sequence of phases, each of which is specified in a frame-based representation for every particular point in time containing all the existents. The protagonist Ales starts without any memories of the past, and gradually recollects them during the story through a two-staged analogical identification process: surface similarity and structural similarity. Triggers in the real world may cause the system to retrieve memories from Riu's pre-authored library of memories based on surface similarities. For instance, an opening door may cause the retrieval of a memory of the oil change tests because they are both tagged as producing squeaky noises. Among the set of memories retrieved by surface similarity, the one(s) sharing deep structural similarities with real world events and existents will be recalled. An example of structural similarity can be between Ales playing with a cat and him playing with a pet bird, because the same structure of (play Ales X), (animal X). Such structural similarity is identified by using SME [4] as part of the Riu system. The Riu system also uses analogy for generation by bringing knowledge from the memories to the real world and vice versa. For example, when given multiple choices for action, Ales will \imagine" the consequence of each action A. First, a clause representing A is incorporated into the current state of the story, forming phase T0. Then, the system tries to find analogical mappings with the recollected memories. In particular, the system maps T0 to the first phase of each of the recollected memories. If for any memory M, composed of a sequence of phases S0, ..., Sn, and if a strong enough mapping is found between T0 and S0, then the system generates a collection of phases T1, ..., Tn by drawing analogy from M (i.e., what Ales \imagines" as the consequence of action A). Figure 2 shows a sample interaction with Riu. The story starts when Ales finds a cat in the street. This encounter triggers one of his memories of a past pet bird. Three choices are scribed at this point | the user can decide whether Ales will \play," \feed," or \ignore' the cat. The user first chooses to \play" with the cat. However, the strong analogy between \playing with the cat" and \playing with his bird" leads to the inference (generated by analogy) that \if Ales plays with the cat, the cat will die and he will be very sad." In this case, 5 We hereinafter use reality to refer to the main story world in contrast with the memory world. 81 Ales was walking on the street. when he saw a cat in front of him. When he was young, Ales used to have a bird. Ales was so fond of it that he played with it day after day. One day the bird died, leaving ALES very sad. Ales hesitated for what to do with the cat. (FEED IGNORE PLAY) > play No, I do not want the cat to die..., Ales thought. (FEED IGNORE) > feed Ales took some food from his bag and gave it to the cat. Fig. 2. An excerpt of User Interaction with Riu. the \cat" in T0 is mapped to \bird" in S0, and all appearances of \bird" are substituted by \cat" in the generation of T1 from S1. Such mappings are applied not only to individual existents such as \cat," but also to relations and actions. In the resulting T1, the cat is dead and ales is sad, and hence Ales refuses to play with the cat. The story then continues after the user selects \feed" for the second time. This simplistic imagination of Ales would be hard to generate using a rational planning approach. 6 Conclusions and Future Directions In this paper, we have systematically explored the idea of generating stories using computational analogy. Although planning-based techniques have been proven fruitful, analogy offers new narrative possibilities as a complement to the aesthetically goal-driven stories generated by planning. Drawing from narratology and computational analogy, we have presented an analytical framework consisting of three dimensions | narrative scope, analogy technique, and story representation | and used it to classify existing systems of analogy-based story generation. As a result, we have identified the trends of existing work and, importantly, areas requiring more attention. For instance, analogy at the level of solely discourse, solely events, and the complete narrative level have not been explored. We have also seen that although the story representation formalism used plays a key role on enabling certain types of analogies, little ffort is put into theorizing their ffects on different story generation systems. Additionally, we have presented a case study of the Riu system. The project not only explores new techniques of integrating analogy, but also demonstrates the potential of a new kind of narrative aesthetics. Based on our analysis, we propose several interesting future lines of research. First, most work on analogy has focused on the story and existents level. We believe that the reason is that these two narrative elements are relatively easy to represent using planning-based or frame-based representations. In addition, 82 analogy may be applied to the unexplored scopes. Some potential theoretical problems is how to represent discourse in ways which are amenable to analogy. Second, the impact of story representation formalisms (e.g., plot-point based, beat based, and planning-based) on analogy and essentially story aesthetics needs to be further studied. Different representations afford different uses of analogy, and imbue certain narrative aesthetics. Third, analogy has been used for both generation purposes and identification purposes. Using analogy for identification purposes is interesting since it enables the development of hybrid story generation systems, which can combine analogy with planning or with other generative techniques. An exploration of the possibilities to create such hybrids and how such hybridizations affect the generative possibilities and the resulting aesthetics is also a promising future research line. Finally, the goal-driven aesthetics of planning-generated stories is well known. Similarly, what is the complete range of aesthetic affordances of analogy? We believe the exploration of such questions may help us identify new generative techniques beyond planning and analogy. 2010_11 !2010 Story Generation Driven by System-Modified Evaluation Validated by Human Judges? Pablo Gervás1 and Carlos León2 1 Instituto de Tecnología del Conocimiento pgervas@sip.ucm.es 2 Departamento de Ingeniería del Software e Inteligencia Artificial cleon@fdi.ucm.es Abstract. Building systems which can transform their own generation processes can lead to the creation of novel high quality artefacts. In this paper a solution based on evaluation is proposed. The generation process is driven by evaluation rules which can be modified by the system. A panel of human evaluators provides feedback on the quality of the artifacts resulting after each modification. The system keeps track of which rules have been applied in the selection of each artifact, and learns indirectly from the human judges which modifications to retain based on the relative ratings of the artifacts. Relevant details and difficulties of this approach are discussed. 1 Introduction Societies of human creators are driven by two basic activities: creation of new artifacts (as performed by artists) and evaluation of newly created artifacts (as performed by artists and/or critics). Most of the efforts at modelling human creativity in computational terms in the past have focused on the task of creating artifacts. There are two strong arguments in favour of shifting the focus towards evaluation. First, developing models or algorithms for producing artifacts of a given kind tends to produce good/recognisable/typical artifacts of that type, rather than creative new ones. Innovation requires both departure from established procedures and the means for identifying when new results are good. Second, generate and test approaches constitute a simple computational way of rephrasing the task of creating artifacts in terms of the task of evaluating them. Very simple enumerative procedures for traversing a search space may yield surprisingly good results if driven by an appropriate evaluation function. If such a shift is taken to an extreme, the enumeration of the valid alternatives would not need to be altered in search for new artifacts, it would be enough to modify the evaluation function to obtain new candidate elements. Under this approach, the task of modifying creative procedures to obtain new artifacts would take the form of modifying the evaluation function. ? Research funded by the MICINN (GALANTE: TIN2006-14433-C02-01), UCM, Comunidad de Madrid (IVERNAO: CCG08-UCM/TIC-4300) and by BSCH-UCM. 85 In societies of human creators the development of an evaluation function (usually understood as artistic sensibility or equivalent abilities) is recognised as a fundamental requirement in the learning process of creative individuals. This learning process almost always takes the form of having instances of good artifacts pointed out. This paper describes a system that outputs new artifacts obtained by exploring a restricted conceptual space under the guidance of a set of evaluation rules. The conceptual space to explore is that of sequences of events that may be understood as stories. The exploration procedure is exhaustive enumeration of the search space. The system starts off from an initial set of evaluation rules for selecting new artifacts as the conceptual space is explored. A method for actively modifying the set of evaluation rules is provided. Modifications of the evaluation rules lead to new artifacts. The system learns which of the modified rules to retain from the responses of a panel of human evaluators that act as audience for its production of new artifacts. 2 Previous Work Boden [1,2] divides creativity into exploratory creativity (exploring the common possibilities for creating artefacts) and transformational creativity (changing these common rules to find really new and valid objects). Jennings [3] hypothesizes that societies create the evaluation criteria of creativity in the individual's mind, thus leveraging the concept of creativity to a place beyond pure inner processes. As such, creativity is learned and taught between individuals, and their relationships and the opinions that each one has about another have a strong influence on the ideas about the quality or novelty of artifacts. Autonomous creativity is the ability to change one's own standards without explicit direction from the outside. According to Jennings, autonomous creativity in humans is achieved through social interaction Ritchie [4] identifies the role of humans in Computational Creativity to be still very necessary, given the current state of the art. In his model, it is stated the role of humans must be clearly established before putting them in the generation loop. It is also hypothesized that human actions in the system should never be directly related to the generative objective of the system. Wiggins [5,6] defines a formalization of Computational Creativity processes in terms of their relation with classic Artificial Intelligence and the characteristics that separate pure exploration processes from those typically and only present in Computational Creativity. In his formalization, several sets are identified: U , the universe of concepts, containing the whole set of artefacts; the conceptual spaces C0 · · · Cn, which are strict subsets of U , among others. Three functions are also important to mention: R, which establishes the constraints that define the conceptual space of valid results, T , which is the function that transverses this conceptual space and sets an order on the identification of artefacts in the Ci set constrained by R, and E , the function for evaluating artifacts. 86 3 Story Generation Based on Evaluation The domain of story generation has been chosen to illustrate the ideas in this paper because it deals with artifacts that are easy to represent symbolically, are linear in nature, and, at a certain level of abstraction, their complete conceptual space may be specified by definition in terms of combinations of their constituent elements. Some of these points are sketched briefly below. In terms of Wiggins' model, the simplest approach for a generation system that explicitly performs evaluation on the stories it generates could be the definition of the E function (Eg at this level) and a basic generative strategy which would generate all possible stories in the conceptual space. The generative strategy, corresponding to Wiggins T function (Tg at this level), could be carried out by a simple backtracking generation in which each step adds a new event to the story (which therefore, after several steps in that branch, creates a whole story) and then backtracks to test another generative branch. Given a certain set of terminals like verbs, character names, places and valid time values, for instance, events in the form subject- verb- arguments can be easily generated. The stories can be considered to be sequences of events in the form {e1, e2, · · · , en} (where events would be conceptual statements corresponding to sentences like "Robert went to the park"). The evaluation function (Eg) would output a real value in the interval [−1, 1], −1 being a "very bad" story and 1 being a wonderful one. A value of 0 would represent a plain, normal story, acceptable but not "good". Thus we could obtain a total order for stories in which any threshold in the [−1, 1] range could be used to differentiate interesting stories (those falling above the threshold) from noninteresting ones. The Eg function could be composed of rules whose structure could be formed by a set of preconditions () considering the current partial evaluation and the current state of the story and a set of effects () that the application of those rules have on the final evaluation. Then, a very simple evaluator would process the story events iteratively, checking the preconditions and applying the postconditions , in such a way that the state of the evaluator (the partial set of variables that form the evaluation) is progressively updated for each processed event. 3.1 Evaluation-Driven Story Generation The original definition of the T function given by Wiggins modelled the operation of identifying the next element in the conceptual space to be considered. Under a certain interpretation, this could be understood to refer to the actual construction process followed by the creative system to obtain its next result. In this case, the range of the T function defines system output. However, under a different interpretation, the T function would be the procedure for constructing the next element to be considered by the evaluation function E . As some of the candidates proposed by the T function will be rejected by the evaluation function E , system output in this case is defined by the interaction between the 87 T and the E functions. In this paper we consider this second interpretation. Modifications of the E function will therefore control system output. For the purposes of this paper, plain random modification of the rules can be considered. More refined solutions may be considered. However, the system should not rely on the quality of any particular method of transformation. At this new level, we will also shift the responsibility for obtaining acceptable results to the evaluation process, in this case, the evaluation of effects of the modified rules. For this higher level evaluation we resort to a panel of judges. 3.2 Social Interaction Between Humans and Computers for Controlling Transformation The human judges that evaluate stories are asked to produce plain values which are decided when reading stories. For every generated story, a single numeric value in the range [−1, 1] could be received from humans reading a story, as long as the variable to be obtained is clearly defined and it is just dependent on human criteria regarding stories. The proposed method for evaluating stories involves checking the available set of evaluation rules against each story. Only some of these rules will have their preconditions met, and therefore be applied to contribute to the final rating that the system assigns to the story. For every story S that is finally selected as system output, a record is kept of which evaluation rules contributed to establish its internal rating: the particular subset of the evaluation rules (the FS set) that contributed to its being selected as output. By combining this record with the evaluations obtained from the human judges, each rule in this subset FS could be assigned the rating that humans assigned to the story S. In this way, rules would receive several ratings coming from humans indirectly. This could be used, for instance, to keep the rules that produce good stories and discard rules creating bad stories according to human evaluation. 4 Discussion and Further Development It is important to consider to what extent the autonomy of such a system would be compromised by the role played by the human judges. Ritchie points out [7] the need to keep humans isolated from the final objective of the system. In the present case, this corresponds to ensuring that the human participants play no direct role in the actual generation of the stories. At a more specific level, since the system is transforming evaluation rules, the human judges must not directly add knowledge concerning transformation of rules. In the described set up, human judges do not at any stage come into contact with the set of evaluation rules or the method used for transforming them. This constitutes a certain safeguard of system autonomy. Another aspect to take into account is whether the role played by the human judges in the proposed system could be seen to be modelling real phenomena 88 that occur in human creativity. We believe that it emulates closely the role played by critics and teachers in the formation of the creative capabilities of human creators. Along these lines, improvements to the present proposal could be contemplated. According to Jennings [3], the influence that external individuals have on generators depends on the relation between the generators and the evaluators. Issues like past agreement or mutual admiration may play a significant role in tempering actual feedback. For instance, it might be interesting to consider whether the learning process of the system might be refined by giving priority to the opinions of judges that have awarded good ratings in the past. The proposed solution would be inefficient. Although the system might explore candidate artifacts at a fast rate, and transform evaluation rules at speed, it relies on a stage of feedback from human judges that would take time (for a number of stories to be read and evaluated by the judges). The system would have to undergo a learning process equivalent to that of human storywriters receiving feedback from knowledgeable mentors. The current proposal restricts system output to a very specific conceptual space, and all system operations, whatever transformations are applied to the evaluation rules and whatever feedback is received from the judges, cannot lead to outputs beyond that conceptual space. In that sense, it could only aspire to be considered creative in an exploratory manner. Nonetheless, the system is explicitly transforming its own procedures in a search for better valued artifacts. This aspect of creative professions, the continuous search for improvement through modification of the procedures, has yet to be addressed in the computational creativity literature. The present proposal constitutes a first step in this direction. 2010_12 !2010 MEXICA-Impro: a Computational Model for Narrative Improvisation Rafael Pérez y Pérez, Santiago Negrete, Eduardo Peñalosa, Rafael Ávila, Vicente Castellanos, Christian Lemaitre División de Ciencias de la Comunicación y Diseño Universidad Autónoma Metropolitana Unidad Cuajimalpa México { rperez, snegrete, eduardop, ravila, vcastellanos, clemaitre}@correo.cua.uam.mx Abstract. This paper describes a system that dynamically generate narratives through improvisation. MEXICA-impro is based on a cognitive account of the creative process called engagement-reflection. Its architecture defines a framework where two agents participate in a simulated improvisation session to generate the plot of a story in which each one draws knowledge from two different databases representing cultural backgrounds. A worked example is explained in detail to show how this approach produces novel stories that could not be generated before. 1 Introduction Storytelling has a relatively long history in computational creativity research. Some well known models of storytelling include: TALESPIN (Meehan 1981), MINSTREL (Turner 1994), FABULIST (Riedl 2004). Improvisation is a means by which creativity can be exercised in storytelling. Two or more participants (agents) intervene in a session where each contributes pieces of stories that are combined with those of the others as time passes, until a whole story is constructed. The intervention of each individual agent in the process, with a certain degree of unexpectedness and a personal point of view, makes the whole process interesting and different from a creative process where only one agent participates. Once several agents participate in a process where their beliefs (rules) are tested by exposure to those of other agents, we enter the realm of social interaction. Here, the notion of creativity acquires a new meaning, namely the creative process is seen as a social process where a cultural clash might result in something new for both originators. The interaction among computational agents represents a metaphor of the processes of communication in a socio-cultural context, given that, when under controlled observation and with clear parameters, we can shed light on the understanding that characterizes the starting up of human systems of meaning. 90 Social cognition processes have an especially important role in a model such as MEXICA-Impro. We identify some important components of this kind influencing the collaborative improvisation of narratives. We consider three important elements of social cognition in our model: 1) communication resources, including the understanding of symbolic systems such as languages and key meanings facilitating mutual comprehension (Clark & Brennan, 1991); 2) availability of pieces of knowledge shared by agents, such as concepts, principles, procedures or strategies, conforming a convergent (shared) knowledge base in the agents' minds (Jeong & Chi, 2007), and 3) multiple perspectives, including an amount of divergent knowledge pieces (not shared by agents), leading to flexible ways of analyzing the same phenomena. In a creative collaborative goal such as constructing narratives, the agents: first, interact and generate a common ground, or mutual understanding space, while directing their efforts to achieve a common goal; second, in order to construct new stories, they rely on their common knowledge base, which should contain the minimal knowledge schemas or mental models to allow discussing the topics under analysis; and finally, according to cognitive flexibility theory (Scott, 1962), they make associations from multiple representations of the same information, such as their own representations compared to other agent's ideas of the same phenomenon, a process allowing the mental scaffolding necessary to consider novel applications of knowledge, or the emergence of new ideas. There needs to be a balance between what both agents know (common knowledge) and what they know individually (unique knowledge) to construct creative stories: when both agents' knowledge bases share an important amount of knowledge pieces, the plots generated by both agents are very similar; otherwise, when agents have an important amount of divergent pieces of knowledge, plots generated by both agents would be completely different (Jeong & Chi, 2007). 2 MEXICA-impro: a computer model of narrative improvisation Mexica (Pérez y Pérez & Sharples 2001, 2004) is a computational model of plot generation on top of which our system is built. It was inspired by the engagement-reflection (E-R) cognitive account of writing as creative design (Sharples 1999). So, MEXICA-impro is a computer model of narrative improvisation. It is formed by two mexica agents working together to develop as a team a story plot. Agent 1 is also called the leader, and agent 2 is also called the follower. The leader starts the improvisation and decides when it finishes (although in future versions both should be able to decide when to finish). The leader generates material through one complete E-R cycle and then cues the follower to continue the narrative. Then the follower takes the material generated so far and progress the story through one complete E-R cycle and then cues the leader to continue the narrative, and so on. Each agent in MEXICA-impro is formed by two main modules: the construction of knowledge structures (the K-S module) and the generation of plots through engagement-reflection cycles (the E-R module). The K-S module takes as input two text files defined by the user: a dictionary of valid story-actions and a set of 91 stories known as Previous Stories. The dictionary of story-actions includes the names of all actions that can be performed by a character within a narrative along with a list of preconditions and post conditions for each. The Previous Stories are sequences of story actions that represent well-formed narratives. With this information the system builds its knowledge base with structures known as atoms. Atoms represent (in terms of emotional links and tensions between characters) potential situations that can happen in the story-world and have associated a set of possible actions to be performed when that situation occurs. For example, an atom might represent the situation where a knight is in love with the princess, and it might have associated the action "the knight buys flowers to the princess" as a possible action to be performed by the lover. In MEXICA-impro each agent has different story-actions and/or different set of previous stories. Thus, the same situation might lead each agent to perform different actions. The E-R module takes two main inputs: the knowledge-base and an initial story-action provided by the user of the system (e. g., the princess heals jaguar knight) that sets in motion the E-R cycle. During engagement, an agent generates sequences of actions guided by rhetorical and content constraints; during reflection, the agent breaks impasses, evaluates, and, if necessary, modifies the material generated so far. It works as follows: the system starts in engagement; the post conditions of the initial action are triggered, generating a story-world context; the story-world context is employed as cue to probe memory and match an atom; then, the system retrieves the actions associated to the atom, selects one at random and updates the story-world context; After that, the engagement cycle starts again; the system attempts to match an atom that is equal to the current story-world context; if it fails the agent looks for an atom that is similar to the current story-world context; after generating three actions (this number can be modified by the user), the system switches to reflection. During reflection the system verifies that all actions' preconditions are satisfied; if it is necessary, the agent inserts actions to satisfy them; then, it evaluates the material generated so far. At this point, a E-R cycle ends. 3 Generating narratives via improvisation The MEXICA-impro project has an important goal: the stories generated by the collaborative agents cannot be developed by any one of them alone. If, furthermore, the story produced by the collaborative agents cannot be found within their knowledge bases, we refer to it as a collectively-creative story. Collectively-creative narratives are produced by providing each of our agents with different computational representations of culture, experience (knowledge-base) and personality. Let us elaborate this idea. For example, when one talks about knowledge bases in the context of the MEXICA-impro project, it is possible to describe at least three possible situations. The first one has to do with providing both agents with the same knowledge base. In this case the stories generated by them cannot be classified as collectively-creative because each agent can develop the same story alone. The second one consists of providing both agents with completely different knowledge bases. In this case we might be able to produce collectively-creative stories; however, 92 there is a high risk that our agents cannot progress the story as a team due to the lack of shared experiences. The third possibility involves providing both agents with partially different knowledge bases. In this case, we expect to produce collectivelycreative stories through a fluid collaboration between agents. This is the case we are interested in. In this paper we report some tests that we have performed employing partially different knowledge-bases. There are several issues related to dissimilar knowledge bases that should be discussed: how the dissimilarity between two knowledge bases can be measured? What is the best ratio similarity/dissimilarity between two knowledge bases to generate the best collectively-creative stories? And so on. However, due to space limitations, in this document we focus exclusively on explaining the core characteristics of our examples. As explained earlier, the knowledge base is built from the dictionary of storyactions and the set of previous stories. For this example, both agents employ the same actions; only the previous stories are different. From now onwards, the file of previous stories of agent 1 will be referred to as PH1 and the file of previous stories of agent 2 will be referred to as PH2. PH1 includes seven previous stories; PH2 includes six previous stories. Each story in PH1 shares a similar plot with at least one story in PH2. Sometimes they are very similar and sometimes they only share few elements. Tensions between characters 0 1 2 3 4 5 6 7 8 0 1-0 1 1-1-(2) 1-0 1-0-(1) 2 2-2-(1) 1-2 1-1 1-0 1-1 3 1-2-(1) 3-1 2-1 0-1 0-1 1-0 1-0 4 1-0 2-2 3-4 2-0 0-1 1-1 0-1 0-1 0-1 5 0-1 0-2(1) 0-2 2-3 1-0 1-0 0-1 6 0-1 0-3 1-0 0-1-(1) 1-2 7 1-1 1-0 1-0 0-1 0-1 0-1 8 1-0 0-1 (1) (1) 1-0 1-0 1-0 Emotional Links Figure 1. Partial map of atoms. The first digit in each cell indicates the numbers of atoms in the knowledge base of agent 1; the second digit indicates the numbers of atoms in the knowledge base of agent 2; numbers in parentheses indicate the shared atoms. For example, story four in PH1 and story three in PH2 only differ in their first action; the rest of the actions and the characters participating in the story are alike. However, story one in PH1 only shares few actions with story 1 in PH2. The knowledge bases built from PH1 and PH2 are partially represented in figure 1(for reasons of space we only show half of the map). This map allows comparing each 93 agent's atoms in terms of the number of components it has. All atoms are comprised by emotional links and tensions between characters (see Pérez y Pérez 2007 for details on how atoms are built). The horizontal axis indicates the number of tensions and the vertical axis indicates the number of emotional links that each atom contains. Each entrance in the map has figures that indicate the number of atoms in that position. The first digit in each cell indicates the number of atoms that belong to the knowledge base of agent 1; the second digit in each cell indicates the number of atoms that belongs to the knowledge base of agent 2; numbers in parentheses indicate those atoms that are equal in both knowledge bases. For example, the position (0 tensions, 2 emotions) shows that both agents share one identical atom, and that each agent has 2 unique atoms. As the reader can observe, agent 1 and agent 2 only share nine identical atoms. Those atoms located in the same position or located close to each other in the map share some characteristics; therefore, they are similar but no identical. Finally, we have few atoms that are very different to the rest. In this way, we are able to have two knowledge bases that are similar but not identical. 4 A story generated by MEXICA-impro MEXICA-impro is set to produce three actions during engagement and then switch to reflection. From now on agent 1 is referred to as the leader and agent 2 is referred to as the follower. The user provides the following first action: (0) princess cured jaguar knight The number on the left indicates that this action was produced at Time = 0. The leader starts an E-R cycle; during engagement the following actions are retrieved from memory: (0) princess cured jaguar knight (1) enemy kidnapped princess (2) enemy attacked princess (3) jaguar knight looked for and found enemy At time = 1 the enemy kidnaps the princes, at time = 2 the enemy attacks her and at time = 3 the knight decides to look for the enemy. Now, the leader switches to reflection. (4) jaguar knight is introduced in the story (5) princess is introduced in the story (7) hunter is introduced in the story (9) hunter tried to hug and kiss jaguar knight (8) jaguar knight decided to exile hunter (10) hunter went back to Texcoco Lake (6) hunter wounded jaguar knight (0) princess cured jaguar knight (1) enemy kidnapped princess (11) enemy got intensely jealous of princess 94 (2) enemy attacked princess (3) jaguar knight looked for and found enemy The system introduces the princess and the jaguar knight into the story. Then, the system requires to justify why the princess healed the knight and inserts the action where the hunter injured the knight at time = 6. Since there is a new actor, the system introduces the hunter into the story (time = 7). The leader now requires to justify why the hunter wounded the knight; so, it inserts the action where the knight exiled the hunter (time = 8). Why did the knight do that? Because the hunter attempted an excessive demonstration of love on the knight (time = 9). However, because the hunter was exiled, it changed his position inside the story-world. Therefore, in order to wound the knight, first it had to move back to the lake (where the knight is located). The system detects this situation and moves back the hunter to the Texcoco Lake (Time = 10). Finally, MEXICA-impro requires justifying why the enemy decided to attack the princess. So, it inserts action at time = 11. At this point, all preconditions are satisfied and the leader ends its first E-R cycle and cues the follower to continue the story. The follower starts its E-R cycle and during engagement generates two actions: (12) jaguar knight had an accident (13) enemy decided to sacrifice jaguar knight So, after the knight finds the enemy the follower continues the story inserting an action where the knight suffered an accident (time = 12) and the enemy decides to kill him (time = 13). The follower cannot match an atom in memory to continue the story and an impasse is declared. Thus, the system switches o reflection. Jaguar knight is introduced in the story princess is introduced in the story hunter is introduced in the story hunter tried to hug and kiss jaguar knight jaguar knight decided to exile hunter hunter went back to Texcoco Lake hunter wounded jaguar knight princess cured jaguar knight enemy kidnapped princess enemy got intensely jealous of princess enemy attacked princess jaguar knight looked for and found enemy (12) jaguar knight had an accident (13) enemy decided to sacrifice jaguar knight (14) hunter found by accident jaguar knight (breaking impasse) All preconditions are satisfied, so the system inserts at the end of the story in progress the action where the hunter found accidentally the knight (time = 14) to try to break the impasse. The E-R cycle ends and the follower cues the leader to continue the story. 95 The leader attempts to match an atom in memory; however, it fails and an impasse is declared. This is not surprising because none of the leader's previous histories includes a scene where a hero goes to rescue a victim and instead suffers an accident. Now the leader switches to reflection. All preconditions are fulfilled and the system inserts the action where the hunter killed the knight to try to break the impasse. (15) hunter killed jaguar knight (breaking impasse) This produces an interesting situation. One would expect that the enemy killed the knight; however, the hunter, who hated the knight, is reintroduced in the story and performs the murder. There is not a similar precedent in the previous stories of both agents. The leader cues the follower to continue the story. The follower tries to match an atom but again an impasse is declared during engagement; the system switches to reflection to try to break the impasse and inserts the action where the hunter killed himself. (16) hunter committed suicide (breaking impasse) The follower evaluates the story in progress and decides that the story is completed. So, the follower cues the leader to continue the story and informs the leader about its decision of finishing the story. The leader receives the information and nevertheless tries to advance the story. During engagement it cannot match an atom in memory and an impasse is declared. During reflection it cannot break the impasse. So, the leader decides to finish the story. This is the plot that both agents built together: *** Final Story (4) jaguar knight is introduced in the story (l) (5) princess is introduced in the story (l) (7) hunter is introduced in the story (l) (9) hunter tried to hug and kiss jaguar knight (l) (8) jaguar knight decided to exile hunter (l) (10) hunter went back to Texcoco Lake (l) (6) hunter wounded jaguar knight (l) (0) princess cured jaguar knight (1) enemy kidnapped princess (l) (11) enemy got intensely jealous of princess (l) (2) enemy attacked princess (l) (3) jaguar knight looked for and found enemy (l) (12) jaguar knight had an accident (f) (13) enemy decided to sacrifice jaguar knight (f) (14) hunter found by accident jaguar knight (f) (15) hunter killed jaguar knight (l) (16) hunter committed suicide (f) What makes this story original is its conclusion: a hero goes to rescue a victim but instead suffers an accident that leads to his murder, not by the enemy, but by an old 96 resented rival that suddenly is reintroduced in the plot. There is no similar story in either PH1 or PH2. The letter on the right side indicates if the action was generated by the leader (l) or by the follower (f). Actions in italics were generated during reflection. In this example the leader performed three E-R cycles while the follower performed two. The leader contributed with 12 actions while the follower contributed with 4. This difference arouse because the initial action provided by the user required that the leader inserted several actions during reflection to satisfy preconditions. During its first E-R cycle the leader produced 11 actions; so, almost its whole contribution to the narrative was generated during its first participation (i.e. during its first E-R cycle). During its second participation the leader inserted one action to try to break an impasse and during its final participation the leader was not able to contribute to the story. The follower contributed with three actions during its first participation and with one during its final participation. Both agents were able to generate more actions during its first E-R cycle because as the narrative unravelled the story-context became more complex and novel and it was more difficult to match an atom. The sequence of actions generated by the leader during its first participation (actions time = 1 to 11) produced a context that was novel to both the leader and the follower. That is, neither the leader's knowledge-base nor the follower's knowledge-base contained an atom that was equal to the current story-world context. This novelty arouse as a result of the heuristics employed to satisfy preconditions. The production of novel contexts is a normal and necessary situation when MEXICA generates stories: novel contexts arise and MEXICA looks for atoms similar to the current story-world context and then retrieves its associated actions to unravel the narrative in progress. In this way MEXICA is able to create novel narratives. However, in MEXICA-impro the follower receives unknown material (in this case produced by the leader) that must be progressed coherently: the actions chosen to continue the story must connect with the previous ones, the relation between characters must be kept, and so on. This is a difficult task. The E-R model provides the necessary elements to achieve this goal. We hypothesized that if the knowledge bases of the two agents were similar enough, the agents would interact without problems. However, in this case, the sequence of actions generated at times 1 to 11 produces a novel context for both agents. This characteristic is positive because the system is generating original situations to push the story forward instead of just copying the content of its knowledge base. However, if the context is "too novel", the system is not capable of matching an atom in memory and an impasse is declared. In this case, the follower was able to retrieve an action during engagement to continue the narrative. Due to lack of space, it is not possible to explain the details of how the follower matched the atom and retrieved the action at time 12. But it is important to mention that the atom matched only satisfied the minimum requirements to be considered similar to the current story context. That is, because the context was pretty novel the follower was close to declaring an impasse. Nevertheless, agent 2 was able to continue the story. Would the leader be able to match an atom employing the same story-context? This question leads to a more important question: would the leader be able to generate the same story alone? In order to answer these questions we ran a second test. We forced agent 1 to generate exactly the same initial first 11 actions and see if it could continue 97 the story alone. The result was that, after generating again the first eleven actions, agent 1 was not capable of matching any atom in memory. Thus, an impasse was declared and the system switched to reflection to try to break the impasse. Employing the heuristics designed for this purpose, the system inserted an action where Jaguar Knight made the enemy a prisoner. Then, the system considered that the story was completed and decided to finish it. In this way, the story generated alone by agent 1 is shorter and its conclusion is not as original as the conclusion in the story generated by both agents (one could easily expect that the knight would make the enemy a prisoner). Because its knowledge base does not include the necessary knowledge, the leader could never produce alone the same tale produced by MEXICA-impro. Would agent 2 be capable of generating alone the first same 11 actions and then continue the story? To answer this question we attempted to forced agent 2 to produce the same initial sequence of actions. However, the content of its knowledge base made it impossible. Agent 2 could not come out with the proposal that, after the princess cured the knight, something logical to happen was that the enemy kidnapped her. Thus, this agent could not generate the desired sequence of actions. As mentioned earlier, collectively-creative narratives are produced by providing each of our agents with different computational representations of culture, experience and personality. In our current version of MEXICA-impro, these characteristics are represented in the system's knowledge base. This way, because the story produced by MEXICA-impro is novel and could not be produced by any of the agents alone, we consider it as a collectively-creative narrative. This is a nice example of how the cooperation between both agents allowed producing a novel story. 5 Conclusions MEXICA and other systems have explored in the past how narratives can be created automatically according to different cognitive models and ideas. MEXICA has been successful in representing in computer terms the engagement-reflection account of the human creative process, especially through its capability to ‘reflect' about partial stories and adjust generation cycles thereafter. One of the main characteristics of creativity, however, one accounted for by many authors (e.g. Boden, 1990), is that products of the process need to be novel to a community, either in a particular group, or in society at large. This particular constraint for some product to be deemed ‘creative', takes the problem of building models and systems of creative processes into the realm of the social: not only the outcomes need to be sound and interesting but also new to the community. Creative agents then, need to take into account the community's knowledge when trying to come up with something new. Our project explores an approach to creativity, namely the use of improvisation in story generation as a metaphor of social reproduction. We believe that creativity is achieved by confronting established local (or global) lore with new, different knowledge and practice. Improvisation is a well known creative experimental medium where two or more different worlds collide in an organised environment to establish the ground for innovative, amusing and relevant knowledge, art work or otherwise. MEXICA-impro has provided a good starting point for our endeavour since it allows 98 us to redefine the creative process as a dialog between two improvising agents that draw their information from different databases considered as cultural contexts for different cultures. From the methodological point of view, our project establishes right from the onset a multidisciplinary approach to a multidisciplinary subject. What creativity is and how it can be modelled and studied can only be investigated by involving all the relevant disciplines. Our system possesses an architecture that provides all members of the group with a clear knowledge of all the relevant mechanisms and parameters at stake, in such a way that everyone can participate almost right from the start in discussions about future design and experiments. MEXICA-impro simulates the interaction between two agents with separate cultural backgrounds. The resulting system has become an experimental zone at the crossroad of several disciplines. The members of the group developing it come from backgrounds as diverse as A.I., Film Studies, Sociology and Psychology. We believe that we, ourselves, have set out in an engagement-reflection journey to explore in a multidisciplinary way the possibilities of creativity. 2010_13 !2010 Curious Whispers: An Embodied Artificial Creative System Rob Saunders1, Petra Gemeinboeck2, Adrian Lombard1, Dan Bourke1, and Baki Kocabali1 1 Faculty of Architecture, Design and Planning, University of Sydney, Australia 2 College of Fine Arts, University of New South Wales, Australia Abstract. Creativity, whether or not it is computational, doesn't occur in a vacuum, it is a situated, embodied activity that is connected with cultural, social, personal and physical contexts. Artificial creative systems are computational models that attempt to capture personal, social and cultural aspects of human creativity. The physical embodiment of artificial creative systems presents significant challenges and opportunities. This paper introduces the \Curious Whispers" project, an attempt to embody an artificial creative system as a collection of autonomous mobile robots that communicate through simple \songs". The challenges of developing an autonomous robotic platform suitable for constructing arti ficial creative systems are discussed. We conclude by examining some of the opportunities of this embodied approach to computational creativity. 1 Introduction Human creativity is situated within cultural, social and personal contexts. From a computational perspective this suggests that the processes involved in creativity should be open to the environment, other creative agents, and a history of creative works. Physical embodiment is an important aspect of human creativity that presents significant challenges and opportunities for the development of computational creativity. Katherine Hayles argues that embodiment is always contextual, enmeshed within the specifics of place, time, physiology and culture, which together compose enactment [1]. Following Pickering [2], creativity cannot be properly understood, or modelled, without an account of how it emerges from the encounter between the world and intrinsically active, exploratory and productively playful agents. The world offers opportunities, as well as presenting constraints: human creativity has evolved to exploit the former and overcome the latter, and in doing both, the structure of creative processes emerge. Why is embodiment important for computational creativity? The enactment described by Hayles, emphasises creativity as a situated act, e.g., in personal histories, social relations and cultural identity. The computational study of situated cognition as proposed by Clancey [3] does not require physical embodiment, but many of the more successful examples of situated computational systems are robotic in nature. Perhaps this is because, despite every ffort that a developer might make to maintain a separation, there is always the sense that agent and 100 environment are of the same `type' within the simulation and consequently that the agent is not truly situated within the environment. Physical embodiment requires that agents deal with the material nature of the creative activity that they engage in|the importance of working with an external material in creative activity was highlighted by the work of Schon studying designers and the process he termed as re ection-in-action [4]. Schon's re ectioninaction illustrates the utility of ideas from distributed cognition [5] in understanding the creative acts of designers, providing insights into the situated nature of creative cognitive process. Distributed cognition and re ection-in-action provide useful frameworks for designing artificial creative systems because they emphasise the relationship between the agent and its environment. The implementation of autonomous robots imposes constraints upon the hardware and software that can be incorporated. These constraints focus the development process on the most important aspects of the computational model. At the same time, embodiment provides opportunities for agents to experience the emergence of ffects beyond the computational limits that they must work within. Taking advantage of properties of the physical environment that would be dicult or impossible to simulate computationally, expands the behavioural range of the agents [6]. Finally, embodiment allows computational agents to be creative in environments that humans can intuitively understand. As Penny [7] describes, embodied cultural agents, whose function is self re exive, engage the public in a consideration of the nature of agency itself. In the context of the study of computational creativity, this provides an opportunity for engaging a broad audience in the questions raised by models of artificial creative systems. Curious Whispers is a project to investigate the nature of embodiment in an artificial creative system and explore the potential of placing this artificial society within a human physical and social environment. 2 Background In 1738, Jacques de Vaucanson exhibited his Flute Player automaton. In 1769, Baron Wolfgang von Kempelen presented to the public his chess playing Mechanical Turk; it was not until 1834, that an article appeared in Le Magazin Pittoresque revealing its inner workings and the man hidden within [8]. In developing these machines, both Vaucanson and von Kempelen engaged the public in philosophical questions about the nature of creativity and the possibilities of automation [9]. Our apparent fascination with the prospect of building machines that can exhibit creative behaviour continues today with the development of embodied agents as robots. Following Vaucanson, many of these robotic experiments are within the domain of music. Ja'maa is a percussion ensemble for human and robotic players, including Haile a robotic drummer that listens to the drumming of human players and responds with its own improvisations [10]. Eigenfeldt has developed software101 based multiagent systems to emulate improvised percussion ensembles of [11] and has embodied these agents within a robotic performer, MahaDeviBot [12]. DrawBots [13] and Mbots [14] are two examples of recent attempts to develop robots capable of exhibiting creative behaviour in the production of abstract drawings. Portraitist robots have been implemented [15, 16] but, while these projects have overcome significant technical challenges, they have mostly neglected to examine issues associated with embodied creativity. Cagli et al. [17] proposes the study the behaviour of realistic drawing to focus on the physical aspects of the creative process. In particular, they focus on visuomotor coordination and present a control architecture based on computational models of eye movements, and the eye-hand coordination of expert draughtsmen. For the development of computational models of creativity one of the key advantages of embodiment with a physical and social environment may be the access it brings to a cultural context beyond the confines of the computational elements. As Penny [7] observes in relation to his embodied cultural agents \viewers (necessarily) interpret the behavior of the robot in terms of their own life experience. [...] The machine is ascribed complexities which it does not possess. This observation emphasises the culturally situated nature of the interaction. The vast amount of what is construed to be the `knowledge of the robot' is in fact located in the cultural environment, is projected upon the robot by the viewer and is in no way contained in the robot." In Penny's works, the robots are viewed within the context of their cultural environment but this has no impact upon intrinsic behaviour of the robots, having no access to the situation that the audience brings. 2.1 Curious Agents Martindale [18] proposes that the search for novelty is a key motivation for individuals within creative societies. Curious agents embody a computational model of curiosity based on studies of humans and other animals, where curiosity is triggered by a perceived lack of knowledge about a situation and motivates behaviour to reduce uncertainty through exploration [19]. Unlike earlier models of creative processes that try to maximise some utility function, curious agents are motivated to discover something `interesting' based on their previous experiences using an hedonic function, the Wundt curve (see Figure 1). Curious agents provide a useful foundation for developing embodied agents to engage in an artificial creative system because they have been shown to be useful in modelling autonomous creative behaviour and have been used to robots to promote life-long learning in novel environments. Schmidhuber [20] presents a model interest and curiosity, based on the compressibility of information, and introduced a distributed model of curiosity based on a pair of agents competing to surprise each other [21]. Saunders focused on the role of curiosity in creativity to develop computational models of creativity to search for novelty and interest in design [22]. In these models, the computation of interest and boredom are based on novelty detection, a technology that was originally developed to detect potential faults in processes where it is critical to stay within \normal" operating 102 Fig. 1. The Wundt Curve: an example hedonic function for curious agents and robots. limits. Unlike in monitoring applications, novelty is considered a desirable quality when modelling curiosity, and detected novelty is used as the basis for positive reinforcement of behaviour. Research developing embodied curious agents has focussed on the utility of modelling curiosity as a motivation for learning about physical environments and social relations. Marsland et al. [23] introduced the idea of \neotaxis", movement based on perceived novelty, as a useful behaviour for autonomous robots to map physical spaces. Peters [24] presented the WRAITH algorithm as a layered architecture for building curious robots suitable for modelling creativity. Oudeyer and Kaplan [25] presents the use of curiosity to support the discovery of communication in social robots. Merrick [26] presents an architecture for curious, reconfigurable robots for creative play that can learn new behaviour in response to changes in their structure. When computational models of curiosity are used as the model of motivation in intelligent environments a new kind of space emerges: a curious place [27]. Curious places are intelligent environments using curious agents to adapt to changing user behaviour and anticipate user demands. Curious places offer new opportunities for supporting and embodying creativity in the physical environment. In addition to supporting human activities, curious places work proactively to anticipate, identify and enact creative behaviour. 2.2 Artificial Creative Systems The Domain Individual Field Interaction (DIFI) framework is a unified approach to studying human creativity that provides an integrated view of individual creativity within a social and cultural context [28]. According to this framework, a creative system has three interactive subsystems: domain, individual and field. A domain is an organised body of knowledge, including specialised languages, 103 rules, and technologies. An individual is the generator of new works in a creative system, based on their knowledge of the domain. A field contains all individuals who can affect the content of a domain, e.g., creators, audiences, critics, and educators. The interactions between individuals, fields and domains form the basis of the creative process in the DIFI framework: individuals acquire knowledge from domains and propose new knowledge evaluated by the field; if the field accepts a proposed addition, it becomes part of the domain and available for use by other individuals. Inspired by the DIFI model of creativity, Saunders and Gero used curious agents to develop artificial creative systems, composed of curious design agents capable of independently generating, evaluating, communicating and recording works [29]. Other distributed approaches to computationally modelling creativity include McCormack's \ecosystemic" approach, which recognises the importance of the environment and the agent's relationship with the environment as primary concerns for modelling creative activity [30]. 3 Implementation Building on the previous work of Saunders [22], we are currently developing the Curious Whispers project in an attempt to develop an artificial creative systems using embodied curious agents, i.e., curious robots. Inspired by the thought experiments of Braitenburg [31] we have implemented the robots as simple vehicles with the addition of a loudspeaker, a pair of microphones and sucient processing units to determine their `interest' in the sonic environment. The robot architecture has been developed as a set of function-specific modules: audio capture and processing, song categorisation and analysis, interest and boredom calculations, sound generation and output, and servo and motor control. The Curious Whispers robots have been built on top of the Ardubot bare bones mobile robot platform developed by Sparkfun Electronics3. The Ardubot platform was designed as a minimal, low-cost platform for developing mobile robots using the Arduino4 interface boards. The Ardubot platform is based around an oversized expansion board for the Arduino integrating a DC motor driver integrated-circuit (IC) and a pair of mounts for motors. Arduino and Atmel ATmega168 microcontrollers are used for this application due to their relatively fast operation speed (20 MIPS), ample memory (16kb ash, 1kb SRAM), exibility, compatibility and affordability. The Arduino acts as the primary interface between the ATmega168 microcontroller and other components attached to the robot, e.g., the DC motor driver, sound generation chips, etc. This provides simple access for programming the ATmega168 and offers expandability through the use of \shields" that can be stacked on top of the Arduino to provide additional functionality. A custom shield has been developed for the Arduino to provide the sound generation and sound capture and processing functions to the Arduino. To produce 3 http://www.sparkfun.com/ 4 http://www.arduino.cc/ 104 the audio signal to drive the loudspeakers each robot is equipped with an FM synthesis subsystem based around a Soundgin5 audio processor. The Soundgin processor has two independent sound engines, each with three oscillators and a mixer, providing a large variety of possible sounds. To allow the robots to move about their environment without damaging themselves, each robot has a pair of front-facing \whiskers" attached to a touch sensor, allowing the robot to stop and back away from obstacles encountered. 3.1 Audio Capture and Processing Two audio signals are captured by small microphones mounted on lightweight movable arms. The ATmega168 microcontroller performs a 64-point, fixed-point Fast Fourier Transform (FFT) operation on each of the audio signals. Using this 64-point FFT, a sampling rate of 16kHz and a frequency resolution of 250Hz, we are able to achieve a Nyquist frequency of 8KHz, i.e., we have 32 frequency bands at 0Hz, 250Hz, 500Hz, 750Hz, 1kHz,...8kHz. This is a sucient frequency range for our application, since we do not generate sounds above 8kHz. The onboard 16kb of memory can hold enough samples to perform two 64 point FFT calculations. Therefore the robots can monitor a stereo pair of signals enabling left-right interest differencing, suitable for driving neotaxis. The result of the FFT calculation is passed to the Arduino board for analysis and processing. The Arduino board monitors a stream of serial data from the FFT calculation on the left and right audio channels. Each sample is represented as an integer value between 0 and 63 representing the most active frequency detected by the FFT calculation: values close to 0 represent bass sounds. When the dominant frequency detected by the FFT changes, in either the left or right audio channel, the values for both channels are appended to short-term memory. The values in the short-term memory represent the \song" the robot is hearing. 3.2 Novelty, Interest and Boredom When short-term memory contains a total of eight frequencies, the values are packaged as a vector and presented to a small Self Organising Map (SOM) that serves as the robot's long-term memory [32]. Due to the limitations of the hardware platform, the SOM contains just 16 neurons, but this has proved sucient for the task of categorising the eight-note songs that the robots are capable of producing. In contrast with typical applications of categorisation systems, the robots do not attempt to maintain a complete map of the space of all possible songs, rather each robot constructs a local map of recently experienced songs. The novelty of a song is calculated as the shortest Euclidean distance between the vector representation of the song and all of the prototypes held in the SOM. To calculate the interest that the robot has in the current song a non-linear function, which approximates the Wundt curve as the sum of two sigmoids, is used to transform the novelty value. Consequently, a song stored in short-term 5 http://oopic.com/soundgin/ 105 memory that exactly matches an existing song in the SOM is not particularly interesting, and a song that is radically different to anything which the robot has previously experienced is also not very interesting. The most interesting songs for these robots will be songs which are similar but different to the songs recently experienced by the robot and held in the SOM. Interest values calculated for the audio signals received from the left and right microphones are translated into movement such that interest value for the left channel will be converted into a speed for the right wheel, and vice versa. Figure 2 illustrates a scenario for neotaxis as implemented in our robots, where one of the robots, having analysed the songs of two other robots (A and B), moves in the direction of the robot that has produced the more interesting song. NVD NVD LTM FFT FFT STM STM GENERATE MICROPHONE MICROPHONE SPEAKER A B Fig. 2. The robots in the Curious Whispers project implement neotaxis, driving in the direction of the most interesting novelty. The architecture includes: audio analysis (FFT); short-term memory (STM); long-term memory (LTM); novelty detection (NVD); and, song generation (GEN). In the absence of interesting songs a robot will become `bored'. Boredom is computationally modelled as a threshold on the long-term level of interest that the robot has had in recent songs. If the robot becomes bored, it changes from a listening mode to a generating mode. 106 3.3 Song Generation Generative systems are often computationally expensive, both in terms of the process of generation and analysis. The limited computational resources available in the autonomous robots has required that generation of new works be handled differently from previous artificial creative systems. Firstly, a simple generative system has been implemented, which takes advantage of the long-term memory of stored patterns to generate similar-but-different songs. To generate a new song the agent either mutates a pattern randomly chosen from the prototypes stored in the SOM, or chooses two prototypes and combines them using an operation similar to crossover used in genetic algorithms. Secondly, the analysis of generated songs takes advantage of the embodied nature of the robots to reuse the analysis systems already present. In the generative mode the robot changes its physical configuration by moving its left and right microphones closer to its speaker and reduces the volume of the speaker. This reconfiguration allows the robot to listen exclusively to its own songs. To bootstrap the system, all robots begin in this generative mode. Using the random vectors assigned to the prototypes held in the robot's long-term memory, each robot generates songs until it discovers one that is interesting enough to communicate to others. 4 Planned Experiments Three robots are in the final stages of construction and a series of experiments are planned to evaluate the utility of our approach. In particular, the experiments will examine: 1. whether embodiment has significant benefits over simulation for the study of artificial creative systems; and, 2. how humans interacting with an artificial creative system construe the agency of the robots. Comparing the behaviour of artificial creative systems is a dicult task. The behaviour of the system cannot be validated using the principles that underlie the approach, yet these principles are important indicators of creative behaviour. Behavioural diversity is a key factor in attaining creative behaviour, and one approach to evaluating creative behaviour is to quantify behavioural diversity. We will quantify the behavioural diversity of our embodied agents and of the artificial creative system as a whole, and compare these to our simulations of the same agents to gain insights into the ffects of embodiment on the creative processes. The simulation of the artificial creative system uses as much of the code running on the robots as possible, interacting within a simulated environment. Human audiences will encounter the artificial society and its evolving tunes within a the context of a gallery environment. This will allow them to share the same space with robots and to engage with their activities and relations from within. To study how embodiment affects the way humans construe the agency 107 of the robots, visitors will have the opportunity to interact with the robotic system using an FM synthesiser, similar to the ones used by the robots. The goal is to encourage visitors to engage with the social creative process at work in the community of robots by playing simple tunes, allowing visitors to inject elements from their human cultural context into the artificial creative system. 5 Conclusion This paper has described the design of Curious Whispers, a proof-of-concept implementation for an embodied artificial creative system. Unlike typical humanrobot interactions, this project does not place the human in a privileged position, able to dictate what the robots should play. Instead the human enters the arti ficial creative system as an equal to the robots, who is required to produce songs of interest to the robots for them to be picked up and reworked within the system. Curious Whispers is a system open to human engagement, potentially allowing the agents to take advantage of the social and cultural contexts that visitors bring. 2010_14 !2010 Elementary Social Interactions and Their Effects on Creativity: A Computational Simulation Andrés Gómez de Silva Garza1,1 and John Gero2 1 Instituto Tecnológico Autónomo de México (ITAM), Río Hondo #1, Colonia Progreso Tizapán, 01080—México, D.F., México, agomez@itam.mx 2 Krasnow Institute for Advanced Study, 4400 University Drive, Mail Stop 2A1, George Mason University, Fairfax, VA 22030, U.S.A., john@johngero.com Abstract. This paper presents a multi-agent computational simulation of the effects on creativity of designers' simple social interactions with both other designers and consumers. This model is based on ideas from situated cognition and uses indirect observation to produce potential changes to the knowledge that each designer and consumer uses to perform their activities. These changes result in global, social behaviors emerging from the indirect interaction of the players with relatively simple individual behaviors. The paper provides results to illustrate these emergent behaviors and how the social interactions affect creativity. 1 Introduction Computational models of creativity typically simulate the reasoning process of one creative agent that produces designs (whether they are of residences [1], stories [2], software [3], artwork [4], or other artifacts, whether abstract or physically realizable). In this view, the simulation ends as soon as the design of the artifact is generated by the simulated designer, and this design generation concludes when the simulated design process converges to an acceptable solution. The simulated designer is programmed with a particular body of knowledge, which may or may not change over time, that embodies its expertise and that includes evaluation knowledge that allows the agent to determine the acceptability of the designs it proposes and therefore halt the simulated design process. This simulated design process might have some parameters that can be adjusted, but usually employs the same overall method in order to generate designs (whether it be evolutionary algorithms [5], analogy [6], constraint satisfaction [7], shape grammars [8], or other computational strategies for creating designs). If the design process is run again on the same problem, then the same solution, or at least the same type of solution, will be obtained as output. This is the "design as search or optimization" view, and does not account for the fact that most designers are able to continue producing creative output throughout their lives. Designers do not just produce one design and stop, and the designs that they have produced in the past 1 Contact author. 110 influence the ones they produce in the future, instead of each episode of producing a design being done in complete isolation from all other episodes. As Boden has pointed out, creative products must be both novel and useful/valuable [9]. In order for a designer's output to be considered creative it must be sufficiently distinct from the designer's previous body of work, and in order for this to occur in a computational model, the simulated designer must be dynamic: at minimum, some aspect of the way it analyzes and/or produces designs must change over time. In addition, in order for a designer's output to be considered creative it must be valued by others, and in order to be successful in this a designer must be aware (as much as possible) of what others look for or value. It thus appears that traditional computational models of design are limited in their veridicality because they do not take into account a designer's social context in modeling design activity and the factors that drive it. This paper describes a computational model that embodies a broader view of design than single designer simulations. In this view the design decisions that a designer makes (i.e., the evaluation criteria on which those decisions are based) are influenced by multiple factors, some of which are external to the agent. In particular, the knowledge that a designer uses both to produce and evaluate designs changes over time as a result of the interactions of the designer with other members of the world around it. The other members that can influence a designer's design decisions can be classified into those that are competitors (other designers in the same industry or domain, producing the same kinds of designs) and those that are consumers (of the type of artifact produced by the designer). The influence is not a result of direct communication between them, but rather results from each member being able to analyze the behaviors of the others around it, in particular their responses to the different designs being produced, and adjusting its own knowledge over time as a result. In broad terms, we can consider the computational model simulates that consumers' purchasing behavior in the world is independent of, but indirectly affects, the evaluation criteria used in producing new products. This view of designing as including a social phenomenon is influenced by research in the branch of cognitive science known as situated cognition [10, 11]. One of the observations of situated cognition is that reasoning occurs within a world and is influenced by a designer's current worldview, called a "situation" [12]. The same designer confronted by the same requirements at a different time, or different designers confronted by the same requirements at the same time, might make different decisions while reasoning and therefore come up with different solutions to the requirements. Basing this computational simulation on ideas from situated cognition allows for the explanation of, and experimentation with, many of the phenomena involving social influences that are related to design activity. The remainder of the paper is organized as follows. Section 2 briefly presents the mechanics of the computational simulation of a social environment in which creative agents are present, using ideas from multi-agent systems [13]. Section 3 presents some details about the makeup of the agents used in this simulation. Section 4 describes and presents the results of some experiments performed with this simulation. The paper concludes by discussing, in Section 5, some of the important outcomes of this research. 111 2 Multi-agent Simulation This simulation was implemented in MASON (Multi-Agent Simulation Of Networks), a multi-agent simulation platform, developed at George Mason University [14]. In this simulated world there are 1,000 agents, of which 2.5% are designers (which are also called producers) and 97.5% are observers or consumers (which are also called receivers) of the designs produced by the designers. These proportions are based on statistics gathered by the U.S. Census Bureau [15] that show that approximately 2.5% of the U.S. population is involved in some sort of creative activity or industry. Each designer and consumer is modeled as a single agent in MASON resulting in 25 designers agents and 975 consumer agents. Each of these agents has its own value system, modeling its situation at any time: a set of interests and preferences, or biases, that are used to evaluate designs. In addition, each of the designer agents has its own set of skills: generative knowledge that it uses to produce new designs. The sets of preferences and skills are different in each agent. The "lives" of the agents are divided into time-steps, and a simulation is run for each agent for 1,000 time-steps. Within each time-step each designer agent produces a new design based on its set of generative skills and its evaluation criteria for deciding what makes a good design. The consumers then observe the produced designs and use their own evaluation criteria to assign a value to the quality of the designs. Once all the consumers have had a chance to evaluate the designs produced by all the designers the results are gathered together to obtain mean values of the population of designs produced in that time-step. The mean values are used to rank the designs and the designers according to their success (the relative quality of the designs they produced, as judged by the consumers) and the consumers according to their enthusiasm (for the overall set of designs produced by the designers). The results of this procedure are used by the agents as a catalyst for potentially making adjustments to the knowledge that they use in their activities in the next time-step (evaluating designs and, in the case of designers, also producing designs). In order to simulate the adoption of technologies and methods that have been used by others and have been proven to be successful, in a previous time-step, the least successful designers change their situation by adopting some of the knowledge (both generative and evaluative) that the most successful designer used in the time-step that has just ended (and thus try to improve their own success in the future). In the real world this adopted knowledge could have been obtained through licensing, patents, reverse engineering, industrial espionage, or other means. In order to simulate the membership behavior of consumers, where consumers are influenced to adopt products based on which products have been adopted by large groups, the least enthusiastic consumers adopt some of the evaluative knowledge that the most enthusiastic consumer used in the time-step that has just terminated in order to try to improve their enthusiasm for the overall set of designs in existence. The above procedure is then repeated for each subsequent time-step in the simulation. Fig. 1 schematically shows the simulation framework just described. The agents in the simulation undergo gradual changes in their way of viewing the world around them (and of producing designs, in the case of the designer agents) as the simulation proceeds. These gradual changes occur as a result of each agent observing 112 the behavior (skills, evaluation criteria, and opinions) of others, rather than as a result of direct communication between the agents. As a result of these gradual changes, our hypothesis is that interesting global (social) behaviors that were not programmed directly into the simulation emerge on the basis of the elementary social individual agent behaviors. Fig. 1. Framework for the simulation of a society of producer and receiver agents. 3 Individual Agent Models In this simulation the designer agents produce simple shapes consisting of sets of colored unit squares through an evolutionary algorithm. Any generative approach can be substituted for the evolutionary algorithm. Each agent uses several criteria in parallel to evaluate designs. In the case of the designer agents, these criteria are used to evaluate the designs that they themselves generate, and guide their generation towards convergence in each time-step. In the case of the consumer agents, the criteria are used to evaluate the designs that the designer agents produced during that time-step. The sets of evaluation criteria, which are used to model the notion of "situation," available to designer and consumer agents overlap but are distinct. The initial state of the agents is randomly set (choosing for each agent a fixed number of criteria from the set of possible criteria that corresponds to it) before commencing the simulation. In this example the evaluation criteria relate to geometric properties of the designed shapes (such as their tallness, flatness, area-to-perimeter ratio, bumpiness, degree of convexity, and symmetry) as well as criteria that relate to color properties of the shapes (such as degree of color saturation, contiguousness of the colors, and the existence of different color patterns within the unit squares that make up the shapes). Each of the designer agents uses a set of genes in order to create genotypes that describe moves that can be made to describe a shape (design). The set of genes that each designer agent uses is initialized at random at the beginning of the simulation, and is chosen from a set of 32 possible genes. Each gene represents making a unit move from a given start position in one of eight possible directions (during the creation of a shape) and placing a unit square (of a particular color) in the position resulting from that move. A genotype is a sequence of such moves and placements of colored unit squares, read from left to right, that together creates an entire shape. The start position for each gene in the sequence (genotype) is the end position for the previous gene. Fig. 2 shows a subset of the set of genes available to designers (the subset shows the eight possible genes that can exist for a given color of unit square). 113 1: 2: 3: 4: Gene: 7: 6: 5: 8: Gene: Key to interpreting moves: = start = end Fig. 2. Subset (for a given color) of the genes available to designer agents. 4 Experimental Results Fig. 3 shows snapshots of the state of the simulated world after each of the first four time-steps in a typical simulation. In the snapshots, designers are shown as hollow squares distributed in five rows of five columns each, and consumers are shown as small circles that are rendered in the vicinity of the designer whose design they liked the most upon terminating the corresponding time-step of the simulation. From one time-step to the next the designers remain immobile, but the consumers travel within the window from their location in the previous time-step to the vicinity of the designer whose designs they evaluated most highly in the current time-step. The density of the cloud of consumers depicted in the vicinity of each designer is a measure of how popular/successful that designer's design was in the current time-step. A wide range of responses can be observed in the sequence of snapshots shown in Fig. 3. If the designers are numbered from left to right and from top to bottom, Designer 8 (second row, third column) maintains an above-average level of popularity throughout the four time-steps. Designer 25 (last row, last column) has an aboveaverage number of "followers" only in the third of the four time-steps shown in the sequence of snapshots. Designer 5 (first row, last column) is not successful at all at the beginning of the simulation, then has an average number of followers during the next two time-steps, and then has very few by the fourth time-step. Designer 21 (fifth row, first column) oscillates between being relatively unpopular and being relatively popular in each of the four time-steps. 114 1. 2. 3. 4. Fig. 3. Snapshots of the first four time-steps in a typical simulation. Designers are shown as hollow squares distributed in five rows of five columns each, and consumers are shown as small circles that are rendered in the vicinity of the designer whose design they liked. This first experiment shows that many different types of social responses to creative agents can emerge in this computational simulation. This is despite the simplicity and indirectness of the knowledge transfer mechanism employed by the individual agents in each time-step in the simulation (which is what originates this range of behaviors) and despite the fact that only four time-steps were observed in detail in order to analyze the agents' individual behaviors. 115 The second set of experiments is designed to determine global emergent trends based on these simplified concepts of situated cognition. A Monte Carlo simulation [16] (with 1,000 time-steps, 20 runs) was run. The mean and standard deviation of the distributions of the evaluation knowledge used by the sets of agents and the genes used by the designer agents at each time-step in each run were measured. The distributions of knowledge allow us to observe whether some evaluation criteria in both types of agent and some production knowledge in the designers tend to dominate in time (their initial distribution is set at random, and is therefore statistically uniform). The standard deviation of these distributions was used as a measure of the variability within the population of agents. The mean and the standard deviation of these standard deviations were measured to obtain a global measure of the variability within the population across all runs (i.e., the variability of the variabilities). Without the concepts of situated cognition, the simulation should tend over time to produce agents that will use the same knowledge that only a few of them used at the beginning of the simulation (the ones initially that turned out to be the most successful or enthusiastic). Situated cognition encompasses changes in the worldview of the participants. In the design world we are simulating this could be brought about when new knowledge (methods, technologies, ideas, etc.) appears. This new knowledge may augment or supplant some of the knowledge that was being used earlier. To account for this change in worldviews new values are introduced in an "onionskin" model of an open world. In the onionskin model the open world is modeled as a sequence of closed worlds, one embedded in the other. Each "skin" completely envelopes the previous world, thus the previous world becomes an open world embedded within the next world as the constants that define that world are turned into variables by the next closed world. In this work we treat both the criteria and the generation knowledge as fixed in each time-step. This makes each time-step operate within a closed world defined by those criteria and the generation knowledge. In the next skin the criteria that were previously fixed become part of a larger set as does the generation knowledge. In this way the current time-step becomes an open world for the previous time-step. Here a set of new values is regularly introduced to account for changes that emerge from the current state of the world. These new values are added to or substituted for existing values. This is repeated at regular intervals. At every 200 time-steps new evaluation criteria are introduced for both designers and consumers and new genes are added to the pool from which designs are produced by designers. Fig. 4 shows the graph of the resulting variability (for the distribution of both designer and consumer evaluation criteria). The eleven values shown in the horizontal (time) axis of Fig. 4 and Fig. 5 correspond to eleven key time-steps in the simulation: the initial (time-step 0) and final (time-step 1,000) state of the simulation, and just before and just after the introduction of new knowledge in time-steps 199, 200, 399, 400, .... The effect of this introduction of new knowledge can be seen in Fig. 4, which shows that the variability of the evaluation knowledge is maintained and does not converge. Having the agents react to these changes in the world by altering the way they do things crudely models the way they construct situations (interpretations of the world around them) for themselves, and thus change, as they interact with other agents in that world in the course of performing their activities [10, 11]. 116 Fig. 4. Graph of the variability in terms of the standard deviations of the standard deviations of the evaluation knowledge, expressed as criteria, used by the designers and consumers. Fig. 5. Graph of the variability in terms of the means of the standard deviations of the designer and consumer criteria. The means of the designer criteria have been multiplied by 10 to make them viewable at the same scale as the consumer criteria. Another measure of the variability is the means of the standard deviations. If these drop that is an indication of a drop in variability. If, however, they stay high then variability is sustained. Fig. 5 shows the means of the standard deviation values of the designer and consumer criteria. Both graphs show that the variability is maintained throughout the entire process. 117 5 Discussion In this paper we have presented a computational simulation that uses ideas from situated cognition to model some of the social aspects of creative activity. In our simulation, creativity does not stop as soon as an agent finishes producing some design for a particular set of requirements, as in many traditional computational models. Instead, we view creativity as an ongoing process that is influenced by factors that are external to creative agents. Our model fits well within, and contributes a computational implementation of, the DIFI (Domain-Individual-Field-Interaction) framework proposed in [17], which views creativity as a property of the interaction between individuals in a society (field) that belong to a given culture (domain). Another model that is conceptually similar to the one we present here is described in [18]. The model in [18] uses a direct interaction between the agents, unlike the one we describe in this paper, but shares our interest in observing the emergence of complex social behavior from the elementary interactions of individual behaviors. It does so by having agents' knowledge not be static, and their "lives" not end as soon as they produce satisfactory designs, but rather modify agents' knowledge based on their changing situation as they proceed with their activities and interact with other agents, and keep agents active throughout many problem-solving episodes. A preliminary version of the model described in this paper appeared in [19]. There are no causal models of the relationship between consumer preferences and the designers of the consumed designs. However, computational simulations like this permit the testing of hypotheses and the observation of the resulting systemic behaviors. The focus of this paper has been on the hypothesis that peer pressure and market pressure are drivers of change in the way designers design creatively, and that this occurs through the indirect observations that designers make of the opinions that consumers and other designers have on their previous designs, rather than direct communication between them. The paper described and showed the results of experiments in which social behaviors emerged from this kind of indirect interaction. Computational social science, from which this work is derived, provides the techniques to experiment with behavior in silico, behavior that is too difficult to experiment with in vivo. Complex social behavior can result from simple individual behaviors. The results produced here demonstrate that the hypothesis that creativity is both an individual and social phenomenon can be tested. The results indicate that social interactions play a role in designers being continuously creative and that the concepts of situated cognition play a role in our understanding of creativity. Further experiments will be carried out where different ideas about how designs and design criteria substitute for existing ones in order to model Schumpeter's [20] foundational concept of "creative destruction" will be tested. Acknowledgements This research supported by grants form the US National Science Foundation, grant numbers CNS0745390 and SBE-0915482. 118 2010_15 !2010 Constructing Conceptual Spaces for Novel Associations Kazjon Grace1, Rob Saunders1, and John Gero2 1 University of Sydney, New South Wales, Australia. 2 Krasnow Institute for Advanced Study, Virginia, USA. Abstract. This paper reports on a system for computational analogymaking based on conceptual spaces. The system constructs conceptual spaces that express the relationships between concepts and uses them to build new associations. A case for this conceptual-space driven model of association making is made, and its advantages and disadvantages are discussed. A prototype space-construction system is detailed and one method by which such a system could be used to make associations is proposed. The system forms concepts that are useful to describe a set of objects, then learns how those concepts relate to each other. These relationships can then be used to construct analogies. 1 Introduction The generally accepted frameworks [1, 2] for computational analogy-making focus on three processes: representation, mapping and transfer. Representations of a source and target object are constructed, mappings are built between them and then knowledge is transferred from the source to the target. Existing models of the representation process [3, 4] build representations out of a set of provided components. Mappings produced by these systems must be constructed (by processes such as conceptual slippage or spreading activation) from relationships existing between those components. While the scope of representations in the system can be broad, all possible kinds of relationship between representations must be provided with the representational components. Representation in analogy-making systems with a fixed set of representational components is reduced to \choosing" between which of the pre-encoded relationships will underlie the mapping. This research investigates an approach to computational association that addresses this restriction: a system that constructs the conceptual space in which it performs representation. If a system builds the relationships between its concepts through use, then potential avenues for mapping between those concepts need not be pre-encoded. We detail a system that learns concepts to describe its world, learns how those concepts relate, constructs a space using those relationships, and then can find mappings through the reinterpretation of objects in that space. In other words, a system in which the associations made are not just expressed in the representations constructed but are situated in the system's 120 experientially-derived conceptual space. Our hypothesis is that this increased autonomy in representation and mapping will aid in producing potentially creative analogies. 2 Association This research defines association as the process of constructing a new mapping between two objects. The process involves identifying a match and building a mapping between the two objects that re ects that match. This process is fundamental to analogy-making, metaphor and other related tasks. We assume that pattern recognition makes recognising mappings in existing representations virtually automatic. From this assumption we derive that associating two objects is fundamentally a process of re-representing the objects to express a connection between them. This is our notion of interpretation-driven association. 2.1 Interpretation-driven association Modelling association as an interpretation-driven search has several benefits for an analogy-making system. Multiple associations between the same objects are possible through the development of multiple interpretations of those objects. Each association is situated within the interpretation used to construct it, and any knowledge learnt or transferred through that mapping is also specific to that interpretation. Each association embodies a \new" match, in that the association process produces a mapping between representations that was not previously known to the system: it is s-creative [5]. The interpretation process involves concurrent re-representation of the objects via a search of the system's experiences with them until a viable representation can be found. In a system governed by this idea of association it must be possible to produce many different representations of one object. We model this by allowing the concepts used to represent objects to have mutable meanings through a process analogous to \conceptual slippage" in the Copycat system [3]. In appropriate circumstances, the meanings of two concepts can \slip" together, allowing previously disparate objects to be matched. In Copycat, these slippages can only happen along predefined paths and under predefined circumstances. Our association system is freed from this constraint as it autonomously develops the relations that cue the \slippage" process between concepts. Our goal is to produce an analogy-making system that builds representations out of concepts that it has learnt, but also to learn relationships between those concepts. This would allow the system to \slip" the meanings of concepts without predefined paths along which to do so. To do this requires the solution of two problems: we need to learn relationships between the concepts produced by the system, and we need to use those relationships to produce new interpretations and thus associations. 121 2.2 Conceptual spaces as a model of experience In this research we use the notion of conceptual spaces to describe how concepts relate to each other and how those relationships can be used in association. A conceptual space is an abstract construct in which all the concepts of a system are located. A conceptual space contains knowledge about how concept meanings relate to each other and about how concepts have been used in conjunction with each other. The conceptual space is an abstraction of a system's experiences over the course of its operation and it can be used to put the act of perceiving an object in the context of a system's past. Our system re-interprets objects by drawing on this knowledge of related past experiences to find another set of concepts that can be used to describe the object. Conceptual spaces for analogy-making must contain rich and interrelated descriptions of the features that comprise objects. It is not sucient to produce a conceptual space in which each object is represented by a single point as the space must express relationships between the concepts used to describe objects, not between the objects themselves. Gardenfors' \theory of conceptual spaces" [6] states that conceptual spaces are defined by quality dimensions, or \aspects or qualities of the external world that we can perceive or think about". If the relationships in a space can be expressed in terms of a few quality dimensions then any mapping produced within the space will be derived from those few qualities. Our definition of conceptual spaces does not imply that the spaces contain any globally coherent organisation. The mechanism governing the location of concepts in space varies by implementation, but at minimum our definition states that proximal concepts are in some way similar. In our system the spaces are defined by undirected multigraphs, with each node being a concept and each edge being a relationship. Some idea of the similarity between concepts can be gained through the edge distance between any two concepts, but as each edge can represent different kinds of relationships there is no notion of moving in a defined \direction" in the space. Concept-to-concept relationships can be learnt through how the system acquires and uses concepts. Relationships in conceptual space in our prototype take two forms; similarity between the meanings of concepts and similarity between the usage of concepts. We can use these relationships to reinterpret objects. 2.3 Matching in conceptual spaces Each object can be represented within the conceptual space as a set of nodes, one for each of the concepts that describe it. These concepts form a region in conceptual space that describes the object. Finding a way to reinterpret the concepts used in this representation involves finding another region of concepts that can be mapped onto this one. When two regions in conceptual space are mapped onto one another, one describing a source object and one describing a target object, it can be said that the concepts within those representations have had their meanings \slipped" together. This results in representations of the two objects that re ect an association between them. 122 If a structural similarity exists between the conceptual regions associated with two objects, then the ways the system models those two objects can be seen as alike. Once a mapping between the concepts in two regions is found, we can produce an interpretation of one object using the concepts associated with the other. The structural similarity between two conceptual regions is indicative of how the system's experiences with those two objects have had similar structure. We can say that there are concepts in both regions playing similar roles within that group of concepts, and with similar patterns of relationships with their neighbours. This approach is syntactic in that it matches on the structure of conceptual space rather than its content, but that structure is learnt through the system's interactions with its world. Therefore what is being mapped is semantic information at the object level expressed as structural information within the conceptual space. This research is concerned with developing a system that can both learn its own concepts and learn how those concepts relate to each other. The more removed the experimenter-provided data is from the analogies being made by the system, the more defensible is the claim that the system has autonomously constructed a new association. A system based on these principles would a) learn a set of concepts to describe the objects in its world, b) learn how those concepts relate to each other in both definition and usage, c) construct a conceptual space embodying the relationships between concepts, d) find a match between the structure of the regions in conceptual space that re ect the target object and a source and e) interpret the target and source objects to re ect the mapping that has been constructed between the concepts used to describe them. We have developed a prototype of our approach to association construction that implements concept formation, conceptual interrelation, conceptual space construction and a limited form of matching. While this prototype does not yet produce compelling or interesting analogies, it serves as a proof of concept for our framework and its behaviours offer some insight into our theories. 3 A System for Constructing Spaces We have developed a system capable of constructing conceptual spaces for analogymaking. An overview of the system can be seen in Figure 1. The system takes a set of objects, learns concepts to describe them, learns relationships between those concepts, constructs a graph of those relationships and then searches for possible mappings within that graph. The system operates in a very simple shape perception domain from which it receives symbolic perceptual input about objects. A future development goal is for the system to take lower level sensory input and learn its own perceptual representations of objects, but symbolic input is sucient for the purpose of testing the construction of spaces. The system then learns a set of concepts that can uniquely describe each of the objects, using a method based on the discrimination agent architecture developed by Steels [7]. Discrimination-based 123 Discriminate between objects by learning patterns of percepts. Percepts (sparse vector of active percepts) Construct a space of concepts organised by their literal similarity. Construct a space of concepts organised by how they have been used together. All concepts (concept definitions as percept sets) Active concepts Construct a graph of the relationships between concepts. Space of concepts (by experience) Space of concepts (by similarity) Find regions of the source and target object graphs that share structure. Concept relationships graph Fig. 1. A diagram of our system, showing the process from perceptual input on the left to the generation of possible matches on the right. learning was chosen for its simplicity and prevalence as a reinforcement strategy in concept formation. Similarity relationships between concepts are then calculated based on shared percepts, while the experiential relationships between concepts are calculated based on which concepts co-occur with each other. These relationships are extracted from the set of concepts using the singular value decomposition process described in Sarkar et al. [8]. This method extracts an underlying set of structurally important vectors from the concept usage and definition data and then describes individual concepts in terms of those vectors. Concepts with similar composition in this \singular value" representation are similar in ways that are significant in the dataset. Concepts that are suciently similar by either the literal or co-occurence metrics are judged to be related and an edge connecting them is added to the conceptual space graph. This graph can then be searched for matching sub-regions. 3.1 Example domain The Line Grid domain used in this research is designed to be a simple visual way to investigate concept formation and space construction. The emphasis is not on the potential for interesting associations, but on the utility for testing conceptual space construction. A line grid of size n is an n-by-n grid of points that can each be connected to any other point orthogonally or diagonally adjacent to them. Figure 2 shows four objects in the size three line grid. Sucient versatility exists in this domain to describe polygonal shapes, isometric depictions of 3D objects, line patterns and a simple but complete typeface of capital letters. A line grid shape is described by a binary string indicating which of the possible edges exist in that shape. Our system has been tested for size three and four line grids, which have twenty and forty-two possible edges respectively. Concepts in this domain are patterns of edge presence and absence that exist in multiple shapes. Relationships between these concepts show how those concepts are similar (identifying similar patterns of edges), or how those concepts are used (identifying that they form discriminating sets together). 124 Fig. 2. A set of example objects in a 3x3 version of our line grid shape domain. For example, in the set of 26 objects representing the capital alphabet, these relationships include things such as \objects containing an enclosed space in the top half of the letter" being used together with \objects containing a stroke down the left side", as in the letters P, B and R. These relationships would then be compiled into a conceptual space expressing the patterns of relationships between the concepts learnt by the system to describe the capital alphabet in the Line Grid domain. The system would then look for matches in the structure of regions of the conceptual space; areas in which other concepts play the same \role" in their groups of related concepts as the source object's concepts do in its conceptual region. If a group of concepts can be found that shares structure with the group that describes the target, then another object that is described by that group may be a potential source. An example of a proportional analogy that could be made in the Line Grid domain by a complete analogy-making system is seen in Figure 3. Given letters in a consistent typeface, the system would find that similar structures existed between pairs of letters. In this case, the difference between the letters `I' and `T' could be considered analogous to the difference between the letters `F' and `E'. Fig. 3. Two examples of matches between pairs of objects in the domain that could be found by a complete analogy-making system and expressed as a proportional analogy of the form \I is to T as F is to E". 3.2 Concept formation Our prototype concept formation system is designed to produce sets of concepts that are suitable for association in conceptual space. It is desirable that each 125 object is described by many concepts in order for conceptual spaces to be more interesting and for potential matches to be more varied. An Accuracy-Based Classifier System [9] modified to reinforce based on discriminative success was chosen as the concept learning algorithm. This algorithm was chosen due to its ability to extract patterns from representations and thus produce many concepts per object. Concepts produced by the system represent patterns of percepts that are useful for telling objects apart from their peers. Concepts use a similar representation to objects but are defined as trinary strings as each concept may require, forbid or not care about each edge in the grid. The concepts are evolved to be able to discriminate an object from all others in the given set. Learning about a set of objects via attempting to tell them apart is a common approach to concept formation and is described in Steels [7], where the discrimination occurs for the purpose of a set of agents trying to co-operatively learn language. The principle has been applied to an analogymaking system based on the idea that it must first be possible to tell objects apart before any interesting ways can be found to put them together. Concepts can be combined together to discriminate a chosen object from its context, with each concept discriminating that object from one or more other objects. This setbased reinforcement method means that each individual concept will be rewarded if it is a part of any discriminating set. As the goal is to produce a rich set of general concepts, there are no limits on the size of each set or the number of discriminating sets that can be found: this promotes the development of multiple divergent approaches to discriminative success. The classifier system was able to find a stable and compact set of general concepts to describe up to 100 objects in the 4x4 line grid domain. A plot of the system's performance over 10,000 generations on a twenty object problem in the 4x4 domain can be seen in Figure 4. The system reached 100% discriminatory success after 1,300 steps with approximately 600 concepts, but the population continued growing to 3,950 concepts after 6,000 steps. The system then reached a saturation point where enough diversity existed in the population to subsume most new classifiers into existing more general ones and the population rapidly declined. After approximately 8,000 steps the system had found 125 general concepts and maintained 100% discrimination rates. The generalisation can be seen in the second data series, with the average number of objects matched per concept rising to 2.5 with the generalisation process. 3.3 Inter-conceptual relationships The construction of conceptual spaces is dependent on the system's ability to form relationships between the concepts that it has learnt. In our system we have identified two kinds of conceptual relationship to model: experiential cooccurence, or when two concepts are used together in discrimination tasks, and literal similarity, or when two concepts describe similar properties of objects. Experiential co-occurence relationships are designed to allow the association system to match between concepts that are \used" the same way: concepts that play a role in their group of concepts that is analogically equivalent to the role 126 Fig. 4. The results of a run over 10,000 generations with a 4x4 grid and 20 random objects. The population of concepts is shown at the left, while the average number of objects that each concept can be used to describe is shown on the right. played by the source concept in its group. Similarity relationships are designed to allow the system to match between the pattern of differences that exist between the meanings of concepts in the two conceptual groups. A conceptual space graph is formed where each relationship is described as either literal or experiential and is labelled by the difference between the concepts it connects. The structure of a region in conceptual space would then be described by the structure of differences between its concepts. Similarly structured regions can then be found that contain potential mappings between pairs of concepts that play the same \role" in their local area of conceptual space. We employ Singular Value Decomposition (SVD), a linear algebra method with uses in statistical natural language processing, data mining and signals processing. In our work SVD calculates connections between the meanings and usages of concepts the system has learnt. The experiential co-occurence is calculated by running the SVD algorithm on a co-occurence matrix of concepts in discrimination sets. The literal similarity is calculated by running the SVD algorithm on a matrix of concept definitions in terms of which grid line edges they match and which they forbid. The advantage of the SVD approach in calculating literal similarity is that the algorithm is able to extract which grid lines represent important differences between concepts and re ect that accordingly, which the use of a literal distance measure would not do. 3.4 Constructing spaces The space construction process takes the relationships identified by the SVD engine and compiles them into a coherent graph representation that can then be searched for matches. In the current prototype conceptual graph edges are labelled only as \similarity" or \co-occurence". Future versions of this system will label edges by how the concepts differ. The current system is able to see patterns and structures in the body of concepts learnt by the system, but not 127 the specifics of how those patterns relate to each other beyond the kind and number of relationships involving each concept. The correlations between concepts using the two metrics produced by the SVD algorithm are compared to a threshold and suciently similar concepts are assigned an edge of the appropriate type. An example of part of a simple graph produced by the system can be seen in Figure 5. This graph shows some of the concepts learnt discriminating a small set of objects. There are two broad groups of literally related concepts connected by solid lines and between those groups are concepts connected with dashed lines indicating co-occurence. Fig. 5. Part of a graph describing relationships between concepts. Solid edges indicate concepts that are literally similar, while the dotted edges indicate co-occuring concepts. 4 Discussion Conceptual relations and conceptual spaces can be constructed in the course of learning to describe a set of objects. We have performed simple matching between groups of concepts in constructed spaces, but producing more interesting associations in these spaces will require a richer description of concept relationships. The current system can only match between relations labelled as \similarity" or as \co-occurence". Much richer information about the nature of the relationships between concepts exists in the singular values produced by the SVD system. A detailed set of relations extracted from the singular values will permit a more complete labelling of edges in conceptual graphs. Edges between related concepts can be labelled by what differs between them, allowing for matches to other concept groups with a similar pattern of differences. Incorporation of a confidence attribute for relationships (the data for which exists in the SVD output) would allow the system to preferentially match between strongly related concepts but to search weaker links if no strong mappings were found. Association in the resulting conceptual space would then involve subgraph isomorphism between the labelled graphs; mapping between groups of 128 concepts with similar patterns of relationships between them, with each relationship defined by its type, strength and the specifics of the difference between its concepts. Like many concept formation systems, learning of concepts in our prototype system is grounded in the ability to discriminate between objects. Our system produces a set of general concepts to identify each of a set of objects by how it is different from its peers. As a result the graphs produced by our system represent the `similarity between differences' and the `co-occurrence of similar differences'. What is necessary for analogy-making is to extract common sub-components that when combined describe the objects themselves rather than describe the differences between objects. Therefore discrimination-based concept formation may not be suitable for analogy-making systems. We have described the benefits of an analogy-making system that constructs its own conceptual spaces. In order to operate as a complete analogy-making system the prototype described here requires additional features, most notably the ability to evaluate potential mappings both in terms of analogical quality and how they relate to previous analogies made by the system. With more detailed conceptual space construction and a revised concept formation process such a system could produce interesting and potentially creative analogies. 2010_16 !2010 Search Strategies and the Creative Process Kyle E. Jennings Institute of Personality and Social Research University of California, Berkeley jennings@berkeley.edu Abstract. The human creative process can be likened to searching for solutions to a problem. This work introduces a computerized aesthetic composition task that is inspired by the \creativity as search" metaphor. Data from this technique can illuminate how personality and situational in uences affect the creative process, rather than merely noting that they affect the outcome. Beyond this, the technique can be used to highlight underlying similarities between human creativity and optimization, as well as the important differences. Early results with N = 34 participants suggest that people's search strategies do differ, and show connections between personality, evaluation criteria, and search strategy. Suggestions for future research are given. 1 Introduction The creative process can be thought of as the search for an ideal solution to a problem. One way to understand creativity is to understand this search process. This paper presents early results from a new behavioral research technique that is based on the \creativity as search"metaphor. In the short term, this technique will allow researchers to understand how individual differences and situational in uences affect the creative process, instead of merely noting that they affect the outcome. In the medium term, the technique will be used to understand similarities and differences between human creative search and optimization. In the long term, the hope is that this and related work will enable better communication among creativity researchers in the behavioral and computational traditions, eventually leading to a more integrative understanding of what creativity is and how it occurs. The paper begins with a discussion of creativity and search. Then, the aims and design rationale for the new technique are presented, followed by illustrative results from an early application of the technique. Finally, future directions are discussed. 1.1 Creativity as Search Search can either be seen as finding a path from a starting state to a specific end state, or as finding the best solution from among many other solutions. The former case is relevant when the desired outcome is known but the means 130 for achieving it are not (for example, proving a mathematical theorem). The latter case is relevant when the desired outcome is unclear, such as during the \problem finding" stages of the creative process. At least in the arts, creative people seem to be distinguished by the problems they choose to solve, not by how they solve them [1]. Accordingly, this research focuses on how people choose the best solution from among competing alternatives, and not on how that solution is realized. In open-ended domains like the arts, choosing what solution to pursue is seldom a simple matter of deciding among a few known choices. Instead, the space of possibilities is usually too vast to be considered simultaneously, meaning that the search must proceed by iteratively considering subsets of the space. How people control this iterative process can be called a search strategy, and includes things like how people move from one subset to another, and how people evaluate each solution. Though search strategies might be an important determinant of how creative the search outcome is, they are not directly observable. However, if the options under consideration at each stage can be at least partially observed, it becomes possible to trace how people move through the space of possibilities over time. This path is called a search trajectory, and offers clues as to what kind of search strategy people are using. This research examines search trajectories, and characterizes them by how complex they appear to be, which is tantamount to how straight of a path people take from their starting solution to the solution they eventually settle upon. At first blush simple trajectories might seem to re ect positive things like decisiveness and expertise. However, they may also re ect unsophisticated strategies that are not well-matched to the nature of the problem. This is particularly likely when the aspects of a solution that can be manipulated (the control dimensions) have complex relationships to the criteria that the solution is evaluated on (the evaluation dimensions). In these cases, simple strategies like repeatedly making incremental improvements until nothing can be improved upon can backfire, since they might miss a drastically different solution that is far superior (see [2, 3]). 1.2 Instrument Design As the foregoing suggests, a research instrument is needed that can track people's search trajectories. Because psychological studies involving personality and situational in uences often require large samples, this technique should be as economical to apply as possible, and should be simple to apply consistently across studies. Also, while high-resolution data are needed, they must be tractable enough to gain insights about as the technique is developed. All of this must be achieved without unduly straining the connection to creativity. Existing creativity research techniques are not well-suited to these requirements. Table 1 characterizes insight tasks (e.g., [4, 5]), holistic assessment of end products (e.g., [6]), divergent thinking tests (e.g., [7]), and protocol analysis (e.g., [8]) according to whether they provide trajectory data, are economical to apply, 131 Instrument Trajectory Economical Consistent Tractable Face Valid Insight tasks no yes yes yes mid1 Holistic assessment no mid mid2 yes yes3 Divergent thinking possibly4 no mid5 mid4 mid1 Protocol analysis possibly6 no possibly6 no yes3 Exploration task yes yes yes yes mid 1 | only represents one part of the creative process; 2 | while findings can be replicated across different tasks and raters, ratings can't be compared across samples; 3 | provided the task is a face valid creative task; 4 | with techniques under development (see [9]); 5 | norms available, but often not used; 6 | depending on how applied Table 1. Comparison of creativity measurement techniques. can be applied consistently, yield tractable data, and are face valid operationalizations of creativity. None of the techniques provides detailed trajectory data in an economical manner. The technique developed here is a computerized aesthetic composition task. Participants have a fixed amount of time to explore a three-dimensional scene on the computer, with the goal of finding the image that most captures their interest.1 Participants can manipulate two things: the camera position, and the position of a light source. However, because of the re ection, refraction, and shadows caused by the interplay of the materials and the light, the task is both less straightforward and more amenable to creative outcomes. (See Figs. 1, 3.) The exploration task results in a moment-to-moment map of the search trajectory. Since there are only two control dimensions (camera and light angle), the search trajectory can be visualized to develop intuitions about the data. The task itself can be economically and consistently applied within typical psychological experimental conditions. Various metrics have been defined for analyzing the search trajectory (discussed later), with more sophisticated ones to be developed over time. Perhaps the least satisfying aspect of the task is its relation to real-world creativity. However, nothing short of in vivo studies of working creators will give a perfect match. Laboratory tasks sacrifice this external validity in order to gain control. The exploration task encompasses more of the creative process than insight or divergent thinking tasks. While more constrained than typical tasks used with holistic assessments, the technique yields essential trajectory data. Despite how constrained the task is, it is suciently complex to require more than ordinary problem solving. First, there is no single best solution. Instead, people will prefer different configurations based on the criteria they use, and would likely find that many configurations satisfied their criteria. Second, provided that people attend to the interplay among the materials in the scene, there is no simple relationship between the two control dimensions and the many eval1 \Interest" incorporates aesthetic concerns [10] but admits more solutions than \aesthetically pleasing" without attracting merely odd solutions as \creative" might. 132 uation dimensions. If data visualization is not a major concern, more control dimensions can be added to increase the complexity. 2 Early Results 2.1 Methods A preliminary experiment was run with N = 34 people, who participated in exchange for course credit. Though it is possible that the experiment description (\perform an aesthetic composition task") attracted more aesthetically-oriented individuals, none of the participants majored in the arts. After signing consent forms, participants were seated at a computer and instructed to begin the experiment, which proceeded automatically. To become familiar with the user interface, participants has up to two minutes to complete the exploration task using a simple scene consisting of a non-re ective, monochromatic arch on a checkered surface with a monochromatic sky. Next, they had up to five minutes to complete the exploration task using the more complex scene shown in Fig. 1, with the goal of finding the image that most captured their interest. In both scenes, the camera and light were a constant distance from the center, with the angles adjustable in four degree increments. Participants could explore the 3D scenes by manipulating the camera and light angles using either a knob that could be rotated to any angle, or buttons that moved one step clockwise or counterclockwise. A timer showed the elapsed and remaining time, as well a button to press when finished. Participants could choose to continue before the time limit expired. After the exploration task, participants rated their liking of a subset of images from the scene. Due to problems with this measurement, these data are not analyzed here. Participants then wrote a few sentences describing how they approached finding the image that most captured their interest. Finally, they completed four questionnaires in a random order (item order was also random). Overall personality was assessed using the Big Five Inventory (BFI) [11]. Cronbach's f for the Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness dimensions were .82, .79, .79, .91, and .84, respectively. Three additional scales were included, but since no relationships were found with these scales, they are not discussed. Participants were debriefed upon completion. 2.2 Results Metrics The following metrics are used to characterize the search trajectory. Where applicable, care was taken to ensure that these metrics properly re ect the circularity of the coordinates. Time Time elapsed between the first movement and the last movement. The median time was 1:10 (minutes : seconds), and the lower and upper quartiles were 0:51 and 1:43, respectively. The maximum time was 2:55, indicating that the five minute time limit was more than sucient. 133 Fig. 1. Exploration interface, showing the experimental scene. Coverage Percentage of the search space encountered, M = 1:84% and SD = 0:70%. Unsurprisingly, each person explored only a small part of the space. Fixations Number of points where the person lingered, determined by doing a Gaussian kernel density estimate over the time spent per coordinate (11 = 22 = 8 degrees), and then counting the local maxima,M = 15:8, SD = 5:75. Fixation Diversity The mean inter-fixation point distance was calculated for the upper 50% of each trajectory's fixation durations (which tended to be less similar to each other in duration than the lower 50%), M = 117, SD = 16:6. Dimension Changes Times that the search switched control dimensions. Follows a Poisson distribution with = 2:26. The modal value was one, indicating that most people searched one dimension, and then the other. Rate The average number of new views per second, M = 3:09 and SD = 1:12. Reversals Time that a trajectory switches direction along a single dimension, M = 10:56 and SD = 6:92. Additionally, the outcome of the search can be characterized by how unusual the final point is, which will be called unusualness. The calculation is based on the average distance between the current search's final point and every other search's final point. To make unusualness more interpretable, the average distance is divided by the mean of the average distances, and the log (base 2) taken. The mean is approximately zero, though in principle it needn't be. The intercorrelations between the metrics are shown as part of Table 2. Criteria and Complexity The key question is how complex people's searches are, and what determines their complexity. One source of complexity is the nature 134 2 3 4 5 6 7 8 9 10 11 12 13 Metrics 1. Total Time .59 .27 .08 .15 .37 -.35 -.11 -.02 .36 .10 .20 .15 2. Coverage | .84 .24 .59 .51 .30+ .22 .09 .46 .27 .28+ -.05 3. Fixations | .32+ .67 .47 .37 .33+ .22 .36 .31+ .26 -.04 4. Fix. Diversity | .21 .21 .04 -.25 .23 -.04 .39 .18 .07 5. Dim. Changes | .28+ -.00 .16 .31+ .10 .38 .12 .02 6. Reversals | .32+ .42 .03 .38 .08 .38 .16 7. Rate | .46 -.05 .34 -.04 .05 -.06 8. Unusualness | -.05 .34 .04 -.04 -.13 Big Five 9. Extraversion | .31+ .41 -.06 -.12 10. Agreeableness | .24 -.08 -.05 11. Conscientiousness | -.28 -.43 12. Neuroticism | .18 13. Openness | +p < :1, *p < :05, **p < :01, ***p < :001 Table 2. Intercorrelations between metrics and personality. of the problem itself. The intent in designing these scenes was to introduce problem complexity via the interplay between materials. However, people were free to choose what criteria they used, and if they did not notice or care about this interplay, their criteria may have been simpler. For illustration, two sample trajectories are shown in Fig. 2. Each trajectory starts at the cross and ends at the `X', with color indicating the passage of time (light blue to black). The size of the circle at each point is proportional to how long the person spent looking at that image. The participant on the left did not mention material properties when describing his/her criteria, while the participant on the right did. To test whether criteria involving the interplay between materials was associated with more complex search trajectories, participants' open-ended descriptions of their search process were coded for whether they mentioned material properties (e.g., re ection, refraction, transparency, and color). Comparisons were made between people who mentioned material properties (N = 14) and those who did not (N = 20). Statistically controlling for time, people who mentioned material properties made more dimension changes (M = 1:77 vs. M = 2:96, adjusted). No other ffects were significant. Individual Differences There were some interesting individual differences in search strategies. Most notably, the time spent searching was significantly and positively related to the trait \agreeableness" (a tendency to be compassionate and cooperative), basically suggesting that nice people took the experiment seriously. People who are more conscientious (self-disciplined, duty-bound, and achievement oriented) trended toward exploring more of the space, explored more diverse regions in the space, and made more dimension changes. Finally, people who are more neurotic (prone to stress and anxiety) showed more reversals in their search. 135 Time Coverage Fixations Fix. Div. Dim. Chgs. Rate Revs. Unusual 1:45 1.67% 11 131 2 2.03 8 -.19 1:52 3.74% 32 127 9 3.37 19 .13 Fig. 2. Two sample search trajectories. 136 A B C D 0 90 180 270 360 0 90 180 270 360 Camera Angle Light Angle A B C D Fig. 3. Final points, with four examples. Overall, these results show that search strategy is largely dependent upon how thoroughly the participant approached the task, which appeared to be higher for people who either nicer (agreeable) or careful and duty-bound (conscientious). Conscientious people in particular appeared to \leave no stone unturned", as evidenced by more dimension changes. Beyond this, more anxiety-prone (neurotic) people reversed their search direction more. All of these ffects appear to be independent of the ffect of criteria complexity, which was itself unrelated to personality. Final Points Fig. 3 shows the final points for all 34 participants. As images B, C, and D illustrate, there was a strong preference for images where the three objects were composed evenly. (The apparent diagonal line does not correspond to any regular pattern when examined further.) As shown in Table 2, the unusualness of the final point appears to be positively related to the rate of the search and the number of reversals, and to agreeableness. After controlling for rate or reversals, the ffect for agreeableness is insignificant, suggesting that there may be a mediating ffect. If replicable, this would suggest a mechanism by which agreeable people might reach more unusual points. The ability to detect mediating relationships between external variables (like personality or situational in uences) and outcomes (like unusualness or creativity) via search trajectory characteristics is a strength of this approach. Taken together, these early results show areas of promise and room for improvement. First, while some people did appear to notice the material properties, and while this did appear to have some in uence on search strategies, the effect was not very large. In future experiments, the scene should be designed to make the material interplay more apparent. Second, while there were interesting relationships between search strategies and personality, the strong ffect of agreeableness says more about the experimental setting than about the nature 137 of the task itself. Future experiments should find ways to encourage people to take the task more seriously without inducing undue demand characteristics. Third, while the metrics themselves have intuitive meanings, more work needs to be done to find and understand the most relevant metrics for characterizing differences between trajectories. Despite these problems, the initial experiment was able to find meaningful relationships among variables and sucient interindividual variability to suggest there is more to be found in future studies. 3 Discussion This paper describes a new research technique for making detailed observations of the human creative process. While not as face valid as protocol analysis or holistic assessment, the technique is more economical and offers more detailed information, making it well-suited for the aims of investigating how personality and situational in uences affect the creative process, and for exploring connections between creativity and optimization. Preliminary results using the technique show that there are many differences in how people approach the search task, some of which stem from personality variables, and some of which stem from what sorts of images people prefer. Future Directions The next step in this research is to better understand the experimental task itself, which includes honing the user interface and experimental setting, refining and better understanding the search trajectory metrics, and experimenting with scenes of varying complexity. From here, specific questions can be explored that will add detail to current psychological knowledge about how various personality and situational in uences affect creativity. Beyond the exploration user interface, three additional user interfaces have been constructed. One interface selects representative points from the search trajectory, and asks participants to rate their interest in each image. Another interface plays the entire search trajectory back at low speed, allowing participants to provide a continuous rating of what they're seeing. The final interface asks participants to rate the similarity of pairs of images from the space, which can be analyzed with multidimensional scaling. These tools are designed to reconstruct participants' overall evaluations of representative points in the space, and to determine what evaluation dimensions participants use. With these additional interfaces, the goal is to demonstrate that the scene being explored has two features: interdependencies, and local maxima. Interdependencies are desirable properties that con ict with each other (such as brightness diminishing re ections), in turn making the search less straightforward. Local maxima are points in the space that are better than similar points, but worse than very different points. As stated at the outset, finding the best overall point is more dicult for problems that have interdependencies and local maxima. Metaheuristics are a class of non-deterministic algorithms for optimizing in such cases, and work by carefully tilting the balance from diversification (exploring many possibilities) 138 toward intensification (pursuing a single local maximum) [2]. The exploration task should yield data suitable for detecting similar tendencies in human creators. By showing links between the nature of creativity and optimization as well as between how humans and computers approach each, this research will help expand the \creativity as search" metaphor. While the aim of this technique is to be comprehensive yet economical, there is nothing preventing more complex applications. One such avenue would be to have participants think aloud as they search, which could then be analyzed and correlated with their search behavior. While time-consuming, this work could help determine things like whether and when people's criteria change mid-search, and how aware people are of their exploration strategies. This kind of work will be particularly useful for determining where creative search and optimization differ, and could even suggest new insights for authors of optimization algorithms, creative artificial intelligence, or creativity simulations. 2010_17 !2010 Automatic Generation of Music for Inducing Emotive Response Kristine Monteith, Tony Martinez, and Dan Ventura Computer Science Department Brigham Young University kristine.perry@gmail.com, martinez@cs.byu.edu,ventura@cs.byu.edu Abstract. We present a system that generates original music designed to match a target emotion. It creates n-gram models, Hidden Markov Models, and other statistical distributions based on musical selections from a corpus representing a given emotion and uses these models to probabilistically generate new musical selections with similar emotional content. This system produces unique and often remarkably musical selections that tend to match a target emotion, performing this task at a level that approaches human competency for the same task. 1 Introduction Music is a significant creative achievement. Every culture in history has incorporated music into life in some manner. As Wiggins explains, \musical behavior is a uniquely human trait...further, it is also ubiquitously human: there is no known human society which does not exhibit musical behaviour in some form" [1]. Perhaps one of the reasons musical behavior is tied so closely to humanity is its ability to profoundly affect human physiology and emotion. One study found that, when subjects were asked to select music that they found to be particularly pleasurable, listening to this type of music activated the same areas of the brain activated by other euphoric stimuli such as food, sex, or illegal drugs. The authors highlight the significance of the fact that music has an ffect on the brain similar to that of \biologically relevant, survival-related stimuli" [2]. Computing that possesses some emotional component, termed affective computing, has received increased attention in recent years. Picard emphasizes the fact that \emotions play a necessary role not only in human creativity and intelligence, but also in rational human thinking and decision-making. Computers that will interact naturally and intelligently with humans need the ability to at least recognize and express affect" [3]. From a theoretical standpoint, it seems reasonable to incorporate emotional awareness into systems designed to mimic (or produce) human-like creativity and intelligence, since emotions are such a basic part of being human. On a more practical level, affective displays on the part of a computerized agent can improve function and usability. Research has shown that incorporating emotional expression into the design of interactive agents can improve user engagement, satisfaction, and task performance [4][5]. Users may 140 also regard an agent more positively [6] and consider it to be more believable [7] when it demonstrates appropriate emotional awareness. Given music's ability to alter or heighten emotional states and affect physiological responses, the ability to create music specifically targeted to a particular emotion could have considerable benefits. Calming music can aid individuals in dealing with anxiety disorders or high-anxiety situations. Joyful and energizing music can be a strong motivating force for activities such as exercise and physical therapy. Music therapists use music with varied emotional content in a wide array of musical interventions. The ability to create emotionally-targeted music could also be valuable in creating soundtracks for stories and films. This paper presents a system that takes emotions into account when creating musical compositions. It produces original music with a desired emotional content using statistical models created from a corpus of songs that evoke the target emotion. Corpora of musical data representing a variety of emotions are collected for use by the system. Melodies are then constructed using n-gram models representing pitch intervals commonly found in the training corpus for a desired emotion. Hidden Markov Models are used to produce harmonies similar to those found in the appropriate corpus. The system also selects the accompaniment pattern and instrumentation for the generated piece based on the likelihood of various accompaniments and instruments appearing in the target corpus. Since it relies entirely on statistics gathered from these training corpora, in one sense the system is learning to imitate emotional musical behavior of other composers when producing its creative works. Survey data indicates that the system composes selections that are as novel and almost as musical as human-composed songs. Without creating any rules for emotional music production, it manages to compose songs that convey a target emotion with surprising accuracy relative to human performance of the same task. Multiple research agendas bear some relation to our approach. Conklin summarizes a number of statistical models which can be used for music generation, including random walk, Hidden Markov Models, stochastic sampling, and pattern-based sampling [8]. These approaches can be seen in a number of different studies. For example, Hidden Markov Models have been used to harmonize melodies, considering melodic notes as observed events and a chord progression as a series of hidden states [9]. Similarly, Markov chains have been used to harmonize given melody lines, focusing on harmonization in a given style in addition to finding highly probable chords [10]. Genetic algorithms have also been used in music composition tasks. De la Puente and associates use genetic algorithms to learn melodies, employing a fitness function that considers differences in pitch and duration in consecutive notes [11]. Horner and Goldberg attempt to create more cohesive musical selections using a fitness function that evaluates generated phrases according agreement with a thematic phrase [12]. Tokui and Iba focus their attention on using genetic algorithms to learn polyphonic rhythmic patterns, evaluating patterns with a neural network that learns to predict which patterns the user would most likely rate highly [13]. 141 Musical selections can also be generated through a series of musical grammar rules. These rules can either be specified by an expert or determined by statistical models. For example, Ponsford, Wiggins, and Mellish use n-gram statistical methods for learning musical grammars [14]. Phon-Amnuaisuk and Wiggins compare genetic algorithms to a rule-based approach for the task of four-part harmonization [15]. Delgado, Fajardo, and Molina-Solana use a rule-based system to generate compositions according to a specified mood [16]. Rutherford and Wiggins analyze the features that contribute to the emotion of fear in a musical selection and present a system that allows for an input parameter that determines the level of \scariness" in the piece [17]. Oliveira and Cardoso describe a wide array of features that contribute to emotional content in music and present a system that uses this information to select and transform chunks of music in accordance with a target emotion [18]. Like these previously mentioned systems, our system is concerned with producing music with a desired emotional content. It employs a number of statistical methods discussed in the previously mentioned papers. Rather than developing rule sets for different emotions, it composes original music based on statistical information in training corpora. 2 Methodology In order to produce selections with specific emotional content, a separate set of musical selections is compiled for each desired emotion. Initial experiments focus on the six basic emotions outlined by Parrot [19]|love, joy, surprise, anger, sadness, and fear|creating a data set representative of each. Selections for the training corpora are taken from movie soundtracks due to the wide emotional range present in this genre of music. MIDIs used in the experiments can be found at the Free MIDI File Database.1 These MIDIs were rated by a group of research subjects. Each selection was rated by at least six subjects, and selections rated by over 80% of subjects as representative of a given emotion were then selected for use in the training corpora. Next, the system analyzes the selections to create statistical models of the data in the six corpora. Selections are first transposed into the same key. Melodies are then analyzed and n-gram models are generated representing what notes are most likely to follow a given series of notes in a given corpus. Statistics describing the probability of a melody note given a chord, and the probability of a chord given the previous chord, are collected for each of the six corpora. Information is also gathered about the rhythms, the accompaniment patterns, and the instrumentation present in the songs. Since not every melody produced is likely to be particularly remarkable, the system also makes use of multilayer perceptrons with a single hidden layer to evaluate the generated selections. Inputs to these neural networks are the de1 http://themes.mididb.com/movies/ 142 fault features extracted by the \Phrase Analysis" component of the freely available jMusic software.2 This component returns a vector of twenty-one statistics describing a given melody, including factors such as number of consecutive identical pitches, number of distinct rhythmic values, tonal deviation, and keycenteredness. A separate set of two networks is developed to evaluate both generated rhythms and generated pitches. The first network in each set is trained using analyzed selections in the target corpus as positive training instances and analyzed selections from the other corpora as negative instances. This is intended to help the system distinguish selections containing the desired emotion. The second network in each set is trained with melodies from all corpora versus melodies previously generated by the algorithm. In this way, the system learns to emulate melodies which have already been accepted by human audiences. Once the training corpora are set and analyzed, the system employs four different components: a Rhythm Generator, a Pitch Generator, a Chord Generator, and an Accompaniment and Instrumentation Planner. The functions of these components are explained in more detail in the following sections. 2.1 Rhythm Generator The rhythm for the selection with a desired emotional content is generated by selecting a phrase from a randomly chosen selection in the corresponding data set. The rhythmic phrase is then altered by selecting and modifying a random number of measures. The musical forms of all the selections in the corpus are analyzed, and a form for the new selection is drawn from a distribution representing these forms. For example, a very simple AAAA form, where each of four successive phrases contains notes with the same rhythm values, tends to be very common. Each new rhythmic phrase is analyzed by jMusic and then provided as input to the neural network rhythm evaluators. Generated phrases are only accepted if they are classified positively by both neural networks. 2.2 Pitch Generator Once the rhythm is determined, pitches are selected for the melodic line. These pitches are drawn according to the n-gram model constructed from the melody lines of the corpus with the desired emotion. A melody is initialized with a series of random notes, selected from a distribution that model which notes are most likely to begin musical selections in the given corpus. Additional notes in the melodic sequence are randomly selected based on a probability distribution of what note is most likely to follow the given series of n notes. The system generates several hundred possible series of pitches for each rhythmic phrase. As with the rhythmic component, features are then extracted from these melodies using jMusic and provided as inputs to the neural network pitch evaluators. Generated melodies are only selected if they are classified positively by both neural networks. 2 http://jmusic.ci.qut.edu.au/ 143 2.3 Chord Generator The underlying harmony is determined using a Hidden Markov Model, with pitches considered as observed events and the chord progression as the underlying state sequence. The Hidden Markov Model requires two conditional probability distributions: the probability of a melody note given a chord and the probability of a chord given the previous chord. The statistics for these probability distributions are gathered from the corpus of music representing the desired emotion. The system then calculates which set of chords is most likely given the melody notes and the two conditional probability distributions. Since many of the songs in the training corpora had only one chord present per measure, initial attempts at harmonization also make this assumption, considering only downbeats as observed events in the model. 2.4 Accompaniment and Instrumentation Planner The accompaniment patterns for each of the selections in the various corpora are categorized, and the accompaniment pattern for a generated selection is probabilistically selected from the patterns of the target corpus. Common accompaniment patterns included arpeggios, chords sounding on repeated rhythmic patterns, and a low base note followed by chords on non-downbeats. (A few of the accompaniment patterns such as \Star Wars: Duel of the Fates" and \Addams Family" had to be rejected or simplified; they were so characteristic of the training selections that they were too recognizable in the generated song.) Instruments for the melody and harmonic accompaniment are also probabilistically selected based on the frequency of various melody and harmony instruments in the corpus. 3 Results Colton [20] suggests that, for a computational system to be considered creative, it must be perceived as possessing skill, appreciation, and imagination. The system could be considered \skillful" if it demonstrates knowledge of traditional music behavior. This is accomplished by taking advantage of statistical knowledge to train the system to behave according to traditional musical conventions. The system may be considered \appreciative" if it can produce something of value and adjust its work according the preferences of itself or others. This is addressed through the neural networks evaluators. The \imaginative" criterion can be met if the system can create new material independent of both its creators and other composers. Since all of the generated songs can be distinguished from the songs in the training corpora, this criterion is met at least on a basic level. However, to further evaluate all of these aspects, the generated songs were subjected to human evaluation. Twelve selections were generated for testing purposes.3 Each selection was then played for thirteen individuals, who were asked to answer the following questions: 3 These selections are available at http://axon.cs.byu.edu/emotiveMusicGeneration 144 1. What emotions are present in this selection (circle all that apply)? 2. On a scale of one to ten, how much does this sound like real music? 3. On a scale of one to ten, how unique does this selection sound? The first two questions target the aspects of skill and appreciation, ascertaining whether the system is skillful enough to produce something both musical and representative of a given emotion. The third question evaluates the imagination of the system, determining whether or not the generated music is perceived as novel by human audiences. To provide a baseline, two members of the campus songwriting club were asked to perform the same task as the computer: compose a musical selection representative of one of six given emotions. Each composer provided three songs. These selections were also played and subjects were asked to evaluate them according to the same three questions. Song order was randomized, and while subjects were told that some selections were written by a computer and some by a human, they were not told which selections belonged to which categories. Table 1 reports on how survey participants responded to the first question. It gives the percentage of respondents who identified a given emotion in computergenerated selections in each of the six categories. Table 2 provides a baseline for comparison by reporting the same data for the human-generated pieces. Tables 3 and 4 address the second two survey questions. They provide the average score for musicality and novelty (on a scale from one to ten) received by the various selections. In all cases, the target emotion ranked highest or second highest in terms of the percentage of survey respondents identifying that emotion as present in the computer-generated songs. In four cases, it was ranked highest. Respondents tended to think that the love songs sounded a little more like joy than love, and that the songs portraying fear sounded a little sadder than fearful. But surprisingly, the computer-generated songs appear to be slightly better at communicating an intended emotion than the human-generated songs. Averaging over all categories, 54% of respondents correctly identified the target emotion in computer-generated songs, while only 43% of respondents did so for humangenerated songs. Human-generated selections did tend to sound more musical, averaging a 7.81 score for musicality on a scale of one to ten as opposed to the 6.73 scored by computer-generated songs. However, the fact that a number of the computergenerated songs were rated as more musical than the human-produced songs is somewhat impressive. Computer-generated songs were also rated on roughly the same novelty level as the human-generated songs, receiving a 4.86 score as opposed to the human score of 4.67. As an additional consideration, the computer-generated songs were produced in a more ecient and timely manner than the human-generated ones. Only one piece in each category was submitted for survey purposes due to the diculty of finding human composers with the time to provide music for this project. 145 Table 1. Emotional Content of Computer-Generated Music. Percentage of survey respondents who identified a given emotion for songs generated in each of the six categories. The first column provides the categories of emotions for which songs were generated. Column headers describe the emotions identified by survey respondents. Love Joy Surprise Anger Sadness Fear Love 0.62 0.92 0.08 0.00 0.00 0.00 Joy 0.38 0.69 0.15 0.00 0.08 0.08 Surprise 0.08 0.46 0.62 0.00 0.00 0.00 Anger 0.00 0.00 0.08 0.46 0.38 0.69 Sadness 0.09 0.18 0.27 0.18 0.45 0.36 Fear 0.15 0.08 0.00 0.23 0.62 0.23 Table 2. Emotional Content of Human-Generated Music. Percentage of survey respondents who identified a given emotion for songs composed in each of the six categories. Love Joy Surprise Anger Sadness Fear Love 0.64 0.64 0.00 0.09 0.09 0.00 Joy 0.77 0.31 0.15 0.00 0.31 0.00 Surprise 0.00 0.27 0.18 0.09 0.45 0.27 Anger 0.00 0.09 0.18 0.27 0.73 0.64 Sadness 0.38 0.08 0.00 0.00 0.77 0.08 Fear 0.09 0.00 0.00 0.27 0.55 0.45 Table 3. Musicality and Novelty of Computer-Generated Music. Average score (on a scale of one to ten) received by selections in the various categories in response to survey questions about musicality and novelty. Musicality Novelty Love 8.35 4.12 Joy 6.28 5.86 Surprise 6.47 4.78 Anger 5.64 4.96 Sadness 7.09 4.40 Fear 6.53 5.07 Average: 6.73 4.86 Table 4. Musicality and Novelty of Human-Generated Music. Average score (on a scale of one to ten) received by selections in the various categories in response to survey questions about musicality and novelty. Musicality Novelty Love 7.73 4.45 Joy 9.15 4.08 Surprise 7.09 5.36 Anger 8.18 4.60 Sadness 9.23 4.08 Fear 5.45 5.45 Average: 7.81 4.67 146 4 Discussion and Future Work Pearce, Meredith, and Wiggins [21] suggest that music generation systems concerned with the computational modeling of music cognition be evaluated both by the music they produce and by their behavior during the composition process. The system discussed here can be considered creative both in the fact that it can produce fairly high-quality music, and that it does so in a creative manner. In Creativity: Flow and the Psychology of Discovery and Invention (Chapter 2), Csikszentmihalyi includes several quotes by the inventor Rabinow outlining three components necessary for being a creative, original thinker [22]. The system described in this work meets all three criteria for creativity. As Rabinow explains, \First, you have to have a tremendous amount of information... If you're a musician, you should know a lot about music..." Computers have a unique ability to store and process large quantities of data. They have the potential even to have some advantage over humans in this particular aspect of the creative process if the knowledge can be collected, stored, and utilized ffectively. The system discussed in this paper addresses this aspect of the creative process by gathering statistics from the various corpora of musical selections and using this information to inform choices about rhythm, pitch, and harmony. The next step is generation based on the domain information. Rabinow continues: \Then you have to be willing to pull the ideas...come up with something strange and different." The system described in this work can create a practically unlimited number of unique melodies based on random selections from probability distributions. Again, computers have some advantage in this area. They can generate original music quickly and tirelessly. Some humans have been able to produce astonishing numbers of compositions; Bach's work alone fills sixty volumes. But while computers are not yet producing original work of Bach's creativity and caliber, they could easily outdistance him in sheer output. The final step is evaluation of these generated melodies, Rabinow's third suggestion: \And then you must have the ability to get rid of the trash which you think of. You cannot think only of good ideas, or write only beautiful music..." Our system addresses this aspect through the neural network evaluators. It learns to select pieces with features similar to musical selections that have already been accepted by human audiences and ones most like selections humans have labeled as expressing a desired emotion. It even has the potential to improve over time by producing more negative examples and learning to distinguish these from positive ones. But finding good features for use in the evaluating classifiers poses a significant challenge. First attempts at improving the system will involve modifications in this area. As previously mentioned, research has been done to isolate specific features that are likely responsible for the emotional content of a song [17, 18]. Incorporating such features into the neural network evaluators could provide these evaluators with significantly more power in selecting the melodies most representative of a desired emotion. Despite the possible improvements, it is quite encouraging to note that even nave evaluation functions are able to produce fairly musical and emotionally targeted selections. 147 Additional improvements will involve drawing from a larger corpus of data for song generation. Currently, the base seems to be suciently wide to produce songs that were considered to be as original as human-composed songs. However, many of the generated pieces tend to sound somewhat similar to each other. On the other hand, sparseness of training data actually provides some advantages. For example, in some cases, the presence of fewer examples in the training corpus resulted in similar musical motifs in the generated songs. Phrases would often begin with the same few notes before diverging, particularly in corpora where songs tended to start on the same pitch of the scale. Larger corpora will allow for the generation of more varied songs, but to maintain musicality, the evaluation mechanism might be extended to encourage the development of melodic motifs among the various phrases. The type and magnitude of emotions can often be indicated by concurrent physiological responses. The format of these experiments lends itself to the additional goal of generating music targeted to elicit a desired physiological response. Future work will involve measuring responses such as heart rate, muscle tension, and skin conductance and how these are affected by different musical selections. This information could then be used to create training corpora of songs likely to produce desired physiological responses. These could then be used to generate songs with similar properties. The format also allows for the generation of songs that can switch emotions at a desired point in time simply by switching to using statistical data from a different corpus. The system described here is arguably creative by reasonable standards. It follows a creative process as suggested by Rabinow and others, producing and evaluating reasonably skillful, novel, and emotionally targeted compositions. However, our system will really only be useful to society if it produces music that not only affects emotions, but that people will listen to long enough for that ffect to take place. This is dicult to demonstrate in a short-term evaluation study, but we do appear to be on the right track. A few of the generated pieces received musicality ratings similar to those of the human-produced pieces. Many of those surveyed were surprised that the selections were written by a computer. Another survey respondent announced that the program had \succeeded" because one of the computer-generated melodies had gotten stuck in his head. These results show promise for the possibility of producing a system that is truly creative. Acknowledgments This work is supported by the National Science Foundation under Grant No. IIS-0856089. Special thanks to Heather Hogue and Paul McFate for providing the human-generated music. 2010_18 !2010 Real-Time Emotion-Driven Music Engine Alex Rodrguez Lopez, Antonio Pedro Oliveira, and Amlcar Cardoso Centre for Informatics and Systems, University of Coimbra, Portugal lopez@student.dei.uc.pt, apsimoes@student.dei.uc.pt, amilcar@dei.uc.pt Abstract. Emotion-Driven Music Engine (EDME) is a computer system that intends to produce music expressing a desired emotion. This paper presents a real-time version of EDME, which turns it into a standalone application. A real-time music production engine, governed by a multi-agent system, responds to changes of emotions and selects the more suitable pieces from an existing music base to form song-like structures, through transformations and sequencing of music fragments. The music base is composed by fragments classified in two emotional dimensions: valence and arousal. The system has a graphic interface that provides a front-end that makes it usable in experimental contexts of different scientific disciplines. Alternatively, it can be used as an autonomous source of music for emotion-aware systems. 1 Introduction Adequate expression of emotions is a key factor in the ecacy of creative activities [16]. A system capable of producing music expressing a desired emotion can be used to in uence the emotional experience of the target audience. EmotionDriven Music Engine (EDME) was developed with the objective of having such a capability. The high modularity and parameterization of EDME allows it to be customized for different scenarios and integrated into other systems. EDME can be controlled by the user or used in an autonomous way, depending on the origin of the input source (an emotional description). A musician can use our system as a tool to assist the process of composition. Automatic soundtracks can be generated for other systems capable of making an emotional evaluation of the current context (i.e., computer-games and interactive media, where the music needs to change quickly to adapt to an ever-changing context). The input can be fed from ambient intelligence systems. Sensing the environment allows the use in installations where music reacts to the public. In a healthcare context, self-report measures or physiological sensors can be used to generate music that reacts to the state of the patient. The next section makes a review of related work. Section 3 presents our computer system. Section 4 draws some conclusions and highlights directions for further work. 150 2 Related Work The developed system is grounded on research made in the areas of computer science and music psychology. Systems that control the emotional impact of musical features usually work through the segmentation, selection, transformation and sequencing of musical pieces. These systems modify emotionally-relevant structural and performative aspects of music [4, 11, 22], by using pre-composed musical scores [11] or by making musical compositions [3, 10, 21]. Most of these systems are grounded on empirical data obtained from works of psychology [8, 19]. Scherer and Zentner [18] established parameters of in uence for the experienced emotion. Meyer [13] analyzed structural characteristics of music and its relation with emotional meaning in music. Some works have tried to measure emotions expressed by music and to identify the ffect of musical features on emotions [8, 19]. From these, relations can be established between emotions and musical features [11]. 3 System EDME works by combining short MIDI segments into a seamless music stream that expresses the emotion given as input. When the input changes, the system reacts and smoothly fades to music expressing the new emotion. There are two stages (Fig. 1). At the off-line stage, pre-composed music is segmented and classified to build a music base (Section 3.1); this makes system ready for the real-time stage, which deals with selection, transformation, sequencing and synthesis (Section 3.2). The user interface lets the user select in different ways the emotion to be expressed by music. Integration with other systems is possible by using different sources as the input (Section 3.3). 3.1 Off-line stage Pre-composed MIDI music (composed on purpose, or compiled as needed) is input to a segmentation module. An adaptation of LBDM [2] is used to attribute weights according to the importance and degree of proximity and change of five features: pitch, rhythm, silence, loudness and instrumentation. Segmentation consists in a process of discovery of fragments, by looking to each note onset with the higher weights. Fragments that result are input to a feature extraction module. These musical features are used by a classification module that grades the fragments in two emotional dimensions: valence and arousal (pleasure and activation). Classification is done with the help of a knowledge base implemented as two regression models that consist of weighted relations between each emotional dimension and music features [14]. Regression models are used to calculate the values of each emotional dimension through a weighted sum of the features obtained by the module of features extraction. MIDI music emotionally classified is then stored in a music base. 151 Real-Time Stage Off-line Stage Desired Emotion Music Selection Music Transformation Music Sequencing Music Synthesis Music Base Music Features Extraction Music Segmentation Pre-composed Music Knowledge Base Pattern Base Listener Music Classification Fig. 1. The system works in two stages. 3.2 Real-Time Stage Real-time operation is handled by a multi-agent system, where agents with different responsabilities cooperate in simultaneous tasks to achieve the goal of generating music expressing desired emotions. Three agents are used: an input agent, which handles commands between other agents and user interface; a sequencer agent, that selects and packs fragments to form songs; and a synthesizer agent, which deals with the selection of sounds to convert the MIDI output from the sequencer agent into audio. In this stage, the sequencer agent has important responsabilities. This agent selects music fragments with the emotional content closer to the desired emotion. It uses a pattern-based approach to construct songs with the selected fragments. Each pattern defines a song structure and the harmonic relations between the parts of this structure (i.e., popular song patterns like AABA). Selected fragments are arranged to match the tempo and pitch of a selected musical pattern, through transformations and sequencing. The fragments are scheduled in order to make their perception as one continuous song during each complete pattern. This agent also crossfades between patterns and when there is a change in the emotional input, in order to allow a smooth listening experience. 3.3 Emotional Input The system can be used under user control with an interface or act autonomously with other input. The input specifies values of valence and arousal. User Interface. The user interface serves the purpose of letting the user choose in different ways the desired emotion for the generated music. It is possible for the user to directly type the values of valence and arousal the music should have. 152 Other way is through a list of discrete emotion the user can choose from. It is possible to load several lists of words denoting emotions to fit different uses of the system. For example, Ekman [6] has a list of generally accepted basic emotions. Russell [17] and Mehrabian [12] both have lists which map specific emotions to dimensional values (using 2 or 3 dimensions). Juslin and Laukka [9] propose a specific list for emotions expressed by music. Another way to choose the affective state of the music is through a graphical representation of the valence-arousal affective space, based on FeelTrace [5]: a circular space with valence dimension is in the horizontal axis and the arousal dimension in the vertical axis. The coloring follows that of Plutchik's circumplex model [15]. Other Input. EDME can stand as an autonomous source of music for other systems by taking their output as emotional input With the growing concern on computational models of emotions and affective systems, and a demand for interfaces and systems that behave in an affective way, it is becoming frequent to adapt systems to show or perceive emotions. EmoTag [7] is an approach to automatically mark up affective information in texts, marking sentences with emotional values. Our system can serve the musical needs of such systems by taking their emotional output as the input for real-time soundtrack generation. Sensors can serve as input too. Francisco et al. [20] presents an installation that allows people to experience and in uence the emotional behavior of their system. EDME is used in this interactive installation to provide music according to values of valence and arousal. 4 Conclusion Real-time EDME is a tool that produces music expressing desired emotions that has application in theatre, films, video-games and healthcare contexts. Currently, we have applied our system in an affective installation [20]. The real-time usage of the system by professionals of music therapy and the integration of EDME with EmoTag [7] for emotional soundtrack generation are also being analysed. The extension of EDME to an agent-based system increased its scability, which makes easier its expansion and integration with external systems. Listening tests are needed to assess the uentness of obtained songs. 2010_19 !2010 On the Impact of Chat Communication on ComputerSupported Idea Generation Processes Florian Forster1, Marc René Frieß1, Michele Brocco1 and Georg Groh1, 1 Technische Universität München, Lehrstuhl XI, Boltzmanstr. 3, 85748 Garching, Germany {forster,friess,brocco,grohg}@in.tum.de Abstract. It has been shown that traditional forms of communication negatively affect idea generation processes. Creativity support systems can help to avoid these impacts and provide alternative means of interaction for which the known negative affects do not apply. In this paper, we investigate the impact of chat communication on computer-supported idea generation processes. The results show that idea quantity, quality and group satisfaction are not affected by the option to use a chat as an additional communication channel. This implies that positive as well as negative influence factors of communication described in previous studies seem to offset each other in computer-supported idea generation processes. We discuss several of these potential positive and negative influence factors. Furthermore, we discuss why a chat feature can help to lower acceptance barriers for creativity support systems. 1 Introduction In the 1950s, Alex Osborn [1] proposed four central guidelines for groups to follow in idea generation processes: Criticism is ruled out, freewheeling is welcome, quantity is wanted and combinations / improvements are sought. As of today, brainstorming is probably the most popular and most often applied (group) creativity technique for idea generation processes. According to a study of Fernald and Nickolenko [2], 92% of the American companies use the brainstorming technique in meetings. Studies on the effectiveness of this technique consistently show that brainstorming groups yield better results than groups conducting traditional meetings [3]. This is mainly due to the strict separation between a divergent phase, where ideas are only generated but not yet discussed, and a consecutive convergent phase, where ideas are evaluated. Taylor et al. [4] made the finding that for brainstorming, nominal groups1 are more effective (in both quantity and quality of ideas) than groups where the participants communicate. Conducting creativity techniques as a real group process seems to imply some negative consequences that outbalance potential benefits. Three factors can explain the decrease of efficiency for interacting groups [3]: 2 In nominal groups, participants work separately from each other and at the end their ideas are merged. 165 1. Group pressure: Fear of judgment by the other group members and power misbalance (e.g. when hierarchies exist) inhibit participation and can lead to unwanted conformity of idea proposals. 2. Social loafing: Social loafing describes the tendency of group members to do less than their potential would allow them to do. It can occur either if a group member feels isolated from the group or if he feels too submerged. 3. Production blocking: Diehl and Stroebe see production blocking as the dominant factor for efficiency losses in group brainstorming processes [5]. Production blocking refers to the fact, that in interacting groups only one member can speak at a time, while the others have to listen; hence all but one member of the group are blocked and cannot work on their own ideas in this time. Computer support for idea generation can help to mitigate the negative effects of interacting groups. Carte et al. [6] state that due to the fact that a creativity support system (CSS) is able to anonymize the users and their contributions, the group pressure on the participants is lowered. Shepherd et al. [7] have shown that by improving the participation awareness in the process, social loafing effects can be reduced. Finally, when the team members do not communicate verbally, they can enter ideas parallely using keyboards and doing so are not interrupted by others. That's why the production blocking does not occur in computer-supported idea generation processes. All these factors lead to improved team effectiveness in idea generation sessions [8]. The studies mentioned above clearly indicate that it is preferable to avoid than to allow verbal communication in idea generation processes, which can be well explained with the effect of production blocking. However, this does not imply that other kinds of communication that enable direct communication between the participants must have a negative impact on the team effectiveness. E.g. using communication means that to allow parallel (non-blocking) communication such as a chat could be an improvement, as suggested in recent studies on computer supported group work (see section 2). In this paper, we want to present and discuss an empirical study we conducted on the research question whether a chat is beneficial for computer supported idea generation processes. 2 Communication in idea generation processes An idea generation process is a series of activities leading to creative ideas. According to Sternberg [9], an idea is creative if it is "both novel (i.e. original, unexpected) and appropriate (i.e. useful, adaptive concerning task constraints)". Creativity techniques are guidelines that structure idea generation processes, e.g. by defining distinctive phases or by regulating the participants' behavior. These restrictions often affect the communication between the group members, as is the case for the brainstorming technique, where criticism is forbidden during idea generation. The strong empirical evidence on the effectiveness and result quality of creativity techniques with strong restrictions on communication (such as brainstorming) 166 indicates that direct communication may not be a positive or necessary factor for idea generation processes. Applegate et al. [10] investigated a group decision support system for idea generation and issue analysis in organization planning. The participants could exchange ideas using the system, but had no means of direct communication via the system. Even though the participants were allowed to communicate verbally, the authors observed that approximately 96,6 % of the participants' time was spent on working on ideas using the system and only 3,4 % was used for non-electronic group interaction and communication. From this non-electronic communication, a large majority (47,54 %) was relatively short and technology-oriented. They also noted that the majority of verbal interaction was directed to the session facilitator (57 %). Given these small percentage values for "true" task-related direct group communication acts, one could assume that the overall role of direct communication in the process is neglectable. Hilliges et al. [11] investigated the effects of using a tabletop interface in combination with a large wall display for face-to-face group brainstorming. They compared the results of this setting with control groups using the traditional paperbased method. When counting the groups' ideas for the analysis of the experiment, they differentiated between new independent ideas, ideas that built on their own earlier ideas, ideas that resulted from seeing somebody else write down an idea and ideas that resulted from talking about an idea. 29 % of the ideas of the paper-based groups and 26% of the ideas of the electronically-supported groups ideas emerged as a result of a communication process. This implies that direct communication is the source of a substantial percentage of ideas which (in contrast to the studies and arguments presented above) may advocate the point of view that direct communication is beneficial in creative processes. Comparing more than 40 studies on teams that interact mainly or exclusively using computer systems, Powell et al. [12] come to the conclusion that communication is the key success factor in virtual team situations. They argue for a high media-richness of the communication channels, which contrasts with the principle of creativity techniques for idea generation processes that have the tendency to limit the communication channels to allow only task-related idea exchanges and to avoid direct communication between the team members. Summarizing major findings in GSS reserach, Nunamaker et al. [13] point out that support systems increase the number of ideas generated during a divergent (generation) process. The participation tends to be more equally distributed than in traditional meeting scenarios, which is mainly due to anonymity and the possibility to input ideas parallely. While Nunamaker refers only to parallel input of the generated ideas, we want to investigate the impact of a chat as a channel for direct communication between the participants of an idea generation process. In conclusion, theories and studies of creative processes give ambiguous signals with respect to the question, whether direct communication, as provided by a chat actually supports idea generation processes or not. 167 3 Study 3.1 Setting In order to find out more about the impact of chat communication on computersupported idea generation processes, we conducted an experimental lab study. As participants we selected computer science students, mainly pursuing their Bachelor's degree. A total of 60 students, divided into 18 different groups, took part in the experiment. The groups were composed randomly by picking the students who had signed up for an experiment date and dividing them into groups of four persons each. This was done in advance. Due to the fact that not all people showed up, the final groups split up into twelve groups with three students each and six groups with four students. Each participant had his own PC, and all PCs were set up with the same configuration (Windows XP, Firefox web browser). Any kind of explicit (e.g. verbal, visual) communication outside of the tool was strictly forbidden. A dedicated facilitator monitored the strict adherence to this rule. 3.2 Experiment The students used a creativity support system named IdeaStream to find ideas for the given problem "For which purpose could the student fees be used?". IdeaStream allows teams to collaborate on a virtual whiteboard using a broad set of different creativity techniques [14]. The user interface is shown in figure 1. Fig. 1. Screenshot of the virtual whiteboard and the chat in IdeaStream. There, people can work collaboratively on their ideas by creating new ones, changing, moving or deleting them. All ideas on the whiteboard are publicly 168 accessible, but without any information on who created or changed them. An idea is represented by a title and a set of components called aspects that represent pieces of information that in turn compose the idea. These pieces of information consist of texts, uploaded images or sketches. The participants used pseudonyms of the form "user x" that were randomly assigned by the system. In half of the sessions, the groups were able to chat with each other by using the integrated chat-function. This set consisted of a total of 29 students partitioned into seven groups of three students and two of four students. The other half (31 students) had no means of direct communication at all. Those students were divided into five groups of three and four groups of four. At the beginning of each session, the facilitator explained the user interface and the applied creativity techniques. After that, a 10-minute test case was played through, in order to let the participants get familiar with the user interface and the features of the tool. After this introduction the main part of the experiment took place. The idea generation process of the experiment consisted of three connected phases, each following a different creativity technique for idea generation (Brainwriting, Unrelated Stimuli and Forced Combination, see [15] for detailed explanations of the techniques). The duration of each phase was 10 minutes, so each group spent 30 minutes for idea generation, which is a typical period for idea generation sessions [18] [19]. During those phases, all participants' activities were logged. After completion of all phases, the students evaluated their generated ideas with respect to creativity and feasibility by using a score from 0 (worst) to 4 (best). 4 Results While analyzing the results of the lab study, we were particularly interested in idea quantity, idea quality and participant satisfaction. When describing the results, we refer to the groups having a chat with "chat" and the groups having no chat with "nochat". 4.1 Idea quantity Chat produced a total of 172 ideas, while no-chat generated 215 ideas. In relation to the number of participants, chat averaged 5.9 ideas per group member, being slightly outscored by no-chat with 6.9 ideas per participant. 4.2 Idea quality The members of chat rated their ideas with respect to creativity with an average of 2.2 of 4 points , with respect to feasibility with an average of 2.6 of 4 points. No-chat evaluated their idea's creativity with an average of 2.0 and their idea's feasibility with an average of 2.3. So for both criteria, chat assessed slightly higher scores than nochat. For an external measure, we asked three researchers from our group to rate the ideas as objectively as possible. In this external rating, chat scored 2.1 for both 169 creativity and feasibility, while the ideas of no-chat were rated with an average of 2.2 for creativity and 2.1 for feasibility. So the external rating showed no significant difference in the quality of contributions between the two groups. 4.3 Participant Satisfaction In a survey that was conducted immediately after the experiment, the participants were able to suggest improvements for the IdeaStream application. In 7 of the 9 groups of no-chat, at least one member requested a chat. However, both groups equally enjoyed working with the IdeaStream tool (4.6 of 6, where 0 means worst and 6 means best). No-chat rated their satisfaction with the group slightly higher than chat (5.0 / 4.8 of 6). Table 1. Results of experiment comparing groups with chat and without chat during computersupported creativity techniques for idea generation. 4.4 Summary Even though there are numerical differences in the means of most of the relevant variables, t-tests showed that they are not statistically significant. Hence, the hypothesis that chat communication affects a collaborative idea generation process is not supported (H0 saying that there is no difference cannot be rejected). This is true for idea quantity, idea quality (internal and external) and group satisfaction. 5 Discussion The design of the IdeaStream application and its use and configuration in our experiment provide means to counteract the three well-known negative influences that are usually credited for negatively impacting collaborative creative processes. The user awareness functions can help to inhibit social loafing. To prevent negative effects from social pressure, the users had pseudonyms assigned. Hence, the fear of negative judgment from others was lowered. Lastly, production blocking effects were minimized, since the system allows parallel input of ideas (and, for the chat-groups, of chat messages) and oral communication was strictly forbidden. In this way we provided an environment in which the known negative influence factors on group 170 creativity are mostly suppressed. Based on this, we were able to investigate the question if groups in idea generation processes can benefit from a chat as a channel for direct communication besides the means to enter ideas and view the ideas of others. 5.1 Impact on the process Our experiment shows that neither overall quality nor quantity of the ideas that are generated by groups in computer-supported creativity techniques are influenced by the group having the possibility to chat. So, looking at the overall process results, there is no net impact of this additional direct communication at all. However, this does not imply that the process itself is not influenced by the communication. The theories presented in the previous chapters provide reasonable arguments on these effects and support the assumption that communication does actually affect the creative group process. However, our experiment showed that different process effects implied by communication definitely cancel each other out in creative group processes, since both settings yielded the same results. To be able to better support the creative process with creativity support systems, it will be necessary to gain an understanding on the factors which influence communication. As a starting point for future research, we summarized potential positive and negative influence factors, which emerge from direct observations in the experiment as well as personal considerations (see figure 2). Fig. 1. Potential positive and negative influence factors of communication on computer-supported idea generation processes. Potential positive factors: 1. Information sharing: Letting others participate in your knowledge on a particular subject may enable them to create new or improved ideas. The stimulus through direct communication between the participants may be different from and additional to the stimulus received by seeing other people's ideas. Furthermore, people may have an information need with respect to the given problem that they can express via a channel such as a chat. That may also help in producing good quality ideas for that particular problem. 2. Socio-Emotional effects: It can be assumed that direct communication can improve emotional states of mind in a very general way, not only with respect to the initial 171 acceptance of the tool (see below) but also with respect to e.g. expressing and thus alleviating temporary emotional indispositions. 3. Consensus or commitment building: While in idea generation processes consensus or "general acceptability" with respect to ideas may not be necessary or in most cases even unwanted, direct communication may contribute to a certain consensus or commitment to some ideas that may give participants incentives to improve, structure or reposition the idea in question during the creative process without diluting it or neglecting alternative ideas. Potential negative factors: 1. Softening and dilution of ideas: An effect that applies to all forms of "free" communication in creative processes and therefore may also affect chat is that communication may contribute to softening and dilution of innovative radical ideas. One possible reason can be the attitude in groups that induces the desire for agreement on proposed solutions. In order to reach this state the described softening on ideas in view of general acceptability often takes place which sometimes leads to a decreased quality or novelty compared to the original ideas. 2. Distraction: While it can be assumed that a chat is a parallel medium and electronically mediated communication requires fewer adherences to social norms that enforce listening to others while they speak, participants still have to take their time to read the chat contributions, which may distract them from their actual task of idea generation. 3. "Unprofessional" use: In the groups that were able to chat we observed tendencies for unfocused behavior such as joking, which may exert a social force on other participants to join this unfocused and thus potentially distracting behavior. 5.2 Impact on acceptance As our survey shows, the majority of the groups that were not able to communicate with a chat suggested improving the system by adding a chat feature. This is in particular interesting given the fact that, after all, these groups were not less satisfied with the computer support system or the group than the groups that actually had the chat feature, so we must assume that actually providing them a chat would not have positively affected their satisfaction level. Nevertheless, there seems to be an inner need to communicate in group settings in general or the need for a communication assurance in view of the fear of having to use an unknown tool together with the pressure to produce good ideas. As Dennis and Reinicke [17] point out, acceptance of creativity support systems in practice is still weak, despite of the positive research results in the field. Our experiment findings suggest that there is a strong a priori demand for having a chat in a creativity support system, so providing a chat can help lowering acceptance barriers. 172 Conclusion In this article we presented the results of a study concerning the use of chat for communicating in computer-supported idea generation processes. First we introduced some of the typical communication problems in interacting groups and showed how using creative support systems can decrease the impact of these factors. Then we reviewed what role recent literature assigns to communication in idea generation processes. Related work gave valid arguments supporting both theories that direct communication via a chat during idea generation processes is necessary and beneficial and that direct communication has negative influence and thus has to be avoided respectively. For this reason we conducted an experiment that resulted in approximately equal performance values for teams having chat as a communication tool and teams without chat function. In the discussion we then interpreted our results and addressed possible causes for them. However, further research is needed, first of all to comprehensively identify possible other effects. Another important question regarding these effects is what influence they have on the idea generation process and its results. This could be accomplished by designing new experiments to isolate each of the effects. In addition there may be hidden context variables that change the weights depending on the situation, which makes the investigations even more difficult. Such context variables for example include time, place, setting, problem statement and/or type, creativity technique used, etc. For example the weight of "unprofessional" use may increase if the team members know each other well or if the computer support system is buggy. It is also important to consider that experimental settings have restrictions, especially regarding relationships between participants, motivation and other important aspects that can influence the results of experiments significantly. Therefore we want to emphasize the need to conduct field studies as well with whose help new effects and aspects of intra-team communication may be discovered. On the other hand the complexity of field studies is far higher than of experiments. Trying to isolate effects (as described above) may be very hard or sometimes even impossible in some field study so that we believe that additional lab experiments still have to be used (as far as possible) in a complementary way. 2010_2 !2010 Development of Techniques for the Computational Modelling of Harmony Raymond Whorley, Geraint Wiggins, Christophe Rhodes, and Marcus Pearce Centre for Cognition, Computation and Culture Goldsmiths, University of London New Cross, London SE14 6NW, UK. Wellcome Laboratory of Neurobiology University College London London WC1E 6BT, UK. {r.whorley,g.wiggins,c.rhodes}@gold.ac.uk marcus.pearce@ucl.ac.uk Abstract. This research is concerned with the development of representational and modelling techniques employed in the construction of statistical models of four-part harmony. Multiple viewpoint systems have been chosen to represent both surface and underlying musical structure, and it is this framework, along with Prediction by Partial Match (PPM), which will be developed during this work. Two versions of the framework are described, starting with the strictest possible application of multiple viewpoints and PPM, and then extending and generalising a little. Some implementation details are reported, as are some preliminary results. 1 Introduction The problem we are attempting to solve by computational means is this: given a soprano part, add alto, tenor and bass such that the whole is pleasing to the ear. This is not as easy as it might initially appear, as there aremany rules of harmony to be followed, which have arisen out of composers' common practice. Rather than providing the computer with rules [1], however, we wish to investigate the process of learning such rules. The idea is to write a program which allows the computer to learn for itself how to harmonise in a particular style, by creating a model of harmony from a corpus of existing music in that style. In our view, however, present techniques are not sufficiently well developed for models to generate stylistically convincing harmonisations (or even consistently competent harmony) from both a subjective and an analytical point of view; although Allan and Williams [2] have demonstrated the potential of this sort of approach. A means of representing music which, when combined with machine learning and modelling techniques, shows particular promise, is multiple viewpoint systems [3]. This framework allows us to model different aspects of the music, and then combine the individual predictions of these models to give an overall prediction. Our research aims to make a theoretical contribution to the field of computational creativity in the domain of music by extending the multiple viewpoint framework in order to cope with the complexities of harmony, such 11 that improved computational models of four-part harmonisation can be created. This is not merely an application to harmony of the framework as it stands. This paper is concerned with two versions of the framework, beginning with a very strict application, and then extending and generalising a little. 2 Brief Description of Multiple Viewpoint Systems and Their Evaluation See Table 1 for a list of basic and derived viewpoints (not exhaustive) and their meanings. Basic types are the fundamental attributes that are predicted, such as cpitch and dur. Derived types such as cpint and dur-ratio are derived from, and can therefore predict, basic types (in this case cpitch and dur respectively). Threaded types are defined only at certain positions in a sequence, determined by Boolean test viewpoints such as tactus; for example, (cpitch ⊖ tactus) has a defined cpitch value only on tactus beats (i.e., the main beats in a bar). A linked type, or product type, is the conjunction of two or more viewpoints; for example, dur-ratio ⊗ cpint is able to predict both dur and cpitch. See also [3] for more details. Table 1. Basic and derived viewpoint types (not exhaustive). Viewpoint Meaning Viewpoint Meaning dur duration of event barlength number of time units in a bar cont event continuation, or not phrase event at start or end of phrase cpitch chromatic pitch piece event at start or end of piece ioi difference in start-time contour descending, level, ascending posinbar position of event in the bar cpintfref pitch interval from tonic metre metrical importance of event inscale event in major scale, or not cpint sequential pitch interval dur-ratio sequential duration ratio fib on first beat of bar, or not liph last event in phrase, or not tactus event on tactus pulse, or not fip first event in piece, or not fiph first event in phrase, or not N-gram Models are Markov models employing sub-sequences of n symbols. The probability of the n th symbol, the prediction, depends only upon the previous n − 1 symbols, the context. The number of symbols in the context is the order of the model. See [5] for more details. What we call a viewpoint model is a weighted combination of various orders of n-gram model of a particular viewpoint type. The n-gram models can be combined by, for example, Prediction by Partial Match (PPM) [6]. PPM makes use of a sequence of models, which we call a back-off sequence, for context matching and the construction of complete prediction probability distributions. The backoff sequence begins with the highest order model, proceeds to the second-highest order, and so on. An escape method determines prediction probabilities at each stage in the sequence. 12 A multiple viewpoint system comprises more than one viewpoint. The prediction probability distributions of the individual viewpoint models are combined by employing a weighted arithmetic or geometric [10] combination technique. See [7] for more information. Conklin [7] introduced the idea of using a combination of a long-term model (LTM), which is a general model of a style derived from a corpus, and a shortterm model (STM), which is constructed as a piece of music is being predicted or generated. The latter aims to capture musical structure particular to that piece. An information-theoretic measure, cross-entropy, is used to guide the construction of models, evaluate them, and compare generated harmonisations. The model assigning the lowest cross-entropy to a set of test data is likely to be the most accurate model of the data. See [5] for more details. 3 Development of the Multiple Viewpoint and PPM Frameworks Version 1: Strict Application of Multiple Viewpoints and PPM The starting point for the definition of the strictest possible application of viewpoints is the formation of vertical viewpoint elements [8]. An example of such an element is {69, 64, 61, 57}, where all of the values are from the domain of the same viewpoint, and all of the parts (soprano, alto, tenor and bass) are represented. This method reduces the entire set of parallel sequences to a single sequence, thus allowing an unchanged application of the multiple viewpoint framework, including its use of PPM. Only those elements containing the given soprano note are allowed in the prediction probability distribution, however. This is the base-level model, to be developed with the aim of substantially improving performance. Version 2: Dividing the Harmonisation Task into Sub-tasks In this version, it is hypothesised that predicting all unknown symbols in a vertical viewpoint element (as in version 1) at the same time is neither necessary nor desirable. It is anticipated that by dividing the overall harmonisation task into a number of sub-tasks [2] [9], each modelled by its own multiple viewpoint system, an increase in performance can be achieved. For example, given a soprano line, the first sub-task might be to generate the entire bass line. This version allows us to experiment with different arrangements of sub-tasks. For example, having generated the bass line, is it better to generate the alto and tenor lines together, or one before the other? As in version 1, vertical viewpoint elements are restricted to using the same viewpoint for each part. The difference is that not all of the parts are now necessarily represented in a vertical viewpoint element. 4 Implementation At present, the corpus comprises fifty major key hymn tunes, and the test data five, harmonised as in [4]. The Lisp implementation of version 1 is capable of predicting or generating the attributes dur (note duration), cont (note continuation, which is the part of an already sounding note which continues to be heard when a new note is 13 sounded) and cpitch (chromatic pitch) for the alto, tenor and bass parts, given the soprano. More than forty viewpoints have been implemented, and any link between two viewpoints which is capable of predicting dur, cont or cpitch is allowed. A modification of the feature selection algorithm described in [10], which involves ten-fold cross-validation of the corpus, is used to optimise multiple viewpoint systems for the long-term model alone, the short-term model alone, or for both together (in which case the same system is used for both). The maximum order of the n-gram models can be varied, as can the method of combining prediction probability distributions, which are initially created using PPM with escape method C. Parameters (biases) affecting the weighting of distributions during combination can also be varied. Version 2 extends version 1, and is implemented as described in Section 3. 5 Preliminary Results Table 2 shows the lowest cross-entropy version 1 multiple viewpoint systems found so far for prediction of dur, cont and cpitch. These are for a combination of long-term and short-term models (LTM and STM, with a cross-entropy of 4.46 bits per event), LTM only (with a cross-entropy of 4.54 bits per event), and STM only (with a cross-entropy of 6.20 bits per event), using weighted geometric combination. This confirms the findings of previous research, for example that of Pearce [10], that using both LTM and STM results in a lower cross-entropy than the use of either of them alone. What is particularly interesting, however, is the fact that the STM system does not share a single viewpoint with the LTM + STM system, and has only one viewpoint in common with the LTM system; this is in stark contrast with the substantial overlap between the LTM + STM system and the LTM system. This prompted us to try using two different multiple viewpoint systems together, one optimised for the LTM and the other separately optimised for the STM; but with a cross-entropy of 4.51 bits per event, this turned out to be not as good a model as LS in Table 2. For prediction of cpitch only, the best version 1 LTM system found so far results in a cross-entropy of 3.29 bits per event. By comparison, the best version 2 LTM system found so far predicts the bass first (1.70 bits per prediction), followed by the alto and tenor together (1.55 bits per prediction), giving a total cross-entropy of 3.25 bits per event. For prediction of cpitch only, then, version 2 appears to be very slightly better than version 1. It is worth noting that the best version 2 system reflects the usual human approach to harmonisation: bass first, followed by alto and tenor together. 6 Conclusions and Future Work We have described two versions of the multiple viewpoint framework and PPM, motivated by our aim to take account of the complexities of four-part harmony. The preliminary results weakly indicate that version 2 is better than version 1 for the prediction of cpitch only. They also suggest the perhaps counter-intuitive conclusion that optimising the LTM and STM together leads to a better model than optimising them separately. This latter result opens interesting routes for 14 Table 2. Best version 1 multiple viewpoint systems (predicting dur, cont and cpitch) for LTM + STM (LS), LTM only (L) and STM only (S). Viewpoint LS L S Viewpoint LS L S cont ⊗ cpint × × (cpintfref ⊖ fiph) ⊗ piece × cont ⊗ (cpintfref ⊖ tactus) × × cpitch × × dur ⊗ (cpintfref ⊖ liph) × × dur-ratio ⊗ (ioi ⊖ fib) × cont ⊗ metre × × dur-ratio ⊗ phrase × dur ⊗ posinbar × × dur ⊗ cont × cpintfref × × cont ⊗ (cpitch ⊖ tactus) × dur ⊗ liph × × inscale × (cpintfref ⊖ liph) × × contour × (cpintfref ⊖ fiph) ⊗ fip × × cpitch ⊗ tactus × cpint ⊗ cpintfref × cpitch ⊗ (cpintfref ⊖ liph) × (cpintfref ⊖ fib) × inscale ⊗ barlength × cont ⊗ (cpintfref ⊖ liph) × cpitch ⊗ (cpintfref ⊖ fiph) × further work. Finally, using the LTM alone is less good still; and the STM alone is, as expected, by far the least good model. In the immediate future, we intend to implement other versions which push the development of the multiple viewpoint/PPM framework further. 2010_20 !2010 Live Coding Towards Computational Creativity Alex McLean and Geraint Wiggins Goldsmiths, University of London, United Kingdom ma503am@gold.ac.uk, WWW home page: http://doc.gold.ac.uk/ma503am/ Abstract. Live coding is a way of improvising music or video animation through live edits of source code, using dynamic language interpreters. It requires artists to represent their work within a formal computer language, using a higher level of abstraction than is the norm. Although the only creative agents in live coding performance are human, this abstraction makes the practice of interest to the field of computational creativity. In this paper live coders are surveyed for their thoughts on live coding and creativity, related to the aims of building creative agents. 1 Introduction Live coding is the writing of rules in a Turing complete language while they are followed, in order to improvise time based art such as music, video animation or dance. This is a relatively new approach, receiving a surge of interest since 2004, through both practice and research [1{6]. Live coding is most visible in performance, however the `live' in live coding refers not to a live audience but to live updates of running code. Conventionally humans write code followed by software, although some experimental dance improvisations have used both human rule makers and rule followers. Whether human live coders can be replaced by software creative agents is a question for the field of computational creativity, which we hope to have at least clarified by the end of this paper. In contrast to live coding, generative art is output by programs unmodified during execution, which often have no user interface at all. The lack of control over such programs has led to a great deal of confusion around the question of authorship. When watching a piece of software generate art without guidance, onlookers ask \is the software being creative?" There is no such confusion with live coding, there is a human clearly visible, making all the creative decisions and using source code as an artistic medium. In fact there is no difference of authorship between live coded and generative art. A programmer making generative art goes through creative iterations to, only after each edit they have to restart the process before re ecting on the result. This stuttering of the creative process alone is not enough to alter authorship status. If the computer's role in a live coding performance is uncreative, then what is this paper doing submitted to a computational creativity conference? Well, as a new way of producing art using formal systems, it is hoped that live coding 175 can give a unique clarifying perspective on issues of computational creativity, and perhaps even become a stepping stone towards a creative software agent. 2 Live coders on computational creativity A survey was carried out with the broad aim of gathering ideas for study of computational creativity. The members of TOPLAP [3], an active live coding community, were asked to fill out an on-line survey, and 32 responded. To avoid prejudice, the word `creativity' was not used in the invitation or survey text, and pertinent questions were mixed with more general questions about live coding. 2.1 Results The subjects. Users of the six pre-eminent live coding environments were represented, between five and fourteen for each system (many had used more than one). Background questions indicated a group with a generally rich musical background. There were a diverse range of approaches to the question of how to define live coding in one sentence, and the reader is referred to the on-line appendix to read the responses (http://doc.gold.ac.uk/ma503am/writing/icccx/). While the responses show some diversity of approach, because the subjects had all used at least one of the main languages it seems safe to assume that they are working to largely the same technical definition. Creating language. Computer users often discuss and understand computer programs as tools, helping them do what they need eciently. For a programmer it would instead seem that a computer language is an immersive environment to create work in. It is interesting then to consider to what extent live coders adapt their computer languages, personalising their environments, perhaps to aid creativity. Over two thirds (69.0%) collected functions into a library or made an extensive suite of libraries. This is analogous to adding words to a language, and shows the extent of language customisation. A smaller proportion (20.7%) had gone further to implement their own language interpreter and fewer still (17.2%) had designed their own language. That these artists are so engaged with making fundamental changes to the language in which they express their work is impressive. Code and style. From the perspective of computational creativity, it is interesting to focus on the code that live coders produce. Their code is not their work, but a high level description of how to make their work. A creative computational agent would certainly be concerned with this level of abstraction. An attempt at quantifying how live coders feel about their code was made by asking \When you have finished live coding something you particularly like, how do you feel towards the code you have made (as opposed to the end result)?" Over half (56.7%) indicated that the code resulting from a successful live coding session 176 was a description of some aspect of their style. This suggests that many feel they are not encoding a particular piece, but how to make pieces in their own particular manner. Around the same number (50.0%) agreed that the code describes something they would probably do again, which is perhaps a rephrasing of the same question. A large number, (83%) answered yes to either or both questions. There are many ways in which these questions can be interpreted, but the suggestion remains that many subjects feel they have a stylistic approach to live coding that persists across live coding sessions, and that this style is somehow represented in the code they make. Live coding as a novel approach. The subjects were asked the open question \What is the difference between live coding a piece of music and composing it in the sequencer (live coding an animation and drawing one)? In other words, how does live coding affect the way you produce your work, and how does it affect the end result?" Some interesting points relevant to computational creativity are selectively quoted for comment here, the reader is again directed to the on-line appendix to read the full responses. \I have all but [abandoned] live coding as a regular performance practice, but I use the skills and confidence acquired to modify my software live if I get a new idea while on stage." The admission that getting new ideas on stage is infrequent, makes an important and humble point. In terms of the Creative Systems Framework (CSF) [8, 7] we can say that live coding is useful in performance if you need to transform your conceptual space (the kind of work you want to find or make), or your traversal strategy (the way you try to search for or make it). However, as with this subject, transformational creativity is not always desirable in front of a paying risk-averse audience. \When I work on writing a piece ... I can perfect each sound to be precisely as I intend it to be, whereas [when] live coding I have to be more generalised as to my intentions." Making the point that live coders work at least one level of abstraction away from enacting individual sounds. \Perhaps most importantly the higher pace of livecoding leads to more impulsive choices which keeps things more interesting to create. Not sure how often that also creates a more interesting end result but at least sometimes it does." Live coding allows a change in code to be heard or seen immediately in the output, with no forced break between action and reception. This would be a surprise to those whose experience of software development is slow and arduous. \Live coding has far less perfection and the product is more immediate. It allows for improvisation and spontaneity and discourages over-thinking." 177 This may also come as a surprise; live coding has a reputation for being cerebral and over technical, but in reality, at least when compared to other software based approaches, the immediacy of results fosters spontaneous thought. \Live Coding is riskier, and one has to live with [unfit decisions]. You can't just go one step back unless you do it with a nice pirouette. Therefore the end result is not as clean as an "oine-composition", but it can lead you to places you [usually] never would have ended." This comment is particularly incisive; the peculiar relationship that live coders have with time does indeed give a certain element of risk. Thinking again within the CSF [7], such riskier ways of making music are more likely to produce aberrant output, providing the opportunity to adjust your style through transformational creativity. \... while Live Coding is a performance practice, it also offers the tantalising prospect of manipulating musical structure at a similar abstract level as 'deferred time' composition. To do this ffectively in performance is I think an entirely different skill to the standard 'one-acoustic-eventper-action' physical instrumental performance, but also quite different to compositional methods which typically allow for rework." This really gets to the nub of what live coding brings to the improvising artist { an altered perspective of time, where a single edit can affect all the events which follow it. Live coding towards computational creativity. The subjects were given a series of statements and asked to guess when each would become true. Regrettably there was a configuration error early on in the surveyed period, requiring that the answers of two subjects were discarded. Optimism for the statement \Live coding environments will include features designed to give artistic inspiration to live coders" was very high, with just over half (51.9%) claiming that was already true, and two fifths (40.7%) agreeing it would become true within five years. This indicates strong support for a weak form of computational creativity as a creative aide for live coders. Somewhat surprisingly, optimism for the stronger form of creativity in \Live code will be able to modify itself in an artistically valued manner" was also high, with two fifths (40.7%) claiming that was already possible. If that is the case, it would be appreciated if the live code in question could make itself known, although it seems more likely that ambiguity in the question is at fault. A little more pessimism is seen in response to \A computer agent will be developed that produces a live coding performance indistinguishable from that of a human live coder", with a third (34.6%) agreeing this will never happen. This question is posed in reference to the imitation game detailed by Alan Turing [9]. However as one subject commented, \the test indistinguishable from a human is very loose and there can be some very bad human live coding music." That would perhaps explain why half (50.0%) thought the statement was either already true or would become so within five years. 178 3 Conclusion What if a musicology of live coding were to develop, where researchers deconstruct the code behind live coding improvisations as part of their work? Correlations between expressions in formal languages and musical form in sound could be identified, and the development of new ways of expressing new musical forms could be tracked. If successful, the result need not be a new kind of music, but could be a music understood in a novel way. It is this new computational approach to understanding music that could prove invaluable in the search for a musically creative software agent. In looking at creativity through the eyes of live coders, we can see some promise for computational creativity even at this early stage of development of both fields. Live coders feel their musical style is encoded in the code they write, and that their language interfaces provide them with inspiration. They are actively developing computer languages to better express the music they want to make, creating computer language environments that foster creativity. From here it is easy to imagine that live coding environments could become more involved in the creation of higher order conceptual representations of timebased art. Perhaps this will provide the language, environment and application in which a computational creative agent will thrive. 2010_21 !2010 On Two Desiderata for Creativity Support Tools Wai K. Yeap1, Tommi Opas1 and Narges Mahyar2 1Centre for Artificial Intelligence Research, Auckland University of Technology , New Zealand 2Department of Artificial Intelligence, Faculty of Computer Science and Information Technology, University of Malaya, Malaysia Abstract. This paper discusses two important desiderata for developing creativity support tools, namely ideation and empowerment. We then use them to guide us in designing a new individual creativity support tool codenamed Creative-Pad. Creative-Pad is designed to assist individual advertising creative to develop creative ideas for advertisements. For ideation, Creative-Pad searches and filters information automatically from the internet to present to the user with related words and exemplar sentences. For empowerment, CreativePad is designed in such a way that the user is neither distracted nor burdened to do any other tasks unrelated to conjuring up a creative idea for a new advertisement. Creative-Pad is fully implemented and some preliminary results of its use by advertising creatives are reported. 1 Introduction Developing a creativity support tool is an exciting and a very challenging problem for HCI researchers. This is because the interaction between the computer and the human in this task is almost magical. That is, despite significant past research into the nature of creativity (for example, see [1, 4, 19, 20]), we do not understand how the mind works in a creative way. Yet, the challenge here is to develop tools to assist the mind in its creative endeavour. Shneiderman [17, p. 116] remarked: "Developing software tools to support creativity is an ambitious, but some would say, vague goal.". Developers of such tools thus face some acute problems. For example, when users of such tools fail to develop a creative solution, it is difficult to know where the problem lies. There are many other factors, such as one's lack of attention, skills, and interests, which could affect the user's performance. The interplay of these factors occurs in the mind of the user and is therefore difficult to weed out. Without doing so, it would be difficult to develop a set of criteria or a framework for developing and evaluating these tools (although attempts were made, see [16, 18]). Another example is a general lack of distinction between a creativity support tool and a problemsolving tool. For instance, if one were to use a sketch pad to help sketch out various ideas, should that be a creativity support tool or a drawing tool? Distinguishing between them might not be a straightforward task. This is because a creativity support tool is often perceived to be very much a part of a problem-solving tool. Yet, without 180 doing so one could complicate the design of such tools or at worse, be confused with the kind of tool that one is supposed to design. Attempts to define creativity support tools in the past thus tend to be quite comprehensive. For example, Lubart [10] considered four categories: computer as nanny, pen-pal, coach and colleague, while Johnson and Carruthers [7] considered three classes which range from tools that do not produce creative ideas/artefact to those that could assist in many different ways. Although these classifications provide a good scope for discussing work in this area in general, they lack definitive statements about these tools and in particular those designed to assist individuals to solve a particular problem. In this paper, we discuss two desiderata for developing creative support tools, namely ideation and empowerment. The former emphasizes on generating new ideas to the user and the latter, empowering the user to be creative. Section 2 discusses these two desiderata in details. We then show how they guide us in our design and implementation of a new creativity support tool, codenamed Creative-Pad. CreativePad is designed to assist individual advertising creative. An advertising creative (or, in short, a creative) is a person working for an advertising agency who is responsible for developing creative ideas for a new advertisement. For ideation, we emphasize on developing a process which generates ideas that bear some relations to the problem on hand. For empowerment, we emphasize on providing the user much time to conjure his/her idea for a new advertisement while being hinted with some "seed" ideas. Section 3 discusses the design and implementation of Creative-Pad. Section 4 concludes the paper with a general discussion of future work and the lessons learned from developing Creative-Pad. 2 Desiderata The first desideratum, ideation, emphasizes a process which has been well observed to be an inherent part of creative thinking, namely the ability to generate/discover new ideas. Much has already been said in the literature regarding the way in which ideas emerge in a creative process, first as a set of divergent ideas and then as a set of convergent ideas. However, we argue that the ideation process implemented for any creativity support tool should focus only on generating a set of divergent ideas. In particular, we consider the convergent part, for now, to be the responsibility of the user. Partly, this is because we lack an understanding of how creative thinking arises and partly this provides a clear goal for designing these tools. If not, the design of these tools would become too intertwined with the two roles and thus becoming unnecessary complicated. Furthermore, the ideation process should be able to generate a new set of ideas when used repeatedly and the ideas generated must somehow be able to inspire the user to then work towards a creative solution. If not, the tool itself will be limited in its ability to support creative thinking, both in terms of quantity and quality of ideas generated. It is worthwhile distinguishing between creative thinking tools from creativity support tools as defined here. The former incorporates methods (such as Osborn's [13] brainstorming, de Bono's [5] lateral thinking, and MacCrimmon and 181 Wagner's [11] techniques for "making connections") which encourage users to come up with new ideas themselves whereas the latter automatically generates ideas to inspire/lead the users to develop a more creative solution for the problem on hand. Developing such an ideation process suggests that its des ign needs to be crafted in a way that combines both the need to have fresh ideas and ideas that are in creative ways linked to the problem on hand. Consequently, attention needs to be paid to the exact nature of the creative aspect of the problem for which the tool is designed and to where the possible source of inspiration lies. Bonnardel [2, p.158], in analysing the use of analogies in creative activities, also emphasized the importance of knowing "the nature of the situations that can be used as sources of inspiration". The second desideratum, empowerment, relates to the tool's usability. Following from the first desideratum, it becomes clear that this should be about empowering the users to develop his/her ideas freely. By "freely", it is meant as little interruption as possible to the user's thought process. Again, this is important because we lack an understanding of how creative thinking arises. The user is best left alone to develop a creative solution. Fig. 1. Two different views of a creativity support tool: (a) Creativity support embedded in a problem-solving environment; (b) Creativity support view as independent moments in a problem-solving process - after each session, the tool could be updated (dotted arrow) One way to ensure that this desideratum is met is to separate one's initial thinking process from one's later "action" process. The latter is when the user begins to implement his/her initial ideas fully. A creativity support tool when designed to assist only the former activity would require an interface whereby the user literally has to do nothing. In contrast, researchers who focus on developing a suitable environment to support creative thinking (example, [6]) often tend to provide many tools or a tool with many functions. These environments often allow the users to experiment with alternative ideas prior to moving on to developing the final idea. Creativity support in this latter case is very much embedded as part of the problem-solving process (see Fig. 1a) whereas in our approach, creativity support is very much an independent process (see Fig. 1b). It captures the moments when users take time out to think about certain aspects of the problem. There could be more than one such moment and each might require the use of a different creativity support tool. (a) Creativitysupport Problem-solving environment Possibly different creativity-support tools Problem solving process (b) 182 3 Creative-Pad: Design and Implementation The process of creating an advertisement could succinctly be described as having three key elements, namely a message, an idea, and an execution. A creative first draws out a message from the brief describing the product. Then he/she develops an idea which in turn is executed to produce an advertisement. It is evidently clear that much of the process for finding some initial good ideas for developing an advertisement involves words association. It is interesting to note that this method is also one of the simplest and most popular methods used for generating ideas in many of the commercial creative software. For example, IdeaFisher is one such program and it has an idea bank of more than 700,000 words associations to generate ideas for its user. However, creatives working in this area have often noted that if the word associations are generated in a manner unrelated to the problem on hand, there is a danger that the ideas generated may not be of much use [15]. Furthermore, Poltrack [15] noted that for advertising, related ideas should come from "all corners of life" and one need to "stay tune with the world". These observations suggest the need for a rich source of contemporary ideas. Without doubt, one rich source of such contemporary ideas is the World-WideWeb, or in short, the web. It also has the added advantage of being a huge resource which is readily available, constantly updated, and information is literally coming from "all walks of life". However, using search tools to retrieve information from the web often produce an overwhelming amount of information [9]. Consequently, researchers are constantly designing new ways to help filter the information. Of particular interests here are Otsubo's Goromi-Web [14] and Koh et al.'s combinformation [8]. The former extracts and displays keywords that frequently appear in the search results and also displays images and blocks of text as floating images on the screen. The latter extracts text and image clippings from the found documents. These clippings are then presented to the user in a composition space as a group of related ideas and with which the user can interact. For instance, he or she could mouse over it to retrieve further information. However, unlike Goromi-Web and combinformation, an ideation process for Creative-Pad must not detract the users unnecessary. Consequently, we design a process which does not engage the user with its search for ideas. The user simply enters the keywords from the message and the extraction of ideas is then done automatically. Our current ideation algorithm is as follows: 1. User enters keywords from message; 2. Send request to search engine (currently using Altavista.com) for related information on the web; 3. Extract links from the results returned and download Html files from each link; 4. Extract all sentences from the Html files that contain the keywords; 5. Extract "interesting" words from these sentences; for now, a simple algorithm is used. We extract all adjectives and verbs found in these sentences. 183 Fig. 2. Creative-Pad Interface Thus, the ideation process produces as ideas, a set of words and sentences. To provide an environment conducive for the user to develop his/her own creative ideas for an advertisement, we developed an interface whereby the demand on the user is again kept to a minimum. Basically, Creative-Pad projects ideas onto the screen and the user constantly works on his/her ideas with little interruption. Each session using Creative-Pad consists of the following steps: 1. Words, in randomly assigned colour, are beamed to the user one at a time and up to a maximum of 150 words per screen (see Fig. 2a). While the words are presented, music is being played in the background Note that words are deliberately presented in an overlapping fashion to discourage the user from reading the words. 2. As each word appears, the user could select it if he/she finds it interesting or inspiring. This is done by either typing the word at the box at the bottom of the screen or clicking the save button (see Fig. 2a). 3. When the maximum number of words is displayed, the screen pauses. The user is given time to work on any ideas brewing in his/her mind. 4. If there are still words to be displayed, the user can repeat step (2) or if the user has had enough ideas, he/she could go to the next step. 5. The user-selected words are then re-beamed to the user together with some randomly generated words. The latter is added partly to add some random ideas and partly to increase the number of ideas in case the user has selected insufficient words. 6. When all the words are displayed, they will be rotated for a few seconds and redisplayed in random positions. During this time, the user is supposed to continue developing his/her ideas. He/she might record his/her ideas by typing a succinct sentence at the bottom of the screen. 7. When the user has had sufficient time developing his/her ideas, he/she will press the continue button to move to the next step. 8. Creative-Pad will then display sentences to the user, moving them from the top right corner to the bottom left corner of the screen. These sentences contain the user selected words and each sentence is numbered (see Fig. 2b). The user could (a) Save button (b) 184 select those sentences of interests by entering their number at the box at the bottom of the screen. 9. Finally, the user views all words and sentences that he/she has selected or created. He/she continues to develop his/her ideas further. 10. When the session concludes, a report will be generated which contains all the words and random words generated during this session and all the words, sentences and idea sentences generated by the user during this session. Fig. 3 shows our basic model for Creative-Pad. Step 1 is the ideation process which consists of retrieving information from the web using some keywords from the message and then processing the information for relevant ideas. In Step 2 ideas are presented to the users in four phases. Firstly, the retrieved words, W1…Wt, are presented to the user to help evoke possible ideas for advertisement. User at this stage only needs to select the words that are of interests to him/her. Secondly, the selected words, Ws1…Wsm, will be re-displayed and if necessary, together with some randomly generated words, R1…Rp. User at this stage will have a chance to refine his/her thoughts further and describe each of them with single sentences. Thirdly, sentences, S1…So, which contain the selected words, Ws1…Wsm, are displayed as exemplar of ideas using those words. User at this stage can select sentences which are of interests to him/her. Finally, both selected words and sentences are displayed and user can continue to refine his/her ideas. Fig. 3. Basic model of Creative-Pad Several experiments were conducted where creatives were asked to use CreativePad to develop ideas for some imaginary advertisements. Details of these experiments are reported in [12]. One such experiment with two creatives from two different advertising companies is reported here. The experiment was conducted as follows: Duration 30 minutes; Instruction given You will need to enter the phrase "car + family + space" into Creative-Pad and during the experiment, develop ideas for an advertisement with the message "a car with more family space". At the end of the Step 2: Message WWW Step 1: S1….Sn W1….Wt Ws1….Wsm W1….Wt + R1….Rp report S+ Ws S1….So 185 experiment, use an A4 sized paper and a black marker to draw or write whatever you think you need to describe your concept. Note that although ideas generated using Creative-Pad is not intended to be used immediately to generate a graphic display, we nonetheless asked the creatives to do so. This might help us to understand better the ideas currently in the mind of the creatives. Ideally though, they should put the ideas in the drawer and re-visit them later to produce the advertisement required. Figs. 4a and 5a show the words and sentences selected, and the ideas generated by the user (arrows indicate a possible connection between the idea generated and the words/sentences selected). Figs. 4b and 5b show the graphic output of a possible ad. Fig. 4. Result obtained using keywords: car, family, space (a) (b) Selected words User's ideas Selected Sentences Buy Interior Buying Camping Rental Comfortable Compact Feel Top Easy Speed Owned Search Appeal Set Cheap National Personal Excellent Ideal automotive Sleeping in car Cheap accommodation Built in stove Has a shower Sleeps 2 people Double bed Seats fold away Large space Camper car Roof expands Transforms to a camper car Electric hybrid Pros: Great Price, Comfortable Ride, Lots of space, Great family car You feel very little movement in the car and it wouldn't be a terrible exaggeration to say that you could easily fall asleep in the backseat. Who should buy this car Best car I've ever owned 186 Fig. 5. Another result obtained using keywords: car, family, space For Fig. 4, two themes emerge from the words selected by the elective: room and them. Interestingly, the word "room" appears in one of the selected sentences. The number of words displayed was: 234 adjectives, 43 verbs, and 30 random words. The creative selected 9 words and 2 sentences, and generated 4 ideas. For Fig. 5, the emerging theme is a camper car. The number of words displayed was: 283 adjectives, 78 verbs, and 20 random words. User selected 22 words and 4 sentences, and generated 2 ideas. 4 Discussion and Conclusion Given that creative thinking is still very much a mystery process, we argued that the development of creativity support tools should focus on: 1. Ideation - the tool must be able to generate a new set of ideas when used repeatedly and the ideas generated must somehow be able to inspire the user to then work towards a creative solution. To achieve the latter, we find words which have some relevance to the problem on hand. 2. Empowerment - the tool must be designed to support the user to be creative in deriving a solution to his/her problem. It aids the "thinking" part as opposed to the "action" part in solving a problem and it should afford minimum interference to the user's thought process. With the above properties, a creative support tool could generate ideas that are more relevant than one which supports creative thinking in general and yet it supports finding a solution but not solving the problem itself. The web undoubtedly provides a rich source of information for the ideation process in Creative-Pad. However, unlike many of the earlier approaches which use this resource, we do not involve the user to search and filter the information retrieved. The ideation process automatically extracts what it believes is useful and the user's role is to focus on developing a creative solution based upon the information presented to Selected word User's ideas Selected Sentences More Extra Larger Designed Organize Stowed Uses Groundbreaking full More room for your precious cargo Designed with them in mind (family) It's all about them It's like a spare room The firm claims it combines the practicality of a larger car with the appeal of a supermini An extra flex room gives you plenty of creative ways to use this space (a) (b) 187 him/her. In this implementation, a simple algorithm is used for extracting words from it. Nonetheless, the resource is so rich that our simple mechanism proves to be adequate, in the sense that the creatives found the information useful and interesting. In addition to the words, we also present sentences that contain the words that the creatives have chosen. One surprising finding is that many of them found that the sentences generated interesting. One possible explanation is that these sentences are like tidbits of news or opinions or commentaries related to the product on hand and thus they kept the creatives informed. They enable the creatives to "stay tune with the world" as if he is gathering the information from a stroll down the street. The creatives could develop an advertisement to position the product in light of these tidbits of information. Our initial experiments with Creative-Pad are not intended to test the effectiveness of Creative-Pad in helping the creatives to generate creative ideas. As noted earlier in the introduction, such a test is ill-defined, at least for the moment. Rather, they are designed to test whether the ideas generated are sufficiently interesting for the creatives. This, we argue, proves to be the case. Creative-Pad could now serve as a platform for developing and testing these algorithms further. However, a closer inspection of the words generated shows many of the words are judged not interesting. Furthermore, if we treat each word generated as an idea for the creatives, Creative-Pad has generated on average 200 words per experiment and the creatives chose, on average, about 10-15 words. Does the ideation process need to be more "thoughtful" in generating ideas? The initial phase of our ideation process closely resembles a brainstorming session; an extremely popular idea in creative thinking and which has influenced our implementation of the ideation process. Is brainstorming, where quantity of ideas is of essence, the only and best way to interact with the creatives? Or, would it be better to present fewer but better developed ideas? Creative-Pad would provide a suitable platform to experiment with these different alternatives in the future. In our framework, one way to empower the user to think freely is to identify clearly different areas within the problem where creative thinking is needed and then develop separate creativity-support tools for them. In advertising, there are two different sub-problems, namely getting an idea for the advertisement and developing the final advertisement itself. Creative-Pad is successfully developed for the former. In implementing the interface, we provide ample of time for the users to develop his/her ideas. However, this is done in an ad hoc fashion; more study is needed to develop an interface that better suits the way creatives work. In summary, creative thinking is such a remarkable feat of the human mind that researchers attempting to develop tools to support such an endeavour must develop a multitude of approaches for experimentation. One such approach is being presented in this paper whereby the focus is on developing an ideation process and an interface which requires the user to almost do nothing except to focus on generating creative ideas for the problem on hand. Using this approach, a tool for advertising creatives has been developed and tested successfully. The tool now provides a platform for future experimentation and discovery about creativity support tools in general and creativity support tools for advertising in particular. Much more research needs to be carried out to establish whether these two desiderata are essential for all such tools or whether we need different desiderata for different tools. If the latter, what are these? 188 2010_22 !2010 Bisociative Knowledge Discovery for Microarray Data Analysis Igor Mozetiˇc1, Nada Lavraˇc1 , 2, Vid Podpeˇcan1, Petra Kralj Novak1, Helena Motaln3, Marko Petek3, Kristina Gruden3, Hannu Toivonen4, Kimmo Kulovesi4 1 Joˇzef Stefan Institute, Jamova 39, Ljubljana, Slovenia {igor.mozetic, nada.lavrac, vid.podpecan, petra.kralj}@ijs.si 2 University of Nova Gorica, Vipavska 13, Nova Gorica, Slovenia 3 National Institute of Biology, Veˇcna pot 111, Ljubljana, Slovenia {helena.motaln, marko.petek, kristina.gruden}@nib.si 4 Department of Computer Science, University of Helsinki, Finland {hannu.toivonen, kimmo.kulovesi}@cs.helsinki.fi Abstract. The paper presents an approach to computational knowledge discovery through the mechanism of bisociation. Bisociative reasoning is at the heart of creative, accidental discovery (e.g., serendipity), and is focused on finding unexpected links by crossing contexts. Contextualization and linking between highly diverse and distributed data and knowledge sources is therefore crucial for the implementation of bisociative reasoning. In the paper we explore these ideas on the problem of analysis of microarray data. We show how enriched gene sets are found by using ontology information as background knowledge in semantic subgroup discovery. These genes are then contextualized by the computation of probabilistic links to diverse bioinformatics resources. Preliminary experiments with microarray data illustrate the approach. 1 Introduction Systems biology studies and models complex interactions in biological systems with the goal of understanding the underlying mechanisms. Biologists collect large quantities of data from wet lab experiments and high-throughput platforms. Public biological databases, like Gene Ontology and Kyoto Encyclopedia of Genes and Genomes, are sources of biological knowledge. Since the growing amounts of available knowledge and data exceed human analytical capabilities, technologies that help analyzing and extracting useful information from such large amounts of data need to be developed and used. The concept of association is at the heart of many of today's ICT technologies such as information retrieval and data mining (for example, association rule learning is an established data mining technology, [1]). However, scientific discovery requires creative thinking to connect seemingly unrelated information, for example, by using metaphors or analogies between concepts from different domains. These modes of thinking allow the mixing of conceptual categories and 190 contexts, which are normally separated. One of the functional basis for these modes is the idea of bisociation, coined by Artur Koestler half a century ago [7]: "The pattern . . . is the perceiving of a situation or idea, L, in two selfconsistent but habitually incompatible frames of reference, M1 and M2. The event L, in which the two intersect, is made to vibrate simultaneously on two different wavelengths, as it were. While this unusual situation lasts, L is not merely linked to one associative context but bisociated with two." Koestler found bisociation to be the basis for human creativity in seemingly diverse human endeavors, such as humor, science, and arts. The concept of bisociation in science is illustrated in Figure 1. Fig. 1. Koestler's schema of bisociative discovery in science ([7], p. 107). We are interested in creative discoveries in science, and in particular in computational support for knowledge discovery from large and diverse sources of data and knowledge. To this end, we participate in a European FP7 FET-Open project BISON5 which investigates possible computational realizations of bisociative reasoning. The project is based on the following, somewhat simplified, assumptions: - A bisociative information network (named BisoNet) can be created from available resources. BisoNet is a large graph, where nodes are concepts and edges are probabilistic relations. Unlike semantic nets or ontologies, the graph is easy to construct automatically since it carries little semantics. To a large extent it encodes just circumstantial evidence that concepts are somehow related through edges with some probability. 5 http://www.bisonet.eu/ 191 - Different subgraphs can be assigned to different contexts (frames of reference). - Graph analysis algorithms can be used to compute links between distant nodes and subgraphs in a BisoNet. - A bisociative link is a link between nodes (or subgraphs) from different contexts. In this paper we thus explore one specific pattern of bisociation: long-range links between nodes (or subgraph) which belong to different contexts. More precisely, we say that two concepts are bisociated if: - there is no direct, obvious evidence linking them, - one has to cross contexts to find the link, and - this new link provides some novel insight into the problem domain. We have to emphasize that context crossing is subjective, since the user has to move from his ‘normal' context (frame of reference) to an habitually incompatible context to find the bisociative link [2]. In terms of Koestler (Figure 1), a habitual frame of reference (plane M1) corresponds to a BisoNet subgraph as defined by a user or his profile. The rest of the BisoNet represents different, habitually incompatible contexts (in general, there may be several planes M2). The creative act here is to find links (m2) which lead ‘out-of-the-plane' via intermediate, bridging concepts (L). Thus, contextualization and link discovery are two of the fundamental mechanisms in bisociative reasoning as implemented in BISON. Finding links between seemingly unrelated concepts from texts was already addressed by Swanson [10]. The Swanson's approach implements closed discovery, the so-called A-B-C process, where A and C are given and one searches for intermediate B concepts. On the other hand, in open discovery [16], only A is given. One approach to open discovery, RaJoLink [8], is based on the idea to find C via B terms which are rare (and therefore potentially interesting) in conjunction with A. Rarity might therefore be one of the criteria to select links which lead out of the habitual context (around A) to known, but non-obviously related concepts C via B. In this paper we present an approach to bisociative discovery and contextualization of genes which should help in the analysis of microarray data. The approach is based on semantic subgroup discovery (by using ontologies as background knowledge in microarray data analysis), and the linking of various publicly available bioinformatics databases. This is an ongoing work, where some elements of bisociative reasoning are already implemented: creation of the BisoNet graph, identification of relevant nodes in a BisoNet, and computation of links to indirectly related concepts. Currently, we are expanding the BisoNet with textual resources from PubMed, and implementing open discovery from texts through BisoNet graph mining. We envision that the open discovery process will identify potentially interesting concepts from different contexts which will act as the target nodes for the link discovery algorithms. Links discovered in this way, crossing contexts, might provide instances of bisociative discoveries. 192 The currently implemented steps of bisociative reasoning are the following. The semantic subgroup discovery step is implemented by the SEGS system [14]. SEGS uses as background knowledge data from three publicly available, semantically annotated biological data repositories, GO, KEGG and NCBI. Based on the background knowledge, it automatically formulates biological hypotheses: rules which define groups of differentially expressed genes. Finally, it estimates the relevance (or significance) of the automatically formulated hypotheses on experimental microarray data. The link discovery step is implemented by the Biomine system [9]. Biomine weakly integrates a large number of biomedical resources, and computes most probable links between elements of diverse sources. It thus complements the semantic subgroup discovery technology, due to the explanatory potential of additional link discovery and Biomine graph visualization. While this link discovery process is already implemented, our current work is devoted to the contextualization of Biomine nodes for bisociative link discovery. The paper is structured as follows. Section 2 gives an overview of five steps in exploratory analysis of gene expression data. Section 3 describes an approach to the analysis of microarray data, using semantic subgroup discovery in the context of gene set enrichment. A novel approach, a first attempt at bisociative discovery through contextualization, composed of using SEGS and Biomine (SEGS+Biomine, for short) is in Section 4. An ongoing experimental case study is presented in Section 5. We conclude in Section 6 with plans for future work. 2 Exploratory gene analytics This section describes the steps which support bisociative discovery, targeted at the analysis of differentially expressed gene sets: gene ranking, the SEGS method for enriched gene set construction, linking of the discovered gene set to related biomedical databases, and finally visualization in Biomine. The schematic overview is in Figure 2. The proposed method consists of the following five steps: 1. Ranking of genes. In the first step, class-labeled microarray data is processed and analyzed, resulting in a list of genes, ranked according to differential expression. 2. Ontology information fusion. A unified database, consisting of GO6 (biological processes, functions and components), KEGG7 (biological pathways) and NCBI8 (gene-gene interactions) terms and relationships is constructed by a set of scripts, enabling easy updating of the integrated database (details can be found in [12]). 3. Discovering groups of differentially expressed genes. The ranked list of genes is used as input to the SEGS algorithm [14], an upgrade of the 6 http://www.geneontology.org/ 7 http://www.genome.jp/kegg/ 8 ftp://ftp.ncbi.nlm.nih.gov/gene/GeneRIF/interaction sources 193 . . . gene2: gene3: gene1: + + ++ geneN: − − SEGS Biomine Microarray data Enriched gene sets Contextualized genes. Fig. 2. Microarray gene analytics proceeds by first finding candidate enriched gene sets, expressed as intersections of GO, KEGG and NCBI gene-gene interaction sets. Selected enriched genes are then put in the context of different bioinformatic resources, as computed by the Biomine link discovery engine. The '+' and '-' signs under Microarray data indicate overand under-expression values of genes, respectively. RSD relational subgroup discovery algorithm [3, 4, 13], specially adapted to microarray data analysis. The result is a list of most relevant gene groups that semantically explain differential gene expression in terms of gene functions, components, processes, and pathways as annotated in biological ontologies. 4. Finding links between gene group elements. The elements of the discovered gene groups (GO and KEGG terms or individual genes) are used to formulate queries for the Biomine link discovery engine. Biomine then computes most probable links between these elements and entities from a number of public biological databases. These links help the experts to uncover unexpected relations and biological mechanisms potentially characteristic for the underlying biological system. 5. Gene group visualization. Finally, in order to help in explaining the discovered out-of-the-context links, the discovered gene relations are visualized using the Biomine visualization tools. 3 SEGS: Search for Enriched Gene Sets The goal of the gene set enrichment analysis is to find gene sets which form coherent groups and are distinguished from the rest of the genes. More precisely, a gene set is enriched if the member genes are statistically significantly differentially expressed as compared to the rest of the genes. Two methods for testing the enrichment of gene sets were developed: Gene Set Enrichment Analysis (GSEA) [11] and Parametric Analysis of Gene Set Enrichment (PAGE) [6]. Originally, these methods take individual terms from GO and KEGG (which annotate gene sets), and test whether the genes that are annotated by a specific term are statistically significantly differentially expressed in the given microarray dataset. 194 Fisher GSEA PAGE Microarray data Generation of gene sets Enriched gene sets GO KEGG NCBI Ranking of genes Fig. 3. Schematic representation of the SEGS method. The novelty of the SEGS method, developed by Trajkovski et al. [12,14] and used in this study, is that the method does not only test existing gene sets for differential expression but it also generates new gene sets that represent novel biological hypotheses. In short, in addition to testing the enrichment of individual GO and KEGG terms, this method tests the enrichment of newly defined gene sets constructed by the intersection of GO terms, KEGG terms and gene sets defined by taking into account also the gene-gene interaction data from NCBI. The SEGS method has four main components: - the background knowledge (the GO, KEGG and NCBI databases), - the SEGS hypothesis language (the GO, KEGG and interaction terms, and their conjunctions), - the SEGS hypothesis generation procedure (generated hypotheses in the SEGS language correspond to gene sets), and - the hypothesis evaluation procedure (the Fisher, GSEA and PAGE tests). The schematic workflow of the SEGS method is shown in Figure 3. 4 SEGS+Biomine: Contextualization of genes We made an attempt at exploiting bisociative discoveries within the biomedical domain by explicit contextualization of enriched gene sets. We applied two methods that use publicly available background knowledge for supporting the work of biologists: the SEGS method for searching for enriched gene sets [14] and the Biomine method for contextualization by finding links between genes and other biomedical databases [9]. We combined the two methods in a novel way: we used SEGS for hypothesis generation in the form of interesting gene sets, and then formulated queries for Biomine for out-of-the-context link discovery and visualization (see Figure 4). We believe that by forming hypotheses with SEGS, constructed as intersections of terms from different ontologies (different contexts), discovering links between them by Biomine, and visualizing the SEGS hypotheses and the discovered links by the Biomine graph visualization engine, the interpretation of the biological mechanisms underlying differential gene expression is easier for biologists. 195 Biomine databases Enriched gene sets Biomine Biomine graph link discovery visualization Fig. 4. SEGS+Biomine workflow. In the Biomine9 project [9], data from several publicly available databases were merged into a large graph, a BisoNet, and a method for link discovery between entities in queries was developed. In the Biomine framework nodes correspond to entities and concepts (e.g., genes, proteins, GO terms), and edges represent known, probabilistic relationships between nodes. A link (a relation between two entities) is manifested as a path or a subgraph connecting the corresponding nodes. Vertex Type Source Database Nodes Degree Article PubMed 330,970 6.92 Biological process GO 10,744 6.76 Cellular component GO 1,807 16.21 Molecular function GO 7,922 7.28 Conserved domain ENTREZ Domains 15,727 99.82 Structural property ENTREZ Structure 26,425 3.33 Gene Entrez Gene 395,611 6.09 Gene cluster UniGene 362,155 2.36 Homology group HomoloGene 35,478 14.68 OMIM entry OMIM 15,253 34.35 Protein Entrez Protein 741,856 5.36 Total 1,968,951 Table 1. Databases included in the Biomine snapshot used in the experiments. The Biomine graph data model consists of various biological entities and annotated relations between them. Large, annotated biological data sets can be readily acquired from several public databases and imported into the graph model in a relatively straightforward manner. Some of the databases used in Biomine are summarized in Table 1. The snapshot of Biomine we use consists of a total of 1,968,951 nodes and 7,008,607 edges. This particular collection of data sets is not meant to be complete, but it certainly is sufficiently large and versatile for real link discovery. 9 http://biomine.cs.helsinki.fi/ 196 5 A case study In the systems biology domain, our goal is to computationally help the experts to find a creative interpretation of wet lab experiment results. In the particular experiment, the task was to analyze microarray data in order to distinguish between fast and slowly growing cell lines through differential expression of gene sets, responsible for cell growth. Enriched Gene Sets 1. SLOW-vs-FAST GO Proc('DNA metabolic process') & INTERACT( GO Comp('cyclin-dep. protein kinase holoenzyme complex')) 2. SLOW-vs-FAST GO Proc('DNA replication') & GO Comp('nucleus') & INTERACT( KEGG Path('Cell cycle')) 3. SLOW-vs-FAST . . . Table 2. Top SEGS rules found in the cell growth experiment. The second rule states that one possible distinction between the slow and fast growing cells is in genes participating in the process of DNA replication which are located in the cell nucleus and which interact with genes that participate in the cell cycle pathway. Table 2 gives the top rules resulting from the SEGS search for enriched gene sets. For each rule, there is a corresponding set of over expressed genes from the experimental data. Figure 5 shows a part of the Biomine graph which links a selected subset of enriched gene set to the rest of the nodes in the Biomine graph. The wet lab scientists have assessed that SEGS in combination with Biomine provide additional hints on what to focus on when comparing the expression data of cells. Additionally, such an in-silico analysis can considerably lower the costs of in-vitro experiments with which the researchers in the wet lab are trying to get a hint of a novel process or phenomena observed. This may be especially true for situations when just knowing the final outcome one cannot explain the drug effect, organ function, or disease satisfactorily. Namely, the gross, yet important characteristics of the cells (organ function) are hidden (do not affect visual morphology) or could not be recognized soon enough. An initial predisposition for this approach is wide accessibility and low costs of high throughput microarray analysis which generate appropriate data for in-silico analysis. 6 Conclusions We presented SEGS+Biomine, a bisociation discovery system for exploratory gene analytics. It is based on the non-trivial steps of subgroup discovery (SEGS) and link discovery (Biomine). The goal of SEGS+Biomine is to enhance the 197 Fig. 5. Biomine subgraph related to five genes from the enriched gene set produced by SEGS. Note that the gene and protein names are not explicitly presented, due to the preliminary nature of these results. creation of novel biological hypotheses about sets of genes. A prototype version of the gene analytics software, which enhances SEGS and creates links to Biomine queries and graphs, is available as a web application at http://zulu.ijs.si/web/segs ga/. In the future work we plan to enhance the contextualization of genes with contexts discovered by biomedical literature mining. We will add PubMed articles data into the BisoNet graph structure. To this end, we already have a preliminary implementation of software, called Texas [5], which creates a probabilistic network (BisoNet, compatible to Biomine) from textual sources. By focusing on different types of links between terms (e.g., frequent and rare co-ocurances) we expect to get hints at some unexpected relations between concepts from different contexts. Our long term goal is to help biologists better understand inter-contextual links between genes and their role in explaining (at least qualitatively) underlying mechanisms which regulate gene expressions. The proposed approach is considered a first step at computational realization of bisociative reasoning for creative knowledge discovery in systems biology. 7 Acknowledgements The work presented in this paper was supported by the European Commission under the 7th Framework Programme FP7-ICT-2007-C FET-Open project 198 BISON-211898, by the Slovenian Research Agency grants Knowledge Technologies and Systems Biology J4-2228, and by the Algorithmic Data Analysis (Algodan) Centre of Excellence of the Academy of Finland. We thank Igor Trajkovski for his previous work on SEGS, and Filip ˇZelezn´y and Jakub Tolar for their earlier contributions leading to SEGS. 2010_23 !2010 Domain Bridging Associations Support Creativity Tobias Kotter, Kilian Thiel, and Michael R. Berthold Nycomed-Chair for Bioinformatics and Information Mining, University of Konstanz, Box M712, 78484 Konstanz, Germany Tobias.Koetter@uni-Konstanz.de Abstract. This paper proposes a new approach to support creativity through assisting the discovery of unexpected associations across different domains. This is achieved by integrating information from heterogeneous domains into a single network, enabling the interactive discovery of links across the corresponding information resources. We discuss three different pattern of domain crossing associations in this context. 1 Data-driven Creativity Support The amount of available data scientists have access to (and should consider when making decisions) continues to grow at a breath-taking pace. To make things worse, scientists work increasingly in interdisciplinary teams where information needs to be considered not only from one research field but from a wide variety of different domains. Finding the relevant piece of information in such environments is dicult since no single person knows all of the necessary details. In addition, individuals do not know exactly where to look or what to look for. Classical information retrieval systems enforce the formulation of questions or queries which, for unfamiliar domains or domains that are completely unknown, is dicult if not impossible. Methods that suggest unknown and interesting pieces of information, potentially relevant to an already-known domain can help to find a focus or encourage new ideas and spark new insights. Such methods do not necessarily answer given queries in the way traditional information retrieval systems do, but instead suggest interesting and new information, ultimately supporting creativity and outside-the-box thinking. In [1] Weisberg stipulates that a creative process is based on the ripeness of an idea and the depth of knowledge. According to Weisberg this means that the more one knows, the more likely it is that innovation is produced. According to Arthur Koestler [2] a creative act, such as producing innovation, is performed by operating on several planes, or domains of information. In order to support creativity and help trigger new innovations, we propose the integration of data from various different domains into one single network, thus enabling to model the concept of domain-crossing associations. These domain-bridging associations do not generate new hypotheses or ideas automatically, but aim to support creative thinking by discovering interesting relations 200 Fig. 1. Association vs. Bisociation between seemingly unconnected concepts, therefore helping to fuse diverse domains. 2 Bisociation and Bisociative Networks The term bisociation has been introduced by Arthur Koestler in [2]. He introduced bisociation as a theory to describe the creative act in humor, science and art. In contrast to an association representing a relation between concepts within one domain, a bisociation fuses the information in multiple domains by finding a (usually indirect) connection between them (see Fig 1). Generally a domain can be seen as a set of concepts from the same field or area of knowledge. A popular example of a Bisociation is the theory of gravity by Isaac Newton, which fuses the previously Aristotelian two-world system of sub-lunar and super-lunar physics. Even though not all creative discoveries are based on bisociation, many of them have been made by associating semantically distant concepts. Once such a connection has been found, it is no longer an unexpected connection and frequently even turns into \common sense". A citation of Henri Poincaré also describes the combination of semantically distant concepts: \Among chosen combinations the most fertile will often be those formed of elements drawn from domains which are far apart... Most combinations so formed would be entirely sterile; but certain among them, very rare, are the most fruitful of all." In order to find bisociations, data from different domains has to be integrated. Bisociative Networks (BisoNets) [3] aim to address this problem by supporting the integration of both semantically meaningful information as well as loosely coupled information fragments. They are based on a exible k-partite graph structure, which consists of nodes representing units of information or concepts and edges representing their relations. Each partition of a BisoNet contains a certain type of concepts or relations e.g. terms, documents, genes or experiments. BisoNets model the main characteristics of the integrated information repositories without storing all the more detailed data underneath this piece of information. By focusing on the concepts and their relations alone, BisoNets therefore allow huge amounts of data to be integrated. 201 3 Patterns of Bisociation Once the information in forms of concepts and relations is combined in the network it can be analyzed and mined for new, unexpected, and hopefully interesting pieces of information to support creative discoveries. One way of doing this is by identifying interesting patterns in the BisoNets. One class of patterns is bisociation. A formal definition of a bisociation in the context of BisoNets is the following1: \A bisociation is a link that connects concepts from two or more domains, which are unconnected depending on the specific view by which the domains are defined." A domain in a BisoNet is a set of concepts. Depending on the view, a domain can either consist of concepts of one type, or bundle concepts of many types. So far, we have considered two different view types, one depending on the user's interest and a second depending on the applied graph analysis algorithms. The first view creates the domain according to the user's specifications. Thereby the subjective view of the data plays an important role; the fields of knowledge vary and hence are differently defined for each user. The second view is defined by the structure of the graph, e.g. the level of detail and is extracted by the a graph summarization or abstraction algorithm, leading to a user-independent view. Different types of such algorithmic views can be defined. Once the domains have been defined by a given view, the main part of a bisociation, the link that connects concepts from different domains, can be identified. A link can be a single concept, a sub graph or any other type of relation. Fig. 2. Example of a Bisociative Network Figure 2 depicts an example BisoNet. The view is the surrounding frame that defines the domains. Each domain is depicted in a different shade and contains concept types represented by dotted lines. The concepts and relations of the BisoNet are depicted as circles whereas links connect concepts and their relations. An example of a bisociation that connects concepts from different domains is depicted by the bold path in the network. The different types of bisociation are described in more detail below. 1 Result from discussions within the EU FP7 Project BISON. 202 Gene Protein Ice Fig. 3. Bridging concept example Gene Text Disease Gene1 Gene7 ... ... Term1 Term6 ... ... Doc1 Doc2 Fig. 4. Bridging graph example Soldiers Horse alters Gate protects Troy no entry passes Structure Add on Bloodbrain barrier protects Brain no entry passes alters Prodrug Trojan Horse Fig. 5. Example of structural similarity Bridging Concepts Bridging concepts are mostly ambiguous concepts or metaphors. In contrast to ambiguous concepts, which can lead to incorrect conclusions, metaphors can lead to new discoveries by connecting seemingly unrelated subjects. Bridging concepts are often used in humor [2] and riddles [4]. Bridging concepts connect dense sub graphs from different domains. Figure 3 depicts the homonym Ice as an example of a bridging concept. Ice is the name of a gene but also the name of the protein it encodes. Thus the concept belongs to the gene and the protein domain. Bridging Graphs Bridging graphs are sub-graphs that connect concepts from different domains. They lead to new insights by connecting domains that at first glance do not appear to have anything in common. An example of a bridging graph is the discovery of Archimedes while he was having a bath. As he got into the tub he noticed that the level of the water rose. By connecting the rise of the water level with his body as he immersed into the water he realized that this ffect can be used in general to determine the volume of a body and is today known as the Archimedes' Principle. A bridging graph could also connect two concepts from the same domain via a connection running through a previously unknown domain. Figure 4 depicts a bridging graph that connects several genes of the same domain via documents that all describe the same disease. Structural similarity Bisociations based on structural similarity are represented by sub graphs of two different domains with a similar structure. This is the most abstract pattern of bisociation discussed here, which potentially leads to new discoveries by linking domains that do not have any connection. Figure 5 depicts the structural similarity between a prodrug that passes the blood-brain 203 barrier and the soldiers that pass the gate of Troy hidden in a wooden horse. In both scenarios the barrier can only be passed by altering the appearance of the intruder. 4 Conclusion and Future work In this paper we discuss a new approach that aims to support creative thinking, ultimately leading to new insights. Bisociative networks (BisoNets) provide an environment that fosters the curiosity to dig deeper into newly discovered insights by allowing to discover new connections between concepts and bridging the gap between previously unconnected domains. We have discussed three different notions of bisociation: bridging concepts, bridging graphs and structural similarity. In addition to defining more patterns of bisociation we will evaluate existing graph-mining algorithms to find different types of bisociations such as betweenness centralities [5] to discover bridging concepts or minimum spanning trees [6] to identify bridging graphs. Structural similarity might be discovered by using role detection algorithms [7] or graph kernels that take the neighborhood of each node into account. Acknowledgment We would like to thank the members of the EU Bison project for many fruitful discussions during the development of BisoNets. The work presented in this paper was supported by a European Commission grant under the 7th Framework Programme FP7-ICT-2007-C FET-Open, project no. BISON-211898. 2010_24 !2010 Measuring Creativity in Software Development C. Nelson, B. Brummel, D.F. Grove, N. Jorgenson, S. Sen, R. Gamble University of Tulsa 800 S. Tucker Drive, Tulsa OK, USA {courtney-nelson, bradley-brummel, dean-grove, noah-jorgenson, sandip, gamble@utulsa.edu} Abstract. Creativity involves choosing to direct resources toward developing novel ideas. Information technology development, including software engineering, requires creative discourse among team members to design and implement a novel, competitive product that meets usability, performance, and functional requirements set by the customer. In this paper, we present results that correlate metrics of creative collaboration with successful software product development in a Senior Software Projects class that is a capstone course in accredited Computer Science programs. An idea management and reward system, called SEREBRO, provides measurement opportunities to develop metrics of fluency, flexibility, originality, elaboration, and overall creativity. These metrics incorporate multiple perspectives and sources of information into the measurement of creativity software design. The idea management portion of SEREBRO is a Web application that allows team members to initiate asynchronous, creative discourse through the use of threads. Participants are rewarded for brainstorming activities that start new threads for creative discourse and spinning new ideas from existing ones. 1 Introduction Creativity can be understood and measured from a variety of perspectives. Two of the major approaches to measuring creativity are the psychometric approach and the confluence approach [1]. In this paper, we are concerned with creativity as a process not as an individual difference. We follow Sternberg and Lubart's [2] investment model of creativity that conceptualizes it as a decision that anyone can make, but few people do because of potential costs. We also incorporate Amabile's [3] componential model of creativity which requires three critical components for creativity: domainrelevant skills, creativity relevant processes, and intrinsic task motivation. We have implemented a software tool called SEREBRO (Software Engineering REwards for BRainstorming Online) to promote creative discourse in a Senior Software Projects class that is a capstone course in the Computer Science Curriculum at the University of Tulsa. SEREBRO organizes students around creativity relevant processes related to software engineering to motivate and assist teams of students with making creative software design and implementation decisions. Our primary interest as presented in this paper is examining whether encouraging the infusion of more creativity in the software development process results in more creative output with respect to the resulting software projects. 205 Essential cognitive elements for divergent thinking in the creative process include fluency, flexibility, originality, and elaboration [4, 5]. Fluency is the frequency of ideas generated. Flexibility involves seeing the problem in new ways or reconceptualizing the problem. Originality is devising novel ideas or synthesis of the problem. Elaboration is extending or fleshing out details. Creativity can be demonstrated in a specific part of a project or evaluated within the full project. Both objective and subjective metrics can be established to measure these elements in the creative process. The overall creativity encompassing the four cognitive elements can also be appraised subjectively. Our project uses both objective and subjective metrics of each element in the creative process to understand and foster creativity in collaborative software engineering projects. Software engineering is the process of designing and developing a software product. It often involves a team-based approach that is oriented around specific tasks such as communication, planning, modeling, construction, and deployment [6]. The class undergoes training to acquire domain-relevant skills in each task to derive a set of work products, also called artifacts, which include customer requirements, documentation, designs, code, and packaging. Creativity relevant processes manifest themselves in a variety of forms in software development, such as. · Interface design to achieve the proper functional use and aesthetic appeal · Combining reusable entities in novel ways to meet innovative product requirements · Implementation of functionality that is new to the developer's skill set · Novel strategies for code refinement to increase performance, usability, and security · Inventive ways to describe how to use a product in the final packaging delivered to the customer We address the issue of fostering creativity as part of the collaborative software design in the context of a Senior Software Projects course. Since a team of students is involved, discussing what the novelty is and how it is placed in the product is as essential as the tangible design and implementation itself. We are most concerned with the creativity exhibited in the software development process during which a number of intermediate artifacts are produced, as well as the final software project and demonstration provided by the team at the end of the semester. Fostering creativity in software engineering student deliberations poses unique challenges. First, it is uncommon for seniors in computer science to be forced to think creatively in their area of studies. Though liberal arts and science courses have creative and critical thinking components, in a computing curriculum, there is less attention paid to creativity and "thinking outside the box." Second, sharing and furthering ideas among team members in a collaborative fashion is very dependent on the openness of the team members, the time available to meet, and the different skill sets of the team members with respect to classifying requirements, analyzing a system design, and coding the various components (interface, processing, databases, etc.) that comprise the software. This problem has also been noted by information technology experts as well, leading to new software development processes such as agile development [7]. Third, creativity is often not directly rewarded. Even if ideas are explored, captured and used in idea management tools [8, 9], only the resulting 206 artifacts are examined by the instructor (i.e. management) and the customer. Therefore, since the creative discourse is not credited to the participants and is not explicit as an artifact itself, there is limited extrinsic motivation to engage in it. To address these problems and provide a mechanism that supports, recognizes and fosters creativity, student teams use SEREBRO throughout the software development process. SEREBRO combines idea expression and management with a reward system designed to reinforce the usefulness of ideas and their contribution to creative discourse during the software development process [10]. Our objective is to explore the ability of SEREBRO to both capture and enhance creativity. The outcomes of our experiments can in turn be used to improve future information technology development, including introducing appropriate rewards. 2 Team Interaction on SEREBRO Teams are an essential part of software development. Even when a single member is more creative or has an advanced skill set, the success of the project requires the contribution of all members, especially within a small team. Therefore, our initial experiments focus on evaluating creativity at a team level. While there may be variances in the contributions to the overall project by each member of the team, our analysis ignores that variance in favor of comparing team creativity with overall performance and outcomes. Team interaction on SEREBRO starts with the designation of a list of topics. These topics are loosely based on the tasks to be performed and the artifacts required for hand-in at each milestone. The software engineering class used for testing SEREBRO had four major milestones over which the artifacts were produced, culminating in a final product presentation and demonstration. Discussions on SEREBRO commence within a specific topic when any team member posts a brainstorm node. Upon reviewing the node, another team member can agree or disagree with the post and then input their own idea (in the case of a disagree node) or an enhancement of the idea (in the case of an agree node). We call the agree/disagree process spinning an idea. Multiple brainstorms can start the discourse within a single topic. Figure 1 shows an idea network with seven brainstorm nodes in blue circles and agree nodes as green triangles. Disagree nodes, though not included in this figure, are upside down orange triangles. The box at the bottom shows the actual post that is seen when a node is ‘moused' over. 207 Figure 1: SEREBRO Idea Network Discourse on SEREBRO is exhibited by a thread of postings, much like an online forum. A short thread is shown in Figure 2 from an actual team when discussing a concept for a video game. Students who are the focus of this experiment are highly computer literate and very familiar with this style of forum posting. Asynchronous postings to SEREBRO, along with email alerts when posts are made, give students the freedom to work on the project from anywhere and at anytime. Though face-to-face meetings are encouraged, they are not essential to the progress of the project and mimic distributed teams. Threads can be pruned when they are no longer considered valid. Those threads that embody essential ideas regarding a specific topic, artifact, or direction are finalized. Finalization in SEREBRO involves naming the emerging concept as a new topic, generally for the next milestone, and tagging all of the contributing ideas, including brainstorm, agree, and disagree nodes. 208 Figure 2: A Thread of Postings To externally motivate creative contributions, SEREBRO implements a multiagent system of rewards [10]. The reward system provides an egalitarian method of point distribution based on the structure of the idea network generated. Each user within the system is assigned a user-agent that distributes reward points to users based on their creative contribution following a specific protocol (see Section 3.1). In our current implementation, the agents reward according to the node type (brainstorm, agree, disagree, and finalized.) and its position in the network, but not the node content (contextual information) in performing these point allocations. When users post with a certain node type, they effectively classify the node with a positive or negative rating relative to its parent node. Based on these classifications, each user agent propagates reward points through each idea thread to produce cumulative rewards for each user. Idea nodes tagged during the finalization of a concept for the next milestone receive additional rewards. As points are incrementally allocated, the instructor and SEREBRO generate rewards for the top performing students. Since the nodes are classified by the participants depending on the length of the thread, we expect the resulting point allocations to correlate with expressed creativity. We hypothesized that such grounded and prudent external rewards that rely on peer feedback of creative contribution will motivate further creative expression from team members to attain a better team product. To ensure adequate participation within SEREBRO, students are encouraged to attain a certain level of reward points per milestone in the project. 209 The targeted class comprised six teams with a minimum of three and a maximum of four members. Each member had a role on the team, such as lead, analyst, and programmer, which gave him or her specific responsibilities. These roles were negotiated by the team members when the teams were formed in the first semester of the course [11]. Three non-trivial projects were defined and their requirements were passed to the teams by the product customers. Thus, two teams competed to create the best of each product to meet the requirements. The first project was a Chemistry Lab Creator and Submission software that allowed the instructor to create and post a lab assignment and students to complete the assignment and submit it online. The second was an Online Emergency Center that provided information on gas, food, and shelter during a disaster. The third project was a game to be used as part of a recruiting package for computer science students to choose a university. Initial requirements were sparse, forcing the team to work with the customer and delve into the domain of application. In addition to software artifacts produced, each team had to devise a novel product name and logo. 3 Methods of Creativity Measurement We measure team creativity from multiple perspectives and sources of information about the software project as it is developed. We seek to determine the degree to which the discourse among the team is creative. This approach reveals unique information on the creativity of the overall software design process. We describe each of these sources and perspectives with respect to SEREBRO's contribution toward measuring creativity, thereby understanding certain differences in how creativity emerges in software design. 3.1 SEREBRO Threads Team postings to SEREBRO are examined as consolidated threads related to a topic. Each thread stems from a single brainstorm node. A topic can have multiple evaluated threads, depending on the number of brainstorm nodes that start the various conversations within the topic. Therefore, threads form the basis for creativity analysis and are rated by experts at the end of the semester for the type and level of creativity. Points. SEREBRO assigns reward points to each individual and team for their creative input based on the number and type of node of each thread posting. The nodes types are designed to align with the elements in the creative process. Fluency can be measured by the number of brainstorming nodes, flexibility by the fan-out of the brainstorming nodes, and originality and elaboration by the fan-in to spinning nodes. An in depth examination of the idea management tool can be found in [10]. Points are allocated through the agent based protocol, which has the following four basic rules. When a user replies with an Agree node to another post, the agent allocates k points to the author of the parent node. Similarly, when a user disagrees with its parent node, the parent node's author is charged k points. In order to reward 210 the progenitors of the thread tree, when an agent receives points at a node, it passes (½ * k) points to its parent node. This process effectively propagates the points throughout local areas of the idea network, yet discounts the reward as the distance increases from the node being considered. This rule also applies to the reward distributed at a Finalized node. Finalized nodes represent some accumulation of ideas that have been implemented within the project or identified as creative and useful to the project. Therefore, rewards at Finalized nodes are magnified by a factor m. This variable factor allows the instructor or project lead to adjust the importance of the Finalized node within the reward system. This (k * m) points are also distributed to other nodes that are tagged with the same tag as the Finalized node. Thus, users directly correlate nodes that are related in content but not necessarily by distance in the network. In combination, these rules distribute points throughout an idea thread, concentrating point totals on nodes that are well received by other users on the team. The user that authors positive nodes that are accepted by the team gains a higher point total than those who do not participate or do not create novel content. We hypothesize that the users with higher reward point totals have contributed more to the creativity of the group. For our experiments, k = 1 and m = 2. In our reported experiment, team point totals ranged from 1520 to 2210 with an average of 1857. Expert Ratings. Each SEREBRO thread for each team was rated by five subject matter experts (SMEs) on the fluency, flexibility, originality, elaboration, overall creativity, and the relevancy of thread to the development of the project. The subject matter experts were faculty members and graduate students involved with the development of the SEREBRO program. Threads which were indicated as irrelevant by at least one of the raters were removed from further rating. These threads included reposts of the same content and tests of the system. Though the sample size of the teams was small, there were a significant number of threads generated for rating. Overall, the SMEs showed sufficient levels of agreement across raters with internal consistencies (Cronbach's Alpha) on all ratings of each area of creativity ranging from 0.83 for Originality to 0.89 for fluency. Consistency ratings (ICC single measure scores) for each thread ranged from 0.49 on flexibility to 0.63 on fluency. This statistic indicates that the proportion of variance on each of the ratings across the five raters can be attributed to the variance on the construct being rated. The scores for each thread were averaged across the teams and the raters to calculate scores assigned to each team. The second step in the analyses was to examine the correlations among different types of creativity expressed in the threads. The correlations between specific creativity metrics and overall creativity were quite high. These correlations indicated that threads were rarely rated highly on one metric of creativity and not on other levels of creativity. This result was expected and may have been influenced by the prevalence of threads with few entries which were typically rated low on all components of creativity. The highest correlation was between originality and overall creativity (r = 0.95). Essentially, the more original a thread was rated, the higher it was rated on creativity. This value was followed by examining elaboration (r = 0.91). Fluency and flexibility (r = 0.88) had the same correlation with rated creativity, and they predicted creativity to a lesser extent when compared to the other two elements. The lack of differentiation between types of creativity within threads indicated that 211 the raters typically judged threads to be creative or not rather than more fluent than flexible. Due to these results, further analyses relied only on the overall SME ratings of creativity. Team overall creativity ratings ranged from 1.99 to 3.09 on a five point scale from Poor to Excellent. 3.2 Final Projects and Demonstrations At the final demonstrations of the team projects, each team was asked to evaluate their own team and the other teams on the overall creativity and overall quality of the projects. Additional survey questions addressed the perception of the creative process within each software team. Figure 3 indicates the different evaluations performed. Team Colleague Ratings. Teams rated both the overall quality and the creativity of their own and other project teams based on their perceptions and understanding of the demonstrations of the final projects. The course instructor also performed the same ratings at the final demonstrations. Creativity ratings ranged from 3.18 to 4.84 and quality ratings ranged from 3.41 to 4.76. In general, teams with higher quality ratings received high creativity ratings as well. The highest rated quality team, however, was second on creativity. Team grades. The course instructor graded teams at each milestone for a particular set of artifacts to provide feedback and assess progress. Final team grades ranged from 78 to 94 with a mean grade of 89. Survey questions. Each team's own experiences were further investigated through a final survey asking about team performance and experiences with the SEREBRO program. This survey included questions about each of the creativity metrics with respect to the team's final project. The team members evaluated their own team on a 1 to 5 scale with anchors of Poor, Neutral, Good, Very Good, and Excellent. These ratings constitute a reference shift approach to team level variables in which individual scores about the team are combined to represent a team level construct [12]. Figure 3: Perspectives and Sources of Information in the Creativity Metrics 212 4 Overall Results and Discussion Table 1 displays the average scores and standard deviations of the various ratings. Table 2 displays the correlations between the creativity metrics Table 1: Average Scores and Standard Deviations Serebro Points Expert Ratings Other Team Creativity Other Team Quality Own Team Creativity Own Team Quality Team Grades Mean 1857 2.43 3.85 3.08 3.98 4.11 89.10 SD 270 0.48 0.41 1.09 0.60 0.52 5.52 The correlations among the various creativity metrics reveal both substantial agreement across the metrics and differentiation between project creativity and quality. Because there were only 6 teams involved in this project, none of the correlations at the team level are statistically significant at p < 0.05 level, but they do reflect a general level of agreement on the rank order of team creativity and quality across the various metrics. These correlations are still the best estimate of the relationships between these measures or team creativity and quality. Specifically, SEREBRO point totals were related to all other measures of both team creativity and quality. Expert ratings were able to distinguish between creativity and quality as only the correlations with other measures of creativity (including team grade) were positive and of medium size. Creativity demonstrated by the process metrics of SEREBRO points, Expert Ratings, and Team Creativity was also demonstrated in the final project as indicated by the other team's ratings of Creativity, as well as the Team Grades as shown in Table 2. Overall these results provide promising directions for using these metrics to continue to develop and measure creativity in computer science courses and software development teams. Table 2: Correlations among team-level creativity metrics 1. 2. 3. 4. 5. 6. 1. SEREBRO Points 2. Expert Ratings 0.43 3. Presentation Creativity* 0.53 0.47 4. Presentation Quality* 0.60 -0.04 0.81 5. Team Grades 0.60 .043 0.53 0.33 6. Team Creativity 0.74 0.32 -0.02 0.25 0.42 7. Team Quality 0.64 -0.11 -0.03 0.54 0.86 0.62 *Includes results for only 5 teams. One team did not demonstrate their project on the assigned day. 5 Conclusion and Future Work Ideas and novel software properties can take many forms in software engineering. Students who are not accustomed to conversing over creative material have difficulty 213 expressing their own ideas in a group setting, such as the one chosen for experimentation. SEREBRO is a software tool that includes both idea management and motivational rewards to foster creativity within a team developing a non-trivial software product. The results discussed in this paper indicate that SEREBRO performed well with respect to providing insight into creativity in a Senior Software Projects class, as well the correlation of product and team criteria with creativity metrics. The positive outcomes of the research are driving the expansion of SEREBRO with additional software process tools. We are examining different reward mechanisms to determine if certain protocols may be more suitable than others with respect to an individual's role on the team, his or her skill set, team size, and general attitudes about class and exposing their creative ideas. Further research into the creative process and team's perception of that process is currently taking place within a new set of students in the Software Projects Class setting. In addition, content based assessment of SEREBRO nodes is being undertaken during the software development process. Acknowledgments. This material is based upon work supported by the National Science Foundation under Grant No. 0757434. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). 2010_25 !2010 Clap-along: A Negotiation Strategy for Creative Musical Interaction with Computational Systems Michael Young and Oliver Bown 1 Music Department, Goldsmiths, University of London, New Cross, London SE14 6NW, UK m.young@gold.ac.uk 2 Centre for Electronic Media Art, Monash University, Clayton 3800, Australia oliver.bown@infotech.monash.edu.au Abstract. This paper describes Clap-along, an interactive system for theorising about creativity in improvised musical performance. It explores the potential for negotiation between human and computer participants in a cyclical rhythmic duet. Negotiation is seen as one of a set of potential interactive strategies, but one that ensures the most equitable correspondence between human and machine. Through mutual negotiation (involving listening/feature extraction and adaptation) the two participants attempt to satisfy their own and each other's target outcome, without knowing the other's goal. Each iteration is evaluated by both participants and compared to their target. In this model of negotiation, we query the notion of ` ow' as an objective of creative human-computer collaboration. This investigation suggests the potential for sophisticated applications for real-time creative computational systems. 1 Introduction Music performance is a creative, `real-time' process that often entails collaborative activity. Creative potential in a performance { i.e. the opportunity for individual or collective actions that appear innovative { is contingent on context and style, the presence of a priori agreements (whether explicit or tacit), and other cultural and procedural elements [1]. Various aspects of performance practice constrain or afford opportunities for immediate creative input, such as recourse to pre-existing materials, the relative emphasis on individual responsibility and the means by which information is exchanged while performing. In human-computer performance the capacities of software to exchange information with other participants and to take responsibility (i.e. act autonomously) are highly significant factors. Any given approach to these factors impacts greatly on the performance practice as a whole. Free collective improvisation is an ffective testing ground for computerbased creativity. The computer must be able to produce sonic events that appear intrinsically valid, and it must be able to collaborate appropriately (responsively and proactively) with one or more human musicians. Both properties satisfy a working definition of computational creativity, such as the ability to exhibit 215 behaviour \which would be deemed creative if exhibited by humans" [2] as well as being \skillful, appreciative and imaginative" [3]. These criteria are only truly satisfied if the system is not directly reliant on a human `user'; there must be an equitable correspondence of ideas between collaborators. Such challenges are not easily met, but at least apply equally well to human-only improvisation, where the behaviour of one performer would never be expected to depend entirely on another's contribution, or depend on rules agreed in advance. Ideally at least, group improvisation avoids organisational procedures that determine or in uence musical content, structure or the interactions and mutual dependencies between performers. Implicit procedures may develop through a process of negotiation in performance. As in other forms of process-orientated art, this process may not be directed towards a known outcome. Rather, the process itself forms both the central problem and focus of interest that both enables and constitutes the performance. We investigate negotiation as a musical process in a wider context of interaction strategies that can demonstrate performance-based computational creativity. We devise a simple system, Clap-along, defined by a number of constraints, which we believe demonstrates the challenges of performative, computational creativity, and offers a promising model for future, more elaborate, computational systems for interactive musical performance. 2 Interaction strategies We regard negotiation as a specific strategy for human-machine interaction, a member of a larger set that includes the following list. (The terms `source' and `result' are used to refer to actions in an asymmetric relationship, and may apply equally to human and machine depending on context): Shadowing: The source and result move together. There is a clear temporal simultaneity that produces layering within a coordinated motion. The coordination of motion between a body and its shadow is simultaneous but also distorted, because the shadow is projected into a different `geometrical' space. Various musical strategies for textural organisation (homophony, micro-polyphony) entail shadowing. Real-time digital ffects and simple interaction systems commonly exhibit shadowing methods. Timbral matching techniques can be thought to employ this strategy, as found in the Cata-RT [4] and Soundspotter systems [5]. The system as a whole may be weakly or strongly integrated, but in general interactivity is likely to be readily verifiable to participants and listeners. Mirroring: The source is re ected in the result. Synchronicity is not required. There is a more elaborate re-interpretation of information received from the source than in shadowing, and this is more telling in the absence of temporal synchrony. Innumerable compositional and improvisational approaches are analogous to mirroring, to be found in structural repetition, motivic development and call-and-response strategies. Delay ffects are elementary mirrors, projecting back an image of the musical source. Systems that seek to analyse an existing style to generate music (e.g. by Markov modelling) are mirroring at a high level 216 of musical organisation, such as in the Continuator system [6], even though the method may be sub-symbolic. Mirroring can establish a cohesive musical identity between source and result, and may also be readily verifiable to participants as an interactive process. Coupling: Two sources are distinct but connected. There is { or appears to be { mutual in uence, but the roles of `source' and `result' may be unstable and unequally balanced. In live performance, coupling can constitute \procedural liveness" and/or \aesthetic liveness" [7]. For instance, coupling can be trivial and procedural, when one system controls another and receives appropriate feedback, as in the laptop-as-instrument paradigm. Or, it can be illusory and \aesthetic", as in the (increasingly rare) genre of music for live instrument with tape, where binding and apparently causal links between the performer and tape are in reality entirely controlled and pre-determined. More abstract couplings that use virtual modelling and dynamical systems offer a more openended and genuine relationship between sources (agents), potentially integrating the procedural and the aesthetic. Examples include music systems where sources share a virtual environment, as in the Kinetic Engine [8] and Swarm Granulator [9]. In coupling, equal relationships are possible, but verification of the degree of true interaction is problematic. Negotiating: The roles of `source' and `result' are con ated. Participating elements have equal status and are engaged in a series of transactions. Negotiation treats performer and computational system as equal in status and overall approach to the interaction. It can be seen as a unified system based on equivalence, and this contrasts with the categories above that suggest an unequal architecture of source and result, in which the most likely scenario is a musician (source), acting upon a computational system (result). According to OED definitions3 `negotiation' refers to transactions directed towards an objective: 1. To communicate or confer (with another or others) for the purpose of arranging some matter by mutual agreement; to discuss a matter with a view to some compromise or settlement 2. To arrange for, achieve, obtain, or bring about (something) by negotiation 3. To find a way through, round, or over (an obstacle, a dicult path, etc.) To negotiate, participants engage in a series of transactions that are guided by local goals directed towards an individual desired outcome (expectation). The transactions can be understood to involve two mutually informing strategies, one externalised, the other internal; \action" and \description" [10]. Actions executed within a system may instigate further changes of state in the system. An assessment of these changes (especially in relationship to anticipated or desired changes) forms a description of action-outcome in the system. This empirical description may inform the next action. If so, a cyclical process of experiential accumulation develops, as the total description becomes more detailed and complex. In a pre-determined and constrained context, this accumulation might be 3 Accessed online, 18th September 2009 217 understood as a straightforward `making sense of it'. But in a more processorientated context there may be an emergent formulation of knowledge that is not external to the system (i.e. the system is not pre-determined). Hamman [10] uses Foucoult's term `episteme' to describe interactive processes that are \immanent in the very particularity of the thoughts, actions, and descriptions made with respect to a hypothesised object of interaction" (p. 95). This is an open-ended process of negotiation, orientated by variable or uncertain expectations. Local goals (intentions) might change as a product of the interactive transactions underway. Desired outcomes (expectations) need not be static either, so the OED characterisation of a \compromise or settlement" may remain theoretical and notional. A reciprocal negotiation entails a degree of equivalence between human and machine capacities to act and formulate descriptions. Both participants must form their own descriptions of the system that incorporates the other participant. Both must be able to modify their actions, short-term goals and intentions, given new information. In other words, they should also be able to formulate an expectation about the overall musical output and modify their contribution, given the other's, in order to best satisfy the expectation. Verification that transactions are underway is, in itself, a part of this process, but the accuracy or ecacy of any \description" is not relevant to the fact that negotiation occurs. We regard \optimal ow" [11] as a directly relevant but problematic concept. Flow is the human enjoyment derived in undertaking a task that becomes autotelic, achieved when there is an optimal level of diculty relative to the skills of the subject. This balance requires the subject to form an internal description of the task's demands and an assessment of his/her skills in meeting them. Flow has been explored in human-computer interaction [12] and in the creative process [13] including group-based creativity [1]. Particularly in the case of creativity, Csikszentmihalyi notes a number of factors contributing to ow, some of which could be modelled (clarity of goals, availability of immediate feedback) and others perhaps not (no worries of failure, no distractions etc.) [13]. Whereas this might describe a positive and productive psychological state, ow perhaps does not take fully into account other facets of creativity, or for that matter the experience of negotiating; the role of randomness, unpredictability, happenstance [14], the use of haphazard trial and error, periods of incubation and, subsequently, innovative \behavioural mutations" [15]. More emotively, consider the pursuit of the impossible, the thin borderline between absorption and obsessive compulsion [15], and, perhaps resultant periods of boredom or frustration. Flow describes a settled state that may be too ffortless in itself to be central in establishing creativity. So we attempt to avoid an easily achievable sense of ` ow' in designing the Clap-along system. 3 Implementation of Clap-along Our aims in Clap-along are to explore process, expectation and verification in a negotiation-based system, and to consider how these elements might ultimately 218 be extended to produce more aesthetically complex and musically valid results. Negotiation occurs both in a feature space and in the foreground surface of actual rhythms. Clap-along is a duet system for human-computer interaction. Both participants produce continuous, synchronised 4-bar clapping patterns in 4/4. The musical context is as minimal as we can conceive: a fixed tempo, a fixed metrical structure, and single sound events quantised to beats. In any loop instance n there is a human clapping pattern, Hn, a computer pattern, Cn, and a composite of the two patterns, Rn. A feature set, Fn, is extracted from Rn and compared by the system to a target feature set Tn, and this comparison is the basis for the next iteration. Rn is the reality of the current state, and Tn represents an expectation that is unknown to the human performer. The computer maintains patterns as a sequence of 0s and 1s, representing either a rest or a clap on each beat. The initial pattern C0 is generated randomly. Machine claps are generated from a sample bank of human clap recordings that have some natural variation in sound; this offers some semblance of human clapping. The human performer claps into a microphone, allowing the system to build a second binary sequence that represents the human's clapping rhythm, Hn. Human claps are quantised by rounding down to the previous beat, unless within 200ms of the following beat, in which case they are rounded up. The initial expectation, T0, is obtained from a randomly created composite rhythm Rtarget that is immediately discarded. At the end of each 4-bar pattern, the system takes the composite of the two rhythms, Rn, and calculates a feature set Fn that forms a minimal internal representation of the musical output. The four features used in this version are: { density: the total number of claps as a fraction of the maximum possible number. { homophony: the number of coincident claps as a fraction of the maximum possible value. { position weighting: the normalised average position in the cycle over all claps. { clumping: the average size of continuous clap streams as a fraction of the maximum possible value. In this multi-dimensional feature space, the system calculates the Euclidean distance between Fn and the target feature set Tn. If the distance exceeds a pre-defined threshold, this is deemed to indicate a significant musical difference between reality and the expectation. We use a threshold (s) of 0.001 for satisfaction of expectation, measured in the feature space where each feature was normalised to the range [0,1]. To create Cn+1 the system runs a generate-and-test loop, producing 20 variations of Cn. Variations were generated by ipping each bit in the rhythmic representation with a probability of 0.1. Each variation is combined with the previous human pattern Hn to produce a candidate composite rhythm R0n, with features F0n . The pattern with the nearest features to the target is chosen as Cn+1. 219 The human performer is invited to negotiate with the system in a comparable way. As each loop occurs, the performer contributes Hn to the total pattern Rn and assesses the machine contribution. He/she may introduce a modification to the next contribution Hn+1. Any modification might be experimental and pseudo-random. Alternatively, it may constitute an intentional action based upon his/her internal description of how the two contributions are co-dependent, and so contribute to developing a better understanding of the expectation, the target point Tn. This implementation could in theory allow the performer and machine to quickly settle upon a rhythm that satisfies the target Tn, so any further changes would need to be entirely elective; this might be likened to a state of `optimal ow'. To avoid this settled state, and ensure a continuing creative negotiation, we introduce an additional device that ensures the open-ended nature of the negotiation. If the distance between Fn and Tn is under the threshold (s) { i.e. if reality is suciently close to the expectation { the system introduces random variation to its expectation to produce Tn+1, with mutation along each feature dimension drawn from a Gaussian distribution. In other words, as the contributing rhythms approach the target, and as the human performer has developed a relatively accurate set of descriptions about the system (based only on the musical surface), the expectation changes. As the feature description Fn is obtained from the composite of both rhythms, both the machine and human performers have a potential role in initiating this change of expectation, whether deliberately or unwittingly, requiring a change of descriptions and resultant actions. A continually diverging system is possible, fostering a mutual creative negotiation that avoids `optimal ow'. 4 Evaluation in performance The system awaits testing with a number of human collaborators, to build upon informal, proof-of-concept tests undertaken by the designers. It is evident that with care, a performer can induce a stable and sustainable behaviour in the computer. For the human performer, there are a number of common sense actions to be attempted. For any rhythmic cycle Rn, possible actions include: 1. varying: an intuitive rhythmic variation of a previous pattern. 2. matching: attempting to follow the Cn homo-rhythmically. 3. repeating: so Hn+1 = Hn 4. complementing: an attempt to insert events or gaps that mirror the system pattern in Cn or remembered from Cn􀀀1 5. parsing: rhythmic patterns c, where c is a substring of C, that are either complementary or matching parts of Rn􀀀1 For each of these possible actions, outcomes are heard in the next outputted rhythm. So the performer attempts to form a description of the system based on how this new pattern deviates from the last. Upsetting the system with a marked variation of pattern (action 1) can cause unstable changes in the output. In this 220 event the system's attempt to update its behaviour in any progressive way is frustrated. Consequently, the performer struggles to verify logical interaction. Clapping the exact same pattern repeatedly (including not clapping at all) and matching patterns (action 2 or 3) can cause the system to slowly evolve its output towards the expectation, allowing a more accurate description of the system to develop, ultimately initiating a change in expectation (feature target point). However, since the features specified are not independent, it cannot be guaranteed that a given expectation can actually be achieved in the musical foreground. In this case the system is stuck, and, to move on, requires a more radical intervention from the performer. These scenarios are thought experiments as much as real tests. We intend to look at how different performers go about negotiating with the kinds of simple `black-boxes' under different scenarios, and how strategies to deal with the computer's behaviour are developed. Further development of the system could involve additional, or more ffective, feature extraction, to provide a richer feature space. Use of alternative methods, such as autocorrelation, would allow an open definition of the length of any given loop, currently defined as 16 units, and the system could be expanded to involve expressive timing and other more complex rhythmic features. Future systems could offer a more sophisticated and rich musical environment that incorporates many other elements of musical organisation beyond metrical rhythm. In all cases { however frustrating or rewarding { these procedures manipulate the expectations and actions of human and machine performers alike. An unresolvable negotiation is fostered. This points to more complex, and possibly creatively valid, negotiation processes that could produce real-time, computational performances of manifest interest and integrity. 5 Conclusion We have outlined a minimal musical context for investigating computational creativity in improvised performance. We have offered a framework for interaction between human and machine that comprises four categories: shadowing, mirroring, coupling and negotiating. We adopt a critical approach to the notion of `optimal ow' in creative interaction. We have developed a test system that explores a process of negotiation in practice, which uses an adaptive system with hidden expectations for varying rhythmic cycles. This creates a demanding context for interactive negotiation. This simple study suggests that the negotiation paradigm could be used to test the dimensions of musical interaction in greater detail (this includes comparing human-human, human-computer and computercomputer interactions using the same paradigm), and could be built up from this minimal form to a critical level of complexity where meaningful and verifiable interaction does occur. 221 Acknowledgements This work was developed at the July 2009 Dagstuhl Seminar, `Computational Creativity: An Interdisciplinary Approach'. We would like to thank the organisers, Margaret Boden, Jon McCormak and Mark d'Inverno, and the other member of our `interactivity' discussion group: Iris Asaf, Rodney Berry, Daniel Jones, Francois Pachet and Benjamin Porter. Oliver Bown's contribution to this research was funded by the Australian Research Council under Discovery Project grant DP0877320. 2010_26 !2010 A Fitness Function for Creativity in Jazz Improvisation and Beyond Anna Jordanous Creative Systems Lab / Music Informatics Research Centre, School of Informatics, University of Sussex, UK a.k.jordanous at sussex.ac.uk Abstract. Can a computer evolve creative entities based on how creative they are? Taking the domain of jazz improvisation, this ongoing work investigates how creativity can be evolved and evaluated by a computational system. The aim is for the system to work with minimal human assistance, as autonomously as possible. The system employs a genetic algorithm to evolve musical parameters for algorithmic jazz music improvisation. For each set of parameters, several improvisations are generated. The fitness function of the genetic algorithm implements a set of criteria for creativity proposed by Graeme Ritchie. The evolution of the improvisation parameters is directed by the creativity demonstrated in the generated improvisations. From preliminary findings, whilst Ritchie's criteria does guide the system towards producing more acceptably pleasing and typical jazz music, the criteria (in their current form) rely too heavily on human intervention to be practically useful for computational evaluation of creativity. In pursuing more autonomous creativity assessment, however, this system is a promising testbed for examining alternative theories about how creativity could be evaluated computationally. 1 Introduction The motivation for this work is to move towards achieving the goal of autonomous evaluation of creativity. It is initially intended as a test scenario in which to evaluate the usefulness of existing proposals for creativity assessment [1, 2]. Longer term, it provides a environment in which to implement and assess theories about how best to evaluate creativity computationally. The computational system presented in this paper is designed to develop increasingly more creative behaviour over time. This behaviour is in the domain of jazz improvisation: evolving jazz improvisors based on maximising the level of creativity exhibited by the improvisor. Ritchie [1, 2] has proposed a set of 18 formal criteria with which to evaluate the level of creativity exhibited by a creative system, using the artefacts which the system produces. Each criterion formally states a condition to be met by the products of the creative system, based on ratings of (at least one of) how typical the set of products are of the domain which the system operates in and how valuable those products are considered to be. The criteria have been adopted by 223 a number of researchers for re ective evaluation of the creativity demonstrated by their creative systems [3, 4, 5]. This work explores the theory that Ritchie's criteria can be adapted and exploited for the purposes of implementing a fitness function for creativity. The criteria are applied to a generation of jazz improvisors and used to select which of these improvisors should be carried forward to the next generation. There has been some interesting work on using evolutionary techniques such as genetic algorithms (GAs) to generate music [6, 7, 8, 9, 10]. Ritchie's criteria require that a creative system should be able to generate artefacts in a specified style or domain. Jazz music improvisation has been chosen for this domain as it encompasses a wide variety of styles of music under the umbrella term of \jazz"; from \trad jazz", through the \bebop" style exemplified by Charlie Parker to free improvisation. This lays the foundation for much creative opportunities, to be exploited by evolutionary tangents taken by the system. 2 Evolving Creative Improvisation: Implementation This system is written in Java, using the jGap1 package to implement the genetic algorithm and the jMusic2 package for music generation. The system evolves a population of \Improvisors" in the form of a set of values for musical parameters. In each generation of the genetic algorithm, these parameters are used to generate MIDI music. The parameters control the maximum number of notes can sound at any one time, the total number of notes in the piece, the key of the music, the range of pitches used, note durations, tempo markings and proportions of notes to rests, as well as the amount of variability allowed in several of these areas. Notes used are restricted to those in the blues scale3 for that key4. Within the constraints of the musical parameters, random choices are used for the generation of musical improvisations. 2.1 Using Ritchie's criteria as a fitness function Ritchie's criteria rely on two ratings of the improvisations produced: how valuable these are as jazz improvisations and how typical they are of the genre. These ratings are made using information about the artefacts. Ritchie makes \no firm proposals on what this information should be" (p. 75) and leaves open the question of how the rating scheme should be implemented. In the present version of this improvisation system, these ratings are provided by human assessment, following the example set in [3] (although as discussed in Section 4, if these ratings can be automatically generated, this would speed up the evolution process). 1 http://jgap.sourceforge.net 2 http://jmusic.ci.qut.edu.au 3 The blues scale is traditionally used for jazz music. In the key of C, the scale consists of the pitches: C, Eb, F, F#, G, Bb, C 4 Future plans for the system are to allow some chromaticism in notes used, or to allow it to evolve the notes that should be used, as another parameter. 224 For each set of parameters, a number of improvisations are generated (the exact number is determined by one of the parameters). Two improvisations are selected and played to the human evaluator, who rates each improvisation on its typicality as an example of jazz and on how much they liked it. Ratings are recorded for the two selected individual improvisations. If there are further improvisations by that Improvisor, the mean values for the two pairs of ratings are used as ratings for the remaining improvisations that had not been rated. In this way, the evaluator is presented only with a selection of improvisations to rate, making the process more time-ecient [10]. This is analogous to the evaluator being given a \demo" of the Improvisor rather than having to listen to all their productions. At this stage, all improvisations have a value rating and typicality rating and the 18 criteria can be applied to the products of each Improvisor. Each criterion is specified formally in [1] such that a criterion is either true or false for a given Improvisor, depending on whether scores derived from the typicality and value ratings are greater than some threshold , by setting suitable parameters to represent high/low typicality and value ratings (ff; fi and ). In [1] Ritchie chooses not to specify what values the threshold and parameters should take, but does highlight discussions on this [3, 5]. For simplicity, in the current implementation = f = fi = = 0:5, but experimentation with these values may be profitable. The fitness value for an individual Improvisor is a score between 0 (no creativity) and 1 (maximally creative). Again a simple approach5 is taken: fitness = number of criteria satisfied total number of criteria (1) After all Improvisors have been evaluated for fitness, the highest scoring Improvisor parameters are used to generate a new set of Improvisors, to act as the new generation of this population of Improvisors. The whole process is then repeated once per generation, until the user wishes to halt evolution. 3 Preliminary Results The current implementation of the evolutionary improvisation system was tested informally, with a jazz musician (the author) providing the ratings required by Ritchie's criteria. Over several runs, it was able to produce jazz improvisations which slowly evolved from what was essentially random noise, to become more pleasing and sound more like jazz to the human evaluator's ears. The question of whether the system was able to evolve more creative behaviour is still unresolved and is the main focus of further work on this project. Some interesting comments can be made on the implementation of Ritchie's criteria for creativity. Each criteria manipulates one or both of the ratings for typicality and value of the music produced during run-time. In [1], Ritchie left 5 This approach assumes all 18 criteria contribute equally to the creativity of a system; again though this is open to experimentation 225 unresolved the issue of generating this rating information, (p. 75), concentrating on how the ratings should then be processed once obtained. This work suggests though that these ratings are crucial to the success of evaluation; without reliable, accurate rating schemes, the application of the criteria becomes less useful. Using a human evaluator as a \rating scheme" is easy for the system implementor but causes problems at run time. Even with restrictions placed on how many products the human must evaluate, any reliance on human intervention introduces a fitness bottleneck [6] into the system, such that the progress of evolution is significantly slowed down by having to wait for the evaluator to listen to and rate the music samples. Levels of expertise, fatigue during system runtime, individual bias in preferences and varying levels of concentration also affect the reliability of using human interaction in this way. Issues with using a human as part of a fitness function are discussed in greater depth in [10]. The system generates random improvisations using evolved parameters, without making use of any examples to guide the production of improvisations. A side ffect of this is that there is no \inspiring set" of examples: as used by many of the criteria. Therefore those criteria currently do not add any new information to the creativity evaluation and do not contribute to the fitness function.6 Although nothing extraordinary has been generated thus far by the system, the need for human intervention has restricted any longer-term evolution of parameters from being attempted. It is proposed that at least some of the information currently supplied by user interaction can be derived or estimated automatically; then the system (and Ritchie's criteria) can be tested over more generations of evolution. This is discussed further in section 4. 4 Plans for Future Work To attempt to escape the problems caused by the fitness bottleneck, automated methods of rating the improvisations for typicality and value are being explored. { Genre classification methods are being investigated to help judge how typical an improvisation is of a specified genre. { In this work, the value of an improvisation is interpreted as how pleasing an improvisation is to listen to. Currently, a value rating function is being implemented based on the perceptual principles described in [11]. Once the system has been extended to a degree where evolution can take place over a reasonable time frame, it will be tested over several runs. The results of evolution will be compared to similar tests carried out with human participants who will be asked to rate creativity of several of the improvisations. This will allow a fairer investigation of the appropriateness and accuracy of Ritchie's criteria for evaluating creativity. On a longer term basis, this approach could also be used to test other theories of how best to evaluate the creativity of the products of a creative system (by 6 If machine learning methods are used to automate typicality ratings, though, the inspiring set will then consist of the examples used during the learning process. 226 implementing them as a fitness function for creativity, in various domains). The various theories can be compared and contrasted to each other and to human judgements. We can also consider the ecacy of evaluating the creativity of a system based solely on the artefacts it produces, in comparison to evaluative frameworks that also take into account the creative process, or details about the system itself or the environment it operates in (e.g. as discussed brie y in [12]). Hence this proves a useful tool to enable us to move closer towards the goal of discovering how best to replicate human evaluation of creativity. 5 Acknowledgements This work has benefitted from discussions with Nick Collins and Chris Thornton. Chris Kiefer has provided useful advice during the implementation of this system. 2010_27 !2010 Learning to Create Jazz Melodies Using Deep Belief Nets Greg Bickerman1, Sam Bosley2, Peter Swire3, Robert M. Keller1 1 Harvey Mudd College, Claremont, CA, USA 2 Stanford University, Stanford, CA, USA 3 Brandeis University, Waltham, MA, USA gbickerman@hmc.edu, sbosley@stanford.edu, swirepe@brandeis.edu, keller@cs.hmc.edu Abstract. We describe an unsupervised learning technique to facilitate automated creation of jazz melodic improvisation over chord sequences. Specifically we demonstrate training an artificial improvisation algorithm based on unsupervised learning using deep belief nets, a form of probabilistic neural network based on restricted Boltzmann machines. We present a musical encoding scheme and specifics of a learning and creational method. Our approach creates novel jazz licks, albeit not yet in real-time. The present work should be regarded as a feasibility study to determine whether such networks could be used at all. We do not claim superiority of this approach for pragmatically creating jazz. 1 Introduction Jazz musicians strive for innovation and novelty in creating melodic lines, in the context of chord progressions. Because of the structural characteristics of typical chord progressions, it is plausible that a machine could be taught to emulate human jazz improvisation. To this end, one might explicitly state the rules for jazz improvisation, e.g. in the form of grammars [1]-[2]. But structural rules may risk losing some of the flexibility and fluidity for which jazz is known. Here we try exploring a more organic approach: instead of teaching a machine rules for good jazz, we give the machine examples of the kind of melodies we want to hear stylistically, and let it determine for itself the features underlying those melodies, so that it can create similar ones. Our current exposition concentrates on a single approach to learning, heretofore not applied to music creation as far as we are aware: deep belief networks (DBNs), a multi-layered composition of restricted Boltzmann machines (RBMs), a specific type of stochastic neural network. We focus on the creation of melodies, and do not attempt to tackle broader issues of real-time collaborative improvisation. In other words, our work tries to explore the application of a specific neural net technology, as opposed to trying solving the general problem of creating an improvising agent by any means necessary. At present, our learning method is necessarily off-line due to a fairly slow training method, but we hope this can be improved in the future. 228 We were attracted DBNs based on recent expositions of Hinton, et al. [3]-[7]. Such machines learn to recognize by attempting to create examples (in the form of bit vectors), comparing those examples to training examples, and adjusting their parameters to produce examples closer to the given examples, a form of unsupervised learning. This seemed to us to be very similar to the way some humans learn to improvise melodies by emulation. Although the stochastic nature of DBNs might be considered a liability in some application fields, we try to leverage that nature to achieve novelty in our generated melodies, a characteristic of the creativity required for jazz improvisation. Thus our objective is different than that of Hinton; we want to create interesting melodies and are less concerned about their recognition. 2 Restricted Boltzmann Machines A restricted Boltzmann machine (RBM) is a type of neural network introduced by Smolensky [8] and further developed by Hinton, et al. [3]-[7]. It consists of two layers of neurons: a visible layer and a hidden layer. Each visible neuron is connected to each hidden neuron, and vice versa, through a series of symmetric, bi-directional weights. A single training cycle for the machine takes a binary data vector as input, activating its visible neurons to match the input data. It then alternates activating its hidden nodes based on its visible nodes, and activating its visible nodes based on its hidden nodes. Each node is activated probabilistically based on a weighted sum of all nodes connected to it. Since nodes within a layer are not connected to each other, activation of the hidden nodes depends only on the states of the visible nodes, and vice versa. After the network has stabilized, the new configuration of visible nodes can be viewed as output. Figure 1: A restricted Boltzmann machine. The first node B of each layer is a fixed bias node. The objective of an RBM is to learn features in sets of data sequences. Toward this end, we implemented the contrastive divergence (CD) learning algorithm, as described by Hinton [3]. We modeled our implementation on an excellent tutorial supplied by Radev [9]. The CD algorithm allows for relatively inexpensive training 229 given the large number of nodes and weights in our networks. Once trained, an RBM can take a random data sequence and, through a series of activations, generate a new sequence that emulates features from the training data. While a single RBM is capable of learning some patterns in the training data, multiple RBMs can be layered together to form a much more powerful machine known as a deep belief network (DBN) [4]. Multiple RBMs are combined by identifying the hidden layer of each RBM with the visible layer of the one below. The second RBM is able to learn features about the features learned by the first RBM, and thus, the entire layered machine should be able to learn far more intricate patterns than a single RBM could. Figure 2 illustrates the structure of a DBN. Figure 2: An illustration of a 3-layer Deep Belief Network 3 Data Representation In order to train DBNs on musical data, we first encode the music as bit vectors. We divide each beat into beat subdivisions called slots, with the number of slots dependent on the smallest note duration to be represented. For our experiments, we chose twelve slots per beat, which allows us to represent all duplet or triplet note durations down to a sixteenth note triplet. Each slot is filled by a block of thirty bits, divided into twelve chord bits and eighteen melody bits. A description of the melody bits follows. Twelve bits are used as a one-hot encoding for the chromatic pitch classes from C to B over one octave, four bits are used as a second one-hot encoding to designate one of four octaves, one bit designates a sustained extension of the previous note, i.e. the note is not attacked anew, and one bit represents a rest. If a note is being attacked at a given slot, its corresponding pitch and octave bits are on and all other bits are off. If a note is being sustained, then the pitch bits are ignored but the sustain bit is on. Representing octaves this way rather than using a single one-hot encoding to represent a fouroctave chromatic range, gave us a significant improvement in training time, by reducing the number of pitch nodes in the input layer. 230 The sustained note bit is used to represent the same pitch value as the note previously played. Thus notes of long duration will be seen as chains of sustain bits being on. Figure 3 shows an example of a melody and its corresponding encoding at a coarser resolution of two slots per beat for brevity. Beat Auxiliary Chromatic Pitch within Octave Octave Sustain Rest C C# D D# E F F# G G# A A# B 1 2 3 4 1 1 1 & 1 2 1 1 & 1 3 1 1 & 1 1 4 1 1 & 1 Figure 3: A short melodic segment with a coarse encoding (only two slots per beat) To improve readability, 0 values are left blank. Each chord is encoded as twelve bits representing the chromatic pitches from C to B. If a pitch is present in a chord, its corresponding bit is on. Melody and chord vectors are concatenated to form part of the input to the network corresponding to one slot. Thus the machine ideally learns to associate specific chords with various melodic features. Because the machine will be seeing more than one slot at a time, as we later describe, it can also learn about chord transitions. 4 Training Data We initially trained on a small set of children's melodies such as "Twinkle, Twinkle, Little Star" and "Frère Jacques." These melodies were all in the same key and generally consisted of simple rhythms and notes that were in their respective chords. Once we taught a machine to learn from, and then create, similarly simple melodies, we moved on to teaching larger networks jazz. Our primary dataset was a large corpus of 4-bar jazz licks (short coherent melodies) cycling over the common ii-V-I-VI7 "turnaround" chord progression in a single key. The ii-V-I is a very common cadence in jazz; the VI7 chord is a connecting chord that leads one lick into the next for the same progression, VI7 being the dominant relative to the ii chord that follows. Most of the licks were either transcribed from notable jazz solos, or hand constructed, some with the help of the grammarbased "lick generator" of the Impro-Visor software tool [10]. 231 5 Learning Method Part of our goal is for the machine to learn how to create melodies that transition between chords in a progression. To add flexibility, rather than training our machine on inputs of all 4 bars of a lick at once, we break our data up into smaller windows of 1 measure each. For each 4 bar lick, we start the "window" at the beginning of the first bar. Then we move the window forward by one beat and look at the next 4 beats starting at beat 2 of the measure for the next window. We move the window forward by a beat at a time, taking measure-long snapshots of the window, until we reach the end of the 4-bar lick. In this way a single 4-bar lick is broken up into 13 overlapping shorter windows that are used sequentially as the inputs to the network. The scenario is analogous to that shown in Figure 4, except there are no question marks during training. For creating new melodies, we start the machine with a "seed" consisting of specified chord bits defining our desired chord progression, and random melody input bits. The chord bits in the first layer of the machine are clamped so that, during any given creation cycle, they cannot be modified by the stochastic nature of the machine. In creating a new melody, we use a procedure analogous to windowing during training. We start by generating the first few beats of a new melody and then clamping their corresponding bits. As each successive beat is generated, the whole melody and chord sequence is shifted forward to make room for the next beat. So in general, the machine only generates one beat at a time, but uses clamped chords and clamped beats of the preceding melody to influence the note choices. This process is illustrated in Figure 4. Figure 4: An illustration of the process of windowed generation. The RBM generates small segments of melody over a fixed chord seed. A newly generated segment is then fixed and used to generate the next segment of melody. 232 During the machine's final activation of its visible layer (which constitutes the newly generated melody), we group certain bits together for special consideration. Rather than letting the machine activate every bit probabilistically, we look at each slot individually and activate only the pitch bit and octave bit with the highest probabilities of activation among their group. Thus the machine is forced to choose whether to sustain, rest, or start a new pitch. We found that this approach allows for good variety of created melodies, while still resonating well with the given chords. We also want to know if the machine can learn to create licks over a ii-V-I-VI7 chord progression in an arbitrary key. Thus, we included the option to transpose each input into different keys and train on the transpositions simultaneously. We implemented all of the functionality described thus far as a stand-alone tool we call "RBM-provisor" that we have made publicly available [11]. The tool is written in Java and supports input and output via the leadsheet format [12] used by Impro-Visor, so that the user can work with readable, symbolic encodings, rather than bit-vectors. 6 Results Our initial experiments used our dataset of short segments of children's melodies, training on small 2-layer machines for 100 epochs. Results were encouraging, with chosen notes fitting well into the simple chords and flowing together melodically. Figure 5 shows a children's melody created over a simple chord progression. After achieving the ability to create stylistically similar melodies from a set of simple examples, we moved on to the more complex problem of learning jazz. Figure 5: An example of a created children's melody over a specified chord progression. In attempting to produce a successful jazz creation network, we experimented with various aspects of networks, including number of layers, number of nodes per layer, number of training epochs, and many others. We ultimately settled on a 3-layer network containing 1441 input nodes (4 beats x 12 slots per beat x 30 bits per slot + 1 bias), with 750, 375 and 200 hidden nodes respectively. A typical training involved 250 epochs on about 100 four-measure licks, which takes about nine hours on an inexpensive desktop computer. The first stave of Figure 6 shows a sample of training data, with the second stave showing a typical lick created by the network. For comparison, the third stave shows random notes at the same resolution of 12 slots per beat. When analyzing the created music using Impro-Visor [10], we found the vast majority of generated notes were in the chord, with occasional color tones (tones not in the chord, but sonorous with it), 233 which is totally acceptable. Foreign tones were hardly ever present. Created melodies tended to avoid large interval jumps and rarely skipped octaves. Additionally, we found the training method was able to deal well with transpositions. After training on four copies of each of our inputs, transposed up 0, 1, 2, and 3 semitones from the original, the machine still created chord-compatible music regardless of the set of chords that was provided as a seed. We have yet to test jazz generation on more than four transpositions due to the extensive added training time required for transposing inputs to all twelve keys. Nonetheless, we are optimistic regarding our machine's ability to handle any number of transpositions, given sufficient nodes and adequate training time. The reason that ability to transpose is viewed as important is that, in jazz music, the chord progressions often have implied abrupt key changes that are not labeled as such explicitly. Ideally, an improvisational algorithm would be able to respond to chord changes based on the chords in whatever relative transpositions they occur, rather relative to a fixed reference key. For example, in the standard tune "Satin Doll", one finds an extended cadence Am7 D7 Abm7 Db7 C. The sub-progression Abm7 Db7 is the same as Am7 D7 transposed down a half-step. It would be more economical and modular to train a network on all transpositions of Am7 D7 that it would be to train it on all contexts that might surround that two-chord sequence. We noticed some differences between input data and generated music. While half-step intervals were common in our inputs, generated licks tended to avoid them - skirting off-chord approach tones and opting instead for more familiar chord tones. The most striking difference between the two sets of music related to rhythms. While our inputs contained notes of duplet and triplet rhythms, our outputs contained almost exclusively duplet rhythms. This issue will be discussed in greater detail in the next section. Figure 6: The first stave is a sample from our training data licks. The second stave is a lick that was generated by a trained deep belief network. The third stave shows random notes generated at the resolution of the network. The fourth stave shows incoherence using selection not based on maximum probability. In all cases, red notes represent discords. 234 Other approaches tried included selecting bits proportional to the neuron probability distribution, rather than always choosing the maximum probability. However, this produced melodies that were more disjointed and less coherent rhythmically, as in the bottom stave of Figure 6. We also experimented with encodings that included beat information, such as which beats were stronger. The results for such encodings were not superior to those for the chosen encoding presented here. At this juncture, using deep belief networks would not be our first choice for a lick generator in a jazz education tool such as Impro-Visor [10]. The quality of licks generated by Impro-Visor's grammatical approach is sufficiently superior qualitatively to those generated by our DBN that it would be pointless to conduct a third-party blindfold test. The other drawback to DBNs is the large training time. On the other hand, DBN's may eventually prove to be less algorithmically biased than an unsupervised approach such as that in [2], which relies on clustering and Markov chains, and it is possible that the training time issue can be alleviated. 7 Future Work The successes of our initial deep-belief improvisor are encouraging, but there is still much potential for improvement. Despite training on inputs containing both triplet and duplet rhythm patterns, our machine created mostly duplet rhythm patterns. We hypothesize that this results from a predominance of duplet rhythms in our training set, overshadowing the examples of triplet rhythms. Ideally, our machine should be able to generate triplet patterns at a lower frequency than duplet patterns, rather than excluding them from generation altogether. It is possible that a different note generation rule might yield more variety, but we have yet to find one that doesn't also result in less coherence. Additionally, the music generated by our trained DBN tends to produce disproportionate numbers of repeated pitches, instances in which the same note is played twice in a row, compared with their relatively low frequency of occurrence in the training data. Repeated notes in jazz may tend to sound static and immobile, and we would like to avoid them if possible. One solution we implemented involved postprocessing our generated music to merge all repeated notes. Ideally the machine should avoid producing as many of them in the first place. It is possible that a different encoding might resolve some of these issues. Finally, we believe that our work naturally lends itself to the open problem of chord inference. Currently, we give our machine chords as input, and it creates a suitable melody. If we instead provide a melody as input, a DBN similar to ours might be able to determine one or more chord progressions that fit the melody. 235 8 Related Work Geoffrey Hinton and his associates are responsible for much previous work related to restricted Boltzmann machines. They used RBMs and DBNs for various purposes, including handwritten digit recognition [3], facial recognition [7], and movie recommendation [6]. These contrast to our use, which is generation. A particularly useful tutorial for implementing an RBM has been written by Rossen Radev [9]. Our RBM implementation was largely influenced by these sources. Early work on generation of music by neural networks includes Mozer [13], who used back propagation through time. See Todd and Loy [14] for other early examples. Bellgard and Tsang [15] used a different form of extended Boltzmann machine for the harmonization and analysis of chorales. Eck and Lapalme [16] describe an approach using LSTM (Long Short-Term Memory) neural networks. Additionally, Page [17] utilized neural networks for musical sequence recognition. Please see Todd and Werner [18] for a more extensive survey. Various other approaches have been taken towards artificial composition. Biles [19] used genetic algorithms. Jazz generation using a grammar-based approach was demonstrated by Keller and Morrison [1], and learning by Gillick, Tang and Keller [3]. Please consult these papers for further references on related approaches. Please see Cope [20] for a broad survey of approaches to musical creativity, including neural networks. 9 Summary The results of our experiments show that a deep belief network is capable of learning certain concepts about a set of jazz licks and in turn creating new melodies. The ability of a single machine to generate licks over a chord progression in several different keys demonstrates the power and flexibility of the approach and suggests that a machine could be taught to generate entire solos over more complex chord progressions given a sufficient dataset. While the licks created by our networks sometimes under-represented features of the training set, their novelty and choice of notes seem adequate to characterize them as jazz. Despite a moderately-successful proof of concept, deep belief networks would not be our first choice for a practical lick-generation tool at this stage of our understanding. Our initial objective of exploring the possibility has been achieved, and further exploration is anticipated. We continue to be attracted to this approach as the basis for an algorithmically unbiased machine learning method. Acknowledgment This research was supported by grant 0753306 from the National Science Foundation. We are grateful to the anonymous referees for several helpful suggestions for revision and future work. 236 2010_28 !2010 Experiments in Objet Trouvé Browsing Simon Colton, Jeremy Gow, Pedro Torres, and Paul Cairnsy Department of Computing, Imperial College, London, UK. y Department of Computer Science, University of York, UK. Abstract. We report on two experiments to study the use of a graphic design tool for generating and selecting image filters, in which the aesthetic preferences that the user expresses whilst browsing filtered images drives the filter generation process. In the first experiment, we found evidence for the idea that intelligent employment of the user's preferences when generating filters can improve the overall quality of the designs produced, as assessed by the users themselves. The results also suggest some user behaviours related to the fidelity of the image filters, i.e., how much they alter the image they are applied to. A second experiment tested whether evolutionary techniques which manage fidelity would be preferred by users. Our results did not support this hypothesis, which opens up interesting questions about how user preferences can be intelligently employed in browsing-based design tools. 1 Introduction The Objet Trouvé (Found Object) movement in modern art gained notoriety by incorporating everyday objects { often literally found discarded in the streets { into visual art and sculpture pieces. This is a two stage process, whereby the original object is first found, and then manipulated into a piece of art. This process is analogous with certain practices in computer-supported graphic design. In particular, both amateur and expert designers will often find themselves browsing through libraries of image filters; or brush shapes; or fonts; or colour palettes; or design templates; etc. Once they have found some possibilities, these are pursued further and manipulated into a final form. This analogy with Objet Trouvé methods is most pronounced in the field of evolutionary art, where artistic images (or more precisely the image generating processes) are evolved, e.g., [5]. Here, the software initially leads the user, through its choices of processes to employ and the way in which it combines and/or mutates those processes as the session progresses. However, as the user begins to exert their aesthetic preferences through their choices, the software should enable them to quickly turn their found processes into a final form. We investigate here the behaviour of \amateur creators" [2] when using such design software. Our motivation is to ultimately build software which acts as a creative collaborator in design processes. We present here the results from two experiments where participants were asked to undertake graphic design tasks using a simple design tool which allows users to browse and select image-filtered versions of a source image in an Objet Trouvé fashion. Various techniques may be used to supply new filters on demand. As described in section 2, our tree based image filtering method enables evolutionary, database lookup and image retrieval techniques to be used in providing 238 Fig. 1. Filter tree with transforms in blue circles: (A)dd colour, (C)onvolution, (I)nverse, (M)edian and (T)hreshold; and compositors in red squares: (A)nd, (F)ade, (M)in and (O)r. Image inputs are in green diamonds. An original image and filtered version are shown. a user with new filters. We incorporated six such techniques into a very pareddown user interface which enables the user to undertake simple graphic design tasks. As described in section 3, the first experiment was designed to test the hypothesis that employing the user's current choices to intelligently determine what to show them next is more ffective than a random selection method. The data revealed that users ascribe on average a higher score to designs produced by the intelligent methods than produced randomly. In addition, by studying user preferences towards the six generation methods, we hypothesised certain user behaviours, largely involving their preferences towards more conservative image filters, i.e., ones with high fidelity which don't radically change the original image. Studying these behaviours enabled us to design a second experiment involving evolutionary techniques where evolved filters are supplemented with other filters. As described in section 4, this experiment tested the hypothesis that choosing the supplementary filters to manage the overall average fidelity of the filters would be more ffective than choosing them randomly. The data did not support this hypothesis, which opens up interesting questions about how to analyse the behaviour of a user to improve the quality of the content they are shown as their browsing session progresses. In section 5, we suggest more intelligent methods for future Objet Trouve design approaches. 2 Image Filtering An image filter such as blurring, sharpening, etc. manipulates the bitmap information of an original digital image into bitmap information for a filtered version of the image. We represent image filters as a tree of fundamental (unary) image transforms such as inverse, lookup, threshold, colour addition, median, etc., and (binary) image compositors such as add, and, divide, max, min, multiply, or, subtract, xor, etc. In the example tree of figure 1, the overall filter uses 7 transform steps and 6 compositor steps, and the original image is input to the tree 7 times. Using this representation, image filters can be generated randomly, as described in [3]. We used such random generation to produce a library of 1000 hand-chosen filters, compiled into 30 categories according to how the filtered images look, e.g., there are categories for filters which produce images that are blurred, grainy, monotone, etc. The time taken to apply a filter is roughly proportional to the size of the input image multiplied by the size of the tree. Over the entire library of filters, the average number of nodes in a tree is 13.62, and the average time { on a Mac OS X machine running at 2.6Ghz { to apply a filter to an image of 256 by 384 pixels is 410 milliseconds. 239 2.1 Filter Generation Methods As described in the experiments below, we investigate how best to supply a user with a set of novel filters (N) given a set of filters (C) for which they have already expressed an interest. We have implemented the following methods. Database methods. These two methods use the 30 hand-constructed categories from our image filter library. The Random From Category (RFC) method supplies filters for N which are chosen randomly from the library. To do this, firstly a category is chosen at random, and then a filter is chosen at random from the category. Given that some of the categories have up to 100 filters in them, we found that choosing evenly between the categories gave the user more variety in the filters shown than simply choosing from the 1000 filters at random (which tends to bias towards filters in the most popular few categories { which can look fairly similar). Whenever a filter has been shown to the user, it is removed from a category, so it will not be shown again, and if a category has been exhausted, then a new one is chosen randomly. The More From Categories (MFC) method takes each filter in C with an equal probability, finds the library category from which it came and then chooses a filter from this category at random to add to N. As before, filters which have been shown to the user are removed from the category. Exhausted categories are re-populated, but when a filter is used from the category, the filter is mutated (see below), to avoid repetitions. Image retrieval methods. An alternative way to retrieve filters from the library is to search for filters which are closest to the user choices in terms of colour or texture. We call these the Colour Search (CS) and Texture Search (TS) retrieval methods. We modified standard image retrieval techniques to perform this search, using information derived oine about how the filters alter the colour histogram and texture of some standard images, as described in further detail in [11]. While the search techniques work with approximate information about filters (to perform eciently), they are fairly accurate, i.e., in [11], we report a 93% probability of retrieving a given filter in a set of 10. Again, a record of the library filters shown to the user is kept to avoid repetitions. Evolutionary methods. Representing image filters as trees enables both the crossing over of branches into offspring from each of two parents, and the mutation of trees. To perform crossover, we start with two parent trees (called the top and bottom parent), and we choose a pair of nodes: Nt on the top parent and Nb on the bottom parent. These are chosen randomly so that the size of the tree above and including Nt added to the size of the tree gained from removing the tree above and including Nb from the bottom parent is between a predefined minimum and maximum (3 and 15 for the experiments here). After 50 failed attempts to choose such a pair of nodes, the two trees are deemed incompatible, and a new couple is chosen. If they are compatible, however, the tree above and including Nt is substituted for the tree above and including Nb to produce the offspring. An example crossover operation is shown in figure 2a. We have implemented various mutation techniques, which alter the filter by different amounts, with details given in [3]. For the experiments here, we employ a mutation method which randomly mutates a single transform/compositor node in the filter tree 240 Fig. 2. a) Example crossover operation. b) Filter Trouvé design interface screenshot. into a different transform/compositor node. This almost guarantees that a visible change will occur, and through experience we have found that it can produce a range of changes to the filter, from mild to fairly strong, depending on where the mutated node is in the tree. With the Cross-Over (CO) method, we choose pairs of filters randomly from the set of chosen filters C and perform one-point crossover to produce new filters for N. With the Mutate (MT) method, filters from C are chosen randomly and mutated as above to produce filters for N. Note that filters generated by the CO or MT methods are checked to determine if they produce single colour images or the same image as another child of the same parents. If so, the filter is rejected and another one is generated. 2.2 The Filter Trouvé User Interface The user interface employed for the experiments described below is portrayed in figure 2b. In each session, the user works on a single design which involves filtering a single image which is incorporated (possibly multiple times) into a design, such as a magazine cover, etc. The user is shown the expression of 24 filters on the image in successive screens. They can click on a filtered image to see it in the design (in figure 2b, a filtered version of a tree image is shown in a magazine cover design). The user can also click a tick sign on each image to express their preference for it. When a user has chosen the images they like from a sheet, they click on the `next' button which supplies them with another sheet of 24 images. At the end of a session, after clicking a `finalise' button, users are shown all the designs that they ticked one by one in a random order. For each design, they are asked to judge it by choosing one of: (1) would definitely not use it (2) would probably not use it (3) not sure (4) would probably use it, and (5) would definitely use it. These choices essentially provide a score between 1 and 5 for each design. When each design has been given a score, the session ends. There has been considerable interest recently in creativity support tools like Filter Trouvé, which allow users to quickly generate, explore and compare multiple alternatives [9]. Our design embraces Shneiderman's principle of \low thresholds" for novices, but intentionally avoids the \high ceilings and wide walls" of advanced and comprehensive functionality. Filter Trouvé is designed for the amateur creator [2], who is not motivated to gain domain expertise, as opposed to the novice, who intends to become an expert. However, we see no reason why Objet Trouvé methods could not support other user groups. 241 3 Experiment 1 To recap, we are interested in comparing methods for presenting users with successive sets of image filters, so that they can drive a browsing session via their aesthetic choices. In this experiment, we investigate whether using techniques which choose new filters based on the user's choices perform better than techniques which choose them randomly. To do this, we compared the Random From Category (RFC) generation technique with a hybrid technique which we call Taster. This supplies the user with: 4 filters from MFC; 4 from CO; 4 from MT; 3 from CS; 3 from TS and 6 from RFC. Note that we included filters returned by RFC in the Taster method, as we found in initial tests that users appreciate the variety provided by RFC. Note also that providing only 3 images from the CS and TS methods was a mistake { while we had intended to provide 4 images for each of these methods, the mistake does not affect the conclusions we draw. The two hypotheses we proposed were: The Taster method will produce better images than the RFC method, as measured by the scores ascribed by participants to their designs. Taster will be quicker to use than RFC, as measured by the time to complete tasks and number of images viewed before deciding they have finished the task. We asked 29 participants with varying levels of graphic design experience (no professionals) to undertake 4 design tasks using the Filter Trouvé interface. The design tasks were: a gallery installation, where a filtered image of a cityscape was included four times with wooden frames; a magazine cover, which involved a filtered image of a woman's face behind text; a Facebook profile, which involved a filtered version of man's face; and a book cover, where a filtered version of a (haunted) house appears on the front and back. We instructed participants to tick any filters they liked on a sheet and to stop when either around 10 minutes had passed, or they felt they had enough designs they were pleased with, or they felt the search was futile and wanted to stop. We balanced the two experimental conditions (i.e., the Taster method and the RFC method) in such a way that each participant had both conditions twice. This meant that there were either 14 or 15 participants in each pairing of design task and condition. The measures for each task were the time taken to complete the task, the number of sheets viewed by the participant, the number of ticks and expansions (viewing the filtered image in the overall design) a participant makes, and the score for each design. 3.1 Quality and Eciency Results Some summary statistics about Taster and RFC are presented in table 1. The data were analysed using SPSS v17.0. With all measures, it was noted that there was substantial positive skew in many of the task conditions. For this reason, non-parametric tests were used to compare the conditions. Additionally, as each participant completes four separate tasks but not in a full-factorial design, the measures for each task are considered separately as a between-subjects design for the two conditions. However, to account for possible correlations between the performance on the different tasks, we make a Bonferroni correction. Thus, for all 242 Condition Mean Score Mean Ticks Mean Expands Mean Time (s) Mean Sheets RFC 3.23 17.59 43.07 497.21 9.48 Taster 3.47 15.97 39.53 463.55 6.97 Table 1. Mean score per chosen design; average number of ticks per design task; mean number of expansions per design task; mean time per design task; mean number of sheets of 24 images per design task, for both RFC and Taster conditions. Design Task RFC Taster Gallery 536 (251) 504 (123) Magazine 409 (92.0) 473 (242) Facebook 373 (179) 372 (128) BookCover 673 (298) 508 (230) Design Task RFC Taster Gallery 6.87 (2.48) 4.71 (2.37) Magazine 7.73 (2.69) 6.07 (2.76) Facebook 12.1 (5.72) 10.5 (6.29) BookCover 11.50 (5.20) 6.33 (2.82) Table 2. a) mean (standard deviation) of task times in seconds for each task in each condition. b) mean (sd) of number of sheets viewed for each task in each condition. tests, we use the Mann-Whitney test with significance level f = 0:05=4 = 0:0125, and SPSS was used to produce exact p values. The mean and standard deviations of the task times are given in table 2a. There are modest differences between the means, but overall there are no significant differences: for the gallery task, U = 102, p = 0:914; for the magazine task, U = 95:5, p = 0:691; for the Facebook task, U = 104, p = 0:974; and for the book cover task, U = 57:5, p = 0:038. Note that the book cover task is tending towards being completed significantly quicker for the Taster condition. For the number of filters viewed, in all tasks, participants viewed fewer sheets in the Taster condition than in the RFC condition. The means and standard deviations are shown in table 2b. Significant differences are seen for Gallery (U = 49, p = 0:011) and for BookCover (U = 40, p = 0:003) but not for Magazine (U = 63:5, p = 0:067) or Facebook (U = 86:5, p = 0:429). Analysing scores is more complicated, as each participant was able to tick and therefore score as many designs as they wished. The number of ticks depends on personal strategies for using the system, hence it would be useful to see if participants differed in the number of ticked designs in one condition over the other. As tasks were fully counterbalanced across the two conditions, for each participant, the number of scores produced was summed across tasks in each condition. In the RFC condition, participants ticked a mean of 17.59 designs, whereas in the Taster condition, the mean is lower, at 15.97. A Wilcoxon Signed Ranks test indicates that these differences are not significant (Z = 􀀀1:14, p = 0:261). Taking instead the average (mean) of all scores for each participant over two tasks in a single condition, the mean score in the RFC condition is 3.23 and in the Taster condition, it is higher, at 3.47. The difference in mean scores over the two conditions is significant (Wilcoxon Z = 􀀀2:649, p = 0:007). It is worth noting though that looking only at the number of score 5s, i.e., the images people would definitely use, a similar analysis showed no significant difference. In summary, users will on average be more satisfied (in terms of scores) with designs produced by Taster. However, they will not necessarily tick more designs nor give more designs the maximum score in a session. While users will not necessarily finish quicker, they might be presented with fewer images in carrying out a design task with Taster than RFC, but this is task dependent. 243 Fig. 3. a) Mean sheet scores for the RFC and the Non-RFC submethods in the Taster method. b) tick and expand probabilities per sheet for RFC and Non-RFC submethods. 3.2 User Behaviour Analysis As Taster is a combination of six filter selection submethods, we can look at the individual contribution each submethod makes. Splitting the submethods into those using the participant's choices (Non-RFC) and the RFC method, looking at figure 3a, we see that for Non-RFC methods, the mean score for designs ticked in sheets 1 to 5 is 3.23, whereas in sheets 6 to 10, the mean score is 3.4. The same ffect is not present in the sheets produced by the RFC method. This suggests that users may appreciate the ability to drive the search with their tick choices. Let us define the probability of ticking, pt(n), for sheet number n as the number of images ticked on the n-th sheet across all sessions divided by 24 (the number of images shown on any sheet). Plotting this in figure 3b, we see that for both RFC and Non-RFC methods, pt(n) consistently falls as n increases from 1 to 10. Defining the probability of expanding a design similarly as pe(n), we see that this also decreases over sheets 1 to 10. This suggests that, in general, participants became more discerning about the images they expanded and ticked as they progressed through the first 10 sheets. After sheet 10, the pattern is less clear, perhaps due to the small number of sessions that lasted that long. Table 3 shows that ranking the methods by mean image score is equivalent to ranking them by pt and (almost) by pe. This suggests a ranking by popularity, i.e., if participants are ticking/expanding more designs during a session, they will give a higher score on average to the designs they choose. We also note that the two evolutionary techniques (CO and MT) perform the best in terms of the mean score that users ascribe to the designs they produce. To further analyse the difference between the submethods, we investigated the fidelity of the image filters. For a given filter f, we let D(f) denote the average Euclidean RGB-distance between pairs of pixels in the original and filtered image at the same co-ordinate, normalised by division by the maximum RGB-distance possible. Note that we have experimented with measures based on the HSV colour model, but we saw little difference in the results. For a given design session, S, we let D(S) denote the average of D(f) over all the filters f shown to the user in the session. We let T(S) be the average of D(f) over all the filters ticked by the user in session S. We also denote P(S) as the average Euclidean RGB-distance between pairs of filters (f1; f2) in a session S, where f1 is a filter 244 Rank Method Mean Score pt pe Mean D(S) Mean T(S) Mean P(S) 1 CO 3.92 0.20 0.34 0.33 0.25 0.32 2 MT 3.73 0.17 0.33 0.35 0.27 0.33 3 CS 3.30 0.09 0.27 0.37 0.30 0.37 4 RFC 3.13 0.07 0.20 0.47 0.35 0.51 5 MFC 3.13 0.06 0.16 0.45 0.35 0.5 6 TS 3.00 0.04 0.19 0.47 0.37 0.51 Table 3. Taster submethods ranked by mean score; probability of being ticked (pt) or expanded (pe); mean distance from original D(S); mean distance of ticked filters from original T(S); mean distance from the ticked filters in the previous sheet P(S). ticked by the user in sheet n and f2 is any of the 24 filters shown to the user in sheet n+1. Table 3 shows D(S), T(S) and P(S) for the submethods used in the Taster sessions. We see that the mean score increases as P(S) decreases, hence participants seem to appreciate filters more if they are more similar to those ticked in the previous sheet. Also, the mean score decreases as D(S) increases, which suggests that participants may have preferred more conservative image filters, i.e., which change the original image less. This is emphasised by the fact that in all but 3 of the 116 design sessions, T(S) was less than D(S), i.e., participants ticked more conservative filters on average than those presented to them in 97% of the design sessions. The extent of this conservative nature differs by design task: Gallery: D(S) = 0:44, T(S) = 0:34; Magazine: D(S) = 0:44, T(S) = 0:28; Facebook: D(S) = 0:46, T(S) = 0:30; BookCover: D(S) = 0:45, T(S) = 0:35. Participants were particularly conservative with the Magazine and Facebook tasks, as these require the filtering of faces, which was generally disliked (as expressed through some qualitative feedback we recorded). 4 Experiment 2 To explore the observation that scores seem to be correlated with the fidelity of the filters, we implemented further retrieval techniques which manage the overall fidelity of filters presented to users. In particular, we implemented another hybrid technique, Evolution (EVO), which returns 8 filters produced by CO, 8 filters produced by MT and 8 filters produced by RFC. This choice was motivated by the fact that CO and MT were appreciatively the best submethods from experiment 1. We produced two variants of EVO to test against it. Firstly, the EVO-S method replaces the 8 filters produced by RFC with filters chosen from the library in such a way that the average D(f) value for the filters on each sheet remains static at 0.25. This choice was motivated by 0.27 and 0.25 being the mean of the D(f) values over the ticked filters produced by the CO and MT submethods respectively. The EVO-D method is the second variant. In order to supply filters for sheet number n, this method calculates the average, A, of D(f) over the ticked filters on sheet n 􀀀 1. Then, EVO-D chooses 8 filters from the library to replace those produced by the RFC submethod in EVO, in such a way that they each have a D(f) value as close to A as possible. The aim of the second experiment was to test the hypothesis that EVO-S, EVO-D or both would be an improvement on the plain EVO method. A similar 245 Method D(f) Score Ticks Exps Time Sheets EVO 0.28 3.45 18.9 30.6 347.1 6.3 EVO-S 0.25 3.32 22.2 32.3 392.0 6.7 EVO-D 0.27 3.15 20.1 27.8 362.2 6.4 Table 4. Statistics for exp. 2: Mean RGB distance per design (fidelity); Mean score per chosen design; mean ticks per design task; mean expansions per design task; mean time (s) per design task; mean sheets viewed per design task. experimental setup as before was employed, involving 24 participants, asked to undertake 6 new design tasks, namely: more stationery; another gallery; another magazine cover, shown in figure 3; a poster; a calendar; and a menu (note that we used no faces in the designs, to avoid any biasing as in the first experiment). The EVO, EVO-S and EVO-D methods were balanced around the six design tasks evenly, so that each participant was given each method twice. The results are shown in table 4. A statistical analysis revealed that the hypotheses that EVO-S or EVO-D is better (in terms of eciency and mean score) are not supported by the data. In fact, EVO has a higher mean score and lower mean time than both the other methods. We speculate that EVO-S and EVO-D's balancing of the average image fidelity results in many very high fidelity filters (i.e. very low RGB distance) being introduced to achieve the balance, and that in general these filters are less satisfying to the user than the random selection used by EVO. 5 Conclusions and Further Work To the best of our knowledge, there has been little study of user behaviour with browsing systems for creative tasks such as evolutionary art [7]. A notable exception is [5], where user interaction with the NEvAr evolutionary art tool is described. We introduce the phrase Objet Trouvé browsing to acknowledge the push and pull between software leading the user and the user leading the software in such systems. This raises the question of whether software could learn from the user, or more ambitiously: take a creative lead in design projects. Such behaviour might be appreciated by novice or amateur designers perhaps lacking inspiration. We are taking deliberately small steps towards such creative systems with experiments involving amateur designers to understand the nature of both the different methods and user behaviour with respect to those methods. In particular, we started with the straightforward hypothesis that using intelligent techniques to deliver new image filters based on those chosen by the user would be an improvement over supplying the filters randomly. (The truth of this is rather taken for granted in evolutionary art and image retrieval systems). Working with image filters enables us to compare and contrast methods from different areas of computing: database, image retrieval and evolutionary methods in browsing for resources (filters) in design tasks. In experiment 1, we found that more intelligent methods will lead to greater satisfaction in the designs produced and may lead to the completion of the design task with less ffort (i.e., having to consider fewer possibilities). We also observed some user behaviours such as becoming more discerning as a session progresses and appreciating the progression afforded by the intelligent techniques. Furthermore, while there is some correlation between filter fidelity and user satisfaction, we were unable 246 to harness this for improved browsing techniques, as shown in experiment 2. Simply giving users more of what they like, whether statically or dynamically is not sophisticated enough, raising interesting questions about managing novelty. We plan to study other browsing systems, e.g., [10], which employs emotional responses to web pages, and other evolutionary image filtering systems, e.g., that of Neufeld, Ross and Ralph (chapter 16 of [7]), which uses a fitness function based on a Bell curve model of aesthetics. Moreover, to improve our experiments, we will study areas of computer supported design such as: the in uences of re ection and emergence [8]; the use of analogy and mutation [4]; and how serendipity can be managed [1]. We will test different browsing mechanisms involving different image analysis techniques such as edge information and moments, and measures based on novelty, such as those prescribed in [6]. Despite the failure of the EVO-D method in experiment 2, we believe that software which dynamically employs information about a user's behaviour to intelligently suggest new artefacts can improve upon less sophisticated methods. In particular, we intend to use the data from experiments 1 and 2 to see whether various machine learning techniques can make sensible predictions about user preferences during image filtering sessions. Our ultimate goal is to build and investigate software which acts as a creative collaborator, with its own aesthetic preferences and goals, able to work in partnership with both amateur and expert designers. Acknowledgments We would like to thank the anonymous reviewers for their valuable input. This work is supported by EPSRC grants EP/F067127 and TS/G002835. 2010_29 !2010 Evolving Expression of Emotions through Color in Virtual Humans using Genetic Algorithms Celso M. de Melo1 and Jonathan Gratch1 1 Institute for Creative Technologies, University of Southern California, 13274 Fiji Way, Marina Del Rey, CA 90292, USA demelo@usc.edu, gratch@ict.usc.edu Abstract. For centuries artists have been exploring the formal elements of art (lines, space, mass, light, color, sound, etc.) to express emotions. This paper takes this insight to explore new forms of expression for virtual humans which go beyond the usual bodily, facial and vocal expression channels. In particular, the paper focuses on how to use color to influence the perception of emotions in virtual humans. First, a lighting model and filters are used to manipulate color. Next, an evolutionary model, based on genetic algorithms, is developed to learn novel associations between emotions and color. An experiment is then conducted where non-experts evolve mappings for joy and sadness, without being aware that genetic algorithms are used. In a second experiment, the mappings are analyzed with respect to its features and how general they are. Results indicate that the average fitness increases with each new generation, thus suggesting that people are succeeding in creating novel and useful mappings for the emotions. Moreover, the results show consistent differences between the evolved images of joy and the evolved images of sadness. 1 Motivation Virtual humans are embodied agents which inhabit virtual worlds and act and look like humans [1]. Inspiring on the human face-to-face conversation paradigm, virtual humans are capable of expressing themselves using verbal and non-verbal modalities in an integrated and synchronized fashion. In order to further increase believability, naturalness and efficiency of communication, virtual humans have been endowed with models of emotions. In particular, research on expression of emotions has tended to focus on the modalities people use in daily interaction: gesture, face and voice. In contrast, this work explores a new form of expression which capitalizes on accumulated knowledge from the visual arts and goes beyond the usual bodily, facial and vocal forms of expression. In fact, artists have been exploring for centuries the idea that it is possible to perceive emotions in line, space, mass, light, color, texture, pattern, sound and motion [2]. In a simpler conception, art is seen as the expression of the artist's feelings [3, 4]. However, John Hospers [5] refined this view by noting that the work of art need not reflect the emotions of its creator but can be said to possess emotional properties in its own right. Thus, first, the creator manipulates the formal elements of art (line, space, mass, light, color, texture, pattern, sound and motion) to convey felt or imagined 248 emotions. Then, the audience relies on analogies with the internal and external manifestations of emotions they experienced in the past to interpret the work of art. This work takes this insight and explores color to manipulate the perception of emotions in virtual humans. Color has been widely manipulated by artists in the visual arts to convey emotion [2, 6]. Color is the result of interpretation in the brain of the perception of light in the human eye. Thus, the manipulation of light in the visual arts, called lighting, has always been a natural way of achieving specific effects with color [7, 8]. In this work, color is manipulated using a lighting model. Moreover, color can also be looked at as an abstract property of a scene and manipulated explicitly with no particular concern for the physics of light. This has been explored in abstract painting [2] and, more recently, in the visual media [9]. The work presented in this paper also explores this form of manipulation and uses filters to achieve such color effects. Filters do postprocessing of pixels in a rendered image according to user-defined programs [10]. Having defined the expression modality the following question ensues: How to find novel mappings of emotions into color which are useful both for the individual and society (i.e., that generalize beyond the individual)? A first difficulty is that perception of emotion in color is influenced by biological, individual and cultural factors [2, 6]. Secondly, looking at the literature on lighting, it is possible to find general principles on how to convey moods or atmosphere [7, 8, 11, 12] but, these aren't sufficient to differentiate between emotions and usually do not reflect the character's mood but the narrative (such as the climax, for instance). The literature on filters is far scarcer and tends to focus on technical aspects or typical uses rather than on its affective properties [8, 9, 13]. Therefore, this work pursues an approach which is not dependent on the existent literature and tries, instead, to learn such mappings directly from people. Moreover, the interest here is in learning intuitions about expression of emotion through color from non-experts. This is in contrast to previous approaches which attempt to learn the affective properties of lighting from artists [15, 16] or the existent literature [17, 18]. Effectively, being able to learn from non-experts is a necessity when new forms of expression are being explored. As noted above, this is specially the case with respect to finding expertise on the affective properties of filters. Furthermore, this will later facilitate extending the proposed system to other elements of art. Therefore, the system needs to be responsible for generating the alternatives, which a non-expert is unlikely to be proficient in doing, and the user should only be responsible for evaluating them (as to how well they convey the emotion). An evolutionary approach, which relies on genetic algorithms, is used to learn mappings between emotions and color. The focus is on joy and sadness and whether the approach is applicable to other emotions is a topic of future work. Genetic algorithms [14] are appropriate for several reasons. The clear separation between generation and evaluation of alternatives is convenient. Alternatives can be generated using biologically inspired operators - mutation, crossover, etc. Evaluation, in turn, relies on feedback from people. Finally, the expression space defined by lighting and filters is very large and genetic algorithms deal well with intractable search spaces. The rest of the paper is organized as follows: Section 2 describes the lighting and filters model used to manipulate color; Section 3 describes the evolutionary model used to learn the mappings of emotions into color; Section 4 describes two 249 experiments which were conducted to define and understand the mappings of joy and sadness; finally, Section 5 discusses the results and draws conclusions. 2 The Expression Model The lighting model defines local pixel-level illumination of the virtual human. Among the supported parameters, the following are used in this work: (a) type, defines whether the light source is directional, point or spotlight; (b) direction, defines the illumination angle; (c) ambient, diffuse and specular colors, define the light color for each component. Color can be defined in either RGB (red, green, blue) or HSB (hue, saturation, brightness) spaces; ambient, diffuse and specular intensity, define a value which is multiplied with the respective component color. Setting the value to 0 disables the component. Filters are used to post-process the pixels of the illuminated rendered image of the virtual human. Several filters are available in the literature [19] and this work uses the following subset: the color filter, Fig.1-(b) and (c), sets the virtual human's color to convey a stylized look such as black & white, sepia or inverted colors; the HSB filter, Fig.1-(d) and (e), manipulates the virtual human's hue, saturation or brightness. Filters can also be concatenated to create compound effects. Further details about the expression model can be found elsewhere [20]. Fig. 1. Filters used to post-process the rendered image of the illuminated virtual human. No filter is applied in (a). The color filter is used to invert the colors in (b) and create the sepia look in (c).The HSB filter is used to reduce saturation in (d) and to increase the saturation and brightness in (e). Both virtual humans used in this work are shown. 3 The Evolutionary Model Building on the expression model, the evolutionary model uses genetic algorithms to evolve, for a certain emotion, a population of hypotheses, which define specific configurations of lighting and filters parameters. Evolution is guided by feedback from the user as to how well each hypothesis conveys the intended emotion. The fitness function, in this case, is the subjective criteria of the user. At the core lies a standard implementation of the genetic algorithm [14]. The algorithm is characterized by the following parameters: (a) stopping criteria to end the algorithm, i.e., the maximum number of iterations; (b) the size of the population, p, to 250 be maintained; (c) the selection method, sm, to select probabilistically among the hypotheses in a population when applying the genetic operations. Two methods are supported: roulette wheel, which selects a hypothesis according to the ratio of its fitness to the sum of all hypotheses' fitness; tournament selection, which selects with probability p' the most fit among two hypotheses selected using roulette wheel; (d) the crossover rate, r, which defines the percentage of the population subjected to crossover; (e) the mutation rate, m, which defines the percentage of the population subjected to mutation; (f) the elitism rate, e, which defines the percentage of the population which propagates unchanged to the next generation. The rationale behind elitism is to avoid losing the best hypotheses from the previous population in the new population [14]. The algorithm begins by setting up the initial population with random hypotheses. Thereafter, the algorithm enters a loop, evolving populations, until the stopping criterion is met. In each iteration, first, (1-r)p percent of the population is selected for the next generation; second, r*p/2 pairs of hypotheses are selected for crossover and the offspring are added to the next generation; third, m percent of the population is randomly mutated; fourth, e percent of the hypotheses is carried over unchanged to the next generation. Evaluation is based on feedback from the user. The hypothesis is structured according to the lighting and filter parameters. Lighting uses the common three-point configuration [7, 8] which defines a primary key light and a secondary fill light. The backlight is not used in this work. Both lights are modeled as directional lights and are characterized by the following parameters: (a) direction, corresponds to a bi-dimensional floating-point vector defining angles about the x and y axis with respect to the camera-character direction. The angles are kept in the range [-75.0º, 75.0º] as these correspond to good illumination angles [5]; (b) diffuse color, corresponds to a RGB vector; (c) Kd, defines the diffuse color intensity in the range [0.0, 5.0]; (d) Ks, defines the specular color intensity in the range [0.0, 3.0]. The HSB and color filters are also applied to the virtual human. Thus, four more parameters are defined: (a) HSB.hue, HSB.saturation and HSB.brightness, define the HSB filter's hue (in the range [0.0, 10.0]), saturation (in the range [0.0, 5.0]) and brightness (in the range [0.5, 3.0]); (b) color.style, which defines whether to apply the black & white, sepia or inverted colors style for the color filter. Both filters can be applied simultaneously. Further details on the evolutionary model can be found in another article [21]. 4 Results 4.1 Learning the Mappings In a first experiment, non-experts evolve mappings for joy and sadness. The experiment is designed so that subjects are unaware that genetic algorithms are being used. They are asked to classify five ‘sets' (i.e., populations) of ‘alternatives' (i.e., hypotheses) for the expression of each emotion. Classification of alternatives goes from 0.0 ('the image does not express the emotion at all' or low fitness) to 1.0 ('the image perfectly expresses the emotion' or high fitness). The sets are presented in 251 succession, being the first generated randomly and the succeeding ones evolved by the genetic algorithm. The experiment is automated in software. The user can save the session and continue at any time. A random name is given to the session so as to preserve anonymity. The parameters for the genetic algorithm are: p = 30, sm = tournament selection, r = 0.70, m = 0.15 and e = 0.10. Two virtual humans are used: a male and a female. The rationale for using multiple virtual humans is to minimize geometry effects in the analysis of the results (e.g., the illusion of a smile under certain lighting conditions even though no smile is generated). Participants are evenly distributed among virtual humans. The virtual human assumes the anatomical position and Perlin noise and blinking is applied. No gesture, facial or vocal expression is used throughout the whole experiment. Transition between hypotheses is instantaneous. The camera is fixed and frames the upper body of the virtual human. The study was conducted in person at the University of Southern California campus and related institutions. Thirty subjects were recruited. Average age was 26.7 years, 46.7% were male, mostly having superior education (93.3% college level or above) in diverse fields. All subjects were recruited in the United States, even though having diverse origins (North America: 50.0%; Asia: 20%; Europe: 20%; South America: 6.7%). Average survey time was around 20 minutes. The evolution of the average population fitness for joy and sadness is shown in Fig.2. Fourteen (out of possible thirty) of the highest fit hypotheses, one per subject, for joy and sadness are shown in Figures 3 and 4, respectively. Fig. 2. Average fitness per set (with standard deviations) for joy and sadness. 4.2 Understanding the Mappings The goals of a second experiment are to understand: (a) what features differentiate the mappings evolved in the first experiment; (b) how general the mappings are. Regarding the first goal, features refer to characteristics in the image generated by the respective hypothesis. The idea, then, is to differentiate the best images for joy and sadness using these features. These images are the union of, for each emotion, for each subject in the first study, the one with the highest classification. Thus, in total, 60 images are used: the 30 best for joy, one per subject; the 30 best for sadness, one per subject. Now, if the first experiment already provided a measure of value for the individuals, the second goal seeks to assess how valuable are the mappings beyond the individuals that generated them. The idea is to understand if there are common 252 patterns in the mappings evolved by each individual and how do these mappings relate to the existent literature. The existent literature is used here as a standard which represents knowledge which already has been shown to be of value to the field. Fig. 3. Fourteen of the highest fit hypotheses for joy. Each hypothesis is from a different subject. Fig. 4. Fourteen of the highest fit hypotheses for sadness. Each hypothesis is from a different subject. Three features were chosen from the literature that measure properties of the pixels in the images generated by the hypotheses: brightness, saturation and number of colors. The brightness of an image is defined, in the range [0.0, 1.0], as the average brightness of the pixels. The brightness of a pixel is the subjective perception of luminance in the pixel's color. The saturation of an image is defined, in the range [0.0, 1.0], as the average saturation of the pixels. Saturation of a pixel refers to the intensity of the pixel's color. Standard formulas are used to calculate brightness and saturation [22]. Finally, the number of colors of an image is defined to be the number of different colors in the pixels. However, the maximum number of colors was 253 reduced by rounding the RGB components to one decimal place. Intuitively, this means the feature is only interested in relatively large differences in color. Having calculated the feature values, the dependent t test was used to compare means between joy and sadness hypotheses with respect to each feature. The results are shown in Table 1. Table 1. Dependent t test statistics (df=29) for difference in means between the joy and sadness images with respect to brightness (BRIG), saturation (SAT) and number of colors (NCOL). Brightness* Saturation* Number of Colors * Mean Diff. 0.12 0.25 199.23 Std. Deviation 0.15 0.29 326.14 Std. Err. Mean 0.03 0.05 59.55 95% CI Lower 0.06 0.14 77.45 95% CI Upper 0.17 0.35 321.02 t 4.26 4.70 3.35 Sig. (2-tailed) 0.00 0.00 0.00 * Significant difference, p<0.05 The results in Table 1 show that: • The average brightness in joy images (M=0.36, SE=0.02) is higher than in sadness (M=0.24, SE=0.02, t(29)=0.00, p<0.05, r=0.62); • The average saturation in joy images (M=0.44, SE=0.04) is higher than in sadness (M=0.19, SE=0.04, t(29)=0.00, p<0.05, r=0.66); • The average number of colors in joy images (M=302.20, SE=374.46) is higher than in sadness (M=102.97, SE=29.93, t(29)=0.00, p<0.05, r=0.53). Finally, to assess how general the mappings are, supervised learning techniques were used to learn models that differentiate images of joy and sadness. In particular, decision trees [23] were used to classify the 60 images with respect to the three features. The J48 implementation of decision trees in Weka [24] was used with default parameters and 10-fold cross-validation. The resulting tree correctly classifies 47 (78.3%) of the images and is shown in Fig.5. Further details on this and the previous experiment can be found in another paper [25]. NCOLORS <= 26: sadness (23.0/3.0) NCOL ORS> 26 | BRIGHTNESS<= 0.302 | | SATURATION <= 0.413: sadness (7.0) | | SATURATION > 0.413: joy (10.0/2.0) | BRIGHTNESS > 0.302: joy (20.0/1.0) Fig. 5. Decision tree that distinguishes joy from sadness. 254 5 Discussion This paper proposes to use accumulated knowledge from the arts to explore new forms of expression of emotions which go beyond the usual bodily, facial and vocal channels in virtual humans. In particular, the work focuses on how to convey emotion through one formal element of art: color. Color is manipulated using a sophisticated lighting model and filters. The paper further proposes an evolutionary approach, based on genetic algorithms, to learn novel and useful mappings of emotion into color. The model starts with a random set of hypotheses i.e. configurations of lighting and filters and, then, uses genetic algorithms to evolve new populations of hypotheses according to feedback provided by non-experts. In a first experiment, subjects are asked to evolve mappings for joy and sadness using the evolutionary model. Subjects successively classify five sets of hypotheses, for each emotion, without being informed that a genetic algorithm is being used to generate the sets. The results show that the average set fitness for both emotions is monotonically increasing with each succeeding set (Fig.2). This suggests that: (a) subjects are succeeding in finding a novel mapping for the expression of emotions through color; (b) the genetic algorithm is succeeding in providing more useful hypotheses with each successive generation. The fact that subjects are unaware that an evolutionary approach is being used allows us to exclude the possibility that they are classifying later hypotheses better just because that is what is expected of them in an evolutionary approach. Nevertheless, the results also show that the average fitness of the fifth and final set is well below the perfect score of 1.0. This might be explained for two reasons: (a) too few sets are being asked to be evolved. This, then, would have been an experimental constraint which existed to limited survey time and not a fundamental limit on the expressiveness of color; (b) no gesture, facial or vocal expression is used. Effectively, these channels have already been shown to play an important role on the expression of emotions in virtual humans [1] and this paper is not arguing otherwise. A second experiment analyzes which features characterize the mappings for joy and sadness. Three features were drawn from the literature: brightness, saturation, and number of colors. The results show consistency between the mappings evolved by different subjects. In particular, the results show that images of joy tend to be brighter, more saturated and have more colors than images of sadness (Table 1 and Fig.5). This suggests that the mappings also reflect values which are shared among the individuals and, therefore, that the mappings have the potential to generalize beyond the individuals that created them. Moreover, these results are in line with the lighting literature [7, 8, 11, 12]. This provides further support that the mappings reflect values which generalize beyond the individuals. Finally, the fact that it was possible to learn, using 10-fold cross-validation, a decision tree model which explains the data with a relatively high success rate, also suggests that there is the potential for generalizing beyond the particular examples that were used to learn the decision tree. In summary, if the first experiment suggested that the proposed evolutionary approach is capable of producing novel mappings that are useful at least for the individual, the second experiment suggests that those mappings are also useful for society. Regarding future work, it would be interesting to explore whether the evolutionary approach generalizes to more emotions. From our experience and the feedback from 255 subjects, we believe this might be so for some, but not all, emotions. Finally, color is but one of the many elements that have been widely explored in the arts. Other elements include: line, space, mass, texture, shape, pattern, sound, motion, etc. It should, therefore, be worth exploring whether the proposed approach also generalizes to these other formal elements in the visual arts [2]. Acknowledgments This work was sponsored by the Fundação para a Ciência e a Tecnologia (FCT) grant #SFRH-BD-39590-2007. This work was also sponsored by the U.S. Army Research, Development, and Engineering Command and the National Science Foundation under grant # HS-0713603. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 2010_3 !2010 Realtime Generation of Harmonic Progressions Using Constrained Markov Selection Arne Eigenfeldt1 and Philippe Pasquier2 1School for the Contemporary Arts, 2School of Interactive Arts and Technology, Simon Fraser University, Canada {eigenfel, pasquier}@sfu.ca Abstract. We present a method for generating harmonic progressions using case-based analysis of existing material that employs a Markov model. Using a unique method for specifying desired harmonic complexity, tension between chord transitions, and a desired bass-line, the user specifies a 3 dimensional vector, which the realtime generative algorithm attempts to match during chord sequence generation. The proposed system thus offers a balance between userrequested material and coherence within the database. 1 Introduction Generative systems have had a long history within computer music [1] and interactive realtime performance [2]. One standard model for such systems has been that of improvisation [3, 4], in which the software interacts with either a composer or performer. Such models have tended to restrict harmonic movement, by employing a static, modal harmony [5] or ignoring harmony altogether in favour of a free-jazz approach [6]. These restrictions are necessitated because harmony cannot, by its very nature, be improvised collectively: it requires a clear goal (although this goal can be achieved through a variety of progressions). Several computer music systems have been developed that do allow the generation of harmony, although few are in use within realtime computer music, with the notable exception of Rowe [7]. Such systems have tended to be stylistically motivated, in that they attempt to reproduce specific progressions from within a defined stylistic period: for example, Baroque chorales [8]. As Pachet and Roy point out [9], harmonic language is style specific; as such, any system that relies upon specific rules will restrict itself stylistically, and thus limit its potential expressiveness. Furthermore, the same authors note that a harmonic language's rules tend to outline specific combinations and linear progressions that should be avoided, rather than followed. Markov models offer a straight-forward method of deriving correct harmonic sequences based upon a specific corpus, since they are essentially quoting portions of the corpus itself. Furthermore, since the models are unaware of any rules themselves, they can be quickly adapted to essentially "change styles" by switching source data. 16 However, as Ames points out [10], while simple Markov models can reproduce the surface features of a corpus, they are poor at handling higher-level musical structures. This research offers a method for the composer to specify high-level control structures, influencing the generative algorithm that chooses from the generated Markov transition tables. Using a unique method of specifying desired harmonic complexity, tension between chord transitions, and a bass line, the user can specify a three dimensional vector which the realtime generation algorithm attempts to match during sequence generation. As such, the proposed system offers a balance between user-requested material and coherence with the database. 2 Related Work Harmonic analysis is a well-researched field, particularly in relation to recent advances in Music Information Retrieval (MIR). Harmonic generation, specifically realtime generation for compositional and/or improvisational systems, is less researched, or at least less documented. Composers of interactive music have written only marginally about their harmonic generation algorithms; those systems that are well documented [22, 23] tend to be non-creative systems that attempt to apply correct (stylistic) harmonic practices to a given melody. This may be a good exercise for music students or musicologists attempting to formulate stylistic rules, but one less useful for creative composers. 2.1 Harmonic Analysis Theoretical models of tonality have existed for decades, if not centuries; one of the most influential in recent years being Lerdahl's Tonal Pitch Space [11]. Anglade and Dixon used Inductive Logic Programming to extract harmonic rules from a large database of existing songs [12]. Ogihara and Li used n-grams of chord sequences to construct profiles of jazz composers [13]. There has been significant research in chord recognition and automatic labeling (for reviews, see [14] and [15]). Similarity of chord sequences has been researched by Liu et. al using string matching [16], while both Pachet [17] and Steedman [18] used rewriting rules. Mauch [19] analysed the frequencies of chord classes within jazz standards. de Haas et. al [20] used a method called Tonal Pitch Step Distance, which is based upon Lerdahl's Tonal Pitch Space, to measure chord sequence similarities 2.2 Harmonic Generation Methods of harmony generation have included n-gram statistical learning for learning musical grammars [21], as well as several using genetic algorithms [8, 22]. Chuan and Chew [23] created automatic style-specific accompaniment for given melodies, using musical style as the determining factor in type of harmonization. Whorley et al. [24] used Markov models for the generation of 4-part harmonization of "hidden" melodies. 17 Similarly, Chan and Ventura [25] harmonize a given melody by allowing user input for parameters that governed the overall mood of the composition. Several systems have used probabilistic models for chord generation, including Paiement et al. [26], whose system was used as an analysis engine for jazz harmony to determine stochastic properties of such harmony. This system is extended in Paiement [27], which uses a machine learning perspective that attempts to predict and generate music within arbitrary contexts given a training corpus, with specific emphasis on long-term dependencies. Allan and Williams [28] used a data set of chorale harmonisations composed by Bach to train a HMM, then used a probabilistic framework to create a harmonization system which learned from examples. 2.3 Differences from Previous Research Our work differs from previous research in that it is not based in music information retrieval nor cognitive science, but in creative practice. Our particular approach has been informed by a number of heuristic choices stemming from the first author's expertise in composition. As it is a creative system, our interest is not in modeling a specific musical style; thus, a rule-based system is not useful. Machine learning strategies offer great potential; however, their usefulness has thus far been limited to rather pedestrian activities of melody harmonization. Furthermore, they do not, at this time, offer the flexibility and speed required by realtime computer music. In fact, the realtime nature of our system is one of its distinguishing qualities, in that it can quickly change direction in performance based upon user control. Lastly, we offer a useful measure for harmonic complexity and voice-leading tension that can be used to define harmonic progressions outside of functional harmony. This research does not attempt to construct correct harmonic sequences within the context of functional harmony; it is a creative system based within the ‘post-tonal' harmony found in certain 20th century musical styles. 3 Description This system uses a case-based system [29] to generate Markov conditional probability distributions, using either first, second, or third-order chains. However, rather than allowing the generative algorithm to freely chose from the derived transitions, user specified vectors, suggesting bass-line movement, harmonic complexity, and voiceleading tension, are overlaid in order to stochastically choose from the best matching solutions. The system is written in MaxMSP. 3.1 Source Data For the purposes of this research, the database consisted of chords derived from jazz standards by Miles Davis (4 tunes), Antonio Carlos Jobim (4 tunes), and Wayne Shorter (6 tunes), all taken from the Real Book [30]. 33 compositions by Pat Metheny 18 taken from the Pat Metheny Songbook [31], equally drawn from the tunes written in the 1970s, 80s, and 90s, were also used. Source data are standard MIDI files, consisting only of harmonic data at chord change locations (see Section 4). 3.2 Representation The term set and chord is used interchangeably in this research. In strict terms, every chord is a set, but not every set is a chord. Chords usually refer to vertical collections of pitches that contain a root, 3rd, 5th, and possibly further extensions (i.e. sevenths, ninths) and their alterations (i.e. lowered ninths, raised elevenths, etc.); sets are any combination of unique pitch classes that need not contain specific relationships. Similarly, set-types are unique sets, or chords; for example, the set (0 4 7 11) is a major seventh chord. Chords are represented as pitch classes [32], although not in normal or prime form. In pitch class theory, the minor triad (0 3 7) is the inversion of the major triad (0 4 7), and is thus considered identical in normal form (i.e. Forte 3-11); however, in tonal music, major and minor chords function very differently. For this reason, the decision was made not to use Forte's set theory representations; instead, the major triad is represented as (0 4 7), whereas the minor triad as (0 3 7). Extensions beyond the octave are folded within the octave; therefore, the dominant ninth chord is represented as (0 2 4 7 10). Transpositions of chords are not considered unique; instead, bass movement, in pitch classes, between chords is acknowledged. Thus, the chords progression Cmaj7 to Fmaj7 is considered a movement between identical chords, but with a bass movement of +5. Chords with alternate bass notes (Cm/F) or inversions (Cm/G) are considered unique; thus, Cm/F is represented as (0 2 7 10), and Cm/G is represented as (0 5 8). Chords are represented within chord vectors as indices into an array of recognized pitch class sets. Currently, this array contains 93 unique chords; for example, the minor seventh chord (0 3 7 10) is the first element in this array and is considered set type 1, while the major seventh (0 4 7 11) is the eleventh element, and is considered set-type 11. When combined with the bass note movement between chords, transitions can be defined as a two-element vector: for example, the pair (2 11) (-2 1) represent a major seventh build on D, followed by a minor seventh chord two semitones lower. 4. Analysis The database requires an initial analysis, executed prior to performance, to be done on specially prepared MIDI files. These files contain only harmonic data at points of harmonic change, with user-defined markers (controller data) specifying phrase beginning and ending points. Individual files are written for each tune in the database, consisting of the sequential chords (see Section 4.1) and the relative duration of chord types (see Section 4.2). The generation of the Markov transition tables occurs at performance time, as these are dependent upon a user-selected corpus from the larger database (see Section 4.3). 19 4.1 Harmonic Data Within the MIDI file, chords are written in root position, with a separate staff containing the bass line (see Fig. 1). This is done for human analysis of the original notation file, since the chord analysis algorithm can identify chord other than root position. No analysis is done on voice-leading, since voice-leading is a performance, rather than a compositional, decision within improvised music; as such, a voiceleading algorithm was created for performance, that controls registral spacing and individual pitch transitions. Figure 1. Example notation for analysis The four chords found in Fig. 1, are represented in Table 1. Table 1. Different representations of the four chords from Fig. 1. Only the third column is stored in the individual data files. Chord name MIDI notes Stored values Set Type AbMaj7 b5 / G 55 56 60 62 67 7 8 12 14 19 31 Gbmaj7#5 / F 53 54 58 62 65 5 6 10 14 17 75 Em9b5 52 54 55 58 62 4 6 7 10 14 76 A7b9 57 58 61 64 67 9 10 13 16 19 25 4.2 Duration Data A mean duration for each chord is calculated in order to give harmonic duration context for generation. Thus, a separate file is created for each composition in the database that contains the mean harmonic rhythm of the composition, and each individual chord's relative ratio to this mean. For example, if the harmonic rhythm of the composition consisted entirely of half notes, the average duration would be 2.0, or two beats. Each chord type in the composition would then receive a ratio of 1.0. The use of ratios to an overall harmonic rhythm is used, instead of discrete timings, since it was felt that a chord's relative duration within a composition is more important than its duration in comparison to other compositions. For example, chords that function as (dissonant) passing chords tend to have shorter durations than stable chords anchoring a tonality, and will thus produce smaller ratios. 20 4.3 Probability Table Generation The user can select individual compositions from the database as the specific corpus for chord generation. From this corpus, the initial-chord array is generated - consisting of the first chord of each phrase - and the final-chordpair array: the last two chords in each phrase. First, second, and third-order transition probabilities are then calculated for every chord transition, and compiled separately. The tables store root movement and set type as a pair; thus, using only the four chords from Fig. 1, the first-order table is shown in Table 2. Table 2. First-order transition table for chords from Fig. 1. Initial Set Bass Movement + Set Occurrences (0 31) ( -2 75) 1 (0 75) (-1 76) 1 (0 76) (5 25) 1 (0 25) 1 The third-order transition table for the four chords from Figure 1 contains only one entry (root movements are relative to the first chord), illustrated in Table 3. Table 3. Third-order transition table for chords from Fig. 1. Index Bass Movement + Set Occurrences (0 31) (-2 75) (-3 76) (2 25) 1 After analysing four Miles Davis compositions, the 3 transition tables are illustrated statistically in Table 4. Table 4. Transition tables for four Miles Davis compositions. First-order Second-order Third-order # of chains 14 52 64 # of transitions 170 179 184 # of unique transitions 54 79 89 Since there are only 14 unique set types in these four compositions, there are only 14 first-order chains; however, these chords appear in 64 different 4-chord combinations, thus there are 64 third-order chains. Variety in the generated progressions depends strongly upon the size of the database. The nature of the user-selected corpus will also influence the generation. Obviously, variety in generation depends on the number of potential transitions. If a corpus is heavily redundant, there will be limited variety in output. On the other hand, selecting a corpus from two composers of very different styles will result in a small intersection of transitions, especially within the higher-order transitions. In such a case, the generated progressions will tend to consist of material from one composer or another, with any transitions between the two occurring only when a chain from the intersection of the databases is stochastically selected. 21 5. Chord Progression Generation Once the transition tables have been generated for the specific corpus, harmonic progressions can be generated using a mixture of stochastic choice and user request. An initial chord is selected from the initial-chord array; given this initial context, the complete harmonic progression is then generated by selecting from the available continuations. 5.1 User Defined Phrase Vectors Selections are influenced by user-defined vectors for bass line, complexity, and tension, over a user-provided phrase length (see Fig. 2). Figure 2. User defined bass line vector. Given a first chord, the available symbols in the transition table are compared to the user-defined vector. Available symbols are scored based upon the distance between their values and the user-defined vector. Distance vectors are created for bass-line, complexity and tension (see Section 5.2). The three vectors are each scaled by a user-defined function for each feature (i.e. bass line 0.7, complexity 0.4, tension 0.1): this allows the user to balance the request versus generating coherence with the corpus. The scaled vectors are then summed, and a roulette-wheel selection is made from the highest 5% of these scores. This selection method ensures that, given the same request, a variety of harmonic progressions can result - a desirable attribute in generative systems. 5.2 Harmonic Complexity and Transition Tension Every set-type has a pre-calculated harmonic complexity, which is a distance function between the pitches of the set and those of a major triad and octave (0 4 7 12). A vector is created of the smallest distance of each note of the set to each note of the triad. Within this vector, each instance of pitchclass 1 (a semitone) is multiplied by 0.3, and each instance of pitchclass 2 (a whole tone) is multiplied by 0.1. Since all possible pitches within the octave are within 2 pitchclasses of one of the notes of the major triad and octave, sets that contain more notes will be considered more dissonant (since they contain more pitchclass differences of 1 and 2 between their pitches and the triad), than smaller sets. 22 These scores are summed to create the set's harmonic complexity. See Tables 5 and 6 for example ratings of the most consonant and most dissonant set types. Table 5. Harmonic complexity ratings for the most consonant sets within the database. Consonant sets Chord name Harmonic Complexity 0 7 no 3 0.0 0 4 7 major triad 0.0 0 7 10 7 no 3 0.1 0 2 7 sus2 0.1 0 4 7 9 add6 0.1 Table 6. Harmonic complexity ratings for the most dissonant sets within the database. Dissonant Sets Chord name Harmonic Complexity 0 3 5 6 8 9 13b9 / third 1.3 0 1 3 5 8 maj9 / seventh 1.2 0 3 6 8 11 7#9 / third 1.2 0 1 3 5 7 8 maj9#11 / seventh 1.2 0 1 3 6 10 m7 / sixth 1.0 The tension rating, tr, compares the intervals between adjacent sets, dividing c, the number of common tones between the two sets by l, the length of the second set: tr(s 1,s2) =1− c(s1,s2) 1(s2) . 5.3 Generated Harmonic Progressions Each phrase has a user-specified suggested length in number of chords. During sequence generation, once the generated length reaches 75% of this value, the algorithm begins testing if the last two chords generated are in the finalchordpair array. If the test returns true, the phrase generation algorithm exits. The use of user-defined vectors influences the selection from the Markov transition tables, but there is no guarantee that the actual generated progression will match the user vector, due to the available values within the tables and the roulette-wheel selection from those values. For example, Fig. 3 displays a user-defined bass line, and the resulting third-order generated bass line, using a four-song database containing 108 chains and a requested phrase length of five chords. A larger database will result in a closer approximation, due to the potentially greater available choices. Lastly, the request may not, and need not, be in the style of the corpus; the result will be a stochastically chosen correction of the request given the actual corpus. 23 Figure 3. A user defined vector, left, and generated bass line, right, given a 4-song database. Harmonic rhythm (chord duration) during performance is a ratio to the performance tempo, since every chord in the database acquires a mean ratio to that chord's duration within each composition in which it appears. Thus, relative duration will be consistent, allowing realtime harmonic rhythm to be adjustable, yet independent of the pulse. 6. Conclusions We have presented a realtime chord sequence generation system that employs a unique user influence over variable-order Markov transition tables. The algorithm described here can be used as a compositional assistant, or embedded within a realtime generative system [33]. In such cases, a large part of the musical success of the system resides in the voice-leading algorithm, which is not described here. This algorithm finds the closest distance between adjacent chord tones, taking into account different chord sizes and octave displacements. 7. Acknowledgements This research was funded by a grant from the Canada Council for the Arts and the Natural Sciences and Engineering Research Council of Canada. 2010_30 !2010 The Evolution of Fun: Automatic Level Design through Challenge Modeling Nathan Sorenson and Philippe Pasquier School of Interactive Arts and Technology, Simon Fraser University Surrey, 250 -13450 102 Avenue, Surrey, BC fnds6, pasquierg@sfu.ca Abstract. A generative system that creates levels for 2D platformer games is presented. The creation process is driven by generic models of challenge-based fun which are derived from existing theories of game design. These models are used as fitness functions in a genetic algorithm to produce new levels that maximize the amount of player fun, and the results are compared with existing levels from the classic video game Super Mario Bros. This technique is novel among systems for creating video game content as it does not follow a complex rulebased approach but instead generates output from simple and generic high-level descriptions of player enjoyment. 1 Introduction Existing processes for creating video game levels are time consuming and expensive, and result in environments that are static and cannot be easily adjusted. Clearly, leveraging a creative system that automatically designs levels would enable independent game developers to generate content that would otherwise require the resources of larger companies. If effective, this process would also result in environments that are not static but are instead amenable to endless variety and modification, ideally creating more interesting and enjoyable experiences for game players. Furthermore, a dynamic and unsupervised method of level creation would allow a virtually limitless amount of content which could be produced off-line and distributed to clients as expansions to the base game, or generated on-the-fly, whereby levels could adapt to individual players as the game is being played. Although automated methods for creating content are occasionally seen in existing games [1- 3], current approaches follow a bottom-up, rule-based approach. This method requires a designer to embed aesthetic goals into a collection of rules, resulting in systems just as difficult to construct as hand-designed levels [4]. Genetic algorithms instead allow developers to specify desirable level properties in a top-down manner, without relying on the specifics of the underlying implementation. However, any effective fitness function for automated level creation must correctly identify the levels that are "fun." To this end, a model of what precisely constitutes a fun level must be developed. In this paper, two such models are proposed: one based on Csikszentmihalyi's concept of "flow" [5] and the other on the notion of "rhythm groups," as described by Smith et al. [6]. These two models are evaluated by inspecting the levels they produce 258 when employed as fitness functions in a genetic algorithm. These levels are then compared to existing levels that are generally considered to be well-designed. In this case, the automatically generated levels will be compared to four levels selected from the 2D platformer game Super Mario Bros. Generative systems exhibit creativity when they are able to construct solutions that are not simply parameterized variations of existing solutions. Indeed, as Byrne notes, systems that are designed for a specific domain might overlook "solutions that might be applicable to the current problem but are obscured by contextor domain-specific details" [7]. With this in mind, the models of fun that we propose are defined in terms of general theories of player enjoyment instead of by specific details of 2D platformer level design. Although we choose to evaluate the system in the context of this particular genre, our intent is to produce a generative technique that will be applicable to a wide variety of games. 2 Previous Work 2.1 Evolutionary Algorithms in Creative Systems Evolutionary algorithms are a popular choice for many creative systems [8- 10]. The process of creating an initial population of potential artefacts and iteratively evaluating and breeding new populations corresponds closely to the engagement-reflection model of creativity [11]. Furthermore, many evolutionary approaches, such as co-evolution and genetic programming, are not restricted to a well-defined search space but rather re-define the search space as they operate, which is a desirable characteristic for creative systems to have [12]. 2.2 Generative Systems in Video Games Recent attempts have been made to generate novel games without the aid of a human designer. Cameron Browne's dissertation [13] considers evolving abstract combinatorial games which are similar to chess and go. His genetic algorithm uses a fitness function consisting of a weighted sum of 57 separate design criteria gleaned from a wide array of sources, including psychology, subjective aesthetics, and personal correspondence with game designers. As the parsimony of the underlying model was clearly not a priority of his research, it is unclear to what degree this approach could be generalized to other contexts. Togelius and Schmidhuber also use genetic algorithms to generate novel game designs [14]. If a neural net is able to learn to play a particular game well, that design is assigned a high fitness value since it is assumed to contain meaningful patterns that human players would enjoy mastering. In contrast to Browne's approach, this technique is not necessarily bound to a specific type of game. However, this work is currently at a very preliminary stage, and it remains to be seen whether games conducive to machine learning indeed correspond to those enjoyable for human players. Smith et al. generate levels for 2D platformer games based on a notion of "rhythm groups" [15], which inspires one of the models presented here. Unlike our system, 259 however, they describe a rule-based system that composes level components together through the use of a generative grammar. This method is crafted specifically for 2D platformer games, whereas our approach seeks greater generality. 2.3 Fun in Video Games The notion of fun is, without a doubt, incredibly broad, and it is debatable whether the search for a comprehensive definition could ever prove fruitful. Limiting the discussion to the sort of fun characteristic of video games still leaves room for considerable ambiguity. In fact, Hunicke, LeBlanc, and Zubek [16] argue that the term ‘fun' in game design must be discarded in favor of more precise terminology. Therefore, they present eight distinct types of pleasure that can arise from game playing: sensation, fantasy, narrative, challenge, fellowship, discovery, expression, and submission. This is not claimed to be a complete categorization, and indeed, many other distinct forms of fun have been identified [17- 19]. Such taxonomies can provide useful terminology, but the limited structuring of their categories and the lack of any theoretical underpinning makes it difficult to extract general design principals from them. These categorizations typically identify what types of fun can be observed, but they generally do not say why such things are fun, and therefore cannot offer principled advice as to how designs can evoke a particular type of fun. 2.4 Flow Many theoretical treatments of fun in games underline the relevance of Csikszentmihalyi's concept of "flow" [5]. Flow refers to a particular state of intense focus, or "optimal experience," that can occur when certain factors are present during a task, such as a sense of being in control, a loss of awareness of self and time, and a close match between the task's difficulty and the individual's skill. This concept has been adapted by Sweetser and Wyath into a game design framework called "GameFlow" [20]. By mapping the factors that encourage the state of flow to the elements of game design, flow can be used as a model for player enjoyment in video games. Both Koster [21] and Zimmerman and Salen [22] note the relevance of flow to certain game experiences, but neither goes so far as to define fun as equivalent to the state of flow. Certain aspects of the concept are, however, regarded as important for facilitating fun, particularly the requirement that game difficulty should properly correspond to a player's skill. In A Theory of Fun, Koster states that "fun is the act of mastering a problem mentally" [21]. It is the process of learning to overcome the challenges inherent in a game that makes the experience enjoyable, whether it be by developing complex strategies to defeat certain opponents in a strategy game or by acquiring the musclememory necessary to execute a series of timed button presses in a fighting game. This learning process can be cut short if a game is too easy, as there will be enough problems to be mastered, or if a game is too difficult, as a player will not be able to overcome the problems at all. Similarly, Zimmerman and Salen identify challenge and frustration as "essential to game pleasure" [22]. 260 3 Contribution Challenge-based pleasure is, therefore, both a category common to all the considered taxonomies and an important aspect of theoretical conceptualizations of video game pleasure. The models explored in this paper will then focus on this particular notion, and can be considered attempts to make explicit the relationship between challenge and fun. As a simplifying assumption, the models will be developed and verified only within the specific context of the 2D platformer genre, exemplified by Donkey Kong [23] and the Super Mario Bros. [24] series. As defined by Wolf, these are "games in which the primary objective requires movement through a series of levels, by way of running, climbing, jumping, and other means of locomotion" [25]. This game genre is convenient for many reasons. First, it can be argued that the essential form of pleasure drawn from such games is from being challenged; the core game play mechanic of a platformer game is the dexterity-based challenge of navigating safely from platform to platform. Secondly, the challenges presented in such games assume a uniquely explicit form. Whereas challenge in many games emerges from the dynamic unfolding of an extensive rule set or from competition with an artificially intelligent non-player character [16], the challenge in platform games is embodied in the physical arrangement of the platforms or the positioning of enemies that possess only the most rudimentary artificial intelligence. This embodiment allows the nature of the challenge in the game to be visually inspected, simplifying the analysis. The level design of a platformer game is not merely an environment or container for the game play, but also serves as an essential element of the game play itself. In other words, to understand the nature of challenge in a platformer game, one need look no further than the physical layout of the levels. 3.1 Model Design Following the hypothesis that challenge is fundamentally related to the pleasure of video games, challenge will therefore serve as the models' primary input variable. As is the case with the notion of fun, challenge is complex and multi-faceted. However, due to the straightforward nature of challenge in 2D platformer games, a reasonable formal conception can be offered. Challenge will be determined entirely by the local configuration of platforms. The challenge of a jump between platforms encountered at time step t, c(t), is typically considered to be proportional to the number of potential player trajectories that successfully traverse the gap [26]. The actual measure used is a rough approximation and is described in (1), where d(p1; p2) is the Manhattan distance between the platforms p1 and p2 minus the sum of the two "landing footprints," fp, of both platforms (shown in Figure 1) plus a constant. c(t) = d(p1; p2) 􀀀 (fp(p1) + fp(p2)) + 2fpmax (1) The landing footprint is a measurement of the length of the platform, bounded to the maximum distance a player can jump, fpmax. This measure is important, as there 261 is a much larger margin of error when jumping to a wide platform than a narrow platform, resulting in a less challenging maneuver. The constant 2fpmax is added simply to ensure that this difficulty measure never produces a negative value. fp( ) } p1 fp( ) } p2 d( p 1 ,p 2 ) Fig. 1. The landing footprint, fp, is a measure of a jump's margin of error. To guide the development of the models, four levels from the original Super Mario Bros. [24], shown in Figure 2, are taken to be exemplars of the 2D platformer genre. These levels will provide the concrete, empirical data to which the models must conform. Essentially, these levels will constitute an implicit operational definition of the type of fun that is of interest; in an effort to avoid the difficult task of devising an authoritative definition and to remain true to the purpose of mathematical modeling, fun will simply be thought of as a variable that is maximized under the particular configuration of these levels. In other words, any model of challenge-based fun must able to account for the specific patterns of challenge evident in the four selected levels. Ultimately, it is hoped that this sampling will suffice to indicate the validity of the models. Fig. 2. The four selected Super Mario Bros. levels. 3.2 Anxiety Curves The levels are analyzed with the described challenge metric to produce a characteristic anxiety curve. The anxiety curve is the value of the player's anxiety, represented by the variable a, over time. These values are attained by integrating the level's challenge value over the duration of the level, with a constant decay factor, cdecay applied, as described in (2). The resulting curve will therefore exhibit an increased slope when there is a sequence of high-difficulty jumps in a short period of time, but will slowly 262 drop during less challenging segments. The purpose of this function is to highlight the relative dynamics of the challenge over time, as opposed to drawing attention to the actual values of the challenge measurement itself. da dt = c(t) 􀀀 cdecay (2) The placement of enemies in the Mario levels is certainly important to the challenge dynamics. Though our difficulty metric does not explicitly refer to enemies, we are able to capture this information by internally representing each enemy as a small gap in the level. Essentially, the enemies are considered to be as difficult as a small jump between platforms. Since enemies must generally be avoided or defeated through jumping, this technique serves as a rough approximation for the challenge a player faces when encountering an enemy. Fig. 3. Super Mario Bros. anxiety curves, charting anxiety, a, (vertical axis), over time, t, (horizontal axis). As Figure 3 demonstrates, there are recognizable similarities between the anxiety curves of the four Super Mario Bros. levels. Although a crude challenge measurement is used and the conceptualization of anxiety is primitive, some structure can already be qualitatively identified: each level begins and ends with a phase of decreasing anxiety and has a recognizable period of high anxiety directly preceding the end. Three of the levels exhibit a lower anxiety slope during the beginning half of the level, followed by a higher slope. Finally, all four levels exhibit a sawtooth pattern. 3.3 Flow-based Model The first model is a na¨ıve interpretation of Csikszentmihalyi's often-cited concept of flow, specifically the portion most often applied to game design: the necessary match 263 between game challenge and player skill. Intuitively, then, the amount of fun had by a player must decrease as the mismatch between skill and challenge increases. This relationship is expressed in (3). df dt = 􀀀(cskill 􀀀 c(t))2 (3) Fun is represented by the variable f, and like anxiety, it accumulates over time and is therefore considered differentiable. Skill (cskill) is assumed to remain constant over the duration of a single level, and can be interpreted as the greatest level of challenge a player is able to effectively overcome. It is clear that to maximize the value of f, cskill and c(t) must be kept as close as possible, which is precisely what the state of flow requires. Therefore, because we are concerned only with maximization of fun, the fact that this model produces negative values is of no consequence. 3.4 Genetic Algorithm To evaluate this model, a genetic algorithm is used to automatically create a level that maximizes the amount of fun (as defined by the model) experienced during play. Each individual level is encoded as a genotype consisting of a sequence of x; y coordinates. Each coordinate pair specifies the horizontal and vertical pixel offset of a 64 by 64 pixel platform relative to the preceding platform. The advantage of relative position encoding is that the range of these offsets can be limited to the maximum distance a game character can jump. As no playable level will consist of a platform that is unreachable from the previous platform, all valid levels can be represented in this manner (assuming, of course, that levels are linear with no branching paths). As well, all unplayable levels are eliminated from the evolutionary search space with this encoding, greatly improving the performance of the genetic algorithm. Mutation consists of perturbing a coordinate by a normally-distributed random value, and individuals are combined through variable-point crossover to allow for various sizes of genotypes. Since two levels of the same total length could consist of different numbers of platforms, therefore requiring different numbers of x; y pairs in the genotype, a variable-length genotype is necessary. The fitness function is a straightforward implementation of (3), which aims to maximize the amount of fun accumulated. As well, levels are generated to be a specific total length and are heavily penalized for deviating from this externally imposed length. This restriction is necessary to avoid the evolutionary search from simply evolving longer and longer levels as a means of accumulating arbitrarily high amounts of fun. The population consists of 200 individuals that are evolved for 10,000 generations. Although efficiency is not presently a concern, running time was on the order of several hours on a mid-range dual-core PC. A generated level and its corresponding anxiety curve are shown in Figure 4. This level appears to be a chaotic scattering of platforms, and the anxiety curve seems to be of a nearly-constant slope. While it does not correspond to the Super Mario Bros. levels, this shape is to be expected; in accordance with the concept of flow, challenge is kept as closely as possible to a constant value in levels with an anxiety curve of constant slope. Therefore, in its na¨ıve application, flow likely does not serve as an effective model for 264 t a Fig. 4. Level and anxiety curve for flow-based model. challenge in video games. This result agrees with the findings of those who believe flow is an inappropriate guideline for game design, such as Juul [27] and Falstein [28]. 3.5 Periodic Model Clearly, variation in a level's challenge is desired. Indeed, this property is inherent in the notion of "rhythm groups," as described by Smith et al. [6] in their structural analysis of the architecture of platform games. A rhythm group is considered a fundamental unit in platform level design, and consists of a moment of high challenge followed by a moment of low challenge. Rhythm groups can explain the oscillatory anxiety curves seen in the Super Mario Bros. levels introduced in Section 3.1, and in the attempt to encourage such rhythmic variation, a new model is described in (4). df dt = m da dt (4) The behavior of this model is determined by the m variable, which can take the value +1 or 􀀀1. When positive, the player will accumulate fun at the same rate that the anxiety increases. This state represents the pleasure gained from being challenged. However, when the anxiety becomes too great, that is, when a > aupper where aupper is some constant threshold, m will become negative. After this point, fun can only be accrued as anxiety falls. When the level of anxiety becomes low enough, that is, a < alower, m will again become positive and the player will be ready for new challenges. Whereas cskill represents, in the flow-based model, a particular degree of challenge, aupper and alower here specify levels of anxiety (essentially, challenge integrated over time, as described in (2)). This model is likewise used as a fitness function in an evolutionary run, and Figure 5 depicts the resulting level and its anxiety curve. Fig. 5. Level and anxiety curve for periodic model. 265 This level qualitatively appears much more desirable than the previous one, with clearly identifiable rhythm groups. The anxiety curve likewise exhibits an oscillatory nature. However, the observed pattern does not possess a unique high-anxiety peak near the end of the level; rather its cyclical appearance is much more regular than what is observed in the Super Mario Bros. levels. 4 Discussion and FutureWork The approach taken in the development and evaluation of this system prompts several observations. First, by employing theoretical accounts of fun in video games, we are able to identify abstract patterns in existing levels: rhythm groups, as described by Smith et al., can be identified in the anxiety curves, and a characteristic dramatic arc can be seen in all the levels. Secondly, we generate levels in a top-down manner by using high-level models as fitness functions in a genetic algorithm. This technique is a promising alternative to the time-consuming trial-and-error approach associated with the creation of rule-based systems. Finally, the analyses of existing levels and the corresponding models were conducted with regard to the dynamics of challenge over time, not in terms of specific details of 2D platformers. This generality allows for the possibility of extending this generative technique to other games and other genres. With these promising initial results, we intend to further explore the utility of this top-down approach for level design. A clear first step would be to extend the genetic algorithm to generate enemies and moving platforms. It would then be interesting to develop challenge metrics for other games and compare the resulting anxiety curves, looking for similarities and differences between game genres. As well, it might prove useful to augment the genetic algorithm with meta-evolutionary techniques, such as evolving the encoding of the levels or the fitness functions. These techniques could further reduce the influence of the human designer in the construction of the content, relegating even more creative responsibility to the underlying system. Although this system does not yet create output comparable to the work a professional level designer, it can be considered an exploratory first step toward that goal. Much work still needs to be done regarding the formal analysis of existing games and the specification of exactly what variables are important when predicting player enjoyment. Even if a definitive formalism is unlikely, it is hoped that the very process of identifying simple and general models of fun will enable creative systems to enhance the practice of game design. 2010_31 !2010 Quantifying Humorous Lexical Incongruity Chris Venour, Graeme Ritchie, Chris Mellish University of Aberdeen Abstract. Traditional discussions of humorous texts often postulate a notion of incongruity as being central, but there is no real formalisation of this notion. We are exploring one particular type of incongruity, in which a clash between the style or register of lexical items leads to humour. This paper describes our construction of a semantic space in which the distance between words reflects a difference in their style or tone. The model was constructed by computing proles of words in terms of their frequencies within various corpora, using these features as a multidimensional space into which words can be plotted and experimenting with various distance metrics to see which measure best approximates differences in tone. 1 Introduction The study of humour using computational techniques is still at a very early stage, and has mainly consisted of two kinds of project: the computer generation of very small humorous texts [1,2] and the use of text classication to separate humorous texts from non-humorous texts [3]. Little of this work has so far explored what many theories of humour claim is an essential ingredient of humour: incongruity[4,5]1. On the other hand, non-computational humour research fails to construct clear and formal definitions of this concept. Our work seeks to bridge this gap, by creating and implementing a precise model of a simple kind of humorous incongruity. The particular type of textual humour that we are focusing on, sometimes called register-based humour [4], is where the broader stylistic properties of words (in terms of style, social connotation, etc.) within a text are in conict with each other. We intend to model this phenomenon by nding a semantic distance metric between lexical items, so that the intuition of 'words clashing' can be made precise. The semantic space we envision will provide an objective and quantiable way of measuring a certain kind of humorous incongruity a concept which has proven hard to measure or even dene. The space we have developed is designed to automatically identify a particular class of jokes, and we plan to use it to generate original jokes of this type. 1 Mihalcea and Strapparava [3] suggest that one of the features used by their classier antonymy is a form of incongruity. 268 2 Incongruity theory Incongruity theory is probably the most widely accepted humour doctrine today (and) was born in the seventeenth century when Blaise Pascal wrote 'Nothing produces laughter more than a surprising disproportion between that which one expects and that which one sees' [6]. The idea of incongruity has been variously dened in the literature so much so that it is not even obvious that all the writers on this subject have exactly the same concept in mind [5] but few commentaries oer more detail than the vague description left by Pascal. Although some detailed work has been done describing some of the mechanisms of humorous incongruity see the two-stage model [7] and the forced reinterpretation model described and extended by [5] models such as these are still not specied enough to be implemented in a computer program. We hope to make some progress in this regard by creating a precise model of a certain kind of incongruity and implementing it to recognize a class of humorous text. The kind of humorous incongruity we formally model and then test in a computer program involves creating opposition along the dimensions of words. 3 Dimensions and Lexical Jokes Near-synonyms, words that are close in meaning but not identical, reveal the kinds of subtle differences that can occur between words nuances of style or semantics which make even words that share the same literal meaning slightly different from each other. For example the words 'bad' and 'wicked' are nearsynonyms both mean 'morally objectionable' but dier in intensity. Similarly the words 'think' and 'cogitate' are almost synonymous but dier in terms of formality. These distinctions between near-synonyms the ideas of 'intensity' and 'formality' in the examples above are what we call dimensions. We believe that humorous incongruity can be created by forming opposition along these and other dimensions. To illustrate this idea, consider the following humorous text, taken from an episode of 'The Simpsons' (Sunday, Cruddy Sunday) in which Wally and Homer have been duped into buying fake Superbowl tickets: Wally: Oh, how could I fall for fake tickets? Gee, the fellas are gonna be crestfallen. Instead of saying 'disappointed', Wally uses an outdated, highly literary and formal word, 'crestfallen'. This choice of word smacks of a kind of eete intellectualism, especially in the highly macho context of professional sports, and the result is humorous. In choosing the word 'crestfallen', it is suggested that Wally mistakenly anticipates how 'the fellas' will react with sadness rather than anger but he has also chosen a word that is: noticeably more formal than the domain made salient by the scene (football) has an opposite score on some sort of 'formality' dimension than many of the other words in the passage ('gee', 'fellas', 'gonna') 269 This kind of incongruity, formed by creating opposition along one or more dimensions, is, we believe, the crux of a subclass of humour we call lexical jokes. Using the idea of dimensions, we aim to automatically distinguish lexical jokes from non-humorous text, and also to generate new lexical jokes. We believe that there is a significant subset of lexical jokes in which the relevant dimensions of opposition have something to do with formality, archaism, literariness, etc.; for brevity, we will allude to this cluster of features as tone. 4 Creating a semantic space As we do not know how the relevant dimensions are dened, how these dimensions are related, and how they combine to create incongruity, it is not feasible to simply extract ratings for lexical items from existing dictionaries. Instead, we have used the distribution of words within suitable corpora as a way of dening the tone of a word. For example, in Figure 1 the grey cells represent the frequencies of words (rows) in various corpora (columns): the darker the cell, the higher the frequency. The words 'person', 'make' and 'call' display similar frequency count patterns and so might be considered similar in tone. Whereas the pattern for 'personage' is quite different, indicating that its tone may be different. More precisely, our proposed model works as follows: select corpora which we judge to exhibit different styles or registers compute proles of words in terms of their frequencies within the corpora use the corpora as dimensions, and the frequencies as values, to form a multidimensional space plot words from texts into the space try various outlier detection methods to see which one displays the outlier and clustering patterns we anticipate. This model assumes that word choice is a significant determiner of tone. Syntax and metaphor, for example, may also play a very important role, but these are not considered here. We looked for corpora which we think display differing degrees of formality/ archaism/literariness. Besides using our intuition in this regard, we also felt that the age of a work is a strong determiner of how formal, etc. it sounds to modern ears, so we chose works that were written or translated in various time periods. Thus the following corpora were chosen for the first set of experiments: Virgil's The Aeneid (108,677 words), Jane Austen's novels (745,926), the King James version of the bible (852,313), Shakespeare's plays (996,280), Grimm's fairy tales (281,451), Samuel Taylor Coleridge's poetry (101,034), two novels by Henry Fielding (148,337), a collection of common sense statements (2,215,652), Reuter's news articles (1,614,077), a year's worth of New Scientist articles (366,393), a collection of movie reviews (1,298,728) and the written section of the British National Corpus (BNC World Edition) (80 million)2. 2 Frequency counts of a word in the BNC were taken from the CUVPlus dictionary, available at the Oxford Text Archive. 270 Fig. 1. Using frequency count patterns as 'tonal fingerprints'. Cells in the table represent the frequencies of words (rows) in various corpora (columns). 5 Automatically identifying an incongruous word 5.1 The development data Twenty lexical jokes were used to develop the model. All contained exactly one word (shown in bold in the examples below) which we judged to be incongruous with the tone of the other words in the text3. 1. Operator, I would like to make a personage to person call please (The Complete Cartoons of the New Yorker (CCNY), 1973, p.312). 2. Sticks and stones may break my bones but rhetoric will never hurt me (CCNY 1970, p.624). 3. You cannot expect to wield supreme executive power just because some watery tart threw a sword at you (Monty Python and the Holy Grail). 4. Listen serving the customer is merriment enough for me (The Simpsons, Twenty-Two Short Films About Springeld). Most of the jokes (15/20) are captions taken from cartoons appearing in the New Yorker magazine. Joke #3 however is taken from a scene in Monty Python and the Holy Grail and three of the twenty jokes are from different episodes of The Simpsons television show. Thus all the texts except possibly one whose exact provenance is dicult to determine are snippets of dialogue that were accompanied by images in their original contexts. Although the visual components enhance the humour of the texts, we believe the texts are self-contained and remain humorous on their own. 3 A more formal test with volunteers other than the authors will be conducted in the future. 271 5.2 Computing scores In the tests, stopwords were ltered from a lexical joke, frequencies of words were computed in the various corpora (and normalized per million words) and were treated as features or dimensions of a word. Words were thus regarded as vectors or points in a multi-dimensional space and the distances between them computed. We are interested in nding outliers in the space because if position in the space is in fact an estimate of tone, the word furthest away from the others is likely to be the word whose tone is incongruous. Ranked lists of words based on their mutual distances (using different distance metrics described below), were therefore computed. If the word appearing at the top of a list matched the incongruous word according to the gold standard, a score of 2 was awarded. If the incongruous word appeared second in the list, a score of 1 was awarded. Any results other than that received a score of 0. The baseline is the score that results if we were to randomly rank the words of a text. If a text has 9 content words, the expected score would be 2 * 1/9 (the probability of the incongruous word showing up in the first position of the list) plus 1 * 1/9 (the probability of it showing up second in the list), yielding a total expected score of 0.33 for this text. This computation was performed for each text and the sum of expected scores for the set of lexical jokes was computed to be 9.7 out of a maximum of 40. 5.3 Computing the most distant word in a text using various distance metrics different methods of computing distances between words were tried to determine which one was most successful in identifying the incongruous word in a text. Our first set of experiments, performed using the corpora listed above, employed three different distance metrics: 1. Euclidean distance: this distance metric, commonly used in Information Retrieval [8], computes the distance D between points P = (p1; p2; : : : pn) and Q = (q1; q2; : : : qn) in the following way: D = vuut Xn i=i (pi ¡ qi)2 A word's Euclidean distance from each of the other words in a lexical joke was calculated and the distances added together. This sum was computed for each word and in this way the ranked list was produced. The word at the top of the list had the greatest total distance from the other words and was therefore considered the one most likely to be incongruous. 2. Mahalanobis distance: This distance metric, considered by [8] as one of the two most commonly used distance measures in IR (the other one being Euclidean distance according to these same authors), is dened as D2 = Xp r=1 Xp s=1 (xr ¡ ¹r) vrs (xs ¡ ¹s) 272 where x = (x1; x2; : : : ; xp), ¹ is the population mean vector, V is the population covariance matrix and vrsis the element in the rth row and sth column of the inverse of V. For each word in a text, the Mahalanobis distance between it and the other words in the text is computed and the ranked list is produced. 3. Cosine distance: Another method of estimating the difference in tone between two words, regarded as vectors v and w in our vector space, is to compute the cosine of the angle µ between them: cosine (µ) = v ¢ w kvk ¢ kwk Cosine distance is commonly used in vector space modelling and information retrieval [9] and was used here to produce a ranked list of words in the manner described in 1. above. 5.4 Initial results Table 1 shows the outcomes of testing on development examples using the set of corpora A (listed in Section 4) and various distance metrics. Predicting the incongruous word in a text using Euclidean distances received a low score of 2 out of a maximum of 40 and proved to be worse than the baseline score. Computing the most outlying word in a text with the Mahalanobis metric yielded a score of 11 which is only slightly better than random, while using cosine distances yielded the best result with a score of 24. Table 1. Results from first set of testing Test no. Pre-processing Distance metric Corpora Score (out of 40) 1 none Euclidean A 2 2 none Mahalanobis A 11 3 none cosine A 24 5.5 Experimenting with pre-processing We experimented with two kinds of pre-processing which are familiar in information retrieval: 1. tf-idf: In an eort to weight words according to their informativeness, tf-idf [10] changes a word's frequency by multiplying it by the log of the following ratio: (the total number of documents)/(how many documents the word appears in). This transformation gives a higher weight to words that are rare in a collection of documents, and so are probably more representative of the 273 documents to which they belong. Our model computes frequency counts in corpora rather than documents, however, so the ratio we use to weight words is a variation of the one normally computed in information retrieval. 2. log entropy: When we compute the frequencies of words in the various corpora, the data is stored in a frequency count matrix X where the value of the cell in row i and column j is the normalized frequency count of word i in corpus j. Our second method of pre-processing, which has been found to be very helpful in information retrieval [11], involved computing the log entropy of the columns of matrix X. This amounts to giving more weight to columns (i.e. corpora) that are better at distinguishing rows (i.e. words). Turney [11] describes how to perform this pre-processing. Tf-idf transformations (Table 2) generated generally worse results. Log entropy pre-processing improved all the results however, the best result emerging once again from use of the cosine metric: its score improved from 24 to 32. Table 2. Results from performing pre-processing Test no. Pre-processing Distance metric Corpora Score (out of 40) 1 tf-idf Euclidean A 3 2 tf-idf Mahalanobis A *4/36 3 tf-idf cosine A 14 4 log entropy Euclidean A 13 5 log entropy Mahalanobis A 23 6 log entropy cosine A 32 *Octave, the software we are using to compute the Mahalanobis distance, was for reasons unknown, unable to compute 2 of the test cases. Thus the score is out of 36. 5.6 Experimenting with different corpora After achieving a good score predicting incongruous words using log entropy preprocessing and the cosine distance metric, we decided to not vary these methods and to experiment with the set of corpora used to compute frequency counts. In experiment #1 corpus set B was built simply by adding four more corpora to corpus set A: archaic and formal sounding works by the authors Bulnch, Homer, Keats and Milton. This increased the corpora size by »600K words but resulted in the score dropping from 32 to 31 out of a maximum of 40. In experiment #2 corpus set C was built by adding another four corpora to corpus B: Sir Walter Scott's Ivanhoe, a collection of academic science essays written by university students, a corpus of informal blogs, and a corpus of documents about physics. As we see from Table 3, adding this data (»1.5 million words) improved the score from 31 to 35. In corpus set C, archaic and formal sounding literature seemed to be over represented and so in experiment #3 a new corpus set D was created by combining Virgil's Aeneid with works by Homer into a single corpus as they are very 274 similar in tone. Shakespeare and Coleridge's work were also merged for the same reason, as were the works by Bulnch and Scott. In this way, fewer columns of the 'tonal ngerprint' consisted of corpora which are similar in tone. Also, works by Jane Austen and by John Keats were removed because they seemed to be relatively less extreme exemplars of formality than the others. These changes to the set of corpora resulted in a score of 37 out of a maximum of 40. Table 3. Results from using different sets of corpora Corpora set B C D Score 31 35 37 The decisions made in constructing corpora set D, indeed most of the decisions about which corpora to use as foils for estimating tone, are admittedly subjective and intuitive. This seems unavoidable, however, as we are trying to quantify obscure concepts in such an indirect manner. To the degree that our assumption that frequency counts in various corpora can be an estimate of a word's tone, the kind of experimentation and guesswork involved in constructing our semantic space seems valid. Thus using corpus set D, log entropy pre-processing and cosine distance as our distance metric, produced excellent results: 37 out of a possible 40 on the development set, according to our scoring, in identifying the incongruous word in the set of lexical jokes. We found that we were even able to raise that score from 37 to 39/40 (97.5%) by not eliminating stopwords from a lexical joke i.e. by plotting them, along with content words, into the space. Incongruous words in lexical jokes tend not to be commonplace and so including more examples of words with 'ordinary' or indistinct tone renders incongruous words more visible and probably accounts for the small rise in the score. 6 Automatically distinguishing between lexical jokes and regular text The next step is to determine whether the space can be used to detect lexical jokes within a collection of texts. One way of automating this classication would be to nd the most outlying word and to look at how far away it is from the other words in the text. If the distance were to be above a threshold, the program would predict that the text is a lexical joke. This approach was tested on a set of texts consisting of the development set of lexical jokes together with a sample of 'regular' i.e. non lexical joke texts: newspaper texts randomly4 selected from the June 5 2009 issue of the Globe and Mail, a Canadian national newspaper. Complete sentences from the newspaper 4 Newspaper sentences containing proper names were rejected in the selection process because names appear haphazardly, making estimation of their tone dicult. 275 were initially much longer than the lexical joke sentences the average number of words in the lexical jokes set is 16.1 so newspaper sentences were truncated after the 17th word. For each text, the most outlying word was determined using the cosine method described above (with log entropy pre-processing) and the average cosine (l) it forms with the other words in the text was computed. Precision is highest when the threshold cosine value is arbitrarily set at 0.425 i.e. when we say that l needs to be less than or equal to 0.425 in order for the text to be considered a lexical joke. From Table 4 we see that 77.8% precision (in detecting jokes from within the set of all the texts processed) and 70% recall result using this threshold. (When pathological cases5 are excluded from the evaluation, the program achieves 10/13 (76.9%) precision and 10/16 (62.5%) recall using this threshold). Table 4. precision and recall when computing averages threshold value precision recall F score <=0.5 19/26 (73.1%) 19/20 (95%) 82.6 <=0.425 14/18 (77.8%) 14/20 (70%) 73.7 The semantic space was developed to maximise its score when identifying the incongruous word in a lexical joke, but it has limited success in estimating how incongruous a word is. We believe that differences in tone in lexical jokes are much larger than those in regular text but the semantic space achieves, at best, only 77.8% precision in reecting the size of these discrepancies. One reason for this might be that the set of corpora is simply not large enough. When the threshold is set at .425, the three newspaper texts (not containing a pathological word) mistakenly classied as lexical jokes are: the tide of job losses washing across north america is showing signs of ebbing, feeding hope that... yet investors and economists are looking past the grim tallies and focusing on subtle details that suggest... both runs were completely sold out and he was so mobbed at the stage door that he... The most outlying words in these texts (shown in bold) appear only rarely in the set of corpora: the word 'ebbing' appeared in only three corpora, 'tallies' in two and 'mobbed' in only one corpus. None of the other words in the newspaper texts appear in so few corpora and perhaps these words are considered significantly incongruous, not because they are truly esoteric (and clash with more prosaic counterparts) but because the corpus data is simply too sparse. 5 Pathological texts contain words which do not appear in any of the corpora. These words were 'moola', 'tuckered', 'ummery', 'eutrophication' and 'contorts'. 276 The problem may be more deeply rooted however. New sentences which no one has ever seen before are constructed every day because writing is creative: when it is interesting and not clichéd it often brings together disparate concepts and words which may never have appeared together before. Perhaps the model is able to identify relatively incongruous words with precision but is less able to gauge how incongruous they are because distinguishing between innovative word choice and incongruous word choice is currently beyond its reach. 7 Future work Results look promising but future work will need to determine how the method performs on unseen lexical joke data. In early experiments, Principal Components Analysis (PCA) was performed on the frequency count data in an attempt to reduce the feature space into a space with fewer (and orthogonal) dimensions but initial results were disappointing. One reason for this might be that the corpora are too sparse to allow for much redundancy in the features, but further investigations into using PCA and other techniques for reducing the dimensionality of vector spaces (such as Latent Semantic Analysis) will be performed. Finally, experiments into using the vector space to generate original lexical jokes will be conducted. 2010_32 !2010 Defining Creativity: Finding Keywords for Creativity Using Corpus Linguistics Techniques Anna Jordanous Creative Systems Lab / Music Informatics Research Centre, School of Informatics, University of Sussex, UK a.k.jordanous at sussex.ac.uk Abstract. A computational system that evaluates creativity needs guidance on what creativity actually is. It is by no means straightforward to provide a computer with a formal definition of creativity; no such definition yet exists and viewpoints in creativity literature vary as to what the key components of creativity are considered to be. This work combines several viewpoints for a more general consensus of how we define creativity, using a corpus linguistics approach. 30 academic papers from various academic disciplines were analysed to extract the most frequently used words and their frequencies in the papers. This data was statistically compared with general word usage in written English. The results form a list of words that are significantly more likely to appear when talking about creativity in academic texts. Such words can be considered keywords for creativity, guiding us in uncovering key sub-components of creativity which can be used for computational assessment of creativity. 1 Introduction How can a computational system perform autonomous evaluation of creativity? A seemingly simple way is to give the system a definition of creativity which it can use to test whether creativity is present, and to what extent [1, 9, 11]. There have been many attempts to capture the nature of creativity in words [Appendix A lists 30 such papers], but there is currently no accepted consensus and many viewpoints exist which may prioritise different aspects of creativity (this is discussed further in Section 2.1). Identifying what contributes to our intuitive understanding of creativity can guide us towards a more formal definition of the general concept of creativity. If a word is used significantly more often than expected to discuss creativity, then I suggest it is associated with the meaning of creativity. Many such words may be more tightly defined than creativity itself; we can encode these definitions in a computational test(s) and combine these tests to approximate a measurement of creativity. The intention of this approach is to make the goal of automated creativity assessment more manageable by reducing creativity to a set of more tractable sub-components, each of which is considered a key contributory factor towards creativity, recognised across a combination of different viewpoints. 278 2 Finding Keywords For Creativity The aim of this work is to find words which are significantly more likely to be used in discussions of creativity across several disciplines. These words can be treated as keywords that highlight key components of creativity. What discussions of creativity should be examined? Written text is simpler to analyse than speech and there are many sources to choose from. The texts should be of a reasonable length, otherwise they provide only an overview rather than investigating more subtle points which may be significant. This study concentrates on the academic literature discussing creativity, in order to reduce variability in formats, facilitate discovery of key documents for inclusion and allow a measure of the in uence of the document (the number of citations). To find words used specifically in creativity literature, the language used in several papers was analysed to extract the frequencies with which individual words were used. These extracted word frequencies were statistically compared with data on how the English language is used in general written form. 2.1 Creativity Corpus: A Selection of Papers on Creativity The academic literature on the nature of creativity ranges over at least the past 60 years; arguably starting from Guilford's seminal 1950 presentation on what creativity is and how to detect it. Many repeated themes have emerged in the literature as important components of creativity. As an example, the word clouds1 in Figs. 1 and 2 show that the word new is frequently used in definitions of creativity and also in discussions of what creativity is. Fig. 1. Most frequent words in 23 creativity definitions (excluding common-use words) Wide variance can be found, though, in what are considered primary contributory factors of creativity. For example psychometric tests for creativity (such as [12]) focus on problem solving and divergent thinking, rewarding the ability to move away from standard solutions to a problem. In contrast, much recent writing in computational creativity (such as [9, 11]) places emphasis on novelty and 1 Generated using software at http://www.wordle.net 279 Fig. 2. Most frequently used words in 30 academic papers on creativity (excluding common English words). Fig. 3. With creativity and creative removed (as they dominate the image) value as key attributes. Whilst there is some crossover, the differing emphases give a subtly different interpretation of creativity across academic fields. This study considers 30 papers on the nature of creativity, written from a number of different perspectives. This set of papers is referred to in this paper as the creativity corpus2 and is detailed in Appendix A. The 30 papers were selected using criteria such as the paper's in uence over future work (particularly measured by number of citations), the year of publication, academic discipline and author(s). To match the diversity of opinions in creativity literature as closely as possible, the set of papers give viewpoints from many different authors, from psychology to computer science backgrounds and across time, from 1950 to the current year (2009). Figure 4 shows the distribution of papers by subject, according to journal classification in the academic database Scopus3. Fig. 4. Distribution of subject area of papers over time The methodology for this study placed some limitations on what papers could be used. Papers had to be written in English4 and had to be available in a format that plain text could be extracted from (this excluded books or book chapters). 2 A corpus is the set of all related data being analysed (plural: corpora). 3 Scopus classifies some journals under more than one subject area 4 All non-British word spellings were amended to British spellings before analysis 280 2.2 Data Preparation For each paper a plain text file was generated, containing the full text of that paper. All journal headers and copyright notices were removed from each paper, as were the author names and aliations, list of references and acknowledgements. All files were also checked for any non-ascii characters and anomalies that may have arisen during the creation of the text file. 2.3 Extraction of Word Frequencies from Data R is a statistical programming environment5that is useful for corpus linguistics analysis. Using R, a word frequency table was constructed from the 30 text files containing the creativity corpus. For each word6 in the text files, the frequency table listed: how many papers that word is used in and the number of times the word is used in the whole creativity corpus (all papers combined). 2.4 Post Processing of Results To reduce the size of the frequency table and focus on more important words, all hapaxes were removed (words which only appear once in the whole creativity corpus). Any strings of numbers returned as words in the frequency table were also removed. To filter out words that were not used by many authors, any words which appear in less than 5 out of 30 papers were also discarded. 2.5 Analysis of Results It is not enough to consider purely the word frequencies on their own: a distinction is often made in linguistics [3, 6, 10] between very commonly used words (form or closed class words) and lower frequency words (content or open class words): when used more often than usual in a text, the open class words usually hold the most interesting or specific content [3]. So for this study the most common words overall are not necessarily the most useful; as the results in Table 1 show, the most frequent words overall are usually those expected to be prolific in any written texts. Removing stopwords (very commonly used English words such as \the" or \and") is not sucient for the purposes of this work: this study focusses on those words which are specifically used more often than expected when discussing creativity, as opposed to other texts. A method for quantifying this usage is discussed in the remainder of this section. 5 http://www.r-project.org/ 6 A word is defined as a string of letters delimited by spaces or punctuation. A compound term such as \problem-solving" was divided into \problem" and \solving". 281 Data on General Language Use: British National Corpus (BNC). The BNC is a collection of texts and transcriptions of speech, from a variety of sources of British English usage. The corpus comprises approximately 100 million words, of which around 89 million words are from written sources and the remainder from transcriptions of speech. This study only uses data on the written sources, excluding all transcriptions of speech, as the creativity corpus is also solely from written sources. The data used in this study was taken from [7]: relative word frequency data from a sample subset of the written part of the BNC. Before using this data, frequencies were extrapolated to estimate absolute values. Statistical Testing of Word Frequencies. It was expected that there is a relationship between how many times a word is used in the creativity corpus and how many times it is used in general writing: to use statistical terminology, that the two corpora are correlated. As the data in both corpora is ratio-scored (i.e. the data is measured on a quantifiable scale), a Pearson correlation test can be performed on the word frequency counts for each corpus, to test the hypothesis that there is significant positive correlation. If there is significant evidence of correlation, then the words which do not follow the general trend of correlation are of most interest: specifically the words that are used more frequently in the creativity corpus than would be expected given the frequency with which they appear in the BNC. A common way to measure this is to use the log likelihood ratio statistic G2[3; 6; 8; 10]7: G2 = 2 X oij(ln oij 􀀀 ln eij) (1) oij = actual observed no of occurrences of a word i in corpus j eij = expected no of occurrences of a word i in corpus j (see Eqn. 2): eij = (oij + oik) total(j) (total(j) + total(k)) (2) total(j) = total number of words in corpus j The G2 value is a measure of how well data in one corpus fits a model distribution based on both corpora. The higher the G2 value, the more that word usage deviates from what is expected given this model. G2 measures the extent to which a word deviates from the model but does not indicate which corpus it appears more frequently than expected in. Therefore a subset of the results was discarded: only those words which appear more frequently than expected in the creativity corpus were retained. 7 An alternative to G2 is the chi-squared test (2): see [3, 5, 6, 8, 10] for discussion of why G2 is the more appropriate option for very large corpora. 282 3 Results 3.1 Raw Frequency Counts As can be seen by Table 1 and as discussed in Section 2.5, most words which appeared very frequently were common English words, not useful for this study. Table 1. Most frequently used words in the creativity corpus. Word Count in corpus Word Count in corpus Word Count in corpus of 8052 is 2412 as 1448 and 4988 that 2372 creativity 1433 to 4420 creative 1994 are 1294 in 3939 for 1716 this 1174 a 3647 be 1561 with 1116 Figure 2 shows the results with \common English words" removed (according to http://www.wordle.com); however as discussed in section 2.5, this study's focus is on how words are used in the creativity corpus compared to normal, so removing only wordle.com's stopwords is not sucient for our purposes. 3.2 Using the BNC data As expected, the creativity corpus and BNC word frequencies are significantly positively correlated, at a 99% level of confidence (p<0.01). Pearson correlation testing returned a value of +0.716. The results of this study returned 781 words which are significantly more likely to appear in creativity literature then in general for written English (at a 99% level of confidence). Table 2 shows the 100 words with the highest G2 score. 4 Discussion of Findings This work has generated a list of words which are significantly associated with academic discussions of what creativity is. The list is ordered by how likely these words are to appear in creativity literature, so the higher they are on the list, the more significantly they are associated with such discussions. While words such as divergent and originality have appeared high on the list, as expected, some interesting results have emerged which are more surprising at first glance, for example openness is 6th and empirical is 21st. One notable observation is that process, in 9th position with a G2 value of 1986.72, is a good deal higher than product, in 409th place with a G2 value of 75.38. Although on closer inspection, the word process has been used in more different contexts 8 Both G2 values are still well above 6.63, the critical value for significance at p<0.01 283 than product, there are still surprisingly many discussions about the processes involved in creativity. This result provides intriguing evidence for the product vs. process debate in creativity assessment [1, 9, 11]. Table 2. Top 100 words in creativity corpus, sorted by descending signed G2 Some words appear surprisingly highly in Table 2, due to unexpectedly low frequencies being recorded in the BNC data. Two examples are because and found. This suggests two possibilities: either a slight weakness in the representativeness of the sample BNC data from [7] (perhaps understandable given the sheer quantity of data in the BNC; no sample can be 100% representative of a larger set of data), or alternatively these words may be used more in academic writing than in everyday speech see section 4.1 for further discussion of this. From inspection, such words seem relatively infrequent, however, compared to the large number of words which are recognisably associated with creativity in at least some academic domains. 4.1 Further Exploration of Keywords Words in Common Academic Usage. It is possible that some words feature highly in the results solely because they are common academic words. Therefore the results list should be compared to common academic words to see if there 284 is evidence of correlation between the two sets of data. If so, this should also be taken into account. Two lists of common words in academic English were found: the Academic Word List (AWL) [2] and the University Word List (UWL) [13]. Both contain groups of words, in order of frequency of usage specifically in academic documents (group 1 holds the most frequent words). Unlike the BNC corpus, the AWL and UWL only provides summary information on academic word usage with no actual frequency data per word; this limits what statistical testing can be performed. Spearman correlation testing returns a value of -0.236 correlation between the creativity corpus and the AWL and -0.210 correlation between the creativity corpus and the UWL. Neither correlation value is significant at p<0.01 (or p<0.05). As this indicates no significant relationship between the creativity corpus and either academic list, no correction should be made to the keyword results on account of either set of academic data. Poor availability of any other data on academic word usage hinders further investigation of this issue at present. Context and Semantics. Although the list of keywords hold much of interest in uncovering what is key to creativity, they rely purely on frequency of word usage. The results are not intended to account for the different contexts in which words are used; when analysing large corpora, exploring every word's semantic context would be highly time-consuming. Instead, the frequency results highlight keywords to focus on in the texts and examine in more detail [6, 10]. Categorising the keywords by semantics is non-trivial and \labour-intensive" [4]. Carrying this out empirically would be a significant step in itself and is a fruitful avenue for further work. From inspection of the contexts in which keywords are used, some key categories are suggested in Table 3. 5 Conclusions For a computational system to be able to perform automated assessment of creativity by a computational system, it needs some point of reference on what creativity is. There is no accepted consensus on the exact definition of creativity. This work empirically derives a set of keywords that combine a variety of viewpoints from different perspectives, for a more universal encapsulation of creativity. Keywords were calculated through corpus analysis of 30 academic papers on the nature of creativity. The likelihood measure G2 (Eqn. 1) was used to compare word frequencies in the creativity papers against usage of those words in general written English, as represented by the sub-corpus of the BNC containing written texts (see Section 2.5). This analysis returned 781 words which were statistically more common in the creativity literature sample than expected, given their general usage in written English. Table 2 displays the top 100 results. The list of keywords encapsulates words we commonly use to describe and analyse creativity in academia. Given their strong association with creativity, 285 Table 3. Key categories for creativity, generated through examining the keywords Category Keywords representing this category cognitive processes thinking, primary, conceptual, cognition, perceptual originality innovation, originality, novelty the creative individual personality, motivation, traits, individual, intrinsic, self ability solving, intelligence, facilitate, uency, knowledge, IQ in uences in uences, problem, extrinsic, example, interactions, domain divergence divergent, investigations, uency, ideas, research, discovery autonomy unconscious discovery openness, awareness, search, discovery, uency, research dimensions dimensions, attributes, factors, criterion association associative, correlation, related, combinations, semantic product artefacts, artistic, elements, verbal value motivation, artistic, solving, positive, validation, retention study of creativity empirical, predictions, tests, hypothesis, validation, research measuress of creativity scores, scales, empirical, ratings, criterion, measures, tests evolution of creativity developmental, primary, evolutionary, primitive, basis replicating creativity programs, computational, process, heuristics they point us towards sub-components of creativity that contribute to our intuitive understanding of what creativity is. Many of the keywords in the results can be tested for by a computer more easily than testing for creativity itself. For example: { Originality: Comparing products to other examples in that domain or to a prototype, to measure similarity { Ability: Depending on the domain, there are usually many standardised tests to measure competence in that domain { Divergence: Measuring variance of products against each other { Autonomy: Quantifiying the assistance needed during the creative process { Value: Again this is domain dependent and there will usually be many tests for value measurement in a particular domain The results presented in this paper identify key components of creativity through a combination of several viewpoints. These rsesults will be used to guide experiments implementing a computational system that evaluates creativity by testing for the key categories that have been identified. The experiments enable us to determine whether this approach to defining creativity gives a good enough approximation for creativity evaluation, and if so, which combination of tests most closely replicates human assessment of creativity. Acknowledgements Nick Collins, Sandra Deshors, Clare Jonas and Luisa Natali all made useful comments during discussions of this work. 286 2010_33 !2010 Automated Jazz Improvisation Robert M. Keller1 with contributions by Jon Gillick2, David Morrison3, Kevin Tang4 1 Harvey Mudd College, Claremont, CA, USA 2 Wesleyan University, Middletown, CT, USA 3 University of Illinois, Urbana-Champaign, IL, USA 4 Cornell University, Ithaca, NY, USA keller@cs.hmc.edu, jrgillick@wesleyan.edu, drmorr0@gmail.com, kt258@cornell.edu Abstract. I will demonstrate the jazz improvisational capabilities of ImproVisor, a software tool originally intended to help jazz musicians work out solos prior to improvisation. As the name suggests, this tool provides various forms of advice regarding solo construction over chord changes. However, recent additions enable the tool to improvise entire choruses on its own in real-time. To reduce the overhead of creating grammars, and also to produce solos in specific styles, the tool now has a feature that enables it to learn a grammar for improvisation in a style from transcribed performances of solos by others. Samples may be found in reference [4]. Acknowledgment This research was supported by grant 0753306 from the National Science Foundation and a faculty enhancement grant from the Mellon Foundation. 2010_34 !2010 The Painting Fool Teaching Interface Simon Colton Department of Computing, Imperial College, London, UK www.thepaintingfool.com The Painting Fool is software that we hope will be taken seriously as a creative painter in its own right { one day. As we are not trained artists, a valid criticism is that we are not equipped to train the software. For this reason, we have developed a teaching interface to The Painting Fool, which enables anyone { including artists and designers { to train the software to generate and paint novel scenes, according to a scheme they specify. In order to specify the nature and rendering of the scene, users must give details on some, or all, of seven screens, some of which employ AI techniques to make the specification process simpler. The screens provide the following functionalities: (i) Images: enables the usage of context free design grammars to generate images. (ii) Annotations: enables the annotation of digital images, via the labelling of user-defined regions. (iii) Segmentations: enables the user to specify the parameters for image segmentation schemes, whereby images are turned into paint regions. (iv) Items: enables the user to hand-draw items for usage in the scenes, and to specify how each exemplar item can be varied for the generation of alternatives. (v) Collections: enables the user to specify a constraint satisfaction problem (CSP) via the manipulation of rectangles. The CSP is abduced from the rectangle shapes, colours and placements, and when solved (either by a constraint solver or evolutionary process), generates new scenes of rectangles, satisfying the user constraints. (vi) Scenes: enables the specification of layers of images, items, segmentations and collections, in addition to substitution schemes. (v) Pictures: enables the specification of rendering schemes for the layers in scenes. In the demonstration, I will describe the process of training the software via each of the seven screens. I will use two running example picture schemes, namely the PresidENTS series and the Fish Fingers series, exemplars of which are portrayed in figure 1. Fig. 1. Exemplar pictures from the PresidENTS and the Fish Fingers series of pictures 289 2010_35 !2010 Generative Music Systems for Live Performance Andrew R. Brown, Toby Gifford, and Rene Wooller Queensland University of Technology, Brisbane , Australia. {a.brown, t.gifford, r.wooller}@qut.edu.au Music improvisation continues to be an intriguing area for computational creativity. In this paper we will outline two software systems designed for live music performance, the LEMu (live electronic music) system and the JamBot (improvisatory accompaniment agent). Both systems undertake an analysis of human created music, generate complementary new music, are designed for interactive use in live performance, and have been tested in numerous live settings. These systems have some degree of creative autonomy, however, we are especially interested in the creative potential of the systems interacting with human performers. The LEMu software generates transitional material between scores provided in MIDI format. The LEMu software uses an evolutionary approach to generated materials that provide an appropriate path between musical targets [1]. This musical morphing process is controlled during performance by an interactive nodal graph that allows the performer to select the morphing source and target as well as transition speed and parameters. Implementations included the MorphTable [2] where users manipulate blocks on a large surface to control musical morphing transitions. This design suits social interaction and is particularly suited to use by inexperienced users. The JamBot [3] listens to an audio stream and plays along. It consists of rhythmic and harmonic analysis algorithms that build a dynamic model of the music being performed. This model holds multiple probable representations at one time in the Chimera Architecture [4] which can be interpreted in various ways by a generative music algorithm that adds accompaniment in real time. These systems have been designed using a research method we have come to call Generation in Context, that relies on iterations of aesthetic reflection on the generated outcomes to inform the processes of enquiry [5]. 2010_36 !2010 Realtime Generation of Harmonic Progressions in Kinetic Engine -Demo Arne Eigenfeldt1 and Philippe Pasquier2 1School for the Contemporary Arts, 2School of Interactive Arts and Technology, Simon Fraser University, Canada {eigenfel, pasquier}@sfu.ca Abstract. We present a method for generating harmonic progressions using case-based analysis of existing material that employs a Markov model. Using a unique method for specifying desired harmonic complexity, tension between chord transitions, and a desired bass-line, the user specifies a 3 dimensional vector, which the realtime generative algorithm attempts to match during chord sequence generation. The proposed system thus offers a balance between userrequested material and coherence within the database. The presentation will demonstrate the software running in realtime, allowing users to generate harmonic progressions based upon a database of chord progressions drawn from Pat Metheny, Miles Davis, Wayne Shorter, and Antonio Carlos Jobim. The software is written in MaxMSP, and available at the first author's website (www.sfu.ca/~eigenfel). 291 2010_37 !2010 The Continuator Strikes Back: a Controllable Bebop Improvisation Generator François Pachet1 1 Sony CSL-Paris, 6, rue Amyot, 75005, Paris, France pachet@csl.sony.fr Abstract. The problem of modeling improvisation has received a lot of attention recently, thanks to progresses in machine learning, statistical modeling, and to the increase in computation power of laptops. The Continuator (Pachet, 2003) was the first real time interactive systems to allow users to create musical dialogs using style learning techniques. The Continuator is based on a modeling of musical sequences using Markov chains, a technique that was shown to be well adapted to capture stylistic musical patterns, notably in the pitch domain. The Continuator had great success in free-form improvisational settings, in which the users explore freely musical language created on-the-fly, without additional musical constraints, and was used with Jazz musicians as well as with children (Addessi & Pachet, 2005). However, the Continuator, like most systems using Markovian approaches, is difficult, if not impossible to control. This limitation is intrinsic to the greedy, left-to-right nature of Markovian music generation algorithms. Consequently, it was so far difficult to use these systems in highly constrained musical contexts. We present here a prototype of a fully controllable improvisation generator, based on a new technique that allows the user to control a Markovian generator. We use a combination of combinatorial techniques (constraint satisfaction) with machinelearning techniques (supervised classification as described in Pachet, 2009) in a novel way. We illustrate this new approach with a Bebop improvisation generator. Bebop was chosen as it is a particularly "constrained" style, notably harmonically. Our technique can generate improvisations that satisfy three types of constraints: 1) harmonic constraints derived from the rules of Bebop, 2) "Side-slips" as a way to extend the boundaries of Markovian generation by producing locally dissonant but semantically equivalent musical material that smoothly comes back to the authorized tonalities, and 3) non-Markovian constraints deduced from the user's gestures. Keywords: music interaction, virtuosity, doodling. 2010_38 !2010 Software Engineering Rewards for Brainstorming Online (SEREBRO) D.F. Grove, N. Jorgenson, S. Sen, R. Gamble University of Tulsa 800 S. Tucker Drive, Tulsa OK, USA {dean-grove, noah-jorgenson, sandip, gamble@utulsa.edu} Abstract. Our multi-faceted tool called SEREBRO (Software Engineering Rewards for Brainstorming Online) is an embodiment of a novel framework for understanding how creativity in software development can be enhanced through technology and reinforcement. SEREBRO is a creativity support tool, available as a Web application that provides idea management within a social networking environment to capture, connect, and reward user contributions to team-based, software engineering problem solving tasks. To form an idea network, topics are created that typically correspond to artifacts needed to achieve specific milestones in the software development process. Team members then perform the activities of brainstorming (initiating) ideas, spinning ideas from current ones by agreeing or disagreeing, pruning threads that are non-productive, and finalizing emerging concepts for the next milestone. Each idea type is represented by a corresponding icon and color in the idea network: brainstorm nodes are blue circles, agree nodes are upright, green triangles, disagree nodes are upside down, orange triangles, and finalized nodes are yellow pentagons that have tags associated with contributing ideas. SEREBRO can display threads as a series of posts or in a graphical view of the entire tree for easy navigation. Team members also use SEREBRO for scheduling meetings and announcing progress. Special idea nodes can be used to represent meeting minutes. The meeting mode associates a clock with each idea type and allows multiple users to be credited. Rewards are propagated from leaf nodes to parents to correspond to idea support. They are supplemented when a node is tagged by finalization. These rewards are represented as badges. Reputation scores are accumulated by the direct scoring of ideas by team members. A user's post publicly displays both reward types. The current version, SEREBRO 2.0, is supplemented with software project management components that enhance both the idea network and reward scheme. These include uploading files for sharing, version control for changes to the product implementations, a Wiki to document product artifacts, a calendar tool, and a Gantt chart. The website with a video of SEREBRO 1.0, data collections, and link to SEREBRO 2.0 to view various idea nets, the wiki, uploaded documents, and any resulting prototype development by the teams, as well as publications, including submissions, can be found at http://www.seat.utulsa.edu/serebro.php. Guest access to SEREBRO is available by email request to gamble@utulsa.edu. 293 2010_39 !2010 Piano_prosthesis Michael Young Music Department, Goldsmiths, University of London New Cross, London, UK m.young@gold.ac.uk Piano_prosthesis presents a would-be live algorithm, a system able to collaborate creatively with a human partner. In performance, the pianist's improvisation is analysed statistically by continuously measuring the mean and standard deviation of 10 features, including pitch, dynamic, onset separation time and ‘sustain-ness' within a rolling time period. Whenever these features constitute a 'novel' point in 10dimensional feature space (by exceeding an arbitrary distance threshold) this point is entered as a marker. This process continues as the improvisation develops, accruing further marker points (usually around 15 are generated in a 10 minute performance). The system expresses its growing knowledge, represented by these multi-dimensional points, in its own musical output. Every new feature point is mapped to an individual input node of a pre-trained neural network, which in turn drives a stochastic synthesizer programmed with a wide repertoire of piano samples and complex musical behaviours. At any given moment in the performance, the current distance from all existing markers is expressed as a commensurate set of outputs from the neural network, generating a merged set of corresponding musical behaviours of appropriate complexity. The identification of new points, and the choice of association between points and network states, is hidden from the performer and can only be ascertained through listening and conjecture (as may well be case in improvising with fellow human player). The system intermittently and covertly devises connections between the human music and its own musical capabilities. As the machine learns and ‘communicates', the player is invited to reciprocate. Through this quasi-social endeavour a coherent musical structure may emerge as the performance develops in complexity and intimacy. This is a new system that substitutes on-the-fly network training (previously described in detail [1]) with Euclidian distance measurements, offering considerable advantages in efficiency. There are a number of sister projects for other instruments, with corresponding sound libraries (oboe, flute, cello). Further explanation and several audio examples of full performances are available on the author's website [2]. 2010_4 !2010 Establishing Appreciation in a Creative System David Norton, Derral Heath, Dan Ventura Computer Science Department Brigham Young University dnorton@byu.edu, dheath@byu.edu, ventura@cs.byu.edu Abstract. Colton discusses three conditions for attributing creativity to a system: appreciation, imagination, and skill. We describe an original computer system (called DARCI) that is designed to eventually produce images through creative means. We show that DARCI has already started gaining appreciation, and has even demonstrated imagination, while skill will come later in her development. 1 Introduction While several theoretical frameworks for creativity have been proposed, actually building a system that applies these frameworks is difficult. We are developing an original system designed to implement and integrate concepts proposed by researchers such as Boden, Wiggens, Ritchie, and Colton. Our system, DARCI (Digital ARtist Communicating Intention), will produce images that are not only perceived by humans as creative products, but that are also produced through arguably creative processes. This paper represents our work with only the first component of DARCI, that of learning about the domain of visual art. We will discuss why this is an important step in the creative process in terms of Colton's creative tripod concept [3], describe how DARCI is learning about this domain, and finally demonstrate DARCI's current level of development. Colton discusses three attributes that must be perceived in a system to consider it creative: appreciation, imagination, and skill. In order for DARCI to be appreciative of art, she needs to first acquire some basic understanding of art [3]. For example, in order for DARCI to appreciate an image that is gloomy, she has to first recognize that it is gloomy. To facilitate this, we are teaching DARCI to associate low-level image features with artistic descriptions of the image. Currently, DARCI has learned how to associate 150 different descriptors to images. Furthermore, she can essentially interpret an image by selecting a specific combination of these descriptors for the image in question, thus demonstrating a degree of imagination. This will also facilitate communication with DARCI's audience, enhancing the perception of appreciation and imagination. DARCI cannot yet produce any images and so does not yet demonstrate skill in the sense that Colton prescribes. However, at the end of this paper we will show how DARCI's understanding of the art domain will be instrumental to her production of original images. 2 Image Feature Extraction Before DARCI can form associations between image features and descriptive words, the appropriate image features for the task must be selected. These need to be low-level features that characterize the various ways that an image can be appreciated. 26 There has been a large amount of research done in the area of image feature extraction. King and Gevers deal with Content Based Image Retrieval (CBIR) [2][6]. CBIR relies heavily on extracting image features which can then be compared and used when searching for images with specific content. CBIR systems look at characteristics such as an image's color, light, texture, and shape. Datta and Li propose several image features that look at these same characteristics to assess the aesthetic quality of images [4][7]. Wang deals with image retrieval specific to emotional semantics [10][9]. The goal is to search for images that have specific emotional qualities such as happy, gloomy, showy, etc. Zujovic tries to classify a painting into one of six different genres: Abstract, Expressionism, Cubism, Impressionism, Pop Art and Realism [11]. All of these researchers have proposed image features that focus on color, light, texture, and shape. Of these image features, we have selected 102 of the more common ones to use in DARCI. As with prior research, our set of image features is broken down into characteristics relating to color, light, texture and shape. Color and light play a significant role in the emotion and meaning conveyed in images. Colors have often been associated directly with emotions. For example, red can mean anger and frustration while blue can mean sad and depressed. Likewise with light, a dark image could mean gloomy or scary while a bright image could denote happiness or enthusiasm. Texture and shape features also play a significant role in the meaning and emotion of an image. For example, a cluttered and busy image could indicate feelings of anxiety or confusion. An image that is blocky and structured could indicate feelings of stability and security. We extract eight color features, four light features, 50 texture features and 40 shape features as follows: Color & Light: 1. Average Red, Green, and Blue 2. Average Hue, Saturation, and Intensity 3. Unique Hue count (20 buckets) 4. Average Hue, Saturation, and intensity contrast 5. Dominate hue 6. Percent of image that is the dominate hue Shape: 1. Geometric Moment 2. Eccentricity 3. Invariant Moment (5x vector) 4. Legendre Moment 5. Zernike Moment 6. Psuedo-Zernike Moment 7. Edge Direction Histogram (30 bins) Texture: 1. Co-occurrence Matrix (x4 shifts) 1. Maximum probability 2. First order element difference moment 3. First order inverse element difference moment 4. Entropy 5. Uniformity 2. Edge Frequency (25x vector) 3. Primitive Length 1. Short primitive emphasis 2. Long primitive emphasis 3. Gray-level uniformity 4. Primitive length uniformity 5. Primitive percentage It is not the purpose of this paper to go into detail about the image features we extracted. These features were selected based on the results of the research previously mentioned. 27 3 Visuo-Linguistic Association DARCI forms an appreciation of art by making associations between image features and descriptions of the images. An image can be described and appreciated in many ways: by the subject of the image, by the aesthetic qualities of the image, by the emotions that the image evokes, by associations that can be made with the image, by the meanings found within the image, and possibly others. To teach DARCI how to make associations with such descriptors, we present her with images labeled appropriately. Ideally we would like DARCI to understand images from all of these perspectives. However, because the space of all possible images and their possible descriptive labels is enormous, we have taken measures to reduce the descriptive label space to one that is tractable. Specifically, we have reduced descriptive labels exclusively to delineated lists of adjectives. 3.1 WordNet We useWordNet's [5] database of adjectives to give us a large, yet finite, set of descriptive labels. Even though our potential labels are restricted, the complete set ofWordNet adjectives can allow for images to be described by their emotional effects, most of their aesthetic qualities, many of their possible associations and meanings, and even, to some extent, by their subject. InWordNet, each word belongs to a synset of one or more words that share the same meaning. If a word has multiple meanings, then it can be found in multiple synsets. For example, the word "dark" has eleven meanings, or senses, as an adjective. Each of these senses belongs to a unique synset. The synset for the sense of "dark" that means "stemming from evil characteristics or forces; wicked or dishonorable", also contains senses of the words "black" and "sinister". Our image classification labels actually consist of a unique synset identifier, rather than the adjectives themselves. 3.2 Learning Method In order to make the association between image features and descriptors, we use a series of artificial neural networks trained incrementally with backpropagation. A training instance is defined as the image features for a particular image paired with a single synset label. We create a distinct neural network, with a single output node, for each synset that has a sufficient amount of training data. For the results presented in this paper, that threshold is eight training instances. Enforcing this threshold ensures a minimum amount of training data for each synset. As we incrementally accumulate data, more and more neural networks are created to accommodate the new synsets that pass the threshold. This process ensures that neural networks are not created for synsets that are either too obscure or occur only accidentally. Shen, et al. employ a similar approach for handling non-mutually exclusive labels to good effect using SVMs instead of ANNs [8]. 4 Obtaining Data Instances To collect training data, we have created a public website for training DARCI [1]. From this website, users are presented with a random image and asked to provide adjectives 28 Fig. 1. Screenshot of the website used to train DARCI. Below the image are adjectives that the user has entered as well as a text box for entering new adjectives. To the right of the image are seven adjectives that DARCI has attributed to the image. Image courtesy of Mark Russell. that describe the image (Figure 1). When users input a word with multiple senses, they are presented with a list of the available senses, along with theWordNet gloss, and asked to select the most appropriate one.We keep track of the results in an SQL database from which we can train the appropriate neural networks. As of this writing, we have obtained close to 6000 data points this way. While this is still only a small fraction of the amount of data we will need, it has proven satisfactory for some adjectives as we will show. While there are 18,156 adjective synsets inWordNet, it is not necessary for DARCI to learn all of them. In the set of roughly 6,000 data instances we have obtained so far, only 1,176 unique synsets have occurred. Of those unique synsets, almost half have only a single example. There will be many synsets that will never meet our threshold of eight instances, thus making the association task more manageable. The total number of synsets that have at least eight data points (our threshold for creating a neural net) is currently 150. This means that DARCI essentially "knows" 150 synsets at the writing of this paper. Keep in mind that many of those synsets contain several senses, so the number of adjectives DARCI effectively "knows" is actually much higher. As DARCI is currently nascent, this number will continue to grow. 4.1 Amplifying Data We have been faced with two fundamental problems with regards to training data. First, all of the training data that we have examined so far is exclusively positive training data (i.e. the training data only indicates what an image is, not what it is not). It is very difficult to train ANNs without negative examples as well. The second problem 29 is a paucity of training instances. ANNs require a lot of training data to converge and currently, of the 150 synsets known to DARCI, there are on average just over twenty three positive data instances per synset. We have employed two methods for obtaining negative data. The first method utilizes the antonym attribute of adjectives in WordNet. Anytime an image is labeled with an adjective, we create a negative data point for all antonyms of that adjective. Second, on DARCI's website, we allow users to directly create negative examples for adjectives that DARCI knows. For each image presented to the user, DARCI lists seven adjectives that she associates with the image (Figure 1). The user is then allowed to flag those labels that are not accurate. This creates strictly negative examples. This method also allows DARCI to demonstrate to the user her current interpretation of an image. Using these methods, we have built up more negative data points than positive ones. In order to help compensate for shortages in training data, for each new data instance that is presented to DARCI, a variable number of old data instances belonging to the same synset, are reintroduced to the neural net in question. In addition to reintroducing old material, a variable number of prior data instances that do not belong to the same synset, but that are statistically correlated, are introduced to the neural net in question. These guessed data instances provide DARCI with more data for each synset than she is in fact receiving, and allow DARCI to take advantage of correlations in labels that are lost by using unique neural nets for each synset. We perform these data expansion strategies to both the positive and negative data instances and do so in a manner that attempts to balance the amount of negative and positive data that DARCI receives for each synset. The combination of adding negative data instances, recycling old data instances, guessing correlations with other synsets, and using these guesses to balance positive and negative training instances, greatly amplifies the amount of training data presented to DARCI. 5 Interpreting Images When presented with an image, DARCI takes the output of each synset's neural net given the image features, and treats that output as a score. But DARCI currently knows 150 synsets, so how does she choose which of the synsets to label the image with? The easiest solution would be to either take all synsets with a score above a specific threshold, or take the top n synsets. However, despite our attempts to amplify the data, some synsets continue to be lacking in training instances. The neural networks for these synsets should not be given as much weight in determining the relevance of an adjective for a particular image. Thus, we use Equation 1 for modifying each neural network's output value to create a new score that takes DARCI's confidence about a particular synset into consideration. In this algorithm, confidence is not specifically the statistical meaning, rather it is an estimation for how certain DARCI is about a particular synset. score = o (p + n) min 1; n p ( o􀀀0:5 ) (1) 30 Here o is the output of a neural network for a particular synset, p is the number of positive data instances present in the training database and n the number of negative data instances, and is a constant that indicates how much effect the "confidence" measure should have—we found = 5 to be useful. This equation amplifies outputs of synsets with greater support (p+n) and at least as many negative as positive examples (there would be more negative than positive examples in an accurate sample of the real world). It is immediately clear that synsets having no negative examples will have a score of zero, thus preventing overly positive data from tainting the labeling process. DARCI then uses this modified score to make her selection of synset labels with the added caveat that no two synsets are chosen that belong to the same satellite group of synsets. Satellite groups are groupings of adjective synsets defined in WordNet to share similar meanings. It is a grouping that is looser than the synset grouping itself, but still somewhat constrained. For example, all colors belong to the satellite group "chromatic". This means that DARCI will never label an image with more than one color. We do this in order to enforce a varied selection of labels. 6 Results Because labeling images with adjectives is subjective, it is difficult to evaluate DARCI's progress. And since DARCI is not yet producing any artefacts, we can't directly assess how the associations she is currently learning will effect those artefacts. Nevertheless, in this section we present the results of a test that we devised to estimate how DARCI is learning select adjectives, with the caveat that the evaluation is still somewhat subjective. We also demonstrate DARCI's labeling capabilities for a handful of images. Finally, we briefly describe DARCI's ability to select the top images, from our database, that fit a given adjective label. As of this writing, there were 1284 images in our image database and a total of 5891 positive user provided labels. 3465 of those labels belonged to synsets that passed the requirement of eight minimum labels. There were 150 synsets that passed this requirement, constituting the synsets that we say DARCI knows. Even though the system is designed to update incrementally, we re-ran all of the data from scratch using updated parameters. 6.1 Empirical Results In order to assess DARCI's ability to associate words with image features, we observed DARCI's neural net outputs for ten select synsets across ten images that were not in our image database. We presented these same images and synsets to online users in the form of a survey. We chose this narrow survey approach for evaluation because the data available for each image in our labeled dataset was scarce. On the survey, users were asked to indicate whether or not each word described each image. They were also given the option to indicate unsure. Across the ten images, each synset received 215 total votes. For every synset, the positive count for each image was normalized by the total number of votes that the image received for the given synset. We then calculated the correlation coefficient between DARCI's neural network output and this normalized 31 Synset Gloss Correlation p-value Coefficient Scary provoking fear terror 0.1787 0.6214 Dark devoid of or deficient in light or brightness; shadowed or black 0.7749 0.0085 Happy enjoying or showing or marked by joy or pleasure 0.0045 0.9900 Sad experiencing or showing sorrow or unhappiness 0.3727 0.2888 Lonely lacking companions or companionship 0.4013 0.2504 Wet covered or soaked with a liquid such as water 0.3649 0.2998 Violent characterized by violence or bloodshed 0.2335 0.5162 Sketchy giving only major points; lacking completeness 0.4417 0.2013 Abstract not representing or imitating external reality or the objects of nature 0.2711 0.4486 Peaceful not disturbed by strife or turmoil or war 0.3715 0.2905 Table 1. Empirical results over ten synsets across ten images. The gloss is the WordNet definition. The correlation coefficient is between DARCI's neural net outputs and normalized positive votes from humans. The p-value is for the correlation coefficient. positive count. Table 1 shows the results of this experiment for each synset along with the accompanying p-value. A high positive correlation and a statistically significant p-value would indicate that DARCI agrees with the majority of those surveyed. The p-values we obtained indicate, unfortunately, that for the most part, these results are not statistically significant. However, all of the synsets have a positive correlation, hinting that the system is heading in the right direction and had we more data, would probably be significant. Of note is the synset "dark", which has the highest correlation coefficient and is statistically significant to p < 0:01. "Happy" is both the least statistically significant and shows essentially no correlation between DARCI's output and the opinions of users. From these results, and acknowledging the small amount of training data we have acquired, we can surmise that DARCI is capable of learning to apply some synsets quite effectively, while other synsets may be impossible for DARCI to learn. More data will be necessary to solidify these conjectures. It is important to note that humans don't always interpret images in the same fashion themselves. For example, the results regarding the synsets for "sketchy", "sad", and "lonely" showed little agreement amongst the human participants. While disagreement amongst humans did not necessarily correlate with DARCI's interpretations, the subjectivity of the problem somewhat absolves DARCI of the necessity for high correlation with common consent among humans. Clearly, other metrics are needed to truly evaluate DARCI. 6.2 Anecdotal Results We presented DARCI with several images that were not in her database, and observed her descriptive labels of them. Figure 2 shows some of the images and the seven adjectives that DARCI used to describe them. In this figure we see that DARCI did fairly well in describing these four images. Though subjective, a case can be made for describing 32 (a) beautiful, blueish, awe-inspiring, supernatural, reflective, aerodynamic, majestic (b) grey, bleached, sketchy, supernatural, plain, simple, penciled (c) orange, hot, supernatural, painted, blotched, abstract, rough (d) scary, beautiful, violent, dark, red, fiery, supernatural Fig. 2. Images that DARCI has interpreted.Words underneath each image are the adjectives DARCI associated with each image. (a), (b), and (d) courtesy of Shaytu Schwandes. (c) courtesy of William Meire. each image the way DARCI did. One exception would be the adjective "supernatural" which appears in every single image DARCI labels. Until DARCI sees enough negative examples of "supernatural", she will continue learning that all images are "supernatural" because she has mostly seen only positive examples of the word. DARCI's vocabulary, as of now, is 150 adjective synsets and she has learned some synsets better than others based on two things. First, she has seen more examples of some synsets than of others. Second, some synsets are simply much more difficult to learn. For example, for DARCI to determine whether an image is "dark" or not is much easier than for her to determine whether or not an image is "awesome". "awesome" is much more subjective and takes more aspects of the image into consideration. DARCI had never seen the images shown in Figure 2 before and so, to analogize with the human process, she had to describe the images based on her own experience. One could argue that DARCI was showing imagination because she came up with appropriate adjectives. 33 (a) peaceful (b) lonely Fig. 3. Representative images that DARCI listed in her top ten images described as (a) peaceful and (b) lonely. These images were not explicitly labeled as such when they first appeared in these lists. (a) courtesy of Bj. de Castro. (b) courtesy of Ahmad Masood. We designed DARCI so that she could find and display the top ten images she thinks are described by a particular adjective synset as well as the top ten images she thinks do not represent that particular synset. This gives us a good idea of how well DARCI has learned a particular synset. It is interesting to note that images that have not been explicitly labeled with a particular synset often show up in DARCI's lists. In Figure 3 we see two examples of this with the adjectives "peaceful" and "lonely". DARCI displayed these two images as respectively "peaceful" and "lonely" even though they had never been explicitly labeled as such. Many would agree that these two images are in fact describable as DARCI categorized them. Again, one could argue that DARCI was showing imagination because she displayed these images on her own. To observe DARCI's image interpreting capabilities go to her website [1]. 7 Discussion and FutureWork In this paper we have outlined and demonstrated the first critical component of DARCI. This component is responsible for forming associations between image features and descriptive words, and represents an aspect of artistic appreciation that is critical for the next steps in DARCI's development. The next component will be responsible for rendering images in an original and aesthetically pleasing way that reflects a series of accompanying adjectives. For example, we may present DARCI with a photograph of a lion and the words: majestic and scary. DARCI would then create an artistic rendering of the lion in a way that conveys majestic and scary. If DARCI is able to learn how to render images according to any combination of descriptive words, then the possibility for original and meaningful art becomes apparent. The argument for creativity is strengthened as well. For example, what if one were to commission DARCI to render the photograph of a forest scene in a way that is photographic, abstract, angry, and calm? Who could say what the final image would look like? The commissioner may be 34 attributed with creativity for coming up with such a contradictory set of words, but the greater act of creativity would arguably lie in the hands of DARCI. The rendering component of DARCI will use a genetic algorithm to discover how to render images in a way that reflects accompanying adjectives. The fitness function for this algorithm will be largely a measure of how closely the phenotype, a rendered image, matches the adjective in question. This measure will be the very output of the adjective's associated neural net described in the body of this paper—it is a measure of her appreciation for her own work. Since DARCI is persistent, this means that the fitness function will be changing as her associative abilities improve. In fact, we intend to introduce some of her own images into the database, thus convolving the associative and productive processes. For this reason, we want DARCI to strengthen her associations while she produces and evaluates her own images. Once the rendering component of DARCI is complete, we will continue to develop her ability to be creative. We intend to allow DARCI to select the adjectives that drive image creation by some process that takes associative knowledge into consideration. We may form associations between adjectives and nouns/verbs. This would provide a framework for DARCI to choose the subjects to render based on image captions. Finally, we hope to eventually allow DARCI to create images from scratch, prior to rendering, using a cognitive model that would rely heavily on the associative component. Acknowledgements This material is based upon work supported by the National Science Foundation under Grant No. IIS-0856089. 2010_40 !2010 A Visual Language for Darwin Penousal Machado and Henrique Nunes CISUC, Department of Informatics Engineering, University of Coimbra, 3030 Coimbra, Portugal machado@dei.uc.pt Abstract. The main motivation for the research that allowed the creation of the works presented here was the development of a system for the evolution of visual languages. When applied to artistic domains the products of computational creativity systems tend to be individual artworks. In our approach search takes place at a higher level of abstraction, using a novel evolutionary engine we explore a space of context free grammars. Each point of the search space represents a family of shapes following the same production rules. In this exhibit we display instances of the vast set of shapes specified by one of the evolved grammars. 295 2010_41 !2010 Using Computational Models to Harmonise Melodies Raymond Whorley, Geraint Wiggins, Christophe Rhodes, and Marcus Pearce Centre for Cognition, Computation and Culture Goldsmiths, University of London New Cross, London SE14 6NW, UK. Wellcome Laboratory of Neurobiology University College London London WC1E 6BT, UK. {r.whorley,g.wiggins,c.rhodes}@gold.ac.uk marcus.pearce@ucl.ac.uk Abstract. The problem we are attempting to solve by computational means is this: given a soprano part, add alto, tenor and bass such that the whole is pleasing to the ear. This is not easy, as there are many rules of harmony to be followed, which have arisen out of composers' common practice. Rather than providing the computer with rules, however, we wish to investigate the process of learning such rules. The idea is to write a program which allows the computer to learn for itself how to harmonise in a particular style, by creating a model of harmony from a corpus of existing music in that style. In our view, however, present techniques are not sufficiently well developed for models to generate stylistically convincing harmonisations (or even consistently competent harmony) from both a subjective and an analytical point of view. Bearing this in mind, our research is concerned with the development of representational and modelling techniques employed in the construction of statistical models of four-part harmony. Multiple viewpoint systems have been chosen to represent both surface and underlying musical structure, and it is this framework, along with Prediction by Partial Match (PPM), which will be developed during this work. Two versions of the framework have so far been implemented in Lisp. The first is the strictest possible application of multiple viewpoints and PPM, which reduces the four musical sequences (or parts) to a single sequence comprising compound symbols. This means that, given a soprano part, the alto, tenor and bass parts are predicted or generated in a single stage. The second version allows the lower three parts to be predicted or generated in more than one stage; for example, the bass can be generated first, followed by the alto and tenor together in a second stage of generation. We shall be describing and demonstrating our software, which uses machine learning techniques to construct statistical models of four-part harmony from a corpus of fifty hymn-tune harmonisations. In particular, we shall demonstrate how these models can be used to harmonise a given melody; that is, to generate alto, tenor and bass parts given the soprano part. Output files are quickly and easily converted into MIDI files by a program written in Java, and some example MIDI files will be played. 296 2010_42 !2010 User-Controlling Expressed Emotions in Music with EDME Alex Rodrguez Lopez, Antonio Pedro Oliveira, and Amlcar Cardoso Centre for Informatics and Systems, University of Coimbra, Portugal lopez@student.dei.uc.pt, apsimoes@student.dei.uc.pt, amilcar@dei.uc.pt http://www.cisuc.uc.pt Abstract. Emotion-Driven Music Engine software (EDME) expresses user-defined emotions with music and works in two stages. The first stage is done oine and consists in emotionally classifying standard MIDI files in two dimensions: valence and arousal. The second stage works in realtime and uses classified files to produce musical sequences arranged in song patterns. First stage starts with the segmentation of MIDI files and proceeds to the extraction of features from the obtained segments. Classifiers for each emotional dimension use these features to label the segments, which are then stored in a music base. In the second stage, EDME starts by selecting the segments with emotional characteristics closer to the user-defined emotion. The software then uses a pattern-based approach to arrange selected segments into song-like structures. Segments are adapted, through transformations and sequencing, in order to match the tempo and pitch characteristics of given song patterns. Each pattern defines song structure and harmonic relations between the parts of each structure. The user interface of the application offers three ways to define emotions: selection of discrete emotions from lists of emotions; graphical selection in a valence-arousal bi-dimensional space; or direct definition of valence-arousal values. While playing, EDME responds to input changes by quickly adapting the music to a new user-defined emotion. The user may also customize the music and pattern base. We intend to explore this possibility by challenging attendees to bring their own MIDI files and experiment the system. With this, we intend to allow a better understanding of the potential of EDME as a composition aid tool and get useful insights about further developments. 2010_43 !2010 Swarm Painting Atelier Paulo Urbano1, 1 LabMag, Universidade de Lisboa, Lisboa, Portugal pub@di.fc.ul.pt Abstract. The design of coordination mechanisms is considered as a vital component for the successful deployment of multi-agent systems in general. The same happens in artificial collective creativity and in particular in artificial collective paintings where the coordination model has direct effects in agent's behavior and in the collective pattern formation process. Coordination, that is, the way agents interact with each other and how their interactions can be controlled, plays an important role in the "aesthetic value" of the resulting paintings, in spite of its subjective nature. Direct or indirect communication, centralized or decentralized control, local versus global information are important issues regarding coordination. We have created a swarm painting tool to explore the territory of collective pattern formation, looking for aesthetically valuable behaviors and interactions forms. We adopted the bottom-up methodology for producing collective behavior, as it is more kin to fragmentation, surprise, and non-predictability—as if it was an unconscious collaboration of collective artists—something similar to a swarm "cadavre exquis", but where we have a much more numerous group of participants, which drop paint while they move. They do not know anything about pattern or style, they have just to decide where to move and which color to drop. We are going to show the artistic pieces made by a swarm painting tool made collections of decentralized painting agents using just local information, which are coordinated through the mediation of the environment (stigmergy). We will also describe other types of agent coordination based on imitation where some consensual attributes, like color or orientation, or position, will emerge, creating some order on a potential collective chaos. This consensus can die out, randomly or by interaction factors, and new consensual attributes can win resulting in heterogeneous paintings with interesting patterns, which would be difficult to achieve if made by human hands. We think that our main contribution, besides the creative exploration of new artistic spaces with swarm-art, will be in the sense of showing the possibilities of generating unpredictable and surprising patterns from the interaction of individual behaviors controlled by very simple rules. This interaction between the micro and macro levels in the artistic realm can be the source of new artistic patterns and also can foster imagination and creativity. The Atelier can be reached at: http://www.di.fc.ul.pt/~pub/swarm-atelier. 298 2010_5 !2010 Automated Collage Generation { With Intent Anna Krzeczkowska, Jad El-Hage,y Simon Colton and Stephen Clarky Department of Computing, Imperial College, London, UK. y University of Cambridge Computer Laboratory, Cambridge, UK. 1 Introduction One reason why software undertaking creative tasks might be perceived of as uncreative is its lack of overall purpose or intent. In computational creativity projects, while some skillful aspects may be undertaken by the computer, there is usually a person driving the process, by making decisions with regard to the intent of the artefact being produced. Such human intervention can take different forms, for instance via the supplying of background information in scientific discovery tasks, or making choices during an evolutionary art session. We are interested in whether a notion of purpose could be projected onto The Painting Fool, an automated painter we hope will one day be taken seriously as a creative artist in its own right [3, 5]. Starting with the maxim that good art makes you think, we have enabled The Painting Fool to produce visual art specifically to invite the viewer to interpret the pieces in the context of the world around them, i.e., to make a point about a current aspect of society. However, we do not prescribe what aspect of modern life it should depict, nor do we describe the art materials { in this case digital images { it should use. Hence, we ffectively opt out of specifying a purpose for any individual piece of art. As described in section 2, the software starts with an instruction to access sources of news articles at regular intervals. Then, via text manipulation, text analysis, image retrieval, image manipulation, scene construction and non-photorealistic rendering techniques, the system produces a collage which depicts a particular news story. This is initial work, and more ffort is needed to produce collages of more aesthetic and semantic value. We present some preliminary results in section 3, including some illustrative examples and feedback from viewers of the collages produced. In section 4, we describe the next stages for this project. 2 Automated Collage Generation A schematic for the collage generation system is provided in figure 1. At regular intervals, scheduled processes begin a ow of information involving the retrieval of news articles from internet sources (with the news source specified by the scheduled job); the extraction of keywords from the news articles; the retrieval of images using the keywords; and the construction of input files for The Painting Fool. The input files specify which images to annotate and extract colour segments from, how to arrange the segments in an overall collage, and what natural media to simulate when painting the segments to produce the final piece. We provide further details of these processes below, with full details of the overall system available in [7]. 36 Fig. 1. Automated collage generation system overview. Text Retrieval and Analysis We enabled the system to access the Guardian News website and Google News search, via their APIs. For the Guardian, the API provides access to headlines, for which there are a number of associated articles, multimedia files, blogs, forums, etc., and our system extracts the first text-based article from this list. The Google News API produces similar output from multiple news sources, from which we extract text-based news stories from the BBC, The Independent newspaper and Reuters news service. The system retrieves only English language headline articles, and we can specify whether the articles should be about World or country-specific issues only. The retrieved articles are cleaned of database information and HTML appendages to produce plain text. Following this, we use a text analysis technique to extract a specified number of keywords from the plain text. The technique is an implementation of the TextRank algorithm [8], which is designed to extract those keywords most indicative of the content of the document. It is based on PageRank [1], the algorithm used by Google to determine the importance of a web page based on the hyperlink structure of the web. The intuition behind PageRank is that important pages will be pointed at by other important pages, a recursive notion of importance which can be assigned numerical values using a simple iterative algorithm. The intuition behind TextRank is similar: important words in the document will be pointed at by other important words. In the context of a document, pointed at is defined in [8] as being in the same context. Hence a graph representing the document can be created, where an edge exists between two words if those words appear in each other's contexts, where context is just a fixed-size window either side of the target word. PageRank uses the graph to assign numerical values to each word and extracts the most important ones. For our experiments, only nouns were extracted, as these were considered likely to be the most informative, and also the most useful keywords to use for image retrieval. Full details of the keyword extraction implementation are given in [6]. Image Retrieval and Manipulation The keywords extracted from the news stories are used to retrieve art materials (i.e., digital images) from the internet and local sources. The system has access to the 32,000 images from the Corel library which have been hand tagged and can be relied on for images which match the given keywords well. 37 Fig. 2. Example collages produced by the system. We also wanted to include images retrieved from the Internet, as these add a level of surprise,1 and a more contemporary nature to the retrieved images. We interfaced the collage generation system with both the Google images and the Flickr APIs. In the former case, the interface is fairly lightweight, given that Google supplies a set of URLs for each keyword, which point to the relevant images. In the latter case, however, a URL must be built from information retrieved from a photo-list, which is a non-trivial process. The three image sources (Corel, Google, Flickr) are queried in a random order, but when either Corel or Flickr return empty results, Google is queried, as this always supplies images. Note that we discuss experiments comparing Flickr and Google images in section 3. Scene Construction and Rendering In the final stage of processing, the retrieved images are assembled as a collage in one of a number of gridbased templates. Then the system employs The Painting Fool's non-photorealistic rendering capabilities [5] to draw/paint the collages with pencils, pastels and paints. In future, we will use the more sophisticated scene generation techniques described in [4]. 3 Initial Results The first image in figure 2 portrays a typical collage produced by the system. Here, a scheduled process happened to retrieve a Guardian news story about the war in Afghanistan, with the headline `Brown may send more troops to Afghanistan'. From the text, the words afghanistan, brown, forces, troops, nato, british, speech, country, more and afghan were extracted. Images were retrieved from Flickr accordingly, including a picture of a fighter plane, a field of graves, a young woman in an ethnic headdress and an explosion. The rendering style for this was simply to segment each image into 1000 regions and present the images in an overlapping grid with 10 slots. This example hints at the ability of the system to construct a collage which can semantically complement a news story, or even add poignancy. The second collage in figure 2 provides a hint of the possibilities for more interesting and perhaps playful juxtapositions. This was produced in response to a news story on the England versus Australia Ashes test cricket series, which had the headline: `England versus Australia { as it happened!' The images of the Houses of Parliament and a kangaroo in the collage are fairly obvious additions. However, the Collage also contains a picture 1 For instance, at the time of writing, querying Flickr for images tagged with the word \Obama" returns an image of a woman body-builder as the first result. 38 of the famous Falling Water building built by Frank Lloyd-Wright. Upon investigation, we found that the name wright was extracted from the news story (as a member of the England cricket team) and the first Google image returned for that keyword is, of course, the image of Falling Water. In order to informally assess the power of the collages to represent the text upon which they are based, we showed a collection of the collages to 11 subjects. We asked them to complete a survey where they were shown 12 news stories and 5 collages per news story, only one of which, called the master collage was generated from the news story (with the others generated from similar stories). To generate the 12 master collages, we varied both the number of keywords to be extracted (5 and 10) and the image source (Google and Flickr). Subjects were asked to rank the 5 collages for each news story in terms of relevance, with rank 1 indicating the most relevant and 5 the least relevant. Taking the average of the ranks, we found that the master collage was ranked as (a) the most relevant in 8 of the 12 tests and (b) the second most relevant in the other 4 tests. The most marked difference we noticed was between the collages produced with Google and with Flickr. In particular, the Google collages had an overall rank of 1.82, while the Flickr collages had an overall rank of 2.14. This highlights that image tagging in Flickr is not particularly reliable, with Google returning more relevant images. These results are encouraging, as they demonstrate that even via the abstractions of keyword extraction, image retrieval and non-photorealistic rendering, it is still possible for the collages to have semantic value with respect to the news stories from which they were derived. 4 Discussion and Further Work The system described above is a prototype, and is presented largely as a proof of principle, rather than a finished system. We have presented an illustrative example of how the pipeline of processes can produce a collage with the potential to make viewers engage their mental faculties { in this case about warfare. The value here is not necessarily in the quality of the final artefacts { which are currently a little nave { but rather in the fact that we had little idea of what would be produced. We argue that, as we did not know what the content of the collages would be, it cannot have been us who provided the intention for them. Given that these collages are based on current events and do have the potential to engage audiences, we can argue that the software provided the intent in this case (perhaps subject to further discussion of intent and purpose in art). In [3], we argue that the perception of how a computational system produces artefacts is as important as the final product. Hence, the fact that the system supplies its own purpose in automated art generation may add extra value to the artworks produced. Having said that, there are a number of improvements to the process we intend to make in order to increase the visual and semantic appeal of the collages. In particular, we plan to make better use of The Painting Fool's scene construction abilities, and to implement scene construction techniques which are aware of the context of the news story being portrayed. For instance, if the text of the news article has a distinctive plot-line, then a linear 39 collage might best portray the narrative of the story, with images juxtaposed in an appropriate order. However, if the major aspect of an article is the mood of the piece, then possibly a more abstract collage might best portray this. We also plan to involve text summarisation software to provide titles, wall text and other written materials for the collages. We hope to show that by stepping back from certain creative responsibilities (described as \climbing the meta-mountain" in [4]), such as specifying the intent for a piece of art, we make it possible to project more creativity onto the collage generation system than if there was a person guiding the process. Our long term goal for The Painting Fool is for it to be accepted as a creative artist in its own right. Being able to operate on a conceptual level is essential for the development of The Painting Fool, hence we will pursue further interactions with text analysis and generation systems in the future. We would like to thank the anonymous reviewers for their useful advice. One reviewer stated that in some scientific theory formation systems, the software is not perceived as uncreative because of a lack of intent. As the engineers of scientific discovery software [2], when running sessions, we always provide intent through our choice of background material and our choices for evaluating the theory constituents. Hence, a critic could potentially argue that the software is not being creative, as it has no purpose of its own.We believe that this is true of most other scientific discovery systems, especially machine learning based approaches such as [9], where finding a classifier is the explicit user-supplied intention. The reviewer also compared the collage generation system with Feigenbaum's famous Eliza program. We find it dicult to see the comparison, given that the collage generation system is given no stimulus from a user, whereas Eliza reacts repeatedly and explicitly to user input. A more accurate analogy in the visual arts would be image filtering, where an altered version of the user's stimulus is presented back to them for consideration. It is clear that the notion of intent in software causes healthy disagreements, and perhaps our main contribution here is to have started a fruitful discussion on this topic. 2010_6 !2010 A Step Towards the Evolution of Visual Languages Penousal Machado and Henrique Nunes CISUC, Department of Informatics Engineering, University of Coimbra, 3030 Coimbra, Portugal machado@dei.uc.pt Abstract. Traditional Evolutionary Art systems allow the evolution of individual artworks. We present a novel Evolutionary Art engine for the evolution of visual languages. In our approach, each individual is a context free grammar that specifies an entire family of shapes following the same production rules. Therefore, search takes place at an higher level of abstraction (families of shapes) than the one typically explored in Evolutionary Art systems (individual shapes). The description of the system gives particular emphasis to the novel aspects of the approach and to the generative potential of the representation. 1 Introduction Stiny and Gips [1] introduced the concept of Shape Grammars, which: \are similar to phrase structure grammars, which were introduced by Chomsky in linguistics. Where phrase structure grammars are defined over an alphabet of symbols and generate one-dimensional strings of symbols, shape grammars are defined over an alphabet of shapes and generate n-dimensional shapes."[1]. Stiny and Gips have successfully built shape grammars that capture the \language" of Frank Lloyd Wright's prairie houses, Mughul Gardens, Palladian Plans, etc. Additionally, they were also able to use these grammars to produce new instances of the same language, e.g. to create new prairie house designs that obey Frank Lloyd Wright's style to the point of being indistinguishable, even to experts, from his original works [2]. Although these grammars are hand-built, and result from a complex process of analysis and formalization of the hidden rules followed in the originals, these results show that: (i) it is possible to capture specific visual languages using a set of production rules; (ii) it is then possible to use this set of rules to automatically generate new objects that belong to the same visual language. The main motivation for the present work is the development of a system for the evolution of novel visual languages. For that purpose we created an evolutionary engine where each individual is a context free grammar, and developed appropriate genetic operators, including several mutation operators and graphbased crossover. The use of the Context Free Design Grammar (CFDG) language [3] for representation allows the specification of complex families of shapes through a compact set of rules, and has several potential advantages over typical Evolutionary 41 Art (EA) representations. A brief overview of current evolutionary art systems, focusing on representation issues, is presented in Section 2 to allow a better contextualization of the present work, while, in Section 3, we describe the most relevant characteristics of CFDG. In the fourth Section we describe our evolutionary engine, and, in section 5, we present some of the experimental results attained. Finally, in Section 6, we draw some final conclusions and indicate future work. 2 Related Work A thorough survey of EA systems beyond the scope of this paper (see, e.g., [4] for a in-depth survey). Here we present a brief overview, focusing on the issues that are of most relevance to the present work, namely, representation scheme and generation abilities of the system. The most popular EA approach is inspired in the seminal work of Karl Sims [5]. It uses Genetic Programming (GP) to evolve populations of images. Each genotype is a tree that encodes a LISPlike symbolic expression. The internal nodes of the tree are functions (typically arithmetic, trigonometric and image processing operations) and the leafs are terminals (typically x and y variables and random constants). The rendering of the expression results in a phenotype, e.g. an image. More often than not user-guided evolution is employed, i.e., the user assigns fitness to the images, thus indirectly determining the survival and mating probabilities of the individuals. The fittest individuals have a higher probability of being selected for the creation of the next population, which is generated through the recombination and mutation of the genetic code of the selected individuals. The use of Genetic Algorithms (GAs) coupled with a fixed or variable length string representation is also frequent. In these cases the most common approach is parametric evolution [4]. In other words, the genotype encodes a set of parameters that determine the phenotype. Among other applications, this approach has been used to evolve cartoon faces, fractal shapes, fonts [4]. The use of EC approaches to the evolution of line-based drawings, 3D shapes, l-systems, filters, etc., has also been explored. Although the application area and implementation details vary, most systems can be seen as instances of expression-based or parametric evolution. As Machado and Amlcar [6] point out, most expression-based EA systems are theoretically able to create any image (see also [7]). Nevertheless, in practice, the image space that is actually explored depends heavily on the particularities of the system (primitives, genetic operators, genotype-phenotype mapping, etc.). In other words, and notwithstanding works such as [8] that describes an approach to iterative stylistic change in EA, most systems have an identifiable signature that naturally emerges from the interactions between its different components. In parametric evolution models, system signature is even stronger since: \...creating a parametric model implicitly creates a set of possible designs or a solution space." [4] Thus, there are strong contraints that limit the search-space and define the type of imagery produced by the system. Finally, to the best of our knowledge, there are two reported examples of the use of CFDG in the context of Evolutionary Art. Unfortunately, none of 42 them allows the evolution of visual languages. As the name indicates CFDG Mutate [9] only allows the application of mutation operators, which is limiting, and does not handle non-deterministic grammars, which means each individual represents a single shape (see Section 3). Saunders and Kazjon [10] present a parametric evolution model that evolves parameters of specific CFDG handbuilt grammars. Although this allows some degree of exploration, in essence it has the same shortcomings as other parametric evolution approaches. 3 Context Free Context Free [11] is a popular open-source application that renders images specified using a simple language entitled Context Free Design Grammar (CFDG). In essence, and although the notation is different from the one used in formal language theory, a CFDG program is a context free grammar, i.e. a 4-tuple: (V;;R; S) where, 1. V is a set of non-terminal symbols 2. is a set of terminal symbols 3. R is a set of production rules that map from V to (V [ ) 4. S is the initial symbol In Fig. 1 we present a simple grammar and the image generated by it. Programs are interpreted by starting with S (in this case S = TREE, as defined by the startshape directive) and proceeding the expansion of the production rules in breath-first fashion. Pre-defined V symbols call drawing primitives (e.g. CIRCLE draws a circle) while predefined symbols produce semantic operations (e.g. size produces a scale change, y moves forward, etc.). Program interpretation is terminated when one of the two following criteria is met: (i) There are no V symbols left to expand; (ii) The further expansion does not change the image (E.g.: although the recursive loop of the grammar presented in Fig. 1 is endless, the set of transformation is contractive [12], after a few iterations we reach a size smaller than pixel size and, therefore, further expansion will not cause visible differences). This second termination criterium has no parallel in formal language theory, but is similar to the termination criterium used in Iterated Function Systems (IFSs) rendering. The grammar depicted in Fig. 1 is deterministic, there is exactly one rule for each V symbol, therefore its interpretation will always result in the same image. To specify languages of shapes we have to resort to non-determinism. In Fig. 2 we present a non-deterministic version of this grammar, with two different production rules for the 'TREE' symbol. When several production rules may be applied one of them is selected randomly and the expansion of the grammar proceeds. One can control the relative probability of selection by specifying a weight after the V symbol1 (in this case, 0.8 for the first rule and 0.2 for the second). 1 The same ffect can be attained by making copies of the production rule we wish to use more frequently. So this does not violate the formal language theory definition of a context free grammar. 43 startshape TREE rule TREE { CIRCLE {} TREEA {size 0.95 y 1.6}} rule TREEA { CIRCLE {} TREEB {size 0.95 y 1.6}} rule TREEB { CIRCLE {} TREEC {size 0.95 y 1.6}} rule TREEC { CIRCLE {} TREED {size 0.95 y 1.6}} rule TREED { CIRCLE {} TREE {size 0.95 y 1.6 rotate 45} TREE {size 0.95 y 1.6 rotate -45}} Fig. 1. A deterministic grammar and the tree-like shape generated by it. startshape TREE rule TREE 0.80 { CIRCLE {} TREE {size 0.95 y 1.6} } rule TREE 0.20 { CIRCLE {} TREE {size 0.95 y 1.6 rotate 45} TREE {size 0.95 y 1.6 rotate -45}} Fig. 2. A non-deterministic version of the grammar presented in Fig. 1 and instances of the family of tree-like shapes generated by it. 4 Evolutionary Context Free Art In this section we describe our evolutionary engine. For the sake of parsimony we will avoid mentioning the implementation details and focus on the key components. An in-depth description is left for a future opportunity. 4.1 Representation Each genotype is a well-constructed CFDG grammar. Internally the genotype is represented by a directed graph where each node encapsulates a production rule. For each node, Ni, outgoing edges are created in the following way: 1. Let Vi be the set of all V symbols that the production generates 2. Let Mi be the set of all nodes representing production rules that may be triggered by Vi symbols 3. Establish edges from Ni to all Mi nodes For instance, the grammar of Fig. 1, results in the following edges: TREE ! TREEA, TREEA ! TREEB, TREEB ! TREEC, TREEC ! TREED; 44 Fig. 3. The two leftmost images are the parents, the remaining ones are results of their crossover for the grammar of Fig. 2 we would have TREE1 ! TREE1, TREE1 ! TREE2, TREE2 ! TREE1, TREE2 ! TREE2, where TREE1 and TREE2 represent the nodes that would be created for the first and second production, respectively. The phenotype is rendered using Context Free. Infinite non-contractive loops may occur. To cope with this problem we specify a maximum amount of time for rendering. If that limit is reached rendering stops and the current image is considered the phenotype. 4.2 Genetic Operators The design of genetic operators that are well-suited to the adopted representation is vital for the success of any evolutionary algorithm. In our case the biggest challenge was to design a recombination operator that allows the meaningful exchange of genetic material between individuals. Given the nature of the representation we developed a graph-based crossover operator based on the one presented by Pereira et al. [13]. In simple terms, this operator, inspired in the standard GP swap-tree crossover, allows the exchange of subgraphs between individuals. Our implementation follows the algorithm described in [13] closely, but we have generalized it to allow the exchange of subgraphs of unequal size. In Fig. 3 we present examples attained trough crossover. We use a total of eight mutation operators: Startshape mutate { randomly selects a new V starting symbol; Add V { Adds a V symbol to a given production rule in a valid random position; Remove V { Removes a V symbol from a given production rule and associated parameters (if any exist); Copy rule { Duplicates a production rule; Remove rule { Removes a given production rule updating the remaining rules when necessary (if it is the only production rule associated with a given V symbol, production rules that generate that symbol must be updated, which is accomplished by removing the symbol from those rules); Change, Remove, Add parameter { as the names indicate these operators add, remove or change parameters, i.e. symbols. All operators preserve the validity of the grammar and update the graph accordingly. 4.3 System Overview and Generation Abilities In all the experiments presented in this paper we adopted an user-guided evolution approach. Unlike most EA representation schemes, it is feasible to edit 45 CFDG by hand, in fact Context Free users have already created an impressive collection of shapes and visual languages. As such, it would be possible for the user to directly manipulate the genetic code. Although the ability to read, understand and edit the evolved genotypes is an important advantage, in the experiments presented herein we didn't take advantage of this possibility. In standard EA systems the initial population is either random or seeded using examples from previous EA runs. In the present system, and given the availability of a wide set of hand-coded CFDG grammars, we have the option of using top-quality grammars to seed the evolutionary runs. The remaining aspects of the system follow standard Evolutionary Computation practices. We use a generational approach, elitism and tournament selection. In the experiments presented in this paper, population size=50, the top individual was preserved, and tournament size=5. In what concerns the generative potential of the system, it is trivial to show that it is possible to represente any image. Consider you want to represent a particular image, for each pixel use a rule that changes the color so that it matches the pixel's color, draw one square, move in the direction of the next pixel, call the rule for the next pixel. Obviously this would result in an extremely long and mostly useless grammar, but it can be done. Another way of demonstrating the generality of CFDG is the following: considering that an IFS can be specified (compactly) using CFDG and that Barnsley [12] demonstrated that IFSs can be used to generate any image, the same applies to CFDG. Although this generic representation abilities are theoretically relevant, in practice the main issue is knowing what types of images can be represented compactly. The wide set of imagery produced by Context Free users indicates that it is possible to generate a large amount of complex and beautiful shapes with surprisingly small grammars. A more interesting question is knowing which languages of shapes can be expressed using CFDG. Once again theory and practice can be quite different. From a theoretical standpoint, the set of all images of a given resolution, albeit vast, is finite [7]. As such, considering a fixed resolution, any given shape language is also a finite collection of shapes. Since it is possible to represent any image with CFDG and that union is a trivial operation for context free languages, it follows that any family of shapes can be represented2. In practice, defining a language of shapes through enumeration is either unfeasible or uninteresting. Thus, from a practical standpoint we can consider the set of all images and the set of all shape languages infinite. It then follows that it is not possible to represent any language of shapes, nor even the set of all recursive languages, since the pumping lemma for context free languages aplies. Like previously, the hand-coded examples of CFDG languages developed by the numerous users of Context Free indicate that, in spite of this limitations, it is possible to create interesting and sophisticated shape languages with compact CFDG grammars. 2 Consider that you have two grammars, A and B, with initial symbols SA and SB, A[B can be attained by: preserving all the production rules of A and B; creating a new initial symbol, SA[B; adding the production rules SA[B ! SA and SA[B ! SB. 46 5 Experimentation The main goals of the experiments conducted were testing the ability of the system to: (i) evolve appealing and complex shapes; (ii) cope with hand-coded CFDGs; (iii) evolve families of shapes. The analysis of the experimental results attained by evolutionary art systems, specially user driven ones, entails a high degree of subjectivity. In our case, there is an additional diculty: each individual encodes a set of images. Considering these diculties, space restrictions, and the visual nature of the results, we chose to focus on a single evolutionary, which can be considered typical. To address goal (ii) we initiate the run using 6 hand-coded CFDGs downloaded from the Context Free gallery [11]. Fig. 4 presents examples of images created by these grammars. In what concerns goals (i) and (iii) we already know it is possible to create stunning imagery and visual languages using CFDG, so the main issue is determining if it is possible to guide the EC algorithm to promising areas of the search space. The visual diversity of the populations found throughout the run was always high (see Fig. 5). Population diversity is generally welcomed and has the additional benefit of keeping the user engaged; however, we found that the user was often distracted due to the presence of too many interesting alternatives, and unable to keep steady evaluation criteria. Nevertheless, the user was able to guide the EC algorithm with relative ease and promote convergence whenever it was found necessary. In Fig. 6 we present some of the favorite images produced by individuals of this run. The mutation operators proved valid throughout the run producing results that are conceptually similar to the ffects of mutation in expression-based EA. That is, the ffects of mutation range from minor visual alterations to dramatic changes in appearance induced by small changes of the genetic code, with the later being less often [6]. Although it is subjective to say it, in what concerns mutation, the system appears to have an adequate degree of plasticity { allowing change { and stability { preventing chaotic behaviour. The ffects of the crossover operator appear to depend heavily on the structural similarity of the genotypes and on their size. In general terms, when the parents are unrelated the visual appearance of each descendent tends to be mostly determined by one of the parents (see fig. 3). This ffect is particularly visible when the genotypes are small and with hand-built grammars (which tend to share little resemblance). Like previously, similar findings have been reported for expression-based EA, particularly for non-random initial populations [6]. In Fig. 7 we present instances of the visual languages defined by two of the individuals evolved during the run. The experimental results show that nontrivial and interesting families of shapes were evolved. During evolution the user only has access to one instance of the images an individual generates. This means that the quality, diversity and consistency of the language of shapes generated by the individual isn't directly assessed. Arguably, individuals that fail to reliably generate high quality images will eventually be 47 Fig. 4. Hand-coded grammars used as initial population. Fig. 5. Images generated by the first 15 individuals of the 10th population of the run. discarded by evolution, and the user will eventually grow tired of individuals that systematically generate the same image. Nevertheless, this is not the same as directly assessing the language of shapes that each individual defines, and interesting languages may easily be overlooked. This may contribute to a lower diversity of shapes within each family. In spite of this, the ability to create families of shapes is inherent to the the system and the experiments successfully evolved interesting visual languages. 6 Conclusions and Future Work We presented a novel evolutionary engine that allows the evolution of CFDGs, is able to cope with non-deterministic grammars, and allows their recombination trough a graph-based crossover operator. Due to these abilities, it successfully overcomes the limitations of previous EC approaches where CFDGs are used. When compared with typical expression-based and parametric evolution models our approach presents several advantages, including the ability: to evolve visual 48 Fig. 6. Images generated by some of the most valued individuals evolved during the course of the run. languages instead of individual images; to use hand-coded grammars; and to allow the user to editing of the genotypes. Although the interpretation of the results is subjective, they provide evidence of the adequacy of the genetic operators and of the generative power and potential of the system. They also indicate that further experimentation is required to fully explore the potential of the approach for the creation of visual languages. Nevertheless, we consider this to be an important step in that direction. In terms of future work, redesigning of the user interface, exploring of automatic image fitness assignment schemes, and developing approaches to automatically assess a language of shapes in terms of consistency, diversity and aesthetic qualities of the generated images are our top priorities. 2010_7 !2010 On the Role of Metaphor in Creative Cognition Bipin Indurkhya Cognitive Science Laboratory International Institute of Information Technology Hyderabad 500032, India bipin@iiit.ac.in Abstract. We consider some examples of creativity in a number of diverse cognitive domains like art, science, mathematics, product development, legal reasoning, etc. to articulate an operational account of creative cognition. We present a model of cognition that explains how metaphor creates new insights into an object or a situation. The model is based on assuming that cognition invariably leads to a loss of information and that metaphor can recover some of this lost information. In this model we also contrast the role of traditional analogy (mapping based on existing conceptualization) with the role of metaphor (destroying existing conceptualizations in order to create new conceptualizations). 1 Introduction Though there have been many approaches to characterize creativity [17] [25] [31], we start with a simple approach which sees creativity as a process of generating a new perspective on a problem or a situation. We are limiting ourselves to individual creativity here, so the information resulting from this process need only be novel to the cognitive agent, and we do not yet concern ourselves with creativity in a society. Secondly, we do not consider usefulness of the generated information: it is sufficient for us here that the information be novel to the agent. In fact, if the model presented here is correct, it implies that there cannot be some domain-independent principle or heuristic that would generate only (or largely) useful perspectives. With these assumptions in place, the task we are undertaking is to propose a model that articulates the role of metaphor in the creative process and also explains why metaphor is so effective in generating new perspectives. In this model, we will also compare the role of analogy with the role of metaphor, and argue that the two play complementary roles in creative cognition. The paper is organized as follows. In the next section we will present a few examples to illustrate how creative insights are obtained in a few diverse domains. Following this, in Section 3, we will present an account in which cognition is seen to necessarily involve loss of some information, and in which metaphor becomes one of the tools that makes it possible to recover some of this lost information. At the end of this section we will also compare the role of analogy with the role of metaphor. Finally, in Section 4, we will summarize the main points of this paper, and mention future research directions. 51 2 Creativity in Cognition: Some Examples We start by considering some concrete instances where a new insight or a new perspective was generated. The examples are taken from a number of diverse domains including art, legal interpretation, mathematics, and product development. At the end of this section we will present a brief overview of the cognitive mechanisms underlying creativity that have been proposed in the past research. 2.1 Creativity in Art In a recent study, Okada et al [20] consider the evolution of artistic style and creativity in the works of a Japanese artist Shinji Ogawa over several years. One interesting point in this study is how the artist hit upon an idea that led to a series of work: "[Shinji Ogawa] was a part-time teacher at a vocational-technical school of media art. When he was preparing for a class, he accidentally erased part of a picture on a computer screen by mistakenly pushing a keyboard button. At that moment, he came up with the idea that if something very important and valuable suddenly disappears, a new value may be generated and a new world could be created. With this idea, he tried to create a new movie poster for Roman Holiday by erasing the main actress, Audrey Hepburn, from the original poster. This was the beginning of the artwork series, ‘Without You'." [p. 194]. Though the authors chose to interpret this example in terms of analogical modification, it resonates strongly with Piaget's account of how new schemas emerge through sensorimotor interactions with the environment. The example presented above bears a strong resemblance to Piaget's account of how a child brings a toy to her mouth in order to suck, accidentally notices the bright color of the toy and starts bringing toys near her face to look at them, eventually generalizing into a schema of ‘bringing objects to the face in order to look at them' [21][22]. In Mr. Ogawa's case, he accidentally discovered the operation of ‘delete figure from a picture', realized artistic potential of it, and a new style of artwork was born. That the discovery was made accidentally is not so relevant for our argument here, but what we would like to emphasize is that the discovery resulted from the application of a familiar operation (‘delete') to a familiar object but in a novel way. Interestingly, similar episodes occurred later as well in Mr. Ogawa's career. Okada et al note: "Mr. Ogawa happened to pick up a postcard at hand with old Western scenery and drew a duplicate building next to an original one. Then he mailed it, as a postcard, to a gallery owner. When he heard from the gallery owner telling him that staff members of the gallery talked highly about his postcard, Mr. Ogawa decided to start a new artwork series, ‘Perfect World', in which he duplicates a person or a thing in postcards or photographs of scenery." [p. 195] The operations of ‘delete' and ‘duplicate' are quite similar. In the framework of Hofstadter [7], one could say that one operation slipped into a neighboring operation to lead to another creative insight. Or one could see it in terms of a Piagetian schema of related operations that are applied to a different class of objects. It is important to underscore the ‘different' part here. When handling photographs of famous landmarks, people, etc., we could change the contrast, brightness level, perhaps apply red-eye reduction tool, but we do not normally delete or duplicate objects, and much less so if the object is the main theme of the photograph. In other words, we could say that the creative insights resulted from applying a set of familiar operations to a set of [also familiar] objects that are not usually associated with the operations. This is the key point that makes metaphor an invaluable tool for generating creative insights, something that we will keep reiterating in the rest of the paper. 52 2.2 Creativity in Legal Interpretation Even though law is a domain that is normally not associated with creativity — for one expects a straightforward application of legal principles and many judicial scholars frown on any deviation from the literal interpretation of the legal text — in our previous research [8] [10] we have found a number of situations where a new perspective or insight was a key factor in a legal discourse. We briefly present two such examples here. The first example is taken from [8]. In Australia and England, when a married couple divorces, the division of property was determined in large part by the old case law of ‘Husband and Wife' and by various Acts. These generally provided for division according to economic value added into the marital assets. This was plainly unjust where the husband had worked, while the wife cared for children and maintained the household. In such situations, the standard decision was, until recently, that the husband would get the lion's share of the property. However in an example of productive thinking, in Baumgartner v Baumgartner [(1987) 164 CLR 137] the High Court of Australia introduced a principle from a completely different area of law and held that the wife's work placed into the house meant she had an equitable interest in it. The husband, though legally the owner of the house, actually held part of it in a ‘constructive trust' for his wife. This decision was soon followed by a number of other similar decisions by other courts, and is now the standard approach. This illustrates a novel application of the legal concept of constructive trusts to a set of situations for which it was not originally intended, which resulted in a new way of rendering judgment on them. Similarly, the decision of Lord Denning in the High Trees case [Central London Property Trust Ltd v High Trees House Ltd [1947] KB 130], which modified contract law by introducing another equitable principle, ‘promissory estoppel', is another example where a legal concept from a different area was applied to deal with a problem in a domain for which it was not originally intended. (See also [27].) The second example [10] concerns the case of a hot-dog stand operator, who claimed tax-deduction for the kitchen at home where hot-dogs were prepared [Baie, 74 T.C. 105 (1980)]. One argument made by B. was that her kitchen was a manufacturing facility of the business. The judges remarked: "We find this argument ingenious and appealing, but, unfortunately, insufficient to overcome the unambiguous mandate of the statute." [74 T.C. 110 (1980)]. The point to emphasize here is that the category ‘manufacturing facility', which is not normally associated with this situation, is applied to the kitchen where hot dogs are prepared resulting in a novel perspective. To summarize, we see that the application of a legal concept to a domain or a situation for which it was not originally intended, can sometimes result in a new way of looking at the situation thereby leading to a novel judgment. 2.3 Creativity in Mathematical Reasoning Consider George Cantor's theory of transfinite numbers, in particular, his arguments concerning the levels of infinity [1]. Two of his key proofs, namely that 1) rational numbers have the same cardinality as natural numbers, and that 2) real numbers are more numerous than natural numbers, can now be understood by a high-school student. However, when originally proposed, they were considered very radical. Many leading mathematicians at that time refused to accept his formalization of set theory and its implications for infinite sets. Yet, Cantor's insights were derived from applying the operation of making one-to-one correspondence, which was already well known for finite sets for hundreds of years, to infinite sets. In addition, he used a 53 particular way of arranging infinite numbers in an array and counting them in such a way that two (or more) infinite dimensions can be mapped onto a single dimension of infinity. A somewhat different operation applied to a similarly arranged twodimensional layout of numbers led to his famous diagonal argument, where he showed that certain sets cannot be put in a one-to-one correspondence with natural numbers. To emphasize, we see again that the application of familiar operations to a different sets of objects resulted in a novel perspective. For indeed, the theorems and proofs discovered by Cantor revealed a whole new aspect of numbers and opened a fresh chapter in mathematical research. 2.4 Creativity in Product Development Consider a case study described in Schön [24] where a product development team was faced with the problem of figuring out why synthetic-fiber paintbrushes were not performing as well as natural-fiber paintbrushes, and to improve their performance. The members of the team tried many ideas — for instance, they noticed that the natural fibers had frayed ends, and they tried to have synthetic fibers with frayed ends too — but without success. The breakthrough came when one member of the team suggested that the paintbrush might work as a pump. This idea was initially considered quite shocking, for a paintbrush and a pump were thought to be very dissimilar. Yet, in trying to make sense of the analogy, a new ontology and structure for the paintbrush was created. In this new representation, the paint was sucked in the space between the fibers through capillary action, and when the fibers were pressed against the surface to be painted, the curvature of the fibers caused a difference in pressure that pumped out the paint from the space between the fibers onto the surface to be painted. From this new ontology, when the synthetic-fiber and natural-fiber paintbrushes were compared, it was found that the synthetic fibers bent at a sharp angle against the surface, whereas the natural fibers formed a gradual curve. Thus, juxtaposition with pumping caused a new perspective to be created on the process of painting and paintbrush. There are many other such examples [Gordon 1961] where seeing one familiar object as another familiar object, but one that is not normally associated with the first object, led to a new perspective and eventually to solving a difficult problem. 2.5 Cognitive Mechanisms of Creativity So far we have seen a number of examples where a set of operations or concepts are applied to an object or a situation with which they are not normally associated, resulting in a novel perspective. Perhaps not surprisingly, such mechanisms have been noted and studied in the past by various researchers, and they have been known under different labels. Here we summarize a few major veins of this research. Making the Familiar Strange: Gordon and his colleagues [6] studied creative problem solving in real-life situations for many years, and found that one way to get a new perspective on the target problem is to look at in a strange way. The mechanism is they proposed is to juxtapose the target problem or object with a completely unrelated object or situation. Displacement of Concepts: Schön [24] emphasized that in order to get a new insight about a concept, it needs to be displaced, that is, put in the context of other unrelated concepts. He emphasized that the most important step in problem solving is problem 54 setting, that is how the problem is stated and viewed, and metaphors play a key role in this step. Bisociation: Koestler [18] coined this term to emphasize that the pattern underlying a creative act is the perception of a situation or an idea in two self-consistent but habitually incompatible frames of reference. Lateral Thinking: Edward de Bono [4] contrasted vertical thinking with lateral thinking. In the former, one starts with some assumptions and explores their implications deeper and deeper. But in lateral thinking, the goal is to look at the problem in different ways so that the familiar assumptions one makes about it can be questioned and perhaps a new set of assumptions can be brought in. Estrangement. Rodari [23] focused on creativity in inventing stories, and proposed many practical methods that stimulate imagination and creativity in children (and in adults). Many of his methods rely on random juxtaposition of concepts. One mechanism he emphasizes as the first step in creating riddles is estrangement, where you are asked to see the object as if for the first time. In other words, instead of seeing the object in terms of the familiar categories it naturally evokes, you are asked to consciously block this evocation and try to view the object as if it is a strange object you are seeing for the first time. Conceptual Blending. Fauconnier and Turner [5] analyzed how people combine perceptual, experiential and conceptual aspects of different concepts subconsciously to generate new insights. Though each of these approaches has its own peculiarities, they all emphasize that in order to get a new insight about an object or situation, we need to get away from, or break, its existing conceptualization. In this task, viewing the object in terms of (or juxtaposing it with) another unrelated object can be a key step. 3 An Account of Creativity in Cognition We saw numerous examples in the last section that show that to get a new perspective on an object or a situation, an approach that often works is to apply operations that are not normally associated with that object, or to see that object as another unrelated object. Here we will propose a model to explain why this process works as it does. 3.1 Cognition and Loss of Information Here we argue that every act of conceptualization (or cognition) invariably involves some loss of information. When we choose to label an object as a ‘chair' numerous specific details of the object, like its color, the material it is made of, shape, etc. are all lost. Of course, we could make our conceptualization of the object more specific — it is a red chair, made of teak, with a high back, and so on — but no matter how detailed the conceptual representation is made, there is always some aspects of the object that are excluded, and it is these excluded aspects that constitute the information lost in the conceptualization. (This precisely is the theme of a short story Del Rigor en la Ciencia (On Exactitude in Science), by Jorge Luis Borges and Adolfo Casares.) 55 Whenever this lost information becomes crucial to solving the problem, then the existing representation becomes hopelessly inadequate. In the paintbrush example presented above, the information about the spaces between the brush fibers etc. was discarded in the then existing model of painting, so no matter how much and how hard the product development team tried, the problem could not be solved. It was necessary to change the representation or the conceptualization of the object. 3.2 Interaction with the Environment Through Actions and Gestalt Projection If the hypothesis presented above is correct, namely that some information is invariably lost in conceptualization, the next question is how can we recover, at least partially, this information. Here we assume that we do not have the God's eye view of the world, meaning that we do not have another way to access the object except through the cognitive agent. This may seem a technicality, but it is a very crucial point, so let us elaborate a bit. In a computer simulation or a model, one can posit a very rich and detailed representation of the object, and then show how a conceptualization picks out some aspects of this rich representation, while ignoring others. For example, the rich representation of a chair may include its material, shape, color, weight, and so on, but the conceptualization can only include legs, seat and back. However, for us here, the rich representation is not available, for if it were, it would be just another conceptualization, and there will still be some lost information. In other words, that our conceptual representation of an object does not include all the information about the object is like an existence proof: we can argue about its existence but we cannot say what it is. So the key question is how can we become aware of this lost information, and how can we recover at least some of it. Piaget's action-oriented approach provides a possible way to addressing this question. Piaget argued that an object is relevant or meaningful to a cognitive agent in only as far as how the agent may act on it. Thus, a ball is something that a baby might roll, kick, squeeze, and so on. In this approach, novel aspects of an object may be revealed when the agent carries out new actions on it. Moreover, the actions can be internalized actions, which are called operations in Piaget's framework. In our earlier work [13], we have used the term gestalt projection to emphasize that it is not just individual operations, but a network of operations, namely a schema or a gestalt, that are projected onto the internalized object or situation. In other words, a cognitive agent can get more information about an object or a situation by projecting a different gestalt, or a different set of operations onto it. 3.3 Metaphor: A Tool for Generating New Information about the Environment So far we have argued that all conceptualization involves some loss of information, and that some of this lost information may be recovered by projecting a different gestalt or a different set of operations onto it. But this is essentially what a metaphor does! By inviting us to see one object as another, we are forced to project the conceptual organization of the second object (usually referred to as the source) onto the experiences, images etc. of the first object (usually referred to as the target). Thus, metaphor can be a useful and powerful tool to get new information about the environment. This is essentially the crux of the arguments made by Turbayne [28]. He argued that though we can understand the world only through some metaphor or other, we enrich our understanding by viewing the world through two different metaphors. What a metaphor does is essentially give us an alternate conceptualization of the target. While this alternative conceptualization also loses some information (as all 56 conceptualizations do), the point is that this loses a different kind of information than what was lost in the original conceptualization, and taking them together we recover some of the lost information. (See also [14] and [24].) In this way, metaphor becomes a potent cognitive tool for generating creative insights. An interesting consequence of this view is that there cannot be any a priori criterion for determining which metaphors will be useful for a particular problem or to achieve a particular goal. If it is the missing information that is the key to solving the problem, then the existing conceptualization is hopelessly inadequate in pointing the way to recovering this information. The metaphor approach presented here is essentially a trial-and-error method that makes no promise to deliver even in a probabilistic sense. We elaborate on this further below. 3.4 To Analogize or Not To Analogize If we follow the arguments presented above, analogies, in their traditional sense at least, turn out to be an anathema to creativity. The reason is that analogies are based on mapping the structure or attributes of the source, to the structure and attributes of the target. So an analogy, which is based on the existing conceptualization of the source, will retrieve sources that are similar to the structure, thereby further strengthening the existing conceptualization of the target. But if the problem could not be solved because of the missing information, then an analogy-based approach will not be very useful. Yet, analogy has also been recognized as a key mechanism of creativity [2] [6] [7] [18] [19] [20]. One must distinguish between two modes of analogy here though. On one hand, analogy refers to "seeing one thing as another", which is essentially the same as how we have characterized metaphor above. The other use of the term analogy refers to the process whereby the structure and the attributes of the source are mapped to the target. It is this latter mechanism that seems contrary to creativity according to the view presented here, and so it needs some further elaboration. The cognitive structures (categories and conceptualizations) that naturally evolve through a cognitive agent's interaction with the environment reflect the priorities of the agent. The information that is retained in the conventional conceptualization is the one that has been useful to the agent (or to its ancestors) in the past, and the lost information may not be very relevant. So as long as one stays in the familiar domain (in which the conventional conceptualizations are very useful), and the problem does not require the lost information, reasoning from conventional operations and conceptualizations may be very efficient. Indeed, many of the case studies that show effectiveness of analogy in creative problem solving either stay within the same domain, or they use a source that is already similar to the target in a way that leads to a successful solution to the problem. However, as soon as the problem becomes different requiring new information, analogy becomes a hindrance, and the metaphor approach is called for. (See also [6] and [9].) To put this in another way, metaphor in the making-the-familiar-strange mode is a cognitively expensive operation, with no a priori guarantee if it will succeed, or when it will succeed. Therefore, this is used sparingly, and only when other avenues (like reasoning from analogy) have been tried out and were not successful. 3.5 Implications for Computational Modeling The account of creativity and cognition articulated here has a number of implications for computational modeling, and we will briefly highlight a few major ones. First of all, traditional approaches based on mapping existing symbolic representations clearly 57 have limitations [3] as far as creativity is concerned. They do capture a certain aspect of creativity in noticing new connections between existing knowledge, and in importing novel hypotheses from the source to the target, but they do not produce a paradigm shift of Kuhnian kind. In this regard, models based on corpus-based analyses and distributed representations seem more promising [26] [29] [30], but so far they are limited to linguistic metaphors. Another approach is to model the representation building process itself so that new representations can emerge through an interaction of concept networks and low-level object details that are available through sensory system or through imagination [7] [13]. This comes closest in spirit to the cognitive mechanisms underlying metaphor that we mentioned above in Sec. 2.5, for the creative insights emerge from applying a concept to an object (or a low-level representation of it) that is not habitually associated with it. In our earlier work, we have formalized this process [9] [12], and have applied it to model creativity in legal reasoning [10], but clearly much more work remains to be done. Moreover, in real-life, a number of different cognitive processes may act in consort to generate a creative insight, modeling of which may require hybrid architectures [19]. 4 Conclusions and Future Research We have articulated an account of cognition here in which cognition necessarily involves loss of some information. Creativity essentially lies in recovering some of this lost information, and metaphor play a fundamental role in this process. This, however, is a cognitively expensive operation. In many situations, such a novel perspective is not needed, so other problem-solving methods, including analogy, may be more efficient. Following the ideas outlined in our earlier research [11], it is possible to build a number of computer-based creativity-support systems to reduce the cognitive load on the agent in generating novel ideas and perspectives, or to stimulate their imagination in coming up with more creative ideas. We have demonstrated this point in a storytelling system that was designed and implemented earlier [16]. Currently we are working on designing and implementing another system to retrieve and display pairs of pictures that are based on perceptual similarities but are conceptually very different, in order to stimulate the user's creativity [15]. 2010_8 !2010 Some Aspects of Analogical Reasoning in Mathematical Creativity Alison Pease, Markus Guhe, Alan Smaill School of Informatics, University of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, UK {A.Pease,M.Guhe,A.Smaill}@ed.ac.uk http://homepages.inf.ed.ac.uk/apease/research/index.html Abstract. Analogical reasoning can shed light on both of the two key processes of creativity - generation and evaluation. Hence, it is a powerful tool for creativity. We illustrate this with three historical case studies of creative mathematical conjectures which were either found or evaluated via analogies. We conclude by describing our ongoing efforts to build computational realisations of these ideas. 1 Introduction Analogical reasoning is an essential aspect of creativity [2] and computational realisations such as [15; 20] have placed it firmly in the computational creativity arena. However, investigation into analogical reasoning has been largely carried out in the context of problem solving in scientific or everyday domains. In particular, very few historical case studies of analogy (excepting [16; 18]) and no computational representations that we know of are in the mathematics domain. There may be features that distinguish mathematics from other domains, such as having a large number of objects as compared to relations, as opposed to domains typically studied by analogy researchers [18], and recent work in analogy [14] has suggested that current theories of analogical reasoning such as the structure mapping theory may require some modification if they are to generalise to mathematics. This largely theoretical paper explores roles that analogical reasoning has played in historical episodes of creativity in mathematics. Towards the end we also describe some computational aspects of our work. 2 A marriage of dimensions Analogies between different geometrical dimensions, especially between two and three dimensions, date back to Babylonian times and have been particularly productive [16, p. 26]. The discovery of the Descartes- Euler conjecture, that for any polyhedron, the number of vertices (V ) minus the number of edges ⋆ We are grateful to Alan Bundy for discussion on the continuity example in section 4. This work was supported by EPSRC grant EP/F035594/1. 60 (E) plus the number of faces (F) is equal to two is one such example: there are differing accounts of its discovery, but both involve analogy at some level. Euler's own account of his discovery suggests that analogy to two dimensional polygons helped to guide his search for a problem: in a letter to Christian Goldbach, Nov 1750 he wrote ... "there is no doubt that general theorems can be found for them [solids], just as for plane rectilinear figures . . . " (our italics). The simple relationship that V = E for two dimensional shapes prompted a search for an analogous relationship between edges, faces and vertices in three dimensional solids. In Polya's reconstruction of this discovery [16, pp. 35-41] he suggested that the analogy was introduced to evaluate, as opposed to generate the conjecture. He developed a technique for using analogies to evaluate a conjecture: given analogical mappings and conjectures, Polya suggested that we adjust the representation in order to bring the relations closer [16, pp. 42-43]. In the Descartes- Euler example, the re-representation works by noting that vertices are 0D, edges 1D, faces 2D and polyhedron 3D, and then rewriting both conjectures in order of the increasing dimensions. In the polygonal case, V = E then becomes V −E+1 = 1, and the polyhedral case V − E + F = 2 becomes V − E + F − 1 = 1. These two equations now look much more similar: in both of them the number of dimensions starts at zero on the left hand side of the equation, increases by one and has alternating signs. The right hand side is the same in both cases. Polya then suggests that since the two relations are very close and the first relation, for polygons, is true, then we have reason to think that the second relation may be true, and is therefore worthy of a serious proof effort. 3 An extremely daring conjecture The Basel problem is the problem of finding the sum of the reciprocals of the squares of the natural numbers, i.e. finding the exact value of the infinite series 1 + 1 4 + 1 9 + 1 16 + 1 25 + 1 36 + . . .. In Euler's time this was a well known and difficult problem, thus in this example the initial problem already exists. Euler used analogical reasoning to find his conjectured solution 2 6 . To find this he rearranged known facts about finite series and polynomials in order to draw an analogy between finite and infinite series, and then applied a rule about finite series to infinite series, thus discovering what is referred to by Polya as an "extremely daring conjecture" [16, p. 18]. Euler then spent years evaluating both this conjecture and his analogous rule. 4 A split in the real continuum Cauchy's conjecture and proof that "the limit of any convergent series of continuous functions is itself continuous" [3, p. 131] is another example in which a rule for one area is analogously assumed to hold for another area. In this case the source domain is series and the target limits, and the rule assumed to hold 61 is that "what is true up to the limit is true at the limit". Lakatos [9, p. 128] states that throughout the eighteenth century this rule was assumed to hold and therefore any proof deemed unnecessary. However, this example is complicated: Cauchy's claim is generally regarded as obviously false, and the clarification of what was wrong is usually taken to be part of the more rigorous formalisation of the calculus developed by Weierstrass, involving the invention of the concept of uniform convergence. This episode was treated by Lakatos in two different ways. Cauchy claimed that the function defined by pointwise limits of continuous functions must be continuous [3]. In fact, what we take to be counter-examples were already known when Cauchy made his claim, as Lakatos points out in his earlier analysis of the evolution of ideas involved [9, Appendix 1]. After discussion with Abraham Robinson, Lakatos then saw that there was an alternative analysis. Robinson was the founder of non-standard analysis, which found a way to rehabilitate talk of infinitesimals (for example, positive numbers greater than zero, but less than any "standard" real number (see [17], first edition 1966). Lakatos's alternative reading, presented in [10], is that Cauchy's proof was correct, but that his notion of (real) number was different from that adopted by mainstream analysis to this day. In analogy terminology, people who had different conceptions of the source domain were critiquing the target domain which Cauchy developed. 5 Computational considerations We are exploring these ideas computationally in two ways. Firstly, we are using Lakoff and N´u˜nez's notion of mathematical metaphor [11]. Lakoff and N´u˜nez consider that the different notions of "continuum" outlined in §4 correspond to a discretised Number-Line blend (in the case of the Dedekind-Weierstrass reals); a discretised line as the result of "Spaces are Sets of Points" metaphor, where all the points on the line are represented (in the case where infinitesimals are present); or by a naturally continuous (physical) line [11, p. 288]. This approach provides promising avenues for the understanding of the relationships between the written representation of the mathematical theories, in this case mostly in natural language, the mathematical structures under consideration, and the geometrical or physical notions that informed the mathematical development. We are using the framework of Information Flow [1] to be more precise about what constitutes metaphors (and blends), by looking at the possible metaphorical relationships in terms of infomorphisms between domains. In [7; 8] we show how Information Flow theory [1] can be used to formalise the basic metaphors for arithmetic that ground the notions in embodied human experience (grounding metaphors). This gives us a form of implementation of aspects of the theory evolution involved here. We are extending this to Fauconnier and Turner's conceptual blending [4] and Goguen's Unified Concept theory [6]. Secondly, Schwering et al. have developed a mathematically sound framework for analogy making and a symbolic analogy model; heuristic-driven theory projection (HDTP) [19]. Analogies are established via a generalisation of source 62 and target domain. Anti-unification is used to compare formulae of source and target for structural commonalities, then terms with a common generalisation are associated in the analogical mapping. HDTP matches functions and predicates with same and different labels as well matching formulae with different structure. In particular, one of its features is a mechanism for re-representing a domain in order to build an analogy. We are using this system to generate the domain of basic arithmetic from Lakoff and N´u˜nez's four grounding metaphors. 6 Conclusions Analogy was used in the first example to find, or generate a problem, for which values could subsequently be conjectured. Also, importantly, it was used to aid evaluation (thus forming an essential tool in McGraw's "central loop of creativity" [13]). This is particularly interesting given that humans are not very good at making judgements, particularly in historically creative domains, and is not a generally noted use of analogy. The importance of re-representation in order to make a more convincing analogy, making sure that any preconditions for the re-representation are satisfied is also clear in all of these examples. In our second case study the original problem, to find an exact value for the sum of the reciprocals of the squares, was invented independently of the analogy which was used to solve it. Euler then tested his application of the rule he used, and this rule itself, rather than the solution to the Basel problem, became the major contribution of Euler's work in this area. The freedom with which one can apply a rule from one domain to another depends on the extent to which the second domain has already been developed. In Euler's time, while the modern mathematical concept of infinity was not developed, infinite series were an established concept, and thus Euler's work with infinite series had to fit with the structure already developed. In other examples, the target domain is much less developed and the analogiser may be able to define the domain such that a desired rule holds. Examples include the operations of addition/subtraction on the reals as analogous to multiplication/division: both are commutative and associative, both have an identity (though a different one) and both admit an inverse operation. Alternatively, it may not be possible to define a target domain such that a particular rule from a source domain holds: in Hamilton's development of quaternions he wanted to develop a new type of number which was analogous to complex numbers but consisted of triples. He was unable to define multiplication on triples, but did discover a way of defining it for quadruples as i 2 = j 2 = k 2 = ijk = −1. However, his multiplication was non-commutative, although it was still associative and distributive. Another possibility is that an analogiser may actively wish to create a target domain in which a rule from a source domain is broken. One example is the development by Mart´ınez [12] of a system in which the traditional rule that (−1)(−1) = +1 is changed to (−1)(−1) = −1, resulting in a new mathematical system. Focusing on analogy as a way of developing new mathematical ideas raises questions about how novel these ideas can be. By definition, an analogy-generated 63 idea must share some sort of similarity with another domain, familiar to the creator. Thus the criterion of novelty, accepted as necessary for creative output seems to be under threat. In this paper we leave aside such considerations: since we consider the examples we give to be both unambiguously creative and unambiguously based on analogy, it seems that any definition of novelty must not exclude analogy. We intend to address this question in a future paper. 2010_9 !2010 A Fractal Approach Towards Visual Analogy Keith McGreggor, Maithilee Kunda & Ashok Goel Design & Intelligence Laboratory, School of Interactive Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA {keith.mcgreggor, mkunda}@gatech.edu, goel@cc.gatech.edu Abstract. We present a preliminary computational model of visual analogy that uses fractal image representations that rely only on the grayscale pixel values of input images, and are mathematical abstractions quite rigorously grounded in the theory of fractal image compression. We have applied this model of visual analogy to problems from the Raven's Progressive Matrices intelligence test, and we describe in detail the fractal solution strategy as well as some preliminary results. Finally, we discuss the implications of using these fractal representations for memory recall and analogical reasoning. 1 Cognition, Computation, and Creativity We may study creative behavior at many levels of aggregation and abstraction ranging from environmental and genetic to neural and cognitive to social and cultural. Our work on computational creativity focuses on the cognitive level. Although at present there is little agreement in the cognitive sciences about the proper characterization of creativity, nevertheless there is broad and deep consensus that analogy is a core cognitive process of creative behavior, e.g. [1]- [8]. Indeed, some cognitive scientists have argued that analogy is a core process not only of creativity, but all cognition, including perception [6][9]. Thus, understanding the computational processes of analogy seems critical to understanding and developing computational models of creativity. We may classify computational theories of analogy into two broad categories. In one category are theories that propose general-purpose mechanisms for mapping and transfer of relations from one problem to another, e.g. production systems [10], structure mapping [11][12], schema induction [13], and constraint satisfaction [14]. These mechanism theories typically make few commitments about the contents of knowledge; indeed, their mechanisms are general-purpose because they are agnostic towards knowledge contents. In the other category are computational theories that describe contents of knowledge that drive analogies in creative tasks such as story understanding [15], diagram understanding [16][17], scientific problem solving [18][19], innovative design [20][21][22], and calligraphy [23]. These content theories of analogy, too, describe computational processes, but their processes are driven by the contents of knowledge (and corresponding vocabularies for capturing the knowledge contents). The extant content and mechanism computational theories of analogy nevertheless are united in their use of propositional representations: From 65 the perspective of the content theories of analogy, the mechanism theories offer potential substrates for computational implementation. We present a preliminary computational theory of visual analogy that is fundamentally different from both content and mechanism theories because it uses fractal representations instead of propositional ones. Visual analogy is a topic of longstanding interest in computational creativity because of its central role in many creative tasks such as intelligence tests [24]. In particular, in this paper, we focus on geometric analogy problems that are a type of visual analogy problems: geometric analogy problems focus solely on shapes, sizes and locations of the shapes, and spatial relations among the shapes. Fractal image representations rely only on the grayscale pixel values of input images, and are mathematical abstractions quite rigorously grounded in the theory of fractal image compression [25]. Below, first we describe how fractal representations can be used for computing features and similarity among distinct images, and how these operations can be combined to form a technique for creating visual analogies using purely pictorial inputs. Second, we present initial experimental results from using this method on geometric analogy problems that occur on the standardized intelligence test called the Raven's Progressive Matrices test [26]. Third, we discuss how this fractal model of visual analogy may be considered to be a mechanism for creativity. 2 Fractal Image Representation Consider the general form of an analogy problem as: A : B :: C : ? In the case of a visual analogy, we can present each of these analogy elements to be a single image. Some unknown transformation T can be said to transform image A into image B, and likewise, some unknown transformation T' transforms image C into the unknown answer image. The central analogy in the problem may then be imagined as requiring that T is analogous to T'. In other words, the answer will be whichever image D yields the most analogous transformation. Using fractal representations, we shall define the most analogous transform T' as that which shares the largest number of fractal features with the original transform T. 2.1 Mathematical Basis The mathematical derivation of fractal image representation expressly depends upon the notion of real world images, i.e. images that are two dimensional and continuous [25]. A key observation is that all naturally occurring images we perceive appear to have similar, repeating patterns. Another observation is that no matter how closely you examine the real world, you find instances of similar structures and repeating patterns. These observations suggest that it is possible to describe the real world in 66 terms other than those of shapes or traditional graphical elements—in particular, terms which capture the observed similarity and repetition alone. Computationally, to determine the fractal representation of an image requires the use of the fractal encoding algorithm. The collage theorem [25] at the heart of the fractal encoding algorithm can be stated concisely: For any particular real world image D, there exists a finite set of affine transformations T which, if applied repeatedly and indefinitely to any other real world image S, will result in the convergence of S into D. We now shall present the fractal encoding algorithm in detail. 2.2 The Fractal Encoding Algorithm Given an image D, the fractal encoding algorithm seeks to discover the set of transformations T. The algorithm is considered "fractal" for two reasons: first, the affine transformations chosen are generally contractive, which leads to convergence, and second, the convergence of S into D can be shown to be the mathematical equivalent of considering D to be an attractor [25]. Here are the general steps for encoding an image D in terms of another image S: 1) Decompose D into a set of N smaller images {d1, d2, d3, ..., dn}. These individual images are sets of points. 2) For each image di: a) Examine the entire source image S for an equivalent image si such that an affine transformation of si will result in di. This affine transformation will be a 3x3 matrix, as the points within si and di under consideration can be represented as the 3-D vector where c is the (grayscale) color of the 2-D point . Collect all such transforms into a set of candidates C. b) Select from the set of candidates that transform which most minimally achieves its work, according to some predetermined and consistent metric. c) Let Ti be the representation of the chosen affine transformation of si into di. 3) The set T = {T1, T2, T3, ..., Tn} is the fractal encoding of the image D. The decomposition of D into smaller images can be achieved through a variety of methods. In our present implementation, we merely choose to subdivide D in a regular, gridded fashion, typically choosing a grid size of either 8x8 or 32x32 pixels. Alternate decompositions could include irregular subdivisions, partitioning according to some inherent colorimetric basis, or levels of detail. 2.3 Searching and Encoding The search of the source image S for a matching fragment is exhaustive, in that each possible correspondence si is considered regardless of its prior use in other discovered transforms. Also, for each potential correspondence, each transformation under a 67 restricted set of affine, similitude transformations is considered. Our implementation presently examines each potential under identity (I), horizontal (HF) and vertical (VF) reflections, and 90° (R90), 180° (R180), and 270° (R270) rotational transformations. The metric we employ to evaluate candidate transformations prefers those transformations which induce the least translation, rotation or reflection, and color manipulation. We use this to ensure that areas which are aligned in a Cartesian sense are chosen. We shall revisit this important metric in a later section. Once a transformation has been chosen, we construct a compact representation of it, called a fractal code. A fractal code Ti is a 3-tuple, < , , k, c >, where is the location of the leftmost and topmost pixel in si; is the location of the leftmost and topmost pixel in di; k ∈ { I, HF, VF, R90, R180, R270 } indicates which affine transformation is to be used; and c ∈ [ -255, 255 ] indicates the overall color shift to be added uniformly to all elements in the block. Note that the choice of source image S is arbitrary. Indeed, the image D can be fractally encoded in terms of itself, by substituting D for S in the algorithm. Although one might expect that this substitution would result in a trivial encoding (wherein all of the chosen fractal codes correspond to an identity transform), in practice this is not the case, for we want a fractal encoding of D to converge upon D regardless of chosen initial image. For this reason, the size of source fragments considered is taken to be twice the dimensional size of the destination fragment, resulting in a contractive affine transform. Similarly, color shifts are made to contract. The fractal encoding algorithm, while computationally expensive in its exhaustive search, transforms a real world image into a much smaller set of fractal codes, which form, in essence, an instruction set for reconstituting the image. This resulting fractal representation of an image D vis-à-vis another given image forms the basis of our investigation into solutions for visual analogy problems. 2.4 Determining Fractal Features As we have shown, the fractal representation of an image is a set of specific affine, similitude transformations, a set of fractal codes, which compactly describe the geometric alteration and colorization of fragments of the source image that will collage to form the destination image. While it is tempting to treat contiguous subsets of these fractal codes as features, we note that their derivation does not follow strictly Cartesian notions (e.g. adjacent material in the destination might arise from strongly non-adjacent source material). Accordingly, we consider each of these fractal codes independently, and construct candidate fractal features from individual codes. In our present implementation, each fractal code < , , k, c > yields a small set of features, generally formed by constructing subsets of the tuple: < , , k, c >: a specific feature; < < dx sx, dy sy >, k, c >: a position agnostic feature; < , , c >: an affine transform agnostic feature; < , , k >: a color agnostic feature; < k, c >: an affine specific feature; < c >: a color shift specific feature. 68 As shown, the features derived may be categorized into either specific or agnostic aspects. Our implementation of the fractal encoding algorithm employs a metric which minimizes the transformative work (either geometrically or colorimetrically) expressly due to our desire to exploit specific feature matches. With this basis of fractal representation and feature discovery, we now may address the problem of visual analogy. 2.5 A Fractal Process of Visual Analogy To find analogous transforms, our algorithm first visits memory to retrieve a set of candidate solution images D to form candidate solution pairs in the form . For each candidate pair of images, we generate the fractal encoding of the candidate image D in terms of the former image C. As we illustrated earlier, from this encoding we are able to generate a large number of fractal features per transform. We store each transform into a memory system, indexed by and recallable via each associated fractal feature. To determine which of the candidate images has resulted in the most analogous transform to the original problem transform T, we first fractally encode that relationship between the two images A and B. Next, using each fractal feature associated with that encoding, we retrieve from the memory system those transforms previously stored as correlates of that feature (if any). Considering the frequency of the transforms recalled, for all correlated features in the target transform, we then calculate a measure of similarity. This metric reflects similarity as a comparison of the number of fractal features shared between candidate pairs taken in contrast to the joint number of fractal features found in each pair member [27]. In our present implementation, the measure of similarity S between the candidate transform T' and the target transform T is calculated using the following formula, also known as the ratio model: S(T, T') = f(T ∩ T') / [ f(T ∩ T') + α f(T T') + β f(T' T) ] where f(X) is the number of features in the set X. For our initial work, we have chosen values of α = β = 1.0, which, according to [27], results in this simplification: S(T, T') = f(T ∩ T') / f(T ∪ T') The final solution is taken as the candidate image from memory that results in the highest measured similarity according to this measure. 3 The Raven's Progressive Matrices Test In this section, we describe the Raven's Progressive Matrices test and present some preliminary results for a solution algorithm that uses the fractal algorithm for visual analogy described in the previous section. 69 The Raven's Progressive Matrices (RPM) test is a standardized intelligence test that consists of visually presented, geometric-analogy-like problems in which a matrix of geometric figures is presented with one entry missing, and the correct missing entry must be selected from a set of answer choices. Figure 1 shows an example of a problem that is similar to one of the problems in the Standard Progressive Matrices (SPM). Fig. 1. Example problem similar to one in the Standard Progressive Matrices (SPM) test. Although the test is supposed to measure only eductive ability, or the ability to extract and understand information from a complex situation [26], the RPM's high level of correlation with other multi-domain intelligence tests have given it a position of centrality in the space of psychometric measures [28], and it is therefore often used as a test of general intelligence. Despite its widespread use, neither the computational nor the cognitive characteristics of the process of solving the RPM are well understood. Hunt gives a theoretical account of the information processing demands of certain problems from the Advanced Progressive Matrices (APM), in which he proposes two qualitatively different solution algorithms—"Gestalt," which uses visual representations and perceptually based operations, and "Analytic," which uses feature-based representations and logical operations [29]. Existing AI systems for problem solving on the RPM, in contrast to Hunt's early work, use propositional representations [30][31]. Carpenter, Just, and Shell describe a computational model that simulates solving RPM problems using propositional representations [30]. Their model is based on the traditional production system architecture, with a long-term memory containing a set of productions and a working memory containing the current state of problem solving (e.g. current goals). Productions are based on the relations among the entities in a RPM problem, for example, the location of the dark component in a row, which might be the top half in the top row of a problem, bottom-half in the bottom row, and so on. Lovett, Forbus, and Usher describe a model that extracts qualitative spatial representations from 70 visually segmented representations of RPM problem inputs and then uses the technique of structure mapping to find solutions [31]. While, as we mentioned earlier, production systems and structure mapping offer alternative mechanisms for implementing content theories of analogies, from the perspective of our work on fractal representations, the two are similar in that both use propositional representations. 3.1 Preliminary Results from the Fractal Method of Visual Analogy A RPM problem can be viewed as a sequence of images (ordered in rows and columns), where some unknown transformation T can be said to transform one image into a corresponding adjacent image. In a typical 2x2 RPM problem, there are four such transformations, as shown in Figure 2. (RPM problems can also have three-bythree matrices, which we do not address in this paper.) Fig. 2. Illustration of four image transformations implicit in a 2x2 RPM problem matrix. RPM problems are formulated to suggest that these transformations are pairwise analogous (i.e. the two row transformations are analogous to one another). A 2x2 RPM problem generally has six candidate solutions presented. Following the algorithm for visual analogy described above, we seek to solve an RPM problem by determining which of the candidate solutions yields the most analogous transformations. In particular, we perform the recall and similarity calculation described in Section 2.3 independently over all row and column relationships present in the RPM problem. The final solution is taken as the candidate that results in the highest measured similarity for any such relationships. As an example, we shall use the "arrow" problem shown in Figure 3. Also, while the complete fractal method examines each of the problem's analogies, we shall restrict this detailed discussion to just one of these transformations (T1, as labeled in Figure 2). The initial transformation T1 is the fractal encoding of the transformation from the upper left arrow into the upper right arrow. When encoded using a block size of 32 x 32 pixels, this encoding generates 39 distinct fractal features. Each candidate answer is encoded likewise, from the upper left arrow into the candidate, each resulting in between 27 and 45 distinct features. For the arrow problem, using a 32 x 32 block size, the similarity measures for each answer Ci are: 71 S(T,C1) = 21 / (21+18+24) ≅ 0.333333 S(T,C2) = 15 / (15+24+30) ≅ 0.217391 S(T,C3) = 16 / (16+23+11) ≅ 0.32 S(T,C4) = 14 / (14+25+29) ≅ 0.205882 The answer with the highest calculated similarity is deemed correct. Therefore, the fractal method chooses as its answer #1. Fig. 3. The "arrow" problem, in the format of a 2x2 problem matrix with four answer choices. 4 Implications of Fractal Representations for Analogical Reasoning and Memory Recall Benoit Mandelbrot coined the term "fractal" from the Latin adjective fractus and its corresponding verb (frangere, which means "to break" into irregular fragments), in response to his observation that shapes previously referred to as "grainy, hydralike, in between, pimply, pocky, ramified, seaweedy, strange, tangled, tortuous, wiggly, wispy, wrinkled, and the like" could be described by a set of compact, rigorous rules for their production [32]. The approach we outline in this paper seeks to define a class of visual representations in a similar, fractal manner, where images are taken as encountered and transformed into a fractal representation either with regard to other images or to themselves. The very nature of this fractal representation places an immediate emphasis on similarity discovery. These discoveries are used to advantage in solving problems that place equivalent emphasis on visual analogy, such as the Raven's Progressive Matrices test. While the use of fractal representation is important, the emphasis upon visual recall in our solution afforded by features derived from those representations bears further discussion. The importance of individual discovered features presents itself strongly in our calculation of the similarity metric from [27]. The weights α and β may be skewed toward or away from equality in order to favor feature inclusion or exclusion. We note that while a 2x2 RPM problem offers little opportunity for determining feature 72 importance, the class of RPM problems which are 3x3 in nature require it, for the additional row and column provide an in-domain manner in which to report visual evidence for the kinds of features to expect within the eventual successful candidate. Using this additional evidence would allow for the weights of feature inclusion (α) and exclusion (β) to become covariant with features. We are actively exploring this as we prosecute solutions for 3x3 RPM problems. We take the position that placing candidate transformations into memory, indexed via those discovered fractal features, affords a new method of discovering image similarity. Indeed, this method of solving the RPM, by being reminded of similar transformations, bears close kinship to various methods and theories of case-based reasoning [33], although we exploit recalled transformations toward a significantly different end. That images, encoded either in terms of themselves or other images, may be indexed and retrieved without regard to shape, geometry, or symbol, suggests that the fractal representation bears further exploration not only as regards solutions to problems akin to the RPM, but also to those of general visual recall and analogy based creativity. Acknowledgments. This research has been supported by an NSF grant (IIS Award #0534266) entitled "Multimodal Case-Based Reasoning in Modeling and Design," by ONR through an NDSEG fellowship, and by the NSF GRFP fellowship program. 2011_1 !2011 Automated Collage Generation - With More Intent Michael Cook and Simon Colton Computational Creativity Group Department of Computing, Imperial College, London, UK. ccg.doc.ic.ac.uk Abstract The majority of software has no meta-level perception of what it is doing, or what it intends to achieve. Without such higher cognitive functions, we might be disinclined to bestow creativity onto such software. We generalise previous work on collage generation, which attempted to blur the line between the intentionality of the programmer and that of the software in the visual arts. Firstly, we embed the collage generation process into a computational creativity collective, which contains processes and mashups of processes, designed so that the output of one generative system becomes the input of another. Secondly, we analyse the previous approach to collage generation to determine where intentionality arose, leading to experimentation where we test whether augmented keyword searches can enable the software to exert more intentional control. Introduction We are building The Painting Fool software (www.thepaintingfool.com) to one day be taken seriously as a creative artist in its own right. To guide this process, we have engaged in much informal discussion with artists, art students and teachers. One clear opinion almost universally expressed has been that art generation software such as Photoshop is not seen as creative, because it exhibits no intentionality. That is, it does not conceive the artwork it wishes to produce, and hence does not drive the process through aesthetic or production decisions. Rather, the person using Photoshop is seen as providing all the intentionality, and hence is the sole creative entity, with the software acting as a mere tool. To address this lack of intentionality in The Painting Fool, as described in (Krzeczkowska et al. 2010), we built a collage-generation module able to construct artworks to illustrate newspaper articles. The system worked via internet retrieval of a news article, text extraction of important keywords, internet image retrieval using the keywords, and nonphotorealistic rendering of the images. We removed human decision making, so that we had no control over (a) what news story would be chosen for a collage (b) what keywords would be extracted (c) what images would be retrieved or (d) how the collage would be rendered. We used the system to raise the issue of whether it was appropriate to say the software exhibited intentionality in producing the collages. We describe here two extensions to this previous work. In the next section, we show how the collage generator can be seen as a mashup of generative processes, and we give details of a computational creativity collective designed with the hope that by allowing the output of such processes to become the input of others, more culturally interesting artefacts will be produced. Following this, we return to the question of intentionality, and analyse the results of the existing collage generation system. This highlights the roles that five parties have in adding intentionality to the generation process, and suggests simple improvements which could wrestle back some control for the software. We present the results from some preliminary experimentation with augmented keyword image retrieval, which leads to further discussion about intentionality in software. We conclude by describing avenues for future work. A Computational Creativity Collective The software discussed in (Krzeczkowska et al. 2010) is best described in modern computing parlance as a mashup. As described in (Abiteboul, Greenshpan, and Milo 2008), in general, mashups take data - usually downloaded from internet sites - and pass them through various linked processes in order to turn raw data into more consumable (amusing/entertaining/informative/interactive) presentations. In contrast, in general, the kinds of creative systems that produce artefacts like musical and literary compositions, works of visual art, scientific hypotheses, etc., are run as standalone, offline, systems. We hypothesise that the value of the artefacts produced by creative systems, and hence the creativity attributed to them, would be increased if (a) they could place their work into current cultural contexts and (b) they could work in concert, whereby the output from one becomes the input to another, so that there is an increase in the sophistication of the overall system. To test this hypothesis in the long term, we have compiled a collective of processes and mashups employing these processes, which is available at (www.doc.ic.ac.uk/ccg/collective), along with a simple Java API which guides the construction of additional material for the collective. Currently, the collective contains 119 processes mashed-up into 57 mashups. The processes follow an interface which means that the output from any process can become the input to any other (whether or not 1 the receiving process can usefully work with the type of input it is given). In general, the current processes perform simple text or graphics routines, or they employ a web 2.0 API in order to retrieve data from the internet. Collage generation is represented with a mashup that contains processes which link with the Guardian newspaper API and the Google image retrieval API, in addition to processes which perform text extraction and graphics routines. The other mashups perform similarly. For instance, one mashup uses image retrieval, graphics, and natural language generation techniques to parody business speak. Another mashup uses processes based on the LastFM and RunKeeper APIs, so that a mosaic of album covers can be constructed depending on what music people run the fastest to. Such mosaics are not necessarily the most culturally important artefacts produced by creative systems. However, the ease by which such mashups can be written highlights the potential for creativity when story generation systems feed into poetry generators, the output of which inspire paintings that influence musical compositions, and so on, with each process benefitting from internet downloads to add timely cultural relevance. We discuss our aims and future directions for the collective in the final section below. Intentionality Analysis and Experimentation One of the most impressive collages presented in (Krzeczkowska et al. 2010) is given in figure 1. This was produced in response to a news article which covered the war in Afghanistan, published in 2009 in the Guardian newspaper. While not rendered to a particularly high aesthetic standard, the contents are striking, as it contains a bomber plane, an explosion, a family with a baby, a girl in ethnic headgear, and - most poignantly of all - a field of war graves. By mixing images of death and destruction with those of children, it is fair to say that these constituents have helped to produce a collage with a moderate but definite negative bias, which reflected the tone of the original news article. We previously thought that the intentionality driving the collage generation therefore came from four parties: (i) the programmer, by enabling the software to access the left-leaning, largely anti-war Guardian newspaper (ii) the software, through its processing, the most intelligent aspect of which was to extract keywords using an algorithm based on (Brin and Page 1998), (iii) the writer of the original article, through the expression of his/her opinions in print, and (iv) individual audience members who have their own opinions forming a context within which the collages are judged. However, upon further inspection, we can add a fifth party to this ensemble. To see this, note first that the keywords extracted from the article were: afghanistan, brown, forces, troops, nato, british, speech, country, more and afghan We further note that none of these words have a particular negative emotive bias. We therefore must reduce the intentional role of the article writer in the construction of the collage, as it would be possible to write a very upbeat appraisal of the war which contained the same keywords. The images for the collage were retrieved from Flickr using the Figure 1: Collage illustrating the war in Afghanistan keywords above. Hence, it is clear that the negative connotations in the collage derive from the way in which Flickr users have tagged the images they have uploaded, presumably by tagging images of cemeteries and explosions with words like ‘afghanistan', ‘nato' and ‘troops'. Hence, we must attribute some of the intentionality behind the collage construction to the crowd. It is interesting that intent can be dissipated amongst five parties, one of which is a faceless crowd. However, the software is currently the junior partner in this arrangement, and our original motivation was for it to become the major driving force, so that it might be perceived as being more creative. We are currently working on an opinion-forming module for The Painting Fool which will enable it to start with some text, such as a news article or a set of search terms, then download and analyse multiple texts on related subjects, in order to form either a positive or negative opinion about the original text. Once it has formed such an opinion, this will be used to bias the search for images in collage generation, i.e., if the opinion is positive, The Painting Fool will attempt to retrieve appropriately up-beat images, and vice-versa if not. A straightforward method for attempting to influence the emotional nature of retrieved images is to augment the keywords with appropriately positive/negative adjectives. To test the effectiveness of this method, we performed some preliminary experiments. We used augmented keyword searches with Flickr to retrieve images, and then asked people to tag the images with respect to the broad emotion being portrayed. In particular, in the first experiment, 17 participants were each asked to tag 21 images, 7 of which were retrieved from Flickr with the keyword soldier, 7 of which were retrieved with the keywords: soldier and sad and 7 of which were retrieved with the keywords: soldier and happy. For each set of keywords, 80 images were retrieved and cached, and these were selected from randomly during the experiment. Subjects were asked to tag each image as either ‘happy', ‘sad' or ‘unsure' (if they were not prepared to assign an emotion to the image). The same experiment was repeated, but with the keyword soldier replaced by dog, and repeated again with the keyword baby. A sample of the images from the dog experiment is given in figure 3, and the distribution of tags for the sets of images 2 Figure 2: Results from the image tagging experiments. Neutral refers to the images retrieved via unaugmented search. Search terms: dog and happy Search term: dog [neutral] Search terms: dog and sad Figure 3: Sample images from the dog image tagging exp. in the three experiments is given in figure 2. We see that the participants' reactions to images were image-type dependent. In particular, people were less sure about the emotions being portrayed in the soldier images than in the other two types of image. Also, the results show a trend for neutral images to be tagged with ‘happy' as often as ‘unsure' and four times as often as ‘sad'. It's likely that this is because people tend to upload pictures of happy babies/dogs/soldiers to Flickr, hence an unaugmented search will result in largely positive images being retrieved. Importantly, it is clear that people were more likely to tag images retrieved with the keyword happy as ‘happy' and with the keyword sad as ‘sad'. Of the 119 happy images, 65% were tagged with ‘happy' and only 10% with ‘sad'. Similarly, of the 119 sad images, 47% were tagged with ‘sad' and only 22% with ‘happy'. Conclusions and Future Work The Painting Fool already has access to the methods and mashups in the computational creativity collective, including the collage generator mashup, and we plan to explore the creative potential of this. In particular, given that any process in the collective can provide input to any other process, we plan to automate the construction of chains of processes to see what the resulting mashups produce. We also plan to explore more sophisticated ways of using the processes, for instance using global workspace architectures, as explored in (Charnley 2009). Our hope is that the collective will become a resource to which computational creativity researchers can contribute creative systems, in addition to experimenting with those of others. Moreover, we envisage the collective becoming a creative entity in its own right, producing artefacts of real cultural value. The results from the preliminary image tagging experiments described above were encouraging. Hence, in future versions of the software, when The Painting Fool forms an opinion about a subject, then attempts to express that opinion via collages built using augmented keyword searches, we can be fairly confident that the attempt will be successful. However we will need to undertake further experimentation to determine under which conditions this will be the case. The opinion forming process will rely on sentiment analysis routines, such as those described in (Pang, Lee, and Vaithyanathan 2002), but The Painting Fool will be trained to subvert popular sentiment on particular topics, or attempt to portray both sides of an argument, if either of these methods might produce artworks with greater impact. Currently, the collages it produces are too literal. Hence, we plan to implement some obfuscation techniques, which will be used to produce pieces which require a level of interpretation by audiences. By implementing such higher level abilities, we hope to endow The Painting Fool with behaviours that will one day be regarded as creative. Acknowledgements We would like to thank Anna Krzeczkowska for her original work on the collage generation system, and John Charnley for foundational work on the computational creativity collective. We wish to thank Fai Greeve for comments which led to a reappraisal of the dissipation of intent in the collage generation process, in addition to Sanjay Bilakhia, Michal Cudrnak, Daniel Waddington and Douglas Willcocks for their contributions to the collective. We are grateful to the anonymous reviewers for their useful comments. 2011_10 !2011 Theme-Based Cause-Effect Planning for Multiple-Scene Story Generation Karen Ang, Sherie Yu and Ethel Ong Center for Language Technologies College of Computer Studies, De La Salle University Manila, 1004 Philippines karenang0903@yahoo.com, sherie_yu@yahoo.com, ethel.ong@delasalle.ph Abstract Early literacy in children begins through picture drawing and the subsequent sharing of an orally narrated story out of the drawn picture. This is the basis for the Picture Books story generation system whose motivation is to produce a textual counterpart of the input picture in order for the child to associate words with images. However, stories are comprised of sequences of events that occur in a cause-effect loop, and the singlescene input picture approach may lead to a story whose event flow may not match the child's original intended story. In this paper, we present Picture Books 2, which provides an environment for a child to creatively define a sequence of scenes for his input picture and then uses a theme-based cause-effect planner to generate a fable story narrating the flow of events across the scenes for children age 6-8 years old. Introduction Storytelling is an important aspect of human life. People use stories to share knowledge, experiences and ideas. Researches have shown that in their early years, children would draw pictures and then tell stories out of these afterwards. This helps develop their literacy skills and creativity, as one form of measuring creative thinking abilities is to assess the articulatenes of children in telling stories through drawings (Torrance, 1977). Furthermore, it has been found that children recognize pictures more easily than words (Fields and Spangler 2003). Picture Books (Hong et al 2009) is an existing system that generates fable-form stories for children age 4-6 years old based on a given single-scene input. This input picture contains the basic story elements - background, characters and objects - that are selected by the child from a predefined list of stickers in the system's Picture Editor. The generated stories embody a moral lesson or theme encapsulated in a plot structure that flows from negative to positive, where a child violates a stated lesson, experiences the consequences of such violation, and learns the required value at the end of the story. The themes are randomly selected from a list of pre-defined themes associated with the specified background, while the plot structure follows the classic story pattern presented by Machado (2003) that flows from problem, rising action, resolution, to climax. Although Picture Books showed the potential for computers to exhibit creativity in the form of literary art, there are a number of factors in storytelling that are currently not supported by the system (Ong 2009). Stories are sequences of events or scenes, and the single-scene structure of the system limits the planner on the events that it may generate, which may not necessarily match the original intent of the child whilst defining the scene. This makes the generated story less interesting as it may not adequately capture the story that was originally conceptualized by the child. Picture Books 2 (PB2) extends the first system (from here-on referred to as PB1) by allowing the children, this time age 6-8 years old, to define multiple scenes which serve as the input picture to the story planner. Enabling the children to input several scenes can lead to stories that are longer and have more complex plot. Computational storytelling can then be used to enhance the creative abilities of children, as they fluently elaborate their stories through connecting sequences of scenes to form a single storyline. Fluency and elaboration are two measures of creative thinking abilities as defined in (Torrance, 1977). Following PB1, PB2 also provides a set of background that the child can select, and a library of character and object images (called stamps) that can be pasted onto the selected background in order to create the scenes. It uses a theme-based cause-effect planning algorithm to generate the content of the stories that still promote moral values, this time set in more adventurous places like the camp or the street to allow older children to learn to explore the world and learn life's lessons on their own. In order to facilitate the flow of the story from one scene to the next, two types of transitions are identified. The existence transition refers to the presence of a stamp in a particular scene. The movement transition refers to the movement or change in position of a particular stamp between adjacent scenes. Another factor is the addition of traits to the characters. Using common animal characters in fables, child educators believe that embodying these characters with traits (e.g., a monkey is mischievous, a panda is loyal, and a rabbit is hardworking) would help children to relate to the story better. Riedl and Young (2004) noted that character believability is an essential property of narratives because the events that occur in the story are motivated by the beliefs, 48 desires and the goals of the characters. Thus, each of the characters in PB2 also possesses traits that comprise one of the factors affecting the flow of the story to be generated. The rest of this paper is organized as follows. First, the knowledge base used by the story planner is presented. This is followed by a discussion of PB2's architecture, with emphasis on the planning process. The paper then presents the results of qualitative and quantitative analysis performed by linguists on the generated stories, and ends with a summary of research findings and further work that can be done to improve the system. Storytelling Ontology Storytelling relies on a large body of knowledge about the story world, character representation (traits, emotions, behavior), and a causal chain of actions and events. Actions are performed directly by characters while events occur as a result of performing some actions, the occurrence of another event, or as a natural occurring phenomenon. The ontology of PB2 stores storytelling knowledge comprising of world concepts and events common in a child's everyday life. It contains a network of binary semantic relations patterned after ConceptNet (Liu and Singh 2004), a free, machine-usable lexical and commonsense ontological knowledge representation. The PB2 ontology was then populated with concepts that are suitable for the target users and relevant to the identified themes. Among the 20+ relations in ConceptNet, only the CapableOf, UsedFor, ReceivesAction, EffectOf, HasSubevent, HasProperty and IsA relations were relevant to PB2. Additional semantic relations, namely Feels, CausesConflictOf, LeadsTo, IsTransition, and HasResolution were defined to support concepts related to scene transitions, character traits, and theme-based planning. Table 1 describes these relations and provides examples defined in PB2. In order to facilitate the flow of the story from one scene to the next, the story planner must be able to identify changes that have occurred between two adjacent scenes. These transitions are classified into two, stamp (character or object) appearance/disappearance and movement. Appearance and disappearance is easily determined by checking if a stamp that is present in one scene is still present in the subsequent scene. Concepts related to this type of transition are then modelled in the ontology using the semantic relation IsTransition. For example, eat - IsTransition - disappearance associates the action that can cause an object, such as marshmallow, to disappear across two adjacent scenes. Similarly, if the marshmallow that was absent in the first scene appears in the second scene, the relation buy - IsTransition appearance is used to model a possible action necessary for this. To model stamp movements, each background image is divided into 6x6 grids, as shown in Figure 1. Each grid is labelled to track the position of a stamp in the background. A stamp is considered to have moved between two adjacent scenes if its position label in the first scene is different in the next scene. An example concept for this type of transition is walk IsTransition - movement. PB2 also recognizes six traits - responsible, honest, brave, helpful, obedient, and persevering. Each character has been assigned to possess three positive and three negative traits. Relation Definition Example Feels Denotes the emotional response of the character to an event. Character - Feels - Sad CausesCo nflictOf Used for selecting a story theme (conflict) based on the character's negative trait. Brave - CausesConflictOf - Scared LeadsTo Used to associate an object to a theme (or conflict) Flashlight - LeadsTo - Scared IsTransition Used to associate an action to a transition. Eat - IsTransition - Disappearance; Bring IsTransition - Appearance HasResolution Used to determine the appropriate resolution for a conflict. Scared - HasResolution - Search CapableOf Represents an action that a character can execute in the story. Character - CapableOf - Eat UsedFor Represents an activity that can take place in a specified location. Camp - UsedFor - Camping ReceivesAction Relates an action that can be performed on an object. Marshmallow - ReceivesAction - Eat EffectOf Provides a causal chain relationship between two events. Tired - EffectOf - Sleep HasSubevent Specified an event that can occur before another event. Eat - HasSubevent - Cook HasProperty Specifies an adjective to describe a noun. Camp - HasProperty - Far IsA A generalized concept of a noun. Marshmallow - IsA - Food Table 1. Semantic Relations in the Ontology of PB2 Figure 1. Grids in the Camp Background 49 System Architecture Picture Books 2 has four main modules, namely, the Story Editor, the Story Planner, the Sentence Planner, and the Story Generator. This is depicted in Figure 2. Figure 2. System Architecture The Story Editor allows the child to choose from a predefined list of background, character and object stamps. Currently, there are four backgrounds (camp, street, park and classroom), four characters (dog, pig, hippo and rabbit), and 16 objects to choose from. The available objects vary depending on the selected background. Each story can have only one background for all its scenes, one character, and up to four objects. An input picture is required to have a minimum of three scenes to depict the initial setting, the problem phase, and the resolution phase of the story. An abstract representation of the input picture that is forwarded to the Story Planner is shown in Figure 3. It shows that the input picture contains three scenes. The first scene (scene 0) contains a character stamp named Danny who appeared in this scene and is at grid 25, and an object stamp named marshmallow which also appeared in this scene and is also at grid 25. The second scene showed a movement of the character stamp from grid 25 to 28, and the appearance of another object, the flashlight. The marshmallow is assumed to have disappeared as it is not in the scene anymore. The value null signifies no transition. Story Planner The Story Planner produces a story plan comprising of semantic relations retrieved from the ontology. These semantic relations represent the progression of the story through a causal chain of character actions and events that will lead the main character to overcome one of his/her negative traits. The planner works by considering the traits of the character, the objects present in the scenes, and the scene transitions. A theme depicting the conflict of the story is selected based on the main character's non-traits and objects present in the conflict (or middle) scene. Candidate conflicts are retrieved from the ontology using the relation CausesConflictOf. Table 2 presents some character trait concepts and their associated conflict concepts. Note that each binary relation means that the absence of the trait concept in the character (e.g., not brave) leads to a story with the stated conflict (scared). Figure 3. Abstract Story Representation Concept 1 Relation Concept 2 Brave CausesConflictOf Scared Responsible CausesConflictOf Lose Obedient CausesConflictOf Disobey Table 2. CausesConflictOf Relations between NonCharacter-Trait and Conflict concepts The setting of the story is based on the background and the selected theme. This includes the time when the story takes place and the adjective to be used in describing the background. For instance, given the theme of a character learning to be brave that is set in the camp background, the most likely story will be that the character is scared of the dark, and thus, the time should be set to evening. The planner generates the chain of events in the story by taking the background adjective as the root node and finding a path in the ontology to connect this to the identified conflict. Table 3 presents some adjective concepts associated to backgrounds through the HasProperty relation. Concept 1 Relation Concept 2 Camp HasProperty Far Camp HasProperty Crowded Park HasProperty Clean Class HasProperty Quiet Table 3. HasProperty Relations for Backgrounds Events generation also considers the possible events that may happen given the transitions between scenes. Furthermore, an event can be considered in the story if the character is capable of doing the associated action and the object required for its performance is present in the scenes. Other events may also require a specific location. Once these preconditions are met, the planner finds a causal chain of events from the root node (background adjective) to the target node (conflict) using EffectOf relations. Table 4 presents EffectOf relations showing the causal link between concepts, starting from the background adjective far, leading to its effect, e.g., tired. Tired, in turn, may necessitate the character to eat. The chain continues until the concept node matching the identified conflict, e.g., scared, has been reached. 50 Concept 1 Relation Concept 2 Tired EffectOf Far Eat EffectOf Tired Sleepy EffectOf Eat Sleep EffectOf Sleepy Hear EffectOf Sleep Scared EffectOf Hear Table 4. Chain of EffectOf Relations If a path is found, the planner also checks for candidate sub-events that can occur in order to increase the length of the story. Currently, only one sub-event is included in the story plan. Figure 4 illustrates a sample causal chain of events based on the relation EffectOf. The orange-colored nodes denote concepts found through the HasSubevent relation. For example, the sleep concept has the sub-events pray, comb, and brush. The dark colored nodes are the root node and the target node representing the background adjective and the theme or conflict of the story, respectively. A possible story path is: "crowded-bump-dizzy-pray-sleephear_sound-scared". Figure 4. Sample Causal Chain of Events In order to achieve variances in the generated stories based on the same input picture, the planning algorithm uses a simple random selection approach to identify the nodes to be included in the chain of events. Future work on PB2 should consider having a scoring function to guide the planning process. Events generation is repeated in order to find a path from conflict to its possible resolutions. Table 5 presents the relationships between conflict concepts and resolution concepts. For example, if a character is scared, then the resolution phase of the story should involve actions requiring the character to search for the causes that lead to his being scared, e.g., what is making the sound in the night? Concept 1 Relation Concept 2 Scared HasResolution Search Lose HasResolution Admit Disobey HasResolution Apologize Table 5. HasResolution Relations between Conflict and Resolution concepts The Sentence Planner produces character goals (Uijlings 2006) by aggregating two or more consecutive semantic relations using discourse markers. A character goal repressents one sentence in the final story. Figure 5 shows a sample output of character goals for the first scene in the abstract story representation in Figure 3. For each character goal entry, agent is the actor, art(n) is the article to be used, verb is the action to be performed, patient is the receiver of the action, rst:n specifies the type of discourse marker to be used, type signifies if the sentence will be joined with another sentence, and tense is the verb's tense. Based on feedback from the linguist, children's stories are usually written in past tense. Figure 5. Sample Output Character Goals The Sentence Planner also lexicalizes concepts and generates a set of sentence specifications, which is then forwarded to the Story Generator to produce the surface text with the use of an external surface realizer, simpleNLG (Venour and Reiter 2008). Results and Analysis 10 sets of stories were given to the evaluators comprising of two linguists and one storywriter. These evaluators were chosen to provide expert judgement on the quality of the generated stories in terms of linguistic structure, narrative content, and appropriateness for the target audience. No feedback was solicited from the intended audience themselves at the time of this writing, because as the results below will show, the system still needs major work in its planning algorithm and knowledge representation in order to produce stories that the children may truly appreciate. The stories were generated from an ontology that contains 1,002 concepts and 1,442 semantic relations. The lexicon has been populated with 769 terms. The evaluation was performed twice; after the first evaluation, PB2 was revised to address some of the feedback, then ten stories were regenerated to undergo a second evaluation. Four criteria were used: language; coherence and cohesion; character, objects and background, and content, each of which has a set of associated questions that are rated by the following scores: 5-strongly agree, 4-agree, 3-neutral, 2-disagree, 1-strongly disagree. Table 6 shows the results. 51 Criterion First Second Language 4.01 4.43 Coherence and Cohesion 3.28 3.66 Character, Objects and Background 4.02 4.02 Content 3.64 3.76 Overall Rating 3.74 3.96 Table 6: Summary of Quantitative Evaluation The language criterion deals with the correctness of the sentence structure and appropriateness of the words used. Also included are the proper usage of articles, pronouns and punctuation marks. Here, PB2 received an average score of 4.43 after the second evaluation since the English grammar rules and the lexicon used during sentence generation were defined specifically with the target users in mind. During revision, rules related to correct usage of pronouns and articles provided by the linguists were implemented. However, there are still cases of incorrect usage as shown in the examples below where there is an incorrect usage of the article and a missing article. Output: She spilled a juice. Correct: She spilled juice. Output: She played a game in park. Correct: She played a game in the park. The coherence and cohesion criterion is concerned with the transition of events and the flow of the sentences, to evaluate if the generated story makes sense and is easy to understand. Coherence between sentences can be enhanced through the use of discourse markers (Mann and Thompson 1987). Taylor (2009) provided a list of common discourse markers, and those appropriate for elementary age kids are presented in Table 7. Used to signal Transition Word Addition Also, again, and, besides Time After, before, during, later, now, then Cause or Reason Because, since Effect Because, hence, so, thus Direction Above, behind, below, between, near Summary So, thus Table 7: Common Discourse Markers for Children PB2 received the lowest average score of 3.66 in this criterion because although the stories contained discourse markers, these are sometimes used inappropriately with respect to the context of the sentence, as shown below. Output: Danny the dog ate a marshmallow, thus he felt sleepy. Correct: Danny the dog ate a marshmallow, and thus he felt sleepy. On the other hand, the absence of discourse markers resulted in the generation of choppy sentences. Output: He slept in tent. He heard a sound. Correct: He slept in tent. While he was sleeping, he heard a sound. Because the planner utilizes a random selection approach and does not perform reasoning over the resulting path of semantic relations, there are also cases in which the generated story is not logical. He brought a blue water jug. The camp was very far. He felt tired. He felt thirsty. The characters, objects and background criterion examines the appropriate interplay between the character, object and background elements of the story with the story itself. This includes checking for the incidence of character traits and moral lesson, and the appropriateness of the objects with respect to the chosen background. The system received an average score of 4.02 for both rounds of evaluation. There are instances wherein the objects placed in the input scenes are not included in the generated story. There are also instances where the object, although introduced in the story, does not play any role in the story. The excerpt below illustrates this. [1] It was a fine evening. [2] Danny the dog was in the camp for a trip. [3] He buys a packed marshmallow. [4] The camp is very big. … [5] He sees a shadow. [6] Danny the dog feels scared. [7] He does not know what to do. … [8] Since then, He learns to be brave. The given story excerpt was generated from an input picture comprising of three scenes. In the first scene, a marshmallow object has been included and introduced in line [3]. However, it plays no part towards the development of the theme where the main character has to learn to be brave. In fact, aside from line [3], no other text in the story mentioned the marshmallow again. The overall content of the story includes evaluating the appropriateness of the story to the target age group, the adequacy of details provided, as well as the believability of the events in the story. The evaluators noted that the generated stories follow the basic structure of a children's story. They also found these stories to be quite interesting due to the interplay of the conflict and resolution to the theme of the story as well as the chain of events. Conclusion Picture Books 2 demonstrated that a coherent story with the four basic classic story subplots of Machado (2003) can be generated from a given input picture with at least three scenes. This is achieved by a theme-based cause-effect planner that utilizes an ontology of semantic domain and narrative knowledge, and a sentence planner that utilizes discourse markers to connect two or more events together. 52 The story planner provided a mechanism to control the sequencing of events to adhere to the basic story plot suitable for children's stories, but also allowed for flexibilities and variances in the generated stories. This is done by manually populating the semantic ontology with the relevant binary relations and concepts. However the population should be done with caution. Based on tests conducted, over population can lead to illogical story paths and under population can lead to not being able to generate stories. Because the ontology representation makes use of binary relations, this can lead to logical errors in the resulting stories. An instance of this is the relation "dizzy - EffectOf - see" and "people - ReceivesAction - see", which logically means that if the character sees many people he or she feels dizzy. However, given that there is also a relation "marshmallow - receivesAction - see", this resulted to a story text where a character feels dizzy because he or she saw a marshmallow, which makes no sense at all. Even though both PB1 and PB2 utilize a semantic ontology, the story planner of PB1 has a set of predefined planning operators in the form of author goals (Hong et al 2009; Cua et al 2010) similar to Minstrel (Turner 1992) that represent high-level tasks to guide the construction process by focusing on the narrative structure of the story. The author goals in turn are divided into two or more character goals (Uijlings 2006), again predefined, and these two types of goals are used to constrain the stories being generated. The ontology is used only to provide information needed to fill in the attributes in the character goals in a theme-driven story plot template (Ong 2010). This generated stories that are of good quality, coinciding with findings of Peinado and Gervas (2006) that "ontology-based stories obtain good results on coherence because the ontology forces explicit links between events". PB2, on the other hand, did away with predefined author goals while its character goals are generated dynamically based on the semantic relations found in the story path retrieved by the planner. The development of a reasoning engine that contains rules for checking the logical inconsistencies and performing inferencing on the set of candidate story paths should be explored to address the issue on resolving ambiguities and generating logical stories. The development of a more comprehensive model for representing the current state of the world and the changes that had already taken place, such as previous actions of the character, previous events that have taken place, and changes in the objects that are in the character's possession or in the story world, should also be explored to address these issues and enable the system to consistently generate story events that are logical and believable. Finally, feedback should be solicited from the intended users (the children) to provide an assessment on the effectiveness and usability of the system in supporting their creative expression, the degree of consistency of the actual story to the target story conceived through the input picture, and the ability of the generated story to capture and retain their attention. 2011_11 !2011 Experimental Results from a Rational Reconstruction of MINSTREL Brandon Tearse Peter Mawhorter Michael Mateas Noah Wardrip-Fruin Department of Computer Science University of California, Santa Cruz Santa Cruz, CA 95064 {batman, pmawhorter, michaelm, nwf }@soe.ucsc.edu Abstract This paper presents results from a rational reconstruction project that is aimed at exploring the creative potential of Scott Turner's 1993 MINSTREL system. In particular, we investigate the properties of Turner's original system of Transform-Recall-Adapt Methods (TRAMs) and analyze the performance of an alternate TRAM application strategy. In order to estimate the creativity of our reconstructed system under various conditions, we measure both the variance of the output space and the ratio of sensible to nonsensical results (as determined by hand-labeling). Together, these metrics give us insight into the creativity of the algorithm as originally constructed, and allow us to measure the changes that our modifications induce. Introduction Scott Turner's 1993 MINSTREL system (Turner 1994) is considered a high water mark for computational story generation. This system was based on the notion of imaginative recall: the idea that stories can be built by adapting fragments from other stories and jigsawing them together into something new. Turner argued that MINSTREL was a computer simulation of human creativity and that the TransformRecall-Adapt Methods (TRAMs) were the cornerstone of that creativity, but never fully evaluated his system. This paper formally evaluates the creativity of MINSTREL's TRAM system. Previous work done by Tearse et al. discussed an initial reimplementation of MINSTREL (Tearse, Mateas, and Wardrip-Fruin 2010) and in this paper we detail our experiments which seek to shed light on the creativity exhibited by TRAMs in MINSTREL REMIXED. We have continued the rational reconstruction efforts on the system and have attempted to measure creativity according to Turner's original definition. Turner states that "Whether or not something is creative depends on the number and quality of its differences from similar works," (Turner 1994) so our goal is to measure the variety and quality of the TRAM system's output relative to its input (the stories that it knows). Specifically, we measure the expected amount of variance between results, and whether they are sensible or nonsensical. Although the diversity of results is not a direct measure of variance relative to similar works, as the diversity increases, the likelihood that the system generates more creative results does as well, so diversity of the results (along with their quality) is an appropriate proxy for the creativity of the system. Beyond measuring the creativity of the original system, we also modify the system in order to shed light on the trade-offs that it gives rise to between quality and diversity. In running these experiments, we aim to validate the performance of MINSTREL as a creative system relative to Turner's definition of creativity. Related Work Rational Reconstruction Rational reconstruction is done to investigate the inner workings of a system, ideally identifying the differences between implementation details and core processes. Projects such as Musen et al. (Musen, Gennari, and Wong 1995) have successfully used rational reconstruction to better understand the fundamental concepts of their system. A partial reconstruction of MINSTREL was even performed (Peinado and Gervas 2006) in which the knowledge representation systems of MINSTREL were recreated in W3C's OWL. While this did a good job of proving that the knowledge representation can be successfully recast, without the full system in place it could not be used to investigate other aspects of MINSTREL. CBR-Based Storytelling Systems Case Based Reasoning (CBR) is a popular approach for creating intelligences and is what MINSTREL's TRAM searching system is based on. While most CBR systems try to match input to a library and then adapt the responses into useful results, MINSTREL goes a step further by transforming its query in order to locate matches which are further afield which in turn increases its creative options. Other systems have used enhanced CBR to develop stories or character actions (Fairclough and Cunningham 2003; Gervas et al. 2005). While these systems are all interesting to look at, Turner in particular made claims about his system's creative output which can now be investigated. Creativity The idea of computational creativity is an area of interest for computer science, in part because it is closely tied with the notion of artificial intelligence. Unfortunately, creativity is difficult to define and although Boden and Ritchie et al. have 54 both presented guidelines and measurements (Boden 2004; Ritchie 2007), their concepts remain difficult to implement. Given the nature of our specific system and the general agreement between Boden, Ritchie, and Turner on the importance of variety and quality, we decided to measure creativity based on Turner's (Turner 1994) suggestion, looking at the variety and quality of the results. By using Turner's original definition, we are also measuring the system by its own standards, so to speak: our results will have direct bearing on whether Turner's system should be viewed as a success on his terms. Although measuring the number of possible outputs and the quality of those outputs (by measuring both size and sense to nonsense ratios) is useful, we also found it interesting to cluster the results in a manner similar to Smith (Smith et al. 2011) in order to get a better picture of actual result differences. Method Rational Reconstruction As a rational reconstruction, the ultimate goal of our project is twofold: recreate MINSTREL in a form that can be used by others and investigate the design choices that went into MINSTREL in order to learn about the creative potential of the system. To achieve these goals, we have created MINSTREL REMIXED, which consists of a Scala codebase that reimplements the functionality of the original MINSTREL. Working from Turner's 1993 dissertation, we have tried to create a faithful reproduction of the original while introducing modularity. This modularity has allowed us to explore alternatives to several of Turner's design choices, thus better characterizing the tradeoffs faced by the system. TRAMs The TRAM system performs all of the fine-grained editing in MINSTREL REMIXED. TRAMs are small bundles of operations designed to help recall information from the story library. TRAMs are used to return information by giving them a query (a graph containing nodes describing some story fragment, using MINSTREL's graph-based story representation). The TRAM transforms the query, finds matches in the story library, and adapts one of those matches before returning it. Of course, the process of finding matches in the library may require further use of TRAMs. The TRAM system has the task of choosing which TRAMs to use during this recursive querying process. An example of TRAMs in action using one of Turner's original King Arthur stories is as follows: a graph is passed in requiring a knight, John, to die by the sword. By transforming this query using a TRAM called Generalize Constraint, we might end up with a query in which something dies in a fight. If this then matched a story about another knight, Frances, who kills an Ogre, the TRAM system could replace the Ogre with John the knight and return a fragment about a duel between John and Frances in which John dies. The creative power of TRAMs ultimately comes from their ability to find cases in the story library which aren't easily recognizable as applicable. In the original MINSTREL, queries that fail to return results are transformed and resubmitted. This leads to a random sequence of transformations before a result is located. In MINSTREL REMIXED however, we have implemented a weighted random TRAM selection scheme. This allows both the original functionality and more targeted weight based TRAM application to be used. The targeted TRAM approach results in fewer alterations being needed to get results and thus more similarity between the original query and the eventual result. To provide a concrete illustration of the TRAM system and its ability to produce varying results, we can look at other ways for John the knight to die. If we start with a fragment in which John is required to die and then alter it with one TRAM to require an ambiguous health change rather than death and then another which replaced John with a generic person, the resulting query matches with the story fragment from figure 1. Generic John's health changing action matches to Princess Peach who uses a potion to hurt (rather than kill) herself. Upon adaptation back to the original requirements, the resulting fragment is that John commits suicide, killing himself with a potion (Figure 1 shows the process of generalization, recall, and adaption from left to right). Figure 1: An example of the transform and adaption process. Measuring Creativity Many notions of creativity focus on the differences between system input and output. In particular, a system can be considered creative when it produces outputs that are significantly different from the given input (Pease and Winterstein 2001). Of course, the ability to generate a great variety of nonsensical outputs is not particularly impressive: variety must be balanced against quality. Unfortunately, both story quality and "significant difference" are difficult to measure in the realm of stories. In domains that have less dense semantic information, perceived difference between artifacts is usually related to measurable qualities of those artifacts (for example, pitch or duration in music are directly measurable and affect the perception of a piece). Given measurable qualities that relate to difference, a distance function can be used to cluster the output of a generator, and the resulting clustering will characterize how much meaningful variety 55 that generator can produce. For stories, however, significant difference is difficult to measure, let alone the task of creating a distance metric between stories. Measurable qualities, like word differences or story length, have unpredictable influences on the difference between two stories (for example, the same story could be told twice with very different overall lengths). More generally, every measurable aspect of a story has some relevance to the story, but changing less-relevant aspects of the story can result in insignificant differences. The problem of deciding which details of a story are relevant enough is itself a difficult unsolved AI problem; creating a computational distance metric over stories that corresponds to human perceptions would be a difficult task. Besides characterizing the variety of our output, we hope to measure its quality. Ideally, each generated story could be assigned a quality value, and then a result set could be evaluated in terms of the quality of the results that it contained. Of course, computationally evaluating story quality is also an open problem, so for both variety and quality we are forced to rely on estimates. To estimate the overall quality of a result set, we hand-label each distinct result in the set as either sensible or nonsensical. By using a binary labelling, we come up with a relatively concrete metric over our set of results (an integral scale would be subject to more noise during labelling and would require a subjective weighting function to be comparable between tests). To estimate the variety of our result space, we measure the expected variation between a pair of results, and use that as an estimate of the rate at which novel artifacts will be generated. The higher this estimate, the greater variety of artifacts we expect will occur for a fixed-size result set. And although we do not know exactly how variety among generated artifacts (measured using differences at the symbolic level e.g. a princess is used instead of a knight) affects the variety of the results in terms of significant differences, we can assume that they are correlated (i.e. the more varied the results in terms of raw symbol differences, the more likely it is that they contain stories that are significantly different). Figure 2: The story space divided into creative groups. Figure 2 is a demonstration of our estimation method for result set variance. Within the entire space (all possible stories), each point represents a particular story (along with minor variants of that story, such as stories that substitute one particular weapon for another in a fight scene). These points in turn are grouped into sets each of which is composed of stories that are perceptually similar (so significant differences exist between these story groups, but not within them). These actual groupings are unknown (because automatically measuring significant difference between stories isn't yet possible), but we can measure the number of different particular stories (individual points) that our generator can be expected to produce. Although the exact correspondence is unknown, it is clear that the greater the variety of particular stories our system generates, the more story categories it can be expected to create examples of (in this case, if our variance measure were to increase from four to five by adding a random story, there would be a two in five chance that that increase would also increase the creativity of the system by including a story from group 3). So our measure of story variance is a proxy for the number of creative categories that our system will produce examples of, which in turn is (along with our measure of the sensibility of the results) a measure of the creativity of the system. Experimental Setup To measure the creativity of MINSTREL's TRAMs, we performed a variety of experiments each of which involved a single query. For each experimental condition, we ran five runs of 1000 repeated queries. We then calculated averages for each condition and ran four metrics on each. First we tallied the number of unique results (by collapsing exact duplicates). Next we computed the number of sensible versus nonsensical results using a mapping from results to sensibility that we built by hand which covered all results. Finally, we computed both the probability that a pair of queries under the given experimental conditions would have at least one difference, and the expected number of differences between such a pair. In addition to these measures we compute separate sense to nonsense ratios (s/ns) for just the unique results. Using these numbers, we are able to characterize the creativity of MINSTREL under our various conditions by comparing them to the baseline. Figure 3: Our default query. 56 Each experiment varied one of five underlying variables: the depth limit used to find results (how far TRAMs can recurse), the query supplied, the story library used, the set of TRAMs available, and the weights on the TRAMs in use. For the base case, we used a search depth of five, a default test query, our full set of story libraries, our full set of TRAMs, and uniform TRAM weights. Our default query consisted of an unspecified State node connected to a Goal node, which connected to an Act node and then another State that linked back to the Goal. In terms of links, the first state motivated the goal, which planned the act, which intended the second state, which achieved the goal (Figure 3 shows this structure). In the Act node, we specified "Shnookie the Princess" as the actor, and in the second State node, we specified a type of "Health" and a value of "Dead". Beyond these constraints, the nodes were completely unspecified, so our query corresponded to the instruction: "Tell me a story in which Shnookie the princess kills something." Our story library contained sixteen hand-authored stories which ranged from simple ("PrincessAndPotion", in which a princess drinks a potion and becomes injured), to fairly complex (Our largest story contained 26 nodes and included eight different nouns). On average, our stories contain 11.5 nodes (Goals, Acts, States, and Beliefs) and include 4 nouns. For these tests, we used a limited set of TRAMs that focused on Act nodes (hence the structure of our query). To test the full range of TRAMs, we would need to engage MINSTREL's author-level planning system in order to perform multiple lookups over a single query, which would make it difficult to make statements about the TRAM system in isolation. Given our limited query, we use a total of seven TRAMs: GeneralizeConstraint, GeneralizeActor, GeneralizeRole, LimitedRecall, RecallActs, IntentionSwitch and SimilarOutcomes. Results To get a sense of the TRAM system's creativity, we can look at our baseline result. We find that our measure of expected variance is about 7 (6.90), while our measure of quality is near 0.5 (0.544). In other words, if we were to submit two queries using these parameters, we'd expect about seven fields in which the results would differ, and only half as many sensible results as nonsensical ones on average. These parameters are not desirable for use in producing stories, because they produce too many nonsensical results (only about 35% (35.2%) of total results were sensible) but for testing the system, the even mix of sense and nonsense allows us to observe how changes promote one or the other. The expectation of seven differences between two random results is encouraging: it indicates that there is variety in the output space. Looking at the total number of stories, we can see that 1000 trials produces an average of about 70 (68.8) unique stories, about 21 (21.0) of which will be sensible. Among unique results, the s/ns ratio is around 0.4 (.439). The fact that this ratio is higher among total results than among unique results indicates that the sampling of unique results is biased towards sensible ones: sensible stories are repeated more often than nonsensical ones. Our baseline sets a high bar for variance, but exhibits lackluster quality. Figure 4: Total sensible versus nonsensical results. Looking at the results as a whole, we can see some significant differences between the various cases. In figures 4 and 5, Our baseline initial query is shown under the heading Control. The D2, D3, D4, and D6 headings are for TRAM depth limits of 2, 3, 4, and 6 respectively (the initial query and all others used depth 5). In terms of the proportion of total results which are sensible, the depth doesn't seem to make a difference. Additionally, depths 3, 4, 5, and 6 are all roughly equivalent in terms of the total number of responses generated, both sensible and nonsensical. Depth 2 does generate significantly fewer results, however, although the s/ns ratio is still approximately the same as it is in the other runs. We can hypothesize that although a significant number of possible stories require at least three TRAMs to reach from our test query, the distribution of these depth-3 stories in terms of sense and nonsense is roughly the same as the distribution of results at depth 2. Based on this idea, we graphed the actual distribution of results across TRAM depth for the D6 test (shown in figure 7, which also shows results for the Weight case). This confirmed our hypothesis: most results have depth 2, many others have depth 1 or 3, but after depth 3, the number of results falls off sharply. Looking at just unique results (figure 8), we can see that there is an even more marked bias towards lower depths, with extremely few results at depths 5 and 6. Comparing figures 7 and 8 we see that many of the results at depths 5 and 6 are almost certainly identical to results at lower depths, unless the deeper unique results are repeated much more often. Examining the log files in detail confirms this. In terms of the MINSTREL system then, it appears that some TRAMs might have no effect on the result of a search, possibly because other TRAMs later reverse their effects. Of course, these TRAMs do have an impact on the overall distribution of results, even if they sometimes don't effect individual searches, and the TRAMs are only sometimes ineffectual. After experimenting with the TRAM depth, we ran an alternate query to explore how much our results might depend on the query. The alternate query had the same graph structure, but stipulated that the goal and motivating state nodes had type ‘Health', and that the actor of the act was a troll rather than a princess. Additionally, the state node in the 57 Figure 5: Unique sensible versus nonsensical results. alternate query was completely blank. As figures 4 and 5 show, this alternate query generated fewer unique results, but had much better ratios of s/ns in both the unique results (.947) and the total results (3.103). Given our TRAMs and story library, this alternate query is apparently more restrictive, but also makes it easier for the system to find sensible results. As might be expected, this increased story quality comes with a decrease in story variance: figure 6 shows that the expected differences between two results using the alternate query has fallen to about 3.5 (which is approximately half of the 7 differences expected from the baseline). The query test shows that our example query should not be taken as a good estimate of the average query: there is a lot of variation between queries, and we have no reason to expect that our default query is representative. This test also exhibits the fundamental trade-off of our system: more constrained queries increase the s/ns ratio, but come at the expense of less variation. To overcome this trade-off, the system would have to either cull nonsensical results, or simply generate variety that does not include nonsense. After testing varying the query, we next tested the effect of library size. We constructed a smaller story library by randomly removing eight stories from our default library. We then re-ran the original query (the results are listed under Story in our graphs). This smaller story set significantly reduced the number of unique stories generated, while at the same time depressing the s/ns ratio. The expected number of differences remained high, but removing stories clearly degrades the system: MINSTREL has trouble finding parsimonious story fragments during TRAM searches, and as a result it more often resorts to nonsensical matches. Interestingly, though, the decrease in the number of unique results was not proportional to the decrease in the number of stories (although the decrease in the number of sensible unique results was). This implies that nonsensical stories are easier to generate than sensible ones. Our next test was the most promising: we took the two TRAMs that create the most liberal changes (SimilarOutcomes and GeneralizeConstraint) and removed them. The results clearly show how important TRAMs are: the sense to nonsense ratio was drastically increased for both total and unique results. At the same time, they show even more clearly that nonsense is the usual price for variation: although the majority of results were now sensible, the variation within these results was reduced: the TRAM case has an expected differences value of close to 3, compared to the original 7. Effectively, even though these liberal TRAMs often create nonsense, they are also key in creating enough variation to generate creative results. For our final trial, we implemented an alternate search method that we hoped would provide a compromise between the chaos of the more liberal TRAMs and the boring results generated without them. Rather than use random TRAM selection during search, we used weighted random selection, and biased the selection towards less-liberal TRAMs. We gave the two liberal TRAMs that had been removed in the TRAM case weights of 2 and 3, and most of the TRAMs got a weight of 5. Two of the more specific TRAMs got weights of 8 and 10. Given these weights, the expected variation was maintained, but the s/ns ratios decreased for both total and unique results. The number of unique results decreased as well. Figure 7 shows that among all results, results that were found after only a single TRAM application significantly increased, at the expense of results that used more than 4 TRAMs. Interestingly, the distribution of unique results among depths (seen in figure 8) did not fundamentally change. Essentially, our weighting scheme did not help find new results, but instead biased the total results towards shallow unique results, some of which were nonsensical. This result demonstrates that even minor changes to the search method can negatively impact the results (as opposed to simply favoring either variety or quality). The fact that TRAM weights can significantly impact the result set also suggests that a principled approach to selecting TRAM weights could potentially enhance the system. Conclusions To evaluate the creativity of MINSTREL's TRAM system, we adopted formal measures of variety and quality to systematically investigate the effect of MINSTREL parameters and design choices on the creativity of the system. In addition to indicating measurable creativity we were pleased to notice that a number of the story fragments showed interesting results (for example, one of our stories involved a group of possessed townsfolk: the princess wants to injure them because they are possessed, but ends up killing them). Figure 6: The expected number of fields that will differ between two results. 58 Figure 7: TRAM depths for the D6 and Weight trials. Figure 8: Unique depths for the D6 and Weight trials. We feel that our main finding is that there is a demonstrable trade-off between the variety and the quality of the results: conditions that increase the quality of the results come with corresponding decreases in the variety thereof. No single configuration among our tests emerged as superior, but given a preference for either variance or sensibility, we have enough data to recommend a set of parameters. By measuring both variance and quality, we were able to effectively track the creative properties of the system, and observe both expected and unexpected changes under various conditions. Although our measurement of creativity is not direct, our data-driven approach has allowed us to make very specific statements about the way that the system operates, and to experiment and discover nuances of the system that would not be uncovered with a more general creativity assessment. Future Work The next step in the evolution of MINSTREL REMIXED is to implement Turner's author-level planning system. Creativity in MINSTREL is not restricted to the TRAM system alone: the way that author-level plans use the TRAM system and the interplay between various author-level plans gives rise to sensible creative results (for example, there are author-level plans that check the consistency of the story, which can be helpful when the TRAM system comes up with odd results). Once author-level planning is implemented, a more holistic study of Minstrel's creativity could be produced. To do this, more modern measures of creativity such as cognitively inspired methods (Riedl and Young 2006) and non-automated evaluative frameworks (Ritchie 2007) should be investigated to determine what measures would be applicable to the full system output. We may also decide to revisit our TRAM weighting system. Although the weights that we chose for this experiment resulted in poor performance, armed with metrics for variance and quality, we could optimize the TRAM weights. This process would provide further information about how each TRAM contributes to variation and to sensibility. Acknowledgements This work was supported in part by the National Science Foundation, grants IIS-0747522 and IIS-1048385. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 2011_12 !2011 Automatic Generation of Emotionally-Targeted Soundtracks Kristine Monteith1 , Virginia Francisco2 , Tony Martinez1 , Pablo Gervas´ 2 , Dan Ventura1 kristinemonteith@gmail.com, virginia@fdi.ucm.es, martinez@cs.byu.edu, pgervas@sip.ucm.es, ventura@cs.byu.edu Computer Science Department1 Brigham Young University Provo, UT 84602, USA Departamento de Ingenier´ıa del Software e Inteligencia Artificial2 Universidad Complutense de Madrid, Spain Abstract Music can be used both to direct and enhance the impact a story can have on its listeners. This work makes use of two creative systems to provide emotionally-targeted musical accompaniment for stories. One system assigns emotional labels to text, and the other generates original musical compositions with targeted emotional content. We use these two programs to generate music to accompany audio readings of fairy tales. Results show that music with targeted emotional content makes the stories significantly more enjoyable to listen to and increases listener perception of emotion in the text. Introduction Music has long been an integral aspect of storytelling in various forms of media. Research indicates that soundtracks can be very effective in increasing or manipulating the affective impact of a story. For example, Thayer and Levenson (l983) found that musical soundtracks added to a film about industrial safety could be used to both increase and decrease viewers' electrodermal responses depending on the type of music used. Bullerjahn and Guldenring (1994) similarly found that music could be used both to polarize the emotional response and impact plot interpretation. Marshall and Cohen (l988) noted significant differences in viewer interpretation of characters in a film depending on the type of accompanying music. Music can even affect the behavior of individuals after hearing a story. For example, Brownell (2002) found that, in several cases, a sung version of a story was more effective at reducing an undesirable target behavior than a read version of the story. An interesting question, then, is whether computationally creative systems can be developed to autonomously produce effective accompaniment for various modalities. Dannenberg (1985) presents a system of automatic accompaniment designed to adapt to a live soloist. Lewis (2000) also details a "virtual improvising orchestra" that responds to a performer's musical choices. Similarly, our system is designed to respond to an outside entity when automatically generating music. Our efforts are directed towards providing accompaniment for text instead of a live performer. This paper combines two creative systems to automatically generate emotionally targeted music to accompany the reading of fairy tales. Results show that emotionally targeted music makes stories significantly more enjoyable and causes them to have a greater emotional impact than music that is generated without regard to the emotions inherent in the text. Methodology In order to provide targeted accompaniment for a given story, each sentence in the text is first labeled with an emotion. For these experiments, selections are assigned labels of love, joy, surprise, anger, sadness, and fear, according the categories of emotions described by Parrot (2001). Selections can also be labeled as neutral if the system finds no emotions present. A more detailed description of the emotion-labeling system can be found in (Francisco and Hervas 2007). Music is then generated to match the la´ bels assigned by the system. Further details on the process of generating music with targeted emotional content can be found in (Monteith, Martinez, and Ventura 2010). Generating the actual audio files of the fairy tales with accompanying soundtrack was done following Algorithm 1. A text corpus is initially segmented at the sentence level (line 1) and each sentence is tagged with an emotion (line 2). Ten musical selections are generated for each possible emotional label and converted from MIDI to WAV format (lines 5-7) using WinAmp1 . In order to produce a spoken version of a given fairy tale, each sentence is converted to an audio file (line 9) using FreeTTS,2 an open-source text to speech program. This provides a collection from which musical accompaniments can be selected. Each audio phrase is analyzed to determine its length, and the musical file with matching emotional label that is closest in length to the sentence file is selected as accompaniment (lines 10-11). If all of the generated selections are longer than the audio file, the shortest selection is cut to match the length of the audio file. Since this is often the case, consecutive sentences with the same emotional label are joined before music is assigned 1 http://www.winamp.com 2 http://freetts.sourceforge.net 60 Algorithm 1 Algorithm for automatically generating soundtracks for text. F is the text corpus (e.g. a fairy tale) for which a soundtrack is to be generated. SoundTrack(F) 1: Divide F into sentences: S1 to Sm 2: Assign emotion labels to each sentence: L1 to Lm 3: S 0 ← join consecutive sentences in S with matching labels 4: L 0 ← join consecutive matching labels in L 5: for all L 0 i in L 0 do 6: Generate MIDI selections: Mi1 to Mi10 7: Convert to WAV files: Wi1 to Wi10 8: for all S 0 i in S 0 do 9: Ai ← Generate TTS audio recording from S 0 i 10: k ← argminj |len(Ai) − len(Wij )| 11: Ci ← Ai layered over Wik 12: O ← C1 + C2 + ... + Cn 13: return O (lines 3-4). Sentences labeled as "neutral" are left with no musical accompaniment. Finally, all the sentence audio files and their corresponding targeted accompaniments are concatenated to form a complete audio story (line 12). Evaluation Musical accompaniments were generated for each of the following stories: "The Lion and the Mouse," "The Ox and the Frog," "The Princess and the Pea," "The Tortoise and the Hare," and "The Wolf and the Goat." 3 For comparison purposes, text-to-speech audio files were generated from the text of each story and left without musical accompaniment. (i.e. line 11 of Algorithm 1 becomes simply, Ci ← Ai .) Files were also generated in which each sentence was accompanied by music from a randomly selected emotional category, including the possibility of no emotion being selected (i.e. line 10 of Algorithm 1 becomes k = rand(|L 0 | + 1), and file Wi0 was silence for all i. Randomization was set such that k = 0 for approximately one out of three sentences.) Twenty-four subjects were asked to listen to a version of each of the five stories. Subjects were divided into three groups, and versions of the stories were distributed such that each group listened to some stories with no music, some with randomly assigned music, and some with emotionally targeted music. Each version of a given story was played for eight people. After each story, subjects were asked to respond to the following questions on a scale of 1 to 5: "How much did you enjoy listening to the story?" "If music was included, how effectively did the music match the events of the story?" and "Rate the intensity of the emotions (Love, Joy, Surprise, Anger, Sadness, and Fear) that were present in the story." A Cronbach's alpha coefficient (Cronbach 1951) was calculated on the responses of subjects in each group to test for 3All audio files used in these experiments are available at http://axon.cs.byu.edu/emotiveMusicGeneration No Random Targeted Music Music Music The Lion and the Mouse 2.88 2.13 2.75 The Ox and the Frog 3.50 2.75 3.00 The Princess and the Pea 3.00 3.38 4.13 The Tortoise and the Hare 2.75 2.75 3.88 The Wolf and the Goat 3.25 2.88 3.38 Average 3.08 2.78 3.43 Table 1: Average responses to the question "How much did you enjoy listening to the story?" Random Targeted Music Music The Lion and the Mouse 2.88 3.38 The Ox and the Frog 2.13 3.25 The Princess and the Pea 2.50 3.88 The Tortoise and the Hare 2.38 3.50 The Wolf and the Goat 1.75 3.25 Average 2.33 3.45 Table 2: Average responses to the question "How effectively did the music match the events of the story?" inter-rater reliability. Coefficients for the three groups were α = 0.93, α = 0.87, and α = 0.83. (Values over 0.80 are generally considered indicative of a reasonable level of reliability and consequently, a sufficient number of subjects for testing purposes.) Table 1 shows the average ratings for selections in each of the three categories in response to the question "How much did you enjoy listening to the story?" On average, targeted music made the selections significantly more enjoyable and random music made them less so. A Student's t-test reveals the significance level to be p = 0.011 for the difference in these two means. Selections in the "Targeted Music" group were also rated more enjoyable, on average, than selections in the "No Music" group, but the difference in means was not significant. Listeners did rate the version of "The Tortoise and the Hare" with emotionally targeted music as significantly more enjoyable than the "No Music" version (p = 0.001). Table 2 reports the average ratings in response to the question "How effectively did the music match the events of the story?" Not surprisingly, music with targeted emotional content was rated significantly higher in terms of matching the events of the story than randomly generated music (p = 0.003). Table 3 provides the intensity ratings for each of the six emotions considered, averaged over all five stories. Listeners tended to assign higher emotional ratings to selections in the "Random Music" category than they did to selections in the "No Music" category; however, this was not statistically significant. Average emotional ratings for the selections in the "Targeted Music" category had significantly higher ratings (p = 0.027) than selections accompanied by randomly generated music. When directly comparing "Targeted Mu 61 No Random Targeted Music Music Music Love 1.83 1.40 1.55 Joy 2.03 2.10 2.53 Surprise 2.63 2.50 2.75 Anger 1.48 1.60 1.55 Sadness 1.60 1.70 2.05 Fear 1.58 2.00 2.15 Average 1.85 1.88 2.10 Table 3: Average intensity of a given emotion for all stories No Random Targeted Music Music Music Love 1.75 1.38 1.75 Joy 2.03 2.10 2.53 Surprise 2.67 2.88 2.75 Anger 1.56 1.50 1.56 Sadness 1.94 2.06 2.31 Fear 1.94 2.13 2.31 Average 1.98 2.01 2.20 Table 4: Average intensity of labeled emotions for all stories sic" with "No Music", average emotional ratings are again higher for the targeted music, though the difference falls a bit short of statistical significance (p = 0.129). Table 4 gives average intensity ratings when only labeled emotions are considered (compare to Table 3). In this analysis, selections in the "Targeted Music" category received higher intensity ratings than selections in both the "No Music" and "Random Music" categories, with both differences being very near statistical significance (p = 0.056 and p = 0.066, respectively). Note that the only emotional category in which targeted music does not tie or exceed the other two accompaniment styles in terms of intensity ratings is that of "Surprise." The fact that "Random Music" selections were rated as more surprising than "Targeted Music" selections is not entirely unexpected. Discussion and Future Work Regardless of how creatively systems may behave on their own, Csikszentmihalyi (1996) argues that individual actions are insufficient to assign the label of "creative" in and of themselves. As he explains, "...creativity must, in the last analysis, be seen not as something happening within a person but in the relationships within a system." In other words, an individual has to interact with and have an impact on a community in order to be considered truly creative. Adding the ability to label emotions in text allows for generated music to be targeted to a specific project rather than simply existing in a vacuum. In addition to allowing further interaction with the "society" of creative programs, our combination of systems also allows creative works to have a greater impact on humans. Music can have a significant effect on human perception of a story. However, as demonstrated in previous literature and in the results of our study, this impact is most pronounced when music is well-matched to story content. Music generated without regard to the emotional content of the story appears to be less effective both at eliciting emotion and at making a story more enjoyable for listeners. Future work on this project will involve improving the quality of the generated audio files. Some of the files generated with the text-to-speech program were difficult to understand. A clearer reading, either by a different text-to-speech program or a recording of a human narrator, would likely enhance the intelligibility and possibly result in higher enjoyability ratings for the accompanied stories. Future work will also include adding more sophisticated transitions between musical selections in the accompaniment. This may also improve the quality of the final audio files. Acknowledgments This material is based upon work that is partially supported by the National Science Foundation under Grant No. IIS0856089. 2011_13 !2011 A System for Evaluating Novelty in Computer Generated Narratives Rafael Pérez y Pérez, Otoniel Ortiz, Wulfrano Luna, Santiago Negrete, Vicente Castellanos, Eduardo Peñalosa, Rafael Ávila División de Ciencias de la Comunicación y Diseño Universidad Autónoma Metropolitana, Cuajimalpa Av. Constituyentes 1050, México D. F. {rperez, oortiz, wluna, snegrete, vcastellanos, epenalosa, ravila}@correo.cua.uam.mx Abstract The outputs of computational creativity systems need to be evaluated in order to gain insight into the creative process. Automatic evaluation is a useful technique in this respect because it allows for a large number of tests to be carried out on a system in a uniform and objective way, and produce reports of its behaviour. Furthermore, it provides insights about an essential aspect of the creative process: selfcriticism. Novelty, interest and coherence are three main characteristics a creative system must have in order for it to be considered as such. We describe in this paper a system to automatically evaluate novelty in a plot generator for narratives. We discuss its core characteristics and provide some examples. Introduction Automatic evaluation is a central topic in computational creativity. Some authors claim that it is impossible to produce computer systems that evaluate their own outputs (Bringsjord and Ferrucci 2000) while others researchers challenge this idea (e.g. Pérez y Pérez & Sharples 2004). Although there have been several discussions and suggestions about how to evaluate the outputs produced by creative computer programs (e.g. Ventura 2008; Colton 2008; Ritchie 2007; Pereira et al. 2005; Pease, Winterstein, and Colton 2001) there is a lack of agreement in the community on how to achieve this goal. We are currently working in plot-generation and as part of our research project we are interested in developing a computer model that evaluates the stories generated by our automatic storyteller. Pérez y Pérez and Sharples suggest that A computer model might be considered as representing a creative process if it generates knowledge that does not explicitly exist in the original knowledge-base of the system and which is relevant to (i.e. is an important element of) the produced output (Pérez y Pérez and Sharples, 2004). The authors refer to this type of creativity as computerised creativity (c-creativity). They also claim that a computerbased storyteller must generate narratives that are original, interesting and coherent. Following these authors, in this document we report a system called The Evaluator that automatically evaluates originality aspects of the c-creativity in the narratives produced by our computer model of writing. We assess three characteristics of novelty in the narratives generated by our storyteller: how novel the sequence of actions is; how novel the general structure of the story is; how novel the use of characters and actions in the story is (we refer to this aspect as repetitive patterns; see below for an explanation). In all cases we compare the plot just produced by the storyteller, from now onwards referred to as the new story, against its knowledge-base. Following the definition of ccreativity, a novel narrative must provide the storyteller with new knowledge that can be used in the future for generating original plots. Thus, we also evaluate how many new knowledge structures are created as a result of the plot-generation process. We combine the results of such evaluations to provide an overall assessment. In this document we present our first results. We are aware that human evaluation of narratives is far more complex and involves not just novelty but several other characteristics. Nevertheless, there are few implemented systems for automatic evaluation (e.g. Norton, Heath and Ventura 2010). In the same way, our implemented system innovates by considering different dimensions of novelty (c.f. Peinado et al. 2010). The paper is organised as follows. Section 2 explains the general characteristics of the knowledge structures employed in our storyteller and how they are used to evaluate novelty. Section 3 describes how our computer model evaluates narratives. Section 4 provides two examples of narrative's evaluation. Section 5 provides the discussion and conclusions of this work. Knowledge Representation Our computer-based storyteller employs two files as input to create its knowledge-base: a dictionary of story actions and a set of previous stories. Both files are provided by the user of the system. The dictionary of story-actions includes the names of all actions that can be performed by a character within a narrative together with a list of preconditions and post conditions for each. The Previous Stories are se 63 quences of story actions that represent well-formed narratives. With this information the system builds its knowledge base. Such a base is comprised by three knowledgestructures: contextual-structures or atoms; the tensional representation; the concrete representation. 1. Contextual-Structures (also known as atoms). They represent, in terms of emotional links and tensions between characters, potential situations that might happen in the story-world, and have associated a set of possible logical actions to be performed when that situation occurs. For example, an atom might represent the situation or context where a knight meets a hated enemy (the fact that the knight hates the enemy is an example of an emotional link), and it might have associated the action "the knight attacks the enemy" as a possible action to be performed. Contextual-structures represent story-world commonsense knowledge. By an analogy with Case Based Systems, we can think of our storyteller as a Contextual Based System. Thus, Contextual-structures are the core structures that our storyteller employs during plot generation to progress a story. 2. Story-structure Representation. Our plot-generator has as its basis the classical narrative construction: beginning, conflict, development, climax (or conflict resolution) and ending. We represent this structure employing tensions. Emotional tension is a key element in any short story (see Lehnert 1983 for and early work on this subject). In our storyteller it is assumed that a tension in a short story arises when a character is murdered, when the life of a character is at risk, when the health of a character is at risk (e.g. when a character has been wounded), when a character is made a prisoner, and so on. Each tension has associated a value. Thus, each time an action is executed the value of the tension accumulated in the tale is updated; this value is stored in a vector called Tensional Representation. The Tensional Representation records the different values of the tension over time. In this way, the Tensional Representation permits representing graphically the structure of a story in terms of the tension produced in it (see examples below). Each previous story has its own Tensional Representation. The storyteller employs all this information as a guide to develop an adequate story-structure during plotgeneration. 3. Concrete-Representation. It is formed by a copy of the dictionary of story-actions and the set of Previous Stories. The system uses this information to break impasses and sometimes to instantiate characters. In summary, the storyteller uses the following information during plot-generation: a dictionary of story-actions, a set of previous stories (a set of sequences of actions), its corresponding set of story structures (Tensional-representations) and several contextual-structures. We are interested in analysing whether the storyteller is able to produce novel material that increments some of this content with useful information. As mentioned earlier, the previous stories are written by humans1 . Previous Stories mirror cultural and social characteristics that end up being encoded within the storyteller knowledge-base. For example, let us imagine that in all previous stories female characters never perform violent actions; or that all previous stories include an important number of violent actions; and so on. We are interested in evaluating if our storyteller is capable of producing stories that somehow move away from some of those recurrent patterns (stereotypes) found in the previous stories. Thus, the steps to evaluate a narrative are: 1. The storyteller generates a new plot. 2. The Evaluator compares the new plot with all the previous stories. The goal is to see how novel the sequences of actions of the new story are compared to all the previous stories. 3. The Evaluator compares the story structure (the Tensional-Representation) of the new plot with the story structure of all previous stories, to measure how novel it is. 4. The Evaluator verifies if at least one recurrent pattern in the new story is novel compared to those employed in all the previous stories. 5. The new plot is added to the Previous Stories. The Evaluator compares the knowledge-base of the storyteller before and after this operation is performed. The purpose of this is to estimate how many new contextual-structures are added to the knowledge-base as a result of adding the new plot to the set of previous stories. Description of The Evaluator The Evaluator is comprised of four modules: 1) Evaluation of Sequences, 2) Evaluation of Story-Structure, 3) Evaluation of repetitive patterns, 4) Evaluation of New Contextual-Structures. 1. Evaluation of sequences. This module analyses how novel is the sequence of actions that encompasses the new story. To analyze its novelty, the new story is split into pairs of actions. For example, let us imagine that the new story 1 is comprised of the following sequence of actions: Action 1, Action 2, Action 3, Action 4, and so on (each action includes the characters that participate in it and the action itself). Thus, the system creates the following pairs: [Action 1, Action 2], [Action 2, Action 3], [Action 3, Action 4], and so on. The program takes each pair and tries to find one alike in the Previous Stories. The system also has the option of searching for what we have called a distance pair. Let us imagine that the first pair of actions in the new story is: [Enemy kidnapped Princess, Jaguar Knight Rescued Princess]. And that in the Previous Stories we have the following sequence: Enemy kidnapped Princess, Princess insulted Enemy, Jaguar Knight Rescued Princess. As we can observe, although in the Previous Stories the insult1 Currently we are testing an Internet application that will allow people around the world to contribute with their own previous stories to feed our plot-generator system 64 ing action is located between the kidnapped and rescued actions, the essence of this pair of actions is kept (the antagonist kidnaps the princess and then the hero rescues her). In order to detect this kind of situations, The Evaluator is able to find pairs of actions in the previous stories that have one, two, or more in-between actions. We refer to the number of in-between actions that separate a pair of actions as Separation-Distance (SD). That is, in the previous stories there might be a separation distance between the first and the second action that form the pair. For the previous example, the separation distance value is 1. 2. Evaluation of the story-structure. The structures of the new story and the previous stories are represented as graphics of tensions. The Evaluator compares the structure of the novel story against all the previous stories to see how novel it is. The process works as follows. The Evaluator compares point by point (action by action) the difference between the Tensional-representation of the new story and the first of the previous stories. The highest peak in the graphic represents the climax of the story. Because stories might have different lengths, the system shifts horizontally the graphics in such a way that the climaxes of both stories coincide in the same position on the horizontal axis. If the lengths of the new story and the previous story are different, the system eliminates the extra actions of the longest history. In this way both stories have the same length. The process reports how many points are equal (have the same value of tension) and how many points are dissimilar. The system includes a modifiable parameter, known as Tolerance, which defines when two points are considered as equals. Thus, point-A is considered equal to point-B if point-A = point-B ± Tolerance. By default, the tolerance is set to ±20. Then, the system calculates the ratio between the number of dissimilar points and the total number of actions in the story. This number is known as the StoryStructure Novelty (STN). The same process is repeated for all previous stories. Finally, The Evaluator calculates the average of all Story-Structure Novelty values to obtain a final result. 3. Evaluation of repetitive patterns. The Evaluator analyses the previous stories and the new story to obtain information about recurrent patterns. The current version of the system searches for patterns related to: 1) the most regular types of actions within a story; 2) the reincorporation of characters. Regarding the most regular types of actions, we have grouped all items in the dictionary of story-actions in four different categories: helpful actions (e.g. A cured B, A rescued B); harmful actions (e.g. A wounded B, A killed B); passionate actions (e.g. A loves B, A hates B); and change of location actions (e.g. A went to the City). The system calculates what percentage of actions in each story belongs to each category; the highest percentage is employed as reference for comparison. Then, The Evaluator compares the new story against all previous stories to calculate how similar they are. If more than 50% of the previous stories share the same classification, the new story is evaluated as standard; if 25% to 49% of the previous stories share the same classification, the new story is evaluated as innovative; if less of 25% of the previous stories share the same classification, the new story is evaluated as novel. All percentages can be modified by the user of the system. This is our first approach to automatically identify the theme of a story. Regarding the reincorporation of characters, we are interested in analysing if one or more characters are reincorporated in a story. This is a resource that Johnstone (1999) employs in improvisation and that helps to develop more complex plots (a set of characters are introduced at some point in the story; then, one or more of them are excluded from the plot; later on they reappear without the narrative losing coherence). This is our first approach to measure the complexity of a narrative in terms of the number of reincorporated characters and the number of actions that takes to reincorporated such characters. We refer to this number of actions as the Distance of Reincorporation (DR). So, if a character is introduced in the story, and she reappears again after 5 actions have been performed, the DR is equal to 5. We consider that characters with higher DR are more difficult to reincorporate without losing coherence in the story than those with lower values. In the same way, we consider that the more characters that are reintroduced in a story without losing coherence the more complex the story is. Thus, we want to study how novel the use of reincorporated characters in the new story is. The Evaluator calculates three values: novelty in the use of reincorporated characters, novelty in the number of reincorporated characters and Novelty of DR. The use of reincorporated characters is calculated employing table 1. The first column indicates the percentage of previous stories that reincorporates characters, the second column indicates if the new story reincorporates characters and the third column shows the evaluation assigned to the new story. Reincorporation of characters in the Previous Stories Reincorporation of characters in the new story Evaluation Less than 25% No Standard Less than 25% yes Novel 25%-50% no Standard 25%-50% yes innovative More than 50% no Below standard More than 50% yes Standard Table 1 Novelty in the use of reincorporated characters. Then the system obtains the number of reincorporated characters in the new story and calculates the percentage of previous stories that have the same or higher number of reincorporated characters. We refer to such a percentage as reincorporated percentage. So, the value of the novelty in the number of reincorporated characters = 100 - percentage of reincorporated characters. The system calculates the percentage of previous stories whose highest value of DR is equal to or higher than the highest DR in the new story. We refer to such a percentage as percentage of DR. So, the Novelty of DR = 100 percentage of DR. 65 4. Evaluation of Novel Contextual-Structures. To perform this process the system requires two knowledge bases: KB1 and KB2. KB1 contains the knowledge structures created from the original file of Previous Stories; KB2 contains the knowledge structures created after the new story is incorporated as part of the Previous Stories. Then, The Evaluator compares both knowledge bases to calculate how many new contextual-structures were included in KB2. The system copies the set of new structures into a knowledge base called KB3. That is, KB3 = KB2 - KB1. Then, The Evaluator performs what we refer to as the approximated comparison. Its purpose is to identify and eliminate those structures in KB3 that are alike, in a given percentage (set by the user) to at least one structure in KB1.In this way, KB3 ends up having only new contextual-structures that are not similar (up to a given percentage) to any structure in KB1. We refer to the final number of structures in KB3 (after performing the approximated comparison) as the KB3-value. The Novelty of the Contextual-Structures (NCS) is defined as the relation between the KB3-value and the number of new contextual-structures. KB3-value NCS = ———————————————— Number of new contextual-structures In this way, we can know how many new contextualstructures are created, and how novel they are with respect to the structures in the original knowledge base KB1. Examples of Evaluation. We tested our system evaluating two stories: new story 1 and new story 2. In both cases we employed the same set of Previous Stories comprised of seven narratives. Example 1. The new story 1 is the outcome of two storytellers working together as a team (see Pérez y Pérez et al. 2010). For this evaluation we employ the knowledge base of one of the agents. New story 1. jaguar knight is introduced in the story princess is introduced in the story hunter is introduced in the story hunter tried to hug and kiss jaguar knight jaguar knight decided to exile hunter hunter went back to Texcoco Lake hunter wounded jaguar knight princess cured jaguar knight enemy kidnapped princess enemy got intensely jealous of princess enemy attacked princess jaguar knight looked for and found enemy jaguar knight had an accident enemy decided to sacrifice jaguar knight hunter found by accident jaguar knight hunter killed jaguar knight hunter committed suicide 1. Evaluation of sequences. We compared the new story 1 against all seven previous stories. We ran the process with values for the separation distance ranging from zero to four. In all cases, we did not find any pair of actions repeated in the previous stories. This is part of the report generated by The Evaluator: Report Separation-distance = 4 Total of Pairs Found in the File of Previous Stories: 0% Novelty of the Sequences of Actions: 100% 2. Evaluation of Story-Structure Novelty (STN). The system generated the following report: Tolerance = 20 Story[1] Coincidences: 6 Differences: 7 STN : 54% Story[2] Coincidences: 3 Differences: 7 STN : 70% Story[3] Coincidences: 2 Differences: 9 STN : 82% Story[4] Coincidences: 0 Differences: 6 STN:100% Story[5] Coincidences: 2 Differences: 9 STN: 82% Story[6] Coincidences: 2 Differences: 8 STN: 80% Story[7] Coincidences: 1 Differences: 8 STN: 89% Average Story-Structure Novelty: 79% The structure of the previous story 1 was the most similar to the structure of the new story 1. Therefore, it has the lowest value of the STN = 54%. On the other hand, the structure of the previous story 4 was the most different to the structure of the new story 1. Therefore, it had the highest STN = 100%. Figure 1. Three story-structures. Figure 1 shows a comparison of the structures of the previous story 6 (PS6), the previous story 7 (PS7) and the new story 1. The comparison only takes place between actions 9 and 16. The three graphics have been accommodated in such a way that their climaxes are located on action 15. The Evaluator calculated that the average value for the STN was 79%. 3. Evaluation of patterns. Table 2 shows the most regular types of actions employed in each story. For example, 54.55% of actions in the previous story one (PS1) belonged to the classification harmful; 50.00% of actions in the previous story two (PS2) belonged to the classification change of location; and so on. The most regular type of actions employed in the new story 1 (NS1) belonged to the classification harmful. That is, this was a violent story. Four of the seven previous stories shared the same classification and shared similar values of percentage. Therefore, the novelty of the used actions in the new story 1 was classified as standard. 66 Table 3 shows those characters that were reintroduced at least once in any story and their corresponding distance of reincorporation. Class of Action PS1 PS2 PS3 PS4 PS5 Change of location 9.09% 50.00% 26.67% 50.00% 44.44% Passionate Actions 36.36% 12.50% 26.67% 10.00% 22.22% Harmful Actions 54.55% 25.00% 40.00% 20.00% 22.22% Helpful Actions 0.00% 12.50% 6.67% 20.00% 11.11% Class of Action PS6 PS7 NS1 NS2 Change of location 37.50% 40.00% 28.57% 14.29% Passionate Actions 25.00% 0.00% 7.15% 14.29% Harmful Actions 37.50% 60.00% 57.14% 71.42% Helpful Actions 0.00% 0.00% 7.14% 0.00% Table 2. Most regular types of actions for each story. Character s1 s2 s3 s4 s5 s6 s7 Sto1 Sto2 Eagle Knight ‐ ‐ 4 ‐ ‐ ‐ ‐ ‐ ‐ Hunter ‐ ‐ ‐ ‐ ‐ ‐ ‐ 8 ‐ Jaguar Knight ‐ ‐ ‐ ‐ ‐ ‐ ‐ 4 ‐ Lady ‐ ‐ 11 ‐ 5 ‐ ‐ ‐ ‐ Prince ‐ ‐ ‐ 5 ‐ ‐ ‐ ‐ ‐ Princess ‐ ‐ 6 ‐ ‐ ‐ 7 ‐ ‐ Table 3. Reincorporated characters and their DR. In four of the seven previous stories was possible to find reincorporated characters. However, only one of those stories reincorporated more than one character. The new story 1 reincorporated two characters. Furthermore, this story had the second longest distance of reincorporation. Thus, the novelty in the number of reincorporated characters was set to 86% and the novelty of the DC was set to 86%. 4. Evaluation of novel contextual structures. After comparing KB1 and KB2 the system found ten new contextualstructures. For the purpose of comparison, we ran the approximated-comparison process with 19 different percentage values. For reasons of space we only show eight. This is part of the report produced by The Evaluator: 100% SIMILAR: [0]/[10] 0.00% 75% SIMILAR: [1]/[10] 10.00% 60% SIMILAR: [5]/[10] 50.00% 35% SIMILAR: [7]/[10] 70.00% 25% SIMILAR: [9]/[10] 90.00% 20% SIMILAR: [10]/[10] 100.00% 15% SIMILAR: [10]/[10] 100.00% 10% SIMILAR: [10]/[10] 100.00% There are no structures 100% equal. If we set the system to find structures that are 75% alike, only one of the ten new contextual-structures is equal or equivalent to at least one structure in KB1. Only when we set the percentage of similarity to 60% half of the new contextual-structures are equal or equivalent to at least one structure in KB1. For this exercise we decided to calculate the value of the Novelty Contextual-structure with the percentage of similarity set to 75%. Thus, NCS = 9/10 = 0.90 In summary, for the new story 1 we got the following values: Novelty of the Sequences of Actions: 100% Average Story-Structure Novelty: 79% Patterns: Novelty in the use of regular type of actions: Standard Novelty in the use of reincorporated characters: Standard Novelty in the number of reincorporated characters: 86% Novelty of DR: 86% Novelty of the Contextual-structures: 90% Example 2. This story was produced by one storyteller. New story 2. Jaguar knight is introduced in the story Enemy is introduced in the story Enemy got jealous of jaguar knight Enemy attacked jaguar knight Jaguar knight fought enemy Enemy killed jaguar knight Enemy laugh at enemy Enemy exiled enemy Enemy had an accident 1. Evaluation of sequences. As in the case of story 1, we ran the process with values for the Separation-distance ranging from zero to four. In all cases, we did not find any pair of actions repeated in the previous stories. Thus, the novelty of the sequences of Actions is 100%. The report is omitted for space reasons. 2. Evaluation of the story structure. As in the case of the new story 1, we got an average value of 65% of novelty in the story structure. The report is omitted for space reasons. 3. Evaluation of patterns. Table 1 shows that the most regular types of actions employed in the new story 2 (NS2) belonged to the classification harmful. Four of the seven previous stories shared the same classification although, by contrast with all previous stories, the new story 2 had the highest percentage of harmful actions used. Nevertheless, the new story 2 was classified as standard. Table 2 shows that the new story 2 did not reincorporate characters. Therefore, it was evaluated as below standard. As a consequence, the novelty in the number of reincorporated characters and the novelty of the DC were set to 0%. 4. Evaluation of novel contextual structures. After comparing KB1 and KB2 the system found seven new contextualstructures. For the purpose of comparison, we ran the approximated-comparison process with 19 different percentage values. For reasons of space we omit the report. There are no structures 100% equal. If we set the system to find structures that are 55% alike, only one of the seven new contextual-structures is equal or equivalent to at least one structure in KB1. Only when we set the percentage of similarity to 25% more than half of the new contextualstructures are equal or equivalent to at least one structure in KB1. For this exercise we decided to calculate the value of the Novelty Contextual-structure with the percentage of similarity set to 75%. Thus, NCS = 7/7 = 1 for 75% 67 Thus, for the new story 2 we got the following values: Novelty of the Sequences of Actions: 100% Average Story-Structure Novelty: 65% Patterns: Novelty in the use of regular type of actions: Standard Novelty in the use of reincorporated characters: Below Standard Novelty in the number of reincorporated characters: 0% Novelty of DR: 0% Novelty of the Contextual-structures: 100% Discussion This paper reports on the implementation of a computer system to automatically evaluate the novelty aspect of ccreativity. Following Pérez y Pérez and Sharples (2004), ccreativity has to do with the generation of material that is novel with respect to the agent's knowledge base and that, as a consequence, generates new knowledge-structures. These authors distinguish two different types of knowledge: knowledge about the story-structure and knowledge about the content (the sequence of actions). In this work we also consider commonsense or contextual knowledge and what we refer to as patterns knowledge. The sequences of actions in the new stories 1 and 2 are unique with respect to the sequences of actions found in the previous stories. Thus, the storyteller is capable of producing novel sequences of actions. The evaluation of the structures' novelty of both new stories got a value of 65%. That is, the system is able to diverge from the structures found in the previous stories. The results of our tests also show that new contextual knowledge structures, the core information employed during plot generation, are built as a result of adding the new story to the file of previous stories. Thus, The Evaluator shows that our storyteller is able to generate novel knowledge structures in at least three aspects. The results obtained from the analyses of recurrent patterns are not conclusive. We need to make more tests to assess if our system can contribute to the measure of some aspects related to the complexity of a story; something similar happens with the automatic detection of the theme of a story. Nevertheless, the statistical information that The Evaluator generates shows that the storyteller is able to generate narratives that display certain degree of pattern originality. Automatic evaluation is a key component of the overall assessment of a creative system because it provides unbiased information on the system's behaviour. This feedback also supplies insights that allow improving different aspects of our computer model of creative writing. The system provides an inkling into how novel stories are that help us adjust the various parameters of the system to carry out new experiments. In this way, The Evaluator speeds up the experimentation cycle. Finally, we are also interested in comparing the results that The Evaluator generates against human evaluation. Specific creative patterns could be sought, similar to repetition-break (Loewenstein and Heath 2009), to carry out a more specialized evaluation of the knowledge bases. 2011_14 !2011 Fractal Analogies: Preliminary Results from the Raven's Test of Intelligence Keith McGreggor, Maithilee Kunda, and Ashok Goel Design & Intelligence Laboratory, School of Interactive Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA {keith.mcgreggor, mkunda}@gatech.edu, goel@cc.gatech.edu Abstract The geometric analogy problems of the Raven's Progressive Matrices tests of intelligence appear to require many of the information-processing elements that form the basis of computational theories of general creativity: imagistic representations and reasoning; pattern detection and abstraction; analogical mapping, transfer and instantiation, and so on. In our method of addressing the test, an image is encoded as fractals, capturing its inherent self-similarity. Herein we present preliminary results from using the fractal technique on all 60 problems from the Standard Progressive Matrices version of the Raven's test. Psychometrics and Creativity Psychometrics entails the theory and technique of quantitative measurement of intelligence, including factors such as personality, aptitude, creativity, and academic achievement. We propose that some psychometric tests of intelligence could also be good tests of some aspects of creativity. Consider problems on the Raven's Standard Progressive Matrices test. The task is to pick as the best match one of the several choices for insertion in the empty element of the matrix. Addressing this problem appears to engage many of the information-processing elements that form the basis of computational theories of general creativity (e.g., Casakin & Goldschmidt 1999; Clement 1988; 2008; Croft & Thagard 2002; Davies, Nersesseian & Goel 2005; Goel 1997; Goldschmidt 2001; Hofstadter 1979, 1995; Holyoak & Thagard 1996; Nersessian & Chandrasekharan 2009; Yaner & Goel 2008): imagistic representations and reasoning; pattern detection and abstraction; analogical mapping, transfer and instantiation, etc. Clement (2008) and Nersessian (2008), for example,describe analogical reasoning using imagistic representations as a fundamental process of creative problem solving in science; Goldschmidt (2001) and Hofstadter & McGraw (1995) make a similar point about visual analogies in design creativity. The Raven's Progressive Matrices tests (Raven, Raven, & Court 1998) are a collection of standardized intelligence tests that consist of visually presented, geometric analogy problems in which a matrix of geometric figures is presented with one entry missing, and the correct missing entry must be selected from a set of answer choices. The Standard Progressive Matrices (SPM) consists of 60 problems divided into five sets of 12 problems each (sets A, B, C, D & E), roughly increasing in difficulty both within and across sets. As far as we know, all extant computational theories of the Raven's and other similar tests involving geometric analogies, rely on the extraction and use of propositional representations. In contrast, at last year's conference on computational creativity, we had described a proposal to use fractal encodings to address the Raven's test (McGreggor, Kunda & Goel 2010). Our technique is grounded in the mathematical theory of fractal image compression (Barnsley & Hurd 1992) and of general fractal representations (Mandelbrot 1982). The main goal of our work is to evaluate whether the Raven's Standard Progressive Matrices test could be solved using purely visual representations, without converting the image inputs into propositional descriptions during any part of the reasoning process. We use fractal representations, which encode transformations between images, as our primary non-propositional representation. Our system operates on inputs that have been scanned directly from a hard copy of the Raven's test and contain the usual rough alignments and pixel-level artifacts. Problem entries are converted to fractal representations, and only relationships among these fractal representations are used to choose the best answer. We stress that at no point are inputs converted into any kind of propositional form (e.g. shapes, colors, lines, edges, or any other visually segmented entity); only the raw RGB pixel values are used. Fractal Representations and Features For visual analogy problems of the form A : B :: C : ?, each of these analogy elements are a single image. Some unknown transformation T can be said to transform image A into image B, and likewise, some unknown transformation T′ transforms image C into the unknown answer image. The central analogy in the problem may then be imagined as requiring that T is analogous to T′. Using fractal representations, we shall define the most analogous transform T′ as that which shares the largest number of fractal features with the original transform T. To find analogous transformations for A : B :: C : ?, our algorithm first visits memory to retrieve a set of candidate solution images X to form candidate solution pairs in the form . For each candidate pair of images, we generate the fractal encoding of the transformation of candidate image X in terms of image C. From this encoding we generate the fractal features for the transform. 69 We store each transform in a memory system, indexed by and recallable via each associated fractal feature. To determine which candidate image results in the most analogous transform to the original problem transform T, we first fractally encode that relationship between the two images A and B. Next, using each fractal feature associated with that encoding, we retrieve from the memory system those transforms previously stored as correlates of that feature (if any). Considering the frequency of transforms recalled, for all correlated features in the target transform, we then calculate a measure of similarity. Determining Similarity The metric we employ reflects similarity as a comparison of the number of fractal features shared between candidate pairs taken in contrast to the joint number of fractal features found in each pair member (Tversky 1977). In our present implementation, the measure of similarity S between the candidate transform T′ and the target transform T is calculated using the ratio model. This calculation determines the similarity between unique pairs of transforms. However, the Raven's test, even in its simplest form, poses an additional problem in that many such pairs may be formed. Reconciling Multiple Analogical Relationships In 2x2 Raven's problems, there are two apparent relationships for which analogical similarity must be calculated: the horizontal relationship and the vertical relationship. Closer examination of such problems, however, reveals two additional relationships which must be shown to hold as well: the two diagonal relationships. Furthermore, not only must the "forward" version of each of these relationships be considered but also the "backward" or inverse version. Therefore for a 2x2 Raven's problem, we must determine eight separate measures of similarity for each of the possible candidate solutions. The 3x3 matrix problems from the SPM introduce not only more pairs for possible relationships but also the possibility that elements or subelements within the images exhibit periodicity. Predictably, the number of potential analogical relationships blooms. At present, we consider 48 of these relationships concurrently. Relationship Space and Maximal Similarity For each candidate solution, we consider the similarity of each potential analogical relationship as a value upon an axis in a large "relationship space." To specify the overall fit of a candidate solution, we construct a vector in this multidimensional relationship space and determine its Euclidean distance length. The candidate with the longest vector length is chosen as the solution to the problem. Results on the Raven's Test To create inputs for the fractal algorithm, each page from the SPM test booklet was scanned, and the resulting grayscale images were rotated to roughly correct for page alignment issues. Then, the images were sliced up to create separate image files for each entry in the problem matrix and for each answer choice. These separate images were the inputs to the fractal algorithm for each problem. The fractal algorithm attempted to solve each SPM problem independently, i.e. no information was carried over from problem to problem. Our fractal algorithm obtained a total score of 32 correct out of 60 problems. Figure 1a illustrates the performance of the algorithm on all 60 problems according to test problem order; Figure 1b shows the performance with problems ordered by difficulty, as determined by normative data (Raven, Raven, & Court 1998). There are three main assessments that can be made following the administration of the SPM to an individual: the total score, which is given simply as the number of correct answers; an estimate of consistency, which is obtained by comparing the given score distribution to the expected distribution for that particular total score; and the percentile range into which the score falls, for a given age and nationality (Raven et al. 1998). Figure 1. Fractal performance on SPM problems ordered by test ordering (a, top) and difficulty (b, bottom). Correct answers are in bold. Figure 2. SPM scores ordered by set for fractal algorithm (dark) and human norms for given total score (light). 70 The score breakdown by set, along with the expected score composition for a total score of 32, are shown in Figure 2. A score is deemed "consistent" if the difference between actual and expected scores for any given set is no more than ± 2 (Raven et al. 1998). The score differences for the fractal algorithm on each set were no more than ±1. This score pattern illustrates that the results achieved by the algorithm fall well within typical human consistency norms on the SPM.! ! Using norms from the United States, we find that a total score of 32 corresponds to the 50th percentile for children around 9-10 years old (Raven, Raven, & Court 1998). Conclusions As mentioned earlier, at ICCC-10, we presented a proposal to use fractal encodings to address the Raven's test. Here, we have described preliminary results from this work. Many problems on these intelligence tests appear to engage cognitive processes that form the building blocks of human creativity, e.g. visual analogy. Our fractal technique works directly on visual inputs, without any need to extract propositional representations from them. The performance of our program would place it at the 50th percentile for 9-10 year olds. We believe that this technique can be enhanced significantly and we anticipate improved results in the near future. Fractal representations are analogical representations in that they have a structural correspondence to the images they represent: the collage theorem (Barnsley & Hurd, 1992) provides a rigorous characterization of this structural isomorphism. Similarity and analogy often have been viewed as central to theories of intelligence. Hofstadter (1995), among others, has posited that analogy forms the core of human cognition. Fractal representations add the powerful idea of self-similarity. While the use of fractal representations is central to our technique, the emphasis upon visual recall in our solution afforded by features derived from those representations is also important. We take the position that placing candidate transformations into memory, indexed via fractal features, affords a new method of discovering image similarity. That images, encoded either in terms of themselves or other images, may be indexed and retrieved without regard to shape, geometry, or symbol, suggests that the fractal representation bears further exploration not only as regards solutions to problems akin to the RPM, but also to those of general visual memory and recall. 2011_15 !2011 Computational Creativity Theory: Inspirations behind the FACE and the IDEA models Alison Pease1 and Simon Colton2 1 School of Informatics, University of Edinburgh, UK. 2Computational Creativity Group, Department of Computing, Imperial College, London, UK. ccg.doc.ic.ac.uk Abstract We introduce two descriptive models for evaluating creative software; the FACE model, which describes creative acts performed by software in terms of tuples of generative acts, and the IDEA model, which describes how such creative acts can have an impact upon an audience. We show how these models have been inspired both by ideas in the psychology of creativity and by an analysis of acts of human creativity. Introduction The Computational Creativity (CC) community needs concrete measures of evaluation to enable us to make objective, falsifiable claims about progress made from one version of a program to another, or for comparing and contrasting different software systems for the same creative task. There are two notions of evaluation in CC: (i) judgements which determine whether an idea or artefact is valuable, and (ii) judgements to determine whether a system is acting creatively. Our thesis is that ideas from the psychology of creativity can help us to develop such measures, and we demonstrate this via our two models, which form a framework to aid us in the development and evaluation of creative software. Note that while we draw on psychology and examples of human creativity for inspiration, our goal lies squarely within CC - to provide a means of formalising some aspects of computational creativity - we by no means claim that our models would be appropriate for evaluating human creativity. The main test and value of such models lies in their applicability to CC systems: we defer such application to a sister paper [4]. The FACE model The FACE model assumes eight kinds of generative acts, in which both processes (p) and artefacts (g) are produced: F p : a method for generating framing information F g : an item of framing information for A/C/E p/g A p : a method for generating aesthetic measures A g : an aesthetic measure for process or product C p : a method for generating concepts C g : a concept E p : a method for generating expressions of a concept E g : an expression of a concept Any particular creative episode can be expressed in terms of at least one of these components (it may well be the case that not all of the components will be present). In order to cover as many creative acts as possible, we assume only that there must be something new created for the question of creativity to arise. This could be very small, a brush stroke of an artist, an inference step by a mathematician, a single note in a piece of music. Thus, we avoid the thorny issue of where an act of creation starts and important questions about where on the scale from basic to sophisticated an act must be, to be judged creative, can be postponed. This position is in line with the argument by Cardoso et al., that "To achieve human levels of computational creativity, we do not necessarily need to start big, at the level of whole poems, songs, stories or paintings; we are more likely to succeed if we are allowed to start small, at the level of simple but creative phrases, fragments and images" [3, p. 17]. The IDEA model In order to assess the impact of the creative acts performed by software, we assume an (I)terative (D)evelopment(E)xecution-(A)ppreciation cycle within which software is engineered and its behaviour is exposed to an audience. The IDEA descriptive model recognises that in some creative tasks, the invention of measures of value forms a part of the creative act. Hence usual models of value are abandoned in favour of describing the impact that creative acts can have. The model introduces some simplifying assumptions about the nature of (i) the software development process (ii) the background information known in general, known by the programmer, and given to the software, and (iii) the nature of the audience who assess the impact of the creative acts performed by the software. Using these simplifications, the model comprises two branches. The first of these is a descriptive model for the stage of development that software is in, in terms of how close its creative acts are to those performed by the programmer in engineering the software, and how close they are to those that have been performed by community members in the wider context in which the software works. The second branch uses the notion of an ideal audience who can perfectly assess both their personal hedonistic value of a creative act, and the time they are prepared to spend interpreting the act and its results. These measures are used to capture certain notions that are usually associated 72 with impact. In particular, notions such as shock (when audience members tend to take an instant dislike to a creative act), acquired taste (when audience members tend to spend a long time in eventually positively assessing a creative act) and opinion splitting (when a creative act is particularly divisive) are formally developed. The creation of prime numbers The many aspects which contributed to the creation and appreciation of the concept of prime number have inspired our development of the FACE model. These include inventing the concept itself and inventing ways of developing new concepts in number theory. Along with the concept definition we find examples, or expressions, of the concept, in the form of numbers which are either known to be prime (eg. 2, 3, 5, 7), known not to be prime (eg. 4, 6, 8, 9),1 or have an unknown status (eg. 2 13466917 + 1) and algorithms for determining whether a given number is prime have been developed, such as the Sieve of Eratosthenes, as well as ways of generating further primes. The concept of primes was embedded into number theory by developing specialisations (eg. Cuban, happy, illegal and lucky primes) and theorems and conjectures (eg. the fundamental theorem of arithmetic); and into the wider field of mathematics as uses for primes have been found in group and ring theory, cryptography, and so on, as well as its application to other areas of human experience (eg. Messiaen used prime numbers to create ametrical music). In addition to actually developing such concepts, theorems and applications, we can create techniques for developing them, eg. the use of analogical reasoning to create analogous concepts in other areas of mathematics. We also evaluate the concept (primes are considered to be the "basic building blocks" of the natural numbers) and suggest new ways in which to make such a judgement. We break this complex story down into the FACE components below: FACE Prime numbers F p : Ways in which theorems involving primes can be found F g : Theorems involving primes, which embed the concept within the field of number theory A p : Ways in which which we judge the value of concepts in number theory (for instance, their applicability and use in multiple mathematical domains) A g : Judgements on the value of the concept of prime number C p : Ways of finding new concepts in number theory C g : The concept of a prime number E p : Algorithms for determining whether a given number is prime or not (for instance the Sieve of Eratosthenes) E g : Numbers with an evaluation as to whether they are prime The creation of upsidedowns In [3], Cardoso et al. describe The Upsidedowns of Gustav Verbeek. These are panels which tell a story up to a half way point, the continuation of which then appears almost magically when one turns the panels upside down. We show another example of this type in Figure 1. In terms of our model, starting with the artefact level, we could describe 1Whether 1 should be considered prime is still debated today. the final piece of art as an expression of the concept Eg ; the concept Cg as the constraint that the picture must make sense when upside down (and fit into the story); the aesthetic Ag as the idea of art having multiple meanings when viewed from multiple perspectives; and the framing Fg being the contextual history of this genre of art, the motivation, justification, etc. At the process level, Ep is the generation of methods for producing expressions of art which have a different meaning when viewed upside down (for example, birds flying in the sky can double as waves in the sea, or a hat on one's head can double as a mouth on one's face); Cp represents methods for generating new perspectives from which the art might make sense (other examples include rotating 90◦ rather than 180◦ ); methods for generating the aesthetic of art having multiple meanings when viewed from multiple perspectives would be Ap (another example would be the aesthetic of art having multiple meanings when viewed from a single perspective); and Fp would be methods for generating new motivations, justifications, and so on. Figure 1: A man coming out of the water - turn upside down to see the same man drowning. Not all of these aspects may be present in a single act of creativity, and they may be performed by different parties. We can express the upsidedowns as follows: FACE Upsidedowns F p : Methods for generating the contextual history of this genre of art F g : The contextual history of this genre of art, motivation, justification, etc. A p : Methods for generating the idea of art having multiple meanings when viewing from multiple perspectives A g : The idea of art having multiple meanings when viewing from multiple perspectives C p : Methods for generating new perspectives from which the art might make sense C g : The constraint that a picture must make sense when upside down E p : Methods for generating expressions of art which have a different meaning when viewed upsidedown E g : Expressions of art which have a different meaning when viewed upsidedown (see figure 1) Lessons from psychology There is much work on creativity and little cross-fertilization between fields. In particular, the relationship between CC and the psychology of creativity (in Europe at least) has been oddly disjointed: this is a waste, and our models constitute one attempt to embed various aspects of creativity 73 research from psychology into a CC context. Simonton [22] identifies four perspectives in basic research in psychology on the nature of creativity: cognitive psychologists are concerned with mental operations which underlie the creative process; developmental psychologists investigate circumstances which contribute to creative growth; differential psychologists focus on individual differences; and social psychologists investigate sociocultural environments which shape or favour creative activity. Of most relevance to CC are Simonton's first and final categories. The first is the most obvious and much work in CC is motivated by giving a computational representation of mechanisms which can underlie creative behaviour: these may be cognitive or otherwise, depending on the motivation of the CC researcher. The final category is far less common in CC, although work such as that by Saunders [21] has demonstrated its relevance to the CC community. Our models are thus inspired by ideas in these two areas. Types of creativity Despite being notoriously difficult to define, researchers have distinguished different types of creativity. These often include some sort of conceptual space, which is an area of work defined culturally by a set of rules and approaches into which individuals are socialised as they master the skills of their field. Some individuals will produce work which violates the rules but is considered by the community to be highly creative, and new work produced according to a new set of rules then becomes the standard for that domain. Csikszentmihalyi [5] distinguishes between little-c creativity which might be within a cultural paradigm and includes everyday, mundane creativity which is not domainchanging, and big-C creativity, which is a rare, eminent creativity in which a problem is solved or an object created that has a major impact on other people and changes the field. Boden [2] draws a similar distinction in her view of creativity as search within a conceptual space, where exploratory creativity searches within the space, and transformational creativity involves expanding the space by breaking one or more of the defining characteristics. Boden also distinguishes between psychological (P-creativity) and historical creativity (H-creativity), concerning ideas which are novel with respect to a particular mind, or the whole of human history. In our FACE and IDEA models we have tried to be sufficiently general so as to capture each of these three types of creativity. There are some controversies in creativity research regarding how general creativity is. For instance, there is disagreement over whether everyday creativity involves the same processes as eminent creativity [11; 27]. Another common debate (also heard at CC conferences) concerns whether creativity is domain specific [1], and in particular whether it is the same phenomenon in the arts as in science. While most investigators assume that there are a small number of cognitive operations which underlie creativity in diverse domains, Simonton [22] notes that it is also possible that no processes are unique to creativity, or that there are no processes which are present in every instance of creativity. There is also debate on where creativity starts, and which, of the different types, is more creative. We occupy a comfortable vantage point on the fence, attempting to create a framework which is general enough to capture all aspects. Specific claims can then be expressed and tested in terms of our models. The distinction between process and product The distinction between a product or idea and the processes used to create it is a common one in psychology. Torrance argues that creativity is a combination of person, process and product, seeing a fine line between studying processes and product (reported in [25]). Maher et al. point out that "A creative process is one in which the process that generates the design is novel, and a creative product is when we evaluate the result of the design process as being novel" [15]. Much work in creative cognition research seeks to understand the mental representations and processes underlying creative thought. Examples of processes include Finke et al.'s Geneplore model [7], in which mental processes such as retrieval, association, transformation, analogical transfer, categorical reduction, and so on, might enter phases of creative invention. Conceptual combination, in which concepts are combined to yield emergent features, is another process thought to be important for creative behaviour, and this process is frequently mentioned in historical accounts of creative accomplishments (see [13; 2]). Analogical reasoning, in which structured knowledge from a familiar domain is applied to a novel or less familiar one, is also thought to be a process with a special link to creativity [10]. There is also justification from historical case studies of creativity for considering a process to be the creative output of interest. For example, consider young Euler's discovery of Arithmetic Series, in which he and his classmates were told to add up all the numbers between 1 and 100. Other pupils arrived at the answer 5050, but everyone except Euler laboriously added each of the numbers individually. Euler realised that if they were written in ascending order and then underneath in descending order, the sum of each of the pairs was 101, and there were 100 pairs. Therefore twice the required sum was 10100, and the answer was 5050. As in the example of the Basel problem described below, there are (at least) two interesting creative outputs: the expression 5050 and the process by which the expression is generated, Sn = sum to n terms = n∗(f irst+last) 2 . This example shows the importance of a generative process for evaluations of creativity - much more so than the Basel problem, since the expression π 2 6 was itself H-creative, whereas Euler's solution 5050 was only P-creative. We can break it down as follows: FACE Euler Problem F p : - F g : Embed into mathematics, eg. AP's and GP's A p : - A g : Proof that the solution is correct C p : Ways of finding new problems C g : Sum the numbers between 1 and 100 E p : Sum to n terms = n∗(f irst+last) 2 E g : The solution 5050 74 In the CC literature, the focus is often on the product, and virtually all systems in CC discuss the creation of artefacts, rather than processes. One strong motivation behind our development of the FACE model is to emphasize the importance of the process by which an artefact is created in a judgement of creativity. We have done this via our distinction between process and product in each of the four aspects of FACE. Evaluation in creativity In the introduction we noted two senses of evaluation which are relevant to creativity: evaluating whether an idea or product is valuable, and evaluating whether a person has been creative. The former, which is judged both internally by a creator and externally by a community, is an essential component of creativity. This notion of evaluation of products as being a key part of the creative process was introduced early on by thinkers such as Wallas [26], who presented one of the first models of the creative process, outlining a four-stage model of creativity: (i) preparation, (ii) incubation, (iii) illumination and (iv) verification. The final stage, where an idea is consciously verified, elaborated, and applied may be carried out by either the creator or by the community. Similarly, Parnes [18] introduced evaluation into the creative process in his theory that creativity is a function of knowledge, imagination and evaluation, arguing that the highly creative person must be able to make evaluative judgements about his or her products. This kind of evaluation is also linked to Williams' divergent-productive thinking (described in [16]) in which multiple solutions to a problem are generated and then evaluated to select the best. Mcgraw and Hofstadter account for evaluation in their twostage model of guesswork and evaluation [17], and refer to the "necessarily iterative process of guesswork and evaluation as the ‘central feedback loop of creativity"' [9, p. 451]. Additionally, Maher, Merrick and Macindoe point out that there are numerous characteristics associated with creative design in addition to simply producing novel material, including aesthetic appeal, quality, unexpectedness, uncommonness, peer-recognition, influence, intelligence, learning, and popularity [15]. Thus, the twin processes of generation and evaluation are firmly embedded into notions of creativity. We maintain this distinction in our two complementary models; FACE, which describes generative acts of creativity and IDEA, which describes ways of evaluating the acts. We further reflect notions of internal and external evaluation of a product via our Aesthetic and Framing aspects of the FACE model. The IDEA model complements these aspects and provides a fine grained approach to evaluation. Kreitler and Kreitler [14] distinguish between creators and spectators of the creative products and argue that there are two types of creative people, those who create art and those who view it. They have developed a theory of experiencing art consisting of two phases. In the first (perceptualcognitive) phase, the spectator responds to the work of art, and in the second (motivational) phase, the experience is motivated by psychological tensions which exist in the spectator independently of the current experience. These tensions trigger new tensions in response to the work of art and allow the spectator to give meaning to the art: such meaning develops with the gradual unfolding of different aspects of the artwork. This work has formed the inspiration behind our IDEA model and our thoughts on the cognitive effort required to understand a creative artwork. Theories of creativity In view of the difficulties of defining this nebulous concept, some psychologists have moved on from considering ‘what is creativity?' to considering ‘where is creativity?' Csikszentmihalyi [5] answers this with his systems view of creativity. He sees creativity as an interactive process between three elements: an individual innovator, his or her knowledge about a domain, and a field or community of experts who decide which individuals and products are valued. Of particular interest to us is Csikszentmihalyi's emphasis on the role of the community in which creators operate and how it affects creative outcomes. Uzzi and Spiro [24] also emphasise the role of a community in amplifying (or stifling) creativity. They discuss work which traces the history of key innovations and show that in nearly all cases the creator is embedded in a network of artists or scientists who share ideas and act as each other's critics, creating an atmosphere of cross-fertilisation of ideas. These theories have inspired the social aspects of our model, in particular the framing aspect, in which a creator embeds his or her creation within a field. Attempts at framing may well inspire further innovations in the field. Similarly, our IDEA model is based on theories of how creative work is received. Noting the difficulties experienced by psychologists when trying to define creativity, the fine level of granularity of our models enables us to pinpoint where creativity in a particular act has occurred, without having to answer the question of what creativity is. Gardner [8] holds that an individual must be consistently creative, in order to be considered to be creative. He argues that "the creative individual is a person who regularly solves problems, fashions products or defines new questions in a domain in a way that is initially considered novel but that ultimately becomes accepted in a particular cultural setting." [8, p. 35]. In order to accommodate this criterion, we have designed our FACE model to be used in a cumulative way. Creative acts and the FACE/IDEA models We break down two further examples of creativity in order to demonstrate the various aspects of our models. In an attempt to avoid charges of cherry-picking examples which fit our model, we analyse two very different examples in different domains. The first is an example of big-C, transformational, H-creativity in mathematics and the second concerns little-c, P-creativity in general problem solving. It is difficult at this stage to show generality: we invite the reader to similarly decompose other creative acts in order to develop the FACE and IDEA models. Recall that our models are intended to help us to develop and evaluate creative software and are inspired by, rather than models of, human creativity. 75 The Basel problem Euler's solution of the Basel problem in mathematics is a seminal historical episode of human creativity. This is the problem of finding the sum of the reciprocals of the squares of the natural numbers, i.e. finding the exact value of the infinite series 1 + 1 4 + 1 9 + 1 16 + 1 25 + 1 36 + . . . In Euler's time this was a difficult and well known problem. It had been around at least since Pietro Mengoli posed it in 1650, and had eluded efforts by Wallis, Leibniz and the Bernoulli brothers: Sandifer refers to it as "the best known problem of the time" [20, p. 58]. In 1734 Euler found the solution π 2 6 , solving the problem in three different ways.2 In his third solution, Euler drew an analogy between finite and infinite series, and applied a rule about finite series to infinite series to get what is referred to by Polya as an "extremely daring conjecture" [19, p. 18]. Euler continued to evaluate this conjecture, and his analogous rule. For instance, he used empirical tests, calculating the value for more and more terms in the series. He also applied the rule to other infinite series, to form predictions which could then be tested: both series with unknown solutions (eg. 1+ 1 16 + 1 81 + 1 256 + . . . = π 4 90 ) and series where the solution was known (eg. Leibniz's infinite series 1 − 1 3 + 1 5 − 1 7 + 1 9 − 1 11 + . . . = π 4 ). In all tests the conjecture and analogous rule held strong. This creative episode involves several different aspects. Firstly, finding the solution to the Basel problem was a major result. This has inspired our Eg , the expression of a concept. Of even greater value was Euler's discovery of the analogous rule, and in general, ways of converting rules about finite series to rules about infinite series, which - once confirmed - could be applied generally. This has inspired Ep . Euler's proofs that his solution is consistent and sound have inspired Ag . While the modern mathematical concept of infinity was not developed in Euler's time, the concept of infinite series was well established;3 therefore, his work with infinite series had to fit with the structure already developed. Euler's extended work on the applications and limitations of his analogy and his independent proofs of the solution inspire our framing aspect, Fg . We summarise these creative acts in terms of our FACE model below. FACE Basel Problem F p : - F g : Euler's extended work on the applications and limitations of his analogy; his independent proofs of the solution; general ways in which analogies can be used in mathematics A p : - A g : Euler's proofs that his solution was sound C p : Ways to find problems such as the Basel problem C g : The Basel problem (Pietro Mengoli) E p : Ways which Euler discovered of converting rules about finite series to rules about infinite series E g : The solution π 2 6 We see that, in describing how Euler solved the Basel problem, authors such as Polya and Sandifer refer to other 2For excellent commentary on this work, see [20, pp. 157-165] 3 Infinite series have been around at least since Zeno. creative acts in relevant mathematics communities. They also describe the amount of time Euler took, and the effort he went to in order to justify his result and his methods. They use emotive terms such as "daring" to describe the results. In effect, this is all an attempt to quantify and qualify the impact this creative act had on mathematical society. This motivates our emphasis on impact rather than just value judgement in the IDEA model. In particular, by highlighting the time taken, Polya and Sandifer indicate the level of development Euler's methods went through as he tried and failed repeatedly to find a solution. They also highlight how difficult people had previously found the Basel problem, hence how different Euler's creative act(s) were to the sum of those coming before him. These aspects of the description of impact give us motivation for the branch of the IDEA model that deals with the stage of development software is in, and the utility of understanding exactly the background knowledge available before, during and after the development and execution of the software. In addition, the emotive descriptions highlight that creative acts are intended to affect people's well being. Finally, the description of Euler going to great lengths to justify his methods can be seen as motivation for why we measure the level of cognitive effort that audience members expend in understanding creative acts in the IDEA model. Clearly, Euler was trying to convince people to understand his approach and spend time appreciating the implications it had for mathematics. Duncker's candle task Duncker's candle task [6] is a cognitive performance test invented by the Gestalt psychologist Duncker, intended to measure the influence of functional fixedness4 on a participant's problem solving capabilities, and has been used in a variety of experiments concerning creative thinking. The challenge in the candle task is to fix a lit candle on a wall (a cork board) in such a way that the candle wax won't drip onto the table below. One may only use a book of matches, a box of thumbtacks and the candle. The solution, to tack the empty box of tacks onto the wall and sit the candle in the box, is difficult since people typically see the box as a container for the tacks rather than a piece of equipment for the task. Some people find partial solutions, such as tacking the candle to the wall or melting some of the candles wax and using it to stick the candle to the wall. The task is interesting for us since its solution is an example of everyday, little-c, exploratory, P-creativity (other components surrounding the task could be argued to be a different type of creativity, such as Duncker's invention of the problem, which could be seen as H-creative and possibly big-C creativity). We show our decomposition of various creative acts connected to this task in terms of our FACE model below. Since we are interested in the everyday, creative problem-solving aspects in this example we do not consider the IDEA model: there is usually no audience in such contexts. 4 Functional fixedness is a cognitive bias which limits a person to using an object only in the way it is traditionally intended, for instance a person with a hammer who needs a paper weight, but is unable to see how the hammer could be used for such a purpose. 76 FACE Duncker's candle task F p : - F g : An explanation as to how the task was completed A p : - A g : Evaluation of the solution C p : Ways of finding problems like Duncker's candle task C g : Duncker's candle task E p : Techniques to overcome functional fixedness, eg. analogical transfer, in which the problem is framed as an analogy, or reorganization of mental categories E g : Tack the box onto the wall and sit the candle in it Further work and conclusions While our FACE and IDEA models are not broad enough to cover all potentially creative software systems, we believe that they cover enough to guide and describe the first wave of creative systems. These models constitute a start, rather than endpoint, to our thinking about how to evaluate creativity in machines. This project is ongoing, and we expect to evaluate our two models principally by their utility in describing creative systems: we begin this task in a companion paper [4]. Philosophers of science can also help us to evaluate and compare such measures: recent work has suggested criteria which a good theory should satisfy.5 In their analysis of one hundred recent doctoral dissertations on creativity, Wehner et al. found a "parochial isolation" and no cross-disciplinary ideas (reported in [23]). They compared the situation to the parable of the blind men who attempt to understand an elephant by touching a different part, thus each building a very different picture of it. We hope that, by drawing inspiration from work in the psychology of creativity, we will help to contribute to a fruitful multidisciplinary approach. Acknowledgements We are very grateful to John Charnley for his thoughts on the FACE and IDEA descriptive models. We would also like to thank our three anonymous reviewers for their helpful comments. This work is supported by EPSRC grants EP/F035594/1 and EP/F036647/1. 2011_16 !2011 Picasso, Pato and Perro: Reconciling Procedure with Creativity Patrick Summerhays McNally Computer Science Department Northwestern University patrickmcnally2013@u.northwestern.edu Kristian Hammond Computer Science Department Northwestern University hammond@cs.northwestern.edu Abstract This paper presents and details 'Pato and Perro on the Movies,' a system that generates web comics about recently released movies. Information is extracted about movies from the internet and a series of panels are drawn with dialogue to set up a punchline in the comic's final frame. Definitions of creativity commonly used to examine computational processes are presented and used to examine this system. The system is used to discuss a common critique of creative systems, namely that procedural creation inherently limits the range of potential content produced. This paper argues that procedure and creativity can be reconciled, and that much of the content produced by humans is subject to similar critique. Finally, we discuss the implications of characterizing many human acts of creation as procedural. Pato and Perro on the Movies 'Pato and Perro on the Movies' is a web comic written and drawn by a machine. Each comic gives bite sized impressions of a recently released film and closes with a punch line about one of the characters' mothers. Consider a comic produced by the system in Figure 1. The system's procedure for creating a comic tries to meet these requirements: -The subject must be a recently released film. -Both a negative and a positive statement about the film should be presented if at all possible. -The statement in the second panel should be fertile language for generating a punchline. -That punchline should aim to be a humorous re-interpretation of language seen in the second panel. Figure 1: a comic created by Pato and Perro on the movies 78 To accomplish these goals, the system mines movie reviews from rottentomatoes.com, a popular movie review site, for the week's top box office hits as well as bite-sized commentary of both positive and negative sentiment pertaining to these movies (see Figure 2). The bite-sized commentary is provided on each films' Rotten Tomatoes page. A section of each page is devoted to a wall of choice snippets from different reviews. The snippets are marked with a ripe tomato when the sentiment of the snippet is positive and a green splat when the sentiment is negative. This marker is used to determine the valence for a snippet when it is used to create dialogue in the comic. The snippets are also analyzed using WordNet (Fellbaum 1998) for potential humorous meanings. The system's best pick is selected for the second panel, and another comment of opposing sentiment is selected to provide contrast in the first panel, or a flat statement of preference like the one seen in Panel 1 of Figure 1 is used. Tension is interesting and plays an important role in the build up to the punchline (Napier 2004) so it is important that the first two panels contain statements of opposing sentiment. The final panel receives the punchline. To craft a punchline from text, the system follows this approach: -tokenize and tag the text that appears in the second panel of the comic -identify verb and noun phrases that occur in the text -compute the wordnet distance between the adjectives, verbs and nouns of a phrase to a set of humorous concepts pre-assembled by hand. -pick the verb or noun-phrase with the shortest distance to this target set of humorous concepts -form the punchline based on whether or not the phrase is a verb or noun phrase The system relies on WordNet, a precompiled database of words and their semantic distances from one another. The method used to compute semantic distance was developed by Philip Resnik (Resnik 1995). This distance is used to determine how easily a phrase can be interpreted as something humorous. Put another way, the system uses a word's WordNet semantic similarity with another word as a proxy for how likely those words can be conflated in meaning. For example, in the comic illustrated in Figure 1, a bull is semantically similar to a donkey, which is considered a humorous topic appropriate for a punchline. The system, in evaluating the words in panel two, selected 'raging bull' for this semantic similarity and crafted the punchline. The result of the system's procedure is a comic strip. If a person had assembled this strip we would call it creative. Can we say the same of this system? What is Creativity Here is a general definition of creativity: the production of something novel and useful. There is a certain amount of consensus around this definition (Mumford 2003) such that any theory or model of creativity needs to address these two tenants (novelty and utility). Although consensus is good, the scope of this definition is quite broad and this generality comes with limitations. It becomes difficult to discuss how creative processes differ without a more precise definition. For example, how should the difference between a baker modifying a recipe and a baker working from scratch be characterized? This is where the consensus dissolves. Given such a lack of consensus around core concepts, claims of creativity in computation have to be made with a spirit of exploration. Never-the-less, two definitions of creativity have been defined with a strong computational conFigure 2: On the left, the box office report on rottentomatoes.com. On the right, review snippets from Battlefield LA from the same site. 79 text. Both are useful for discussing the creativity of a machine. They strike out in remarkably different directions. One is old and one is new. One is content-centric and the other is process-centric. Despite their apparent differences, both tell us fundamental things about our ideas of creativity and machines. One of the earliest definitions used in the computational creativity field was put forth by Newell, Shaw and Simon in 1963. Their four criteria for a creative system are: 1) the solution is novel and useful 2) the solution demands the rejection of previous ideas 3) the solution occurs after much persistence 4) the solution should clarify an initially vague problem Understandably, their criteria are framed with the task of searching in mind. Problem solving systems of the time were designed to strategically search a vast solution space. So, searching was a common paradigm upon which to base a system. These criteria were intended to judge when a solution found by such a system would have appeared creative had it been developed by a human mind. To this end, Newell, Shaw and Simon developed this list (which was neither meant to be complete nor sufficient) to provide a scaffolding around which to organize an argument that a system exhibited creativity. The first criterion is simply a repetition of the general definition above. The second speaks to a sense that creative things often elicit surprise; their implications are unexpected in some way. The third criterion suggests that creative things are not easily arrived upon. The fourth criterion reflects the sense that creativity should not be obvious. In fact, creativity will often provide an 'ah ha!' moment, where the solution itself illuminates the structure of the very problem it solves. Like the original definition, there is a tremendous amount of room to interpret these criteria. But they are meant to be interpreted. Each one indicates characteristics considered to be markers of creativity in human acts. Attributing each one to a system that produces content is a way to argue the system is creative. One potential problem with these criteria is how content centric they are. They stipulate nothing about the architecture of a system or procedure making the content. Fortunately, a second perspective on computational creativity is more process-centric. Margaret Boden has published ideas about different classes of creativity. And these categories pertain more to process than product (Boden 1998). Two important classes of creativity she discusses are exploratory creativity and transformational creativity. These distinguish between creativity that follows a procedure and creativity that defines a procedure. Put another way, Boden tells us that exploratory creativity navigates a defined space while transformational creativity redefines that space. A painter adhering to the process and techniques of watercolor, for example is different from Picasso establishing cubism. A watercolorist will start in the background, and 'block' in their scene, moving from lightest to darkest colors. They will end with the foreground and employ a set of stroke techniques defined by the medium. For this reason a classic watercolor scene has a perspective and texture that is uniform across the medium. There is a recognizable proportion and boundary to watercolor as a space of potential content. Boden would describe the creativity of a painter following the classic watercolor procedure as 'exploratory.' The boundaries of the conceptual space are the medium and the procedure with which that space is explored are the stroke techniques and layering strategy. In contrast, any process that redefines an existing conceptual space or establishes a completely new one, like Picasso developing cubism, is an example of transformational creativity. Transformational creativity is more complex. The minds we most celebrate as creative tend to be known for their grand acts of transformational creativity, but there can be minor acts too. If the watercolorist (noticing the way a heavy brush drips and splatters on the paper while traveling to begin the first stroke) decides to employ splatter as a technique, the space of potential content changes; A new technique has been added to the painter's procedure. This modification would also be transformational creativity. These two characterizations of creativity prove useful in discussing whether or not (or to what degree) a machine exhibits creativity. Attempting to answer such a question forces clarifications about what it means to be creative. Which in turn may have important implications for our understanding of human creativity, and the further development of machines capable of entertaining and illuminating our lives. Specifically, these two definitions of creativity can inform our understanding of Pato and Perro and help us to articulate the system's limitations as well as imagine beyond them. Creativity in Pato and Perro Pato and Perro produces content that is variable. Decisions are being made by the system as the space of possible constructions is examined. The case can be made that Pato and Perro meets three of the four criteria laid out above by Newell, Shaw and Simon. Surely the comics could be considered novel, since they are new pieces of media. It can be argued that they have use as well, since humor often has a beneficial effect on mood, and each strip conveys some amount of information about a movie. If the system succeeds, the reader will interpret the text in the second panel and then be forced to reinterpret the text again after reading the punchline, forcing a humorous rejection of the original meaning. This arguably satisfies the second criteria. The third criteria seems to be rather subjective. Depending on the number of review snippets available, the system may select a comic it considers best from hundreds of potential candidates. If this isn't enough persistence, more 80 snippets could be retrieved and more potential comics could be evaluated. The final criteria, which stipulates that the comic should clarify an initially vague problem, is perhaps inapplicable here. It is not clear what problem is being clarified. Now, there is a lot of room for interpretation with these criteria. For example, consider the modern search engine. A results page for a given search term will be novel and useful because the page will present relevant, recent results catered to the user's language and geographic location. The results may demand the rejection of the user's previous ideas, be those ideas about the cheapest flights or the definitions of words. Certainly a good results page will present the user with varied perspectives on a topic. The results may arrive quickly, but vast amounts of computational power has gone into creating the indices that afford this speed, and these indices are constantly being revised. Finally, the best results set for a given search term is often unclear. The best engine should clarify this through its ranking mechanism; In other words, the best engine will have a strategy for determining why one set of results is superior to another. A search engine appears, from this perspective, to be a creative system. Rather than defend the previous statement, we will simply say that Newel, Shaw and Simon's criteria are open for interpretation. Furthermore it may not be appropriate to attribute creativity to a system in hindsight, looking only at it's solutions and our impressions of those solutions. Which means Pato and Perro's claim to creativity based on these criteria is debatable. This is where Boden's conceptions of exploratory and transformational creativity can be of use. For example, if we examine Pato And Perro's process it becomes clear that the system exhibits exploratory creativity but lacks transformational creativity, since the system follows rules and has no mechanisms for breaking or amending those rules. But what would it mean for the system to break it's own rules? The system has only a general idea of the pieces with which it works. Essentially, there is a plan (the overarching comic structure of contradicting opinions followed by a punchline) and potential chunks of data that could fit into that plan (the review-bites from rottentomatoes.com). In a general sense, the system ranks potential arrangements according to a fitness metric and arrives at the one these metrics deem best. Each review-bite has a sentiment valence and character length attached to it as the only orienting information the system uses to realize its plan. The most technical piece of the system is used to evaluate each chunk of review for potential humorous meanings. In order to break with it's own procedure, the system would need a way to examine all these pieces and the motivations for how they currently fit together. But the system doesn't do this; The process described here is the procedure that Pato and Perro always executes. The problem for many critics is the static nature of this procedure. There are plenty of procedural art forms, where the artist defines rules and then produces a work by following these rules, but even the art world feels vaguely uncomfortable with these (though they too have trouble putting their fingers on a why). There is a sense that creative processes should not be static, that creativity means being capable of stepping outside of one's process and implementing changes—what Boden would call transformational creativity because it would redefine the space of potential content. This should be vaguely reminiscent of John Searle's thought experiment with the Chinese Room (Searle 1980). There is a strong parallel between following a procedure to create content (what we are calling exploratory creativity) and the Chinese Room producing believable conversation. For a system to exhibit transformational creativity it would likely need to actually understand the process it was executing. It may be that transformational creativity mirrors in difficulty what Searle called 'strong AI.' Which begins to suggest why following rules is so much easier than breaking or defining new ones for a system designed to create content. In summary, examining only the content produced by Pato and Perro, it can be argued that the system is creative, but upon further examination, such hindsight assessments may be subject to the same sort of critique put forward for strong AI with the Chinese Room- meaning creativity has something to do with the process as well as the product. In fact, when we actually examine the system's process, it becomes clear that the creativity at work is perhaps the weaker of two forms defined by Boden. Discussion For now, if a machine is to produce compelling content the details of a compelling production process have to be determined. This means systems are built from their medium up; they are envisioned with their final products in mind and built to execute those products. This requires that the architect understand the nature- the what and why behind interestingness- of the content to be produced. So the architect of a creative system must understand the potential space of content to be explored and instill in the machine's procedure constraints or heuristics that will keep the machine within the interesting areas of that potential space. For Pato and Perro, the procedure consists of building tension through disagreement in the first two panels and establishing an alternate interpretation for text appearing in the second panel. This implicitly says something interesting about the creative process behind the content: tension heightens effect and ambiguous meanings provide entertainment. The system doesn't have a macro-level process to examine this like a mind would, nonetheless these values are implicit in the system's procedure. This often leads to the critique that the creativity happens outside the system. But it isn't clear why proceduralizing an act of creation negates, or rather separates it's creativity. If establishing a process abolishes creativity, many human pursuits generally considered to be creative should be reevaluated. 81 Almost every amateur taking art lessons has not yet been creative. Reporters adhering to the strictures of their form are not being creative. Students following a prescribed essay structure are not being creative. Programmers executing a well established architectural paradigm are not being creative. Pato and Perro as a system, is like the novice painter practicing the process they've learned to paint with watercolors. To repeat, the medium provides a conceptual space and the learned strokes and techniques are the procedures with which the painter navigates that space. This paper is not claiming the person painting and the system creating comics are equivalent cognitively, only that if the painter sticks to a defined set of techniques their creations will be limited to a certain range of possible content in a fashion similar to that of systems like Pato and Perro. If Pato and Perro could examine its own process and establish new patterns of creation, it would exhibit the transformational creativity of Picasso inventing Cubism or the painter discovering a spatter technique; transformational creativity redefines the procedures and/or the conceptual space. So why don't these systems exhibit transformational creativity? If the problem seems to be rooted in a sense that a static procedure isn't creative, why not make the procedure dynamic? Why are many 'exploratory' systems being built while so few 'transformational' systems are coming about, particularly if the latter seems to be more powerful or more highly regarded in some way? The list of exploratory systems and their mediums is extensive. Jape builds linguistic puns (Binstead 1994). HAHACRONYM builds acronyms (Stock 2003). AARON builds paintings (McCorduck 1991). ASPERA builds poetry (Gervás 2001). Tale-spin builds stories (Meehan 1981). The explanation is likely that exploratory systems are just easier to build. It is easier to become a watercolorist adept at understood techniques than it is a Picasso capable of inventing new styles. Returning to Searle's Chinese Room, the distinction between following procedure and understanding the procedure may actually be endemic to people as well as machines. A novice painter could study the stroke techniques of masters for a lifetime and they still might never recognize the rules being followed and transform them like Picasso. 'Understanding' the procedure, we argue, is often quite difficult even for humans. Within their domains computationally creative systems have instructions and constraints that allow them to explore but they do not exhibit a mastery over their own process. They lack the higher perspective to modify their own production behaviors. They don't exhibit Boden's transformational creativity. Never-the-less, these system's successes suggest that for the purposes of creating compelling content of a fixed type for an audience, exploratory creativity may be sufficient. Exploratory creativity seems to excel when leveraging a specificity of domain regarding medium or form. But critics might still claim exploratory creativity will never surprise and delight like transformational creativity. And they would be right. Systems exhibiting exploratory creativity may certainly provide useful, even valuable, material but the human mind, once it has grasped the system's process, will naturally imagine the boundaries and proportions that limit that process. And this will always feel disappointing. But the fact remains that much of the content we produce is procedural in nature. Exploratory creativity is the most common type exhibited by people, day to day. Our most celebrated works are certainly the result of transformational creativity but our world spins by procedure. Conclusion: Creative Systems in the World The Pato and Perro system produces multimedia content that has never been seen before. It does this quickly and at scale. Most importantly, it is not alone. Automated content generation systems like it are beginning to emerge for public and commercial use. Music generation systems are being seen on popular mobile devices like the iPad (Eno 2010). Systems are being built to craft sports narratives (Carr 2009). Investigative journalism is using systems to identify trends and soon perhaps even characterize them in language. All of these systems fit the general pattern of exploratory creativity, that is, they have a massive potential space for creation and instructions for how to position, navigate and orient themselves in this space. Any category of content that is produced for a massive audience, and adheres to a procedure, has the potential to be produced by a machine. But it is unlikely that machines will be defining any new mediums or breaking the rules of old ones. For now it is unreasonable to expect our machines to be Picassos that transcend their procedures, but systems like Pato and Perro, Aarron, Jape and many more show us we can reasonably expect adepts of a specified medium. Acknowledgements We would like to acknowledge support from the National Science Foundation. Computational Creativity: Building a model of machine generated humor Grant Number: 0856058 2011_17 !2011 A Vision of Creative Computation in Music Performance Roger B. Dannenberg School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 USA rbd@cs.cmu.edu Abstract Human Computer Music Performance (HCMP) is the integration of computer performers into live popular music. At present, HCMP exists only in very limited forms, due to a lack of understanding of how computer performers might operate in the context of live music and a lack of supporting research and technology. The present work examines recent experimental systems as well as existing music performance practice to envision a future performance practice that involves computers as musicians. The resulting findings provide motivation as well as specific research objectives that will enable new creative practice in music. Introduction Sound and music computing research has focused attention mainly in two directions: "high art" and commercial music from the recording industry. Largely ignored is the live performance of popular music including rock, jazz, and folk music. While the practice of popular music is not currently a hot topic for music technology research, it is arguably the dominant form of live music. In a recent weekly listing of concerts in Pittsburgh, there are 24 "classical" concerts, 1 experimental/electro-acoustic performance, and 98 listings for rock, jazz, open stage, and acoustic music. By examining the features of popular music practice, we find many commonalities across a diverse array of musics, including rock, jazz, folk, music theater, church music, choral music, and others. Live music in all of these popular forms offers a wealth of opportunities for computing and music processing research. I call the integration of computers as performers into popular live music performance practice "Human-Computer Music Performance" (HCMP). Technologies such as digital audio recording and music synthesizers have changed the musical landscape dramatically over the last few decades. I believe an even more radical and creative musical revolution is in progress, where computers become performers rather than merely instruments. Because this revolution will involve new musical forms and new tasks (not just the automation of known ones), it is difficult to imagine and predict. HCMP will be most interesting when computers exhibit human-level musical performance, but this is such a giant advance over current capabilities and understandings that it offers little guidance for HCMP research in the short term. An alternative is to envision a future of HCMP based on realistic assumptions of machine intelligence. Thus, an important initial step in HCMP research is to imagine how HCMP systems will operate. A clear vision of HCMP will motivate research to make it happen. This is a position paper that poses a "grand challenge" for creative computing in music. Rather than stating a vague problem with no path to a solution, I hope to make the vision concrete enough to pose specific problems and research directions. I will consider some intermediate steps and approaches toward the ultimate goal. The results so far are not solutions themselves, but objectives that define problems and motivate their solutions. There is an apparent contradiction in saying, on the one hand, that my goal is to develop "new musical forms and new tasks," but on the other hand, that I wish to address existing forms of live popular music. Since interactive computer music, aside from electronic instruments, is rarely used in popular music, there is a huge opportunity for a new and creative synthesis of ideas from which new forms and new tasks will emerge. This synthesis will only occur if musicians are first motivated to integrate more creative computing into their current practice. Thus, my strategy is to bridge the gaps that limit the use of computing in live popular music. New technologies will create opportunities for more radical transformations and artistic innovation. The next section describes different modes of interactive computer music. Then current work in HCMP is presented. "The Vision of HCMP" describes requirements for future systems and makes specific predictions as to how these will be met. The paper ends with "Future Work" and "Conclusions. Human-Computer Music Performance Computers have been used in music performance for many years, so before going further, we should discuss HCMP and explain how this differs from current practice in computer music. (See Table 1.) The most common use of computing in music performance is through computer 84 instruments, typically keyboards. These, and other electronic instruments, are essentially substitutes for traditional instruments and rely upon human musicians for their control and coordination with other musicians. Many composers of interactive contemporary art music use computers to generate music in real time, often in response to live performers. These works typically take advantage of contemporary trends toward atonality and absence of a metrical pulse, which simplifies the problems of machine listening and synchronization. Alternatively, the practice of computer accompaniment solves the synchronization problem by assuming a pre-determined score (music notation) to be played expressively by the performer while the computer follows the performer in the score and synchronizes an accompaniment. Other solutions to the synchronization of computers and humans include using fixed media with "click tracks" that performers can hear through headphones (with minimal interaction), and conducting systems, where a human essentially taps a click track for the computer. Table 1. Interactive Music Major Threads Computer Instruments Direct physical interaction with virtual instruments: digital keyboards, drums, etc. Interactive Contemporary Art Music Composed interactions; often unconstrained by traditional harmony or rhythm. Digital audio effects and transformations of live performance. Computer Accompaniment Assumes traditional score; score following synchronizes computer to live performer. Fixed Media Many musical styles and formats. Live performers synchronize to fixed recording. Conducting Systems Synchronize live computer performance by tapping or gesturing beats. Best with "expressive" traditional/classical music. HCMP Assumes mostly steady tempo and synchronization to beats, measures, and sections. Compatible with improvisation at all levels. In contrast, HCMP is aimed toward "common practice" music where performers improvise to varying degrees, where tempo is fairly steady, and where the structure of the music may be determined spontaneously during a performance. The "improvisation" assumption says that performers make musical choices ranging from strumming styles and bass lines to full-blown jazz improvisation. Since there is no pre-determined note-level description of the performance, musicians must synchronize on the basis of a moreor-less implied temporal structure of beats, measures, and sections. We also assume that performers can take liberties with the music, adding an introduction, extending a song with an instrumental solo, etc. Often, these changes are determined during the performance, so performers (human or computer) must be flexible enough to recognize and adapt to these changes. As mentioned in the introduction, HCMP addresses the most commonly performed musical styles, including rock, folk, and jazz. HCMP Systems to Date Before describing a vision of future systems, let us look at three systems that have already been built. These are two systems for performing with live jazz musicians and one system for electronic display of music notation. Virtual Orchestras, Virtual Music Players The first system is a virtual string orchestra that performs with a live jazz band (Dannenberg, 2011). This system uses studio recordings of acoustic instruments (violins, violas, and cellos) that are synchronized to the live band using audio time-stretching software based on PSOLA (Schnell et al. 2000). This produces high-quality output as long as the source sound is a single, periodic tone. Therefore, we recorded each string part separately, resulting in a 20-channel audio file, and we stretch each channel independently to form a variable-tempo, 20-piece orchestra. To synchronize with the live band, a band member taps a foot pedal in time with the music, and some simple outlier rejection and linear regression software processes the data to form an estimate of the current and future beat position as a function of time. The virtual orchestra does not play continuously, but instead has about 10 separate entrances. Most entrances are cued by pressing a key in time with the music, allowing the system to resynchronize if anything goes wrong or if a soloist decides to play some extra measures. The strings are mixed into 8 audio channels, which are played by 8 high-quality speakers arranged spatially to give the sound of a full ensemble playing in a concert hall. (See Figure 1.) This helps the strings match the 3-dimensional presence of the live musicians. Figure 1. Jazz band performance with virtual strings (played by speakers in background). The second system is a somewhat scaled-down version of the first. Rather than carefully recording and editing music for the virtual player, this version uses MIDI. In this particular case, the part would otherwise be played on a MIDI keyboard, so MIDI is not a limitation. This system uses a foot-tapping input for tempo acquisition, and the foot pedal is also used to cue entrances. Sections are played in sequence, but the operator can override the order by clicking a button on a computer screen. The computer part is mostly eighth-note arpeggios over a rock beat, making precise synchronization very important. 85 Electronic Music Display One of the problems of HCMP is communication among both computer and human performers. Since musicians often read music during a performance, a computer-based visual display of music provides an interesting potential for a real-time, two-way musical communication channel. To ease the transition from traditional paper, we assume printed music will be digitally photographed or scanned. Our software loads image files and offers some simple editing where users can mark logical page boundaries (normally between systems or staves). Digital music displays are not new, but the concept of a music display as a musical communication device has received very little attention beyond the concept that a conductor could remotely ensure that all musicians are viewing the same music (Connick 2002). One of the areas we have been investigating is the mapping between the "static score" such as printed music and the "dynamic score" which is represented by recorded music or a live performance. The static score includes instructions to repeat sections, jump back and play the beginning again, etc., whereas the dynamic score is in some sense an instantiation or unfolding of the static score. These concepts are important because if a musician points to some notation, or if the computer highlights a score location, that location might represent the first, second, or third repetition in terms of the dynamic score. Mechanisms are needed to disambiguate static score locations. Another focus of our work is page layout and page turning. We have shown, for example, that if a musician is unsure about a music structure decision (e.g. whether to repeat a section or move on) and the musician needs to look ahead in the music, then a display must be capable of showing at least three lines (or "systems") of music at once: the current line plus the two alternative destinations. This provides a way to structure page turning and page layout on a dynamic music display. The Vision of HCMP To develop a practice of HCMP and to build upon these initial investigations, we need to imagine how humans and computers will interact, what sorts of communication will take place, and what sorts of processing and machine intelligence will be needed. We need a research agenda. To guide this imagining process, we should look at the practice of music performance without computers. From this, we will construct a set of predictions that anticipate characteristics and functions of future HCMP systems. These predictions will serve to guide future investigations and pose challenges for research and development. We can divide HCMP into two main activities: music preparation and music performance. Music Preparation An assumption in HCMP is that music is well-structured: There are agreed-upon melodies, chord progressions, bass lines, and musical structure that must be communicated to all performers. If the music performance is always the same, this is trivial, but our assumption is that the structure may change even during the performance. What happens when the vocalist decides to sing the verse again or the bandleader directs the band to skip the drum solo? This relates to the descriptions of static and dynamic scores. We can think of static scores as sequential computer programs. The score is "executed" by performing one measure after the next. Musical repeats, jumps, and optional endings are program control constructs (loops, gotos, and conditionals). The dynamic score is then a trace of the execution of this program. With this analogy, one can imagine the preparation of a computer performer to be a kind of programming: "When you reach measure 17, if this is the second repeat, then if there is no human bass player, then play this audio." A conventional programming language is certainly not the right way to express these "programs," but it is clear that we will need something more than a conventional audio recorder and editor. Designing interfaces that are both intuitive and expressive for "programming" performances is an important problem. Predictions: HCMP systems will make the static/dynamic score relationship more explicit. Terminology for specifying the location in a dynamic score in terms of the static score will be formalized. Score location will be indicated not only in terms of measure numbers but also in terms of the static score structure. Techniques for displaying and directing dynamic score location will form a necessary part of the communication between human and computer performers. "Scores" in popular music performance can range from complete and detailed common music notation (as in "classical" works) to highly abstract descriptions such as lyrics or lists of sections. Other music representations are also common: drummers often need just the music structure (how many measures in each section) without actual instructions on what to play, and keyboard, bass, and guitar often read from "chord charts" that give chord symbols rather than specific pitches. Prediction: HCMP systems will work with multiple music representations. Computer-generated music can be based on audio (with time stretching for synchronization), MIDI sequences, or computer composition and improvisation from specified chord progressions. For many musical genres, automatic generation of parts is feasible, as illustrated by programs such as Band-in-a-Box (Gannon 2004). However, there are seemingly infinite varieties of styles and techniques, so there is plenty of room for research in this area. An interesting problem is not just to, say, create a bass line in a given style, but to give the user control over different parameters or to allow the user to say "I want a bass line like the one in song X," where of course song X has a different key, tempo, and chord progression. This is a kind of musical analogy problem (Hofstadter, 1996): bass line a is to music structure b as bass line c is to music structure d. Given b, c, and d, solve for a. Many users will not have the skill, time, or inclination to play the parts themselves or compose the parts note-by-note, so the ability to generate 86 parts automatically is an essential feature. Prediction: HCMP systems will rely on stylistic generation of music according to lead sheets in addition to pre-recorded audio and sequenced MIDI data. Music notation offers a direct visual and spatial reference to the otherwise ephemeral music performance. As discussed earlier, we envision capturing music notation by camera or scanner (Lobb, Bell, and Bainbridge 2005) as well as using computer-readable notation. For unstructured images, one would like to convert the notation into a machine-readable form, but like OCR, optical music recognition (OMR) is far from perfect, especially for hand written (or scrawled) lead sheets. Furthermore, some musicians work from lyrics and chord symbols rather than common music notation. It seems essential to develop methods to annotate music images with structural information. In most cases, this annotation of music notation will be the mechanism by which the static score is described and communicated to the computer. Prediction: HCMP systems will extend music notation to specify music structure. One characteristic of popular music performance addressed by HCMP is the preparation of "scores" before the performance. Unlike most classical music where the score is carefully prepared by the composer and publisher, popular music is more likely to be arranged and structured by the performing musicians. Decisions to alter the introduction, where and how to end, and whether to repeat sections are common. Prediction: HCMP systems will provide interfaces for specifying arrangements and performance plans. Having discussed audio, MIDI, and various forms of music notation, it should be obvious that an important function of HCMP systems will be to provide abstractions of music structure and to allow users to integrate and coordinate multiple representations of music. Prediction: A primary function of HCMP systems will be to coordinate multiple media both in preparation for and during live performance. Music Performance Once parts are prepared, we need to perform them! The main issues have to do with musical synchronization and communication. Indeed, the primary reason that there is no common practice of HCMP today is the difficulty of getting artificial performers to synchronize to live music. There were early attempts to achieve HCMP using tape recorders and other technologies, but these were mostly abandoned. Many street musicians and solo acts today use a very limited form of HCMP in which the performer simply switches on a pre-recorded "backup band" and plays or sings along. This same idea is seen in Karaoke and many TV and theater productions. B-Keeper is a recent system that uses live audio for beat-based synchronization (Robertson and Plumbley 2007). We want to envision how performers and larger groups could function if many of their present limitations were removed. When musicians perform together, they synchronize at several levels of a time hierarchy. At the lowest level is the beat or pulse of the music. Unfortunately, fast and accurate automatic detection of beats is not a solved problem. Prediction: HCMP systems will use a variety of beat detection systems and integrate information from multiple sources in order to achieve the necessary accuracy and reliability to support computer-based performers. Another level of time synchronization is the measure (or bar). Typically a group of 2 or 4 beats, measures organize music into chunks. In rock, measures are indicated by the familiar snare drum accents on beats 2 and 4 and chord changes on or slightly before beat 1. Measures are important in music synchronization because sections are aligned with respect to measures. A musician would never say "let's go to section B on the 3rd beat of measure 8." Prediction: HCMP systems will track measure boundaries. As with beats, multiple sensors and modalities will be used to overcome this difficult machine listening problem. Finally, music is organized into sections consisting of groups of measures. These sections are typical units of arrangement, such as introductions, choruses, and verses. When a performance plan is changed during the performance, it is usually accomplished by communicating, in effect, "Let's play section B now (or next)." In the case of "now," the section begins at a measure boundary. In the case of "next," the new section begins at the end of the current section. Without these higher-level temporal structures and conventions, synchronization and cues would need to be accurate to the nearest beat, perhaps just a few hundred milliseconds rather then the 1 to 10 seconds afforded by higher level structures. Prediction: HCMP systems will be "aware" of measures and higher-level sectional boundaries in order to synchronize to human players. As with measures, multiple sensors and modalities will be used to overcome the machine listening problem of identifying musical sections. Two Examples It is useful to describe sessions with imagined HCMP systems in order to grasp how the overall system might function. I will describe two examples. The first is a rehearsal and private practice with a conventional orchestra. The second is an informal jam session. The orchestra example will focus on music display and practice as opposed to computer performance. In fact, it falls outside the assumptions of popular music, improvisation, and steady tempo, but it is good to show that these restrictions are not always needed. Imagine that ordinary music on paper is available before the first rehearsal. Using a camera, each page is captured as a digital image and wirelessly transferred to a digital tablet. OCR and OMR do some preliminary analysis of the images to identify titles, rehearsal markings, and staff and measure locations. The captured information is identified using fonts or colors so that the musician can visually confirm where the automatic recognition is correct and intervene where recognition was in error or missing. Printed music is often on large pages arranged side-byside on a music stand to minimize page turning. While a 87 folding digital display might be able to reproduce this arrangement, we will assume a smaller display that necessitates more frequent page turning or scrolling. (Bell, et al. 2005) The semi-automated staff recognition divides the music into sub-page units that can be displayed in sequence. These units of music are automatically arranged from top to bottom on the display. When the player reaches the bottom, the next unit of music overwrites the top of the display, allowing the musician to read ahead. Because of repetitions in the music, the display is not strictly sequential through the pages. Using either automated or manual markup techniques, repeats and other markings can be identified, and the tablet can show the music in the proper dynamic order. After practicing parts at home, the musician takes the tablet to the first orchestra rehearsal. There, the tablet offers a directory of pieces and an index into rehearsal markings and measure numbers so that the musician can quickly jump to any location requested by the conductor. The tablet records audio from the entire rehearsal and tags the audio with score locations based on whatever music is being displayed at that moment. In the rehearsal, music can be advanced by eye tracking, foot pedals, or other sensors. One viable and simple method is a ring on the first finger that can be pressed with the thumb to signal a page turn. Once music has been rehearsed, the tablet might take a more active role in page turning by matching the live music to recordings from previous rehearsals (Dannenberg and Raphael 2006). Back at home, as the musician resumes practice, recordings from rehearsals can be selected, allowing the musician to play along with the sound of the orchestra. Ideally, one might want to remove the part to be practiced from recordings. There are a couple of techniques that might at least suppress the unwanted sound (Han and Raphael 2007, Smaragdis and Mysore 2009). Another practice aid is the ability to speed up or slow down the rehearsal audio using time stretching techniques. This is in fact already available in the SmartMusic (MakeMusic 2010) commercial practice system, but SmartMusic does not integrate the capture of scores and rehearsal audio. If music tablets communicate, and if their score representations are comparable, then it will be possible for the conductor to direct everyone's attention to a particular location (in practice, much rehearsal time is currently spent directing musicians to particular locations in the score … "please look at the third beat of the fourth measure after letter G … no, the fourth measure … yes, the third beat … yes, where you have an F-sharp …"). Page turning could be made more reliable and automatic by sharing location and confidence estimates among dozens of tablets. The next example is a jam session. Imagine that some friends want to play some songs they have played together before, but a bass player is not available. To prepare for the session, the leader finds music on the Web consisting of MIDI files or some commercial formats such as Guitar Pro (Arobas 2010) or Band-in-a-Box (Gannon 2004). Using an HCMP system, the music is imported and automatically converted into a "lead sheet" representation, which has the music structure and chord names. This may require automatic analysis to derive chords and music structure. The user may then reorganize the music into a performance plan such as "4-bar intro, verse, chorus, verse, ending." Rather than laboriously preparing each song, the band leader might download ready-to-use bass parts for the songs from the Web. These might be posted to sharing sites similar to those currently storing MIDI, guitar tablature, lyrics, and other music data. Or, there might be commercial sites offering virtual musicians and song data for them to play, just as one can now buy ready-to-use clip-art and background music for multimedia productions. At the jam session, the group selects a song to play (informing the computer), and the leader counts off the beginning. The computer joins the performance. The entrance could be synchronized by many mechanisms that include foot pedals, gestures, speech recognition to "hear" the count-off, etc. Once the band is playing, the bass stays in time by synchronizing to beats. Again, there are many possible ways to detect beats, including foot tapping, audio analysis, and gestures detected by vision, inertial sensors, or other techniques. (No current systems can do this well.) As the band rehearses, there may be directions for the computer bass player to adjust the sound, the volume, the style, etc. This will require an interface where musical style can be manipulated. In addition, the computer must generate a bass line from the chord representation. This is a computer composition or improvisation task that could be accomplished off line or in real time. During the performance, the band may decide to play the chorus an extra time (for example). Human performers might signal this by gesture or by shouting "chorus" a measure before the repetition should begin. It seems unlikely that the computer will be able to understand natural human gestures like this, but there are many ways to communicate the information, such as touching the beginning of the chorus on a tablet-based music display (in which case the computer would understand that, when the current section finishes, it should go back to the pointed to location). A gesture-based interface might accomplish the same task. Future Work Our work to date has focused on identifying the potential applications and functions of HCMP systems by building experimental systems, using them, and speculating how computers might be used in future systems. This paper has described three prototype systems that illustrate some of the musical potential of this work. Based on these prototypes, we have identified a number of interesting problems to pursue in future work. We have made specific predictions about the major characteristics of future systems, establishing a research agenda which we now summarize. An important area for research is preparation of musical scores. While computer-based music notation editors exist, they work at the note and measure level, whereas HCMP 88 systems should allow users to alter existing music in terms of sections. (E.g. "play the chorus twice.") Another limitation of existing editors for music is the inability to deal with multiple representations (lyrics, notes, chords, etc.) and multiple media (notation, audio, MIDI). While specialized editors are still important, we need a way to integrate and coordinate all representations used in the music performance. A second area for research is synchronization. Computers must maintain a representation of at least three levels of the time hierarchy: beats, measures, and sections. A variety of techniques and modalities can be used. Since even humans use both audible and visual cues, it seems that HCMP systems will need to integrate multiple sources of timing information and exchange information between different levels of the timing hierarchy in order to provide reliable synchronization. More work is also needed to design systems that clearly express the relationship between static scores and their unfolding into a linear performance. Problems include naming (how to refer to a specific location in the dynamic score), communicating intentions before and during music performances, and adapting either recorded or generated music to dynamic changes to the score. The actual sound generated by the computer is important. Whether the sound is from recordings or is generated on-the-fly, users will need support to prepare and control this sound. This includes problems of synthesis and sound diffusion. Perhaps the most important musical issue is control. How will humans "ask" the computer player to play in different styles, and how will style be represented and realized in a controllable fashion? Conclusions Human Computer Music Performance is at present more of a dream than reality. This paper offers a set of research problems based on experience from a few early HCMP systems and thinking hard about how current practice in live music performance can be extended with state-of-theart computation. Due to the deep musical knowledge and experience needed for musical interaction in popular forms of music, it seems that the solutions lie in careful design that balances human-computer interaction techniques with machine intelligence. The former reduces the need to automate musical intelligence completely while the latter reduces the burden of direct human control and intervention. It will be very exciting if HCMP reaches its potential to impact thousands or even millions of musicians. Our prototypes illustrate that HCMP does not require any technical breakthroughs to be practiced in a simple form now, but the deeper issues of music generation and music understanding will likely mature over the next decade. Once it is established and widely available to artists, we believe HCMP will inspire and enable new concepts and genres in music that cannot yet be imagined. Acknowledgements Thanks to Ryan Calorus, who implemented our experimental music display, and Nicolas Gold for valuable discussions and pointing out the importance of (re)arrangement by music sections. Dawen Liang and Gus Xia assisted with the most recent performance system. Our first performance system and the music display work were supported by Microsoft Research and the Carnegie Mellon School of Music. Zplane kindly contributed their highquality audio time-stretching library for our use. Current work is supported by the National Science Foundation under Grant No. 0855958. 2011_18 !2011 Simulating the Everyday Creativity of Readers Brian O'Neill and Mark Riedl School of Interactive Computing Georgia Institute of Technology {boneill, riedl}@cc.gatech.edu Abstract Sense-making is an act of everyday creativity. Research suggests that comprehending the world is an act of story construction. Story comprehension, the process of modeling the world of a fictional narrative, thus involves creative story construction ability. In this paper, we present an intelligent system that "reads" a story and incrementally builds a model of the story world. Based on psychological theories of story comprehension, our system computationally simulates the everyday creative process of a human reader with a combination of story generation search and strategies for inferring character goals. We describe the work in the context of a Synthetic Audience - a system that assists amateur storywriters by reading the story, analyzing the resultant sense-making models, and providing critique. Introduction Humans exhibit creativity in a vast range of domains such as music, art, dance, and storytelling. On a regular basis, humans exhibit creativity in ways that are overlooked: problem-solving, inference, and, in general, sense-making are all creative acts that humans carry out on a regular, and sometimes unconscious, basis. Sense-making is an act of human cognitive creativity: the construction of a narrative that explains what is happening around us (Bruner 1990; Gerrig 1994). We call this everyday creativity. Consider the following example: It's nighttime; Jesse is standing above Marlow, a gun to his head. The trigger slowly squeezes… The next morning, we see William, Marlow's brother, digging a shallow grave. He drops a body into the hole in the ground. It's Jesse's lifeless body… What happened that night? Who are these characters and what were their goals? These are questions that arise in the mind of the reader. A reader must effectively reconstruct the narrative, and infer concepts and events not explicitly read, causal relationships between events, and the goals and motivations of characters (Graesser et al 1994). Boden (2009) suggests that not all creativity is high art. If this everyday creativity could be computationally harnessed what could a system do? Systems could employ human-analogous sense-making processes in order to build a model of the world - the real world or a fictional world observed in a book or movie. This model can be used to simulate human responses to stimuli. With respect to everyday creativity in story comprehension, we could "read" stories or "watch" movies and produce cognitive and affective responses equivalent to those of a human audience. Those responses can, in turn be used to provide feedback to amateur human creators that need assistance with storytelling ability by simulating the responses of a human reader/viewer receiving it for the first time. With that final concept in mind, this paper describes initial steps towards a "synthetic audience." A synthetic audience aims to assist creators by modeling the cognitive processes of recipients of a creative artifact and providing feedback. Feedback from a synthetic audience could be given to the human creator faster and more frequently than feedback from another human source. The synthetic audience must have sufficiently robust ability in everyday creativity. For the purposes of the synthetic audience, we computationally model human creative sense-making processes in the context of story comprehension, focusing on the ability of a system to reconstruct a narrative from the events it reads. Readers/viewers (we will use the media-agnostic term "audience") are actively engaged in cognition when reading/viewing a narrative (Gerrig 1993). The audience engages in problem solving from the perspective of story world characters, attempts to resolve (intentionally or unintentionally placed) gaps in the narrative (called ellipses), and forecasts future events. The inference processes applied by the audience provide an explanation for what has been observed, but unlike conventional problem-solving, the results of these processes cannot be declared right or wrong. These inference processes can be exploited by authors to enable many of the more interesting cognitive phenomena of storytelling: ellipses, suspense, surprise, and genre expectations, among others. In this paper, we present a component of the synthetic audience: a model builder that constructs a possible mental model for a human reader. The model builder "reads" the story as it is authored by the human creator. After each read event, it builds or revises its mental model by hypothesizing the goals of the story characters using a number of strategies, and reconstructs the story using a narrative generation planner. The model is then a source of feedback for the larger synthetic audience system. Different model structures are indicative of possible comprehen 153 sion issues, such as changing character goals, diverging storylines, or unmotivated actions by the characters. In the remainder of this paper, we discuss related work in the psychology of narrative comprehension, story understanding, story generation, and creativity support. We then describe our model builder, in the context of a synthetic audience. Finally, we will show that our model builder produces a cognitively plausible mental model of the storyso-far, improving on approaches that do not model human creative processes. Background Reader Inference While reading a story, readers continuously make inferences about aspects of the story that have not been explicitly stated in order to make sense of the narrative (Graesser et al. 1994). Some inferences can be made with little effort while reading, while other types of inference occur only when the audience has been given time to reason. The former group, described as online inferences, includes: • Superordinate goals - Inferring the overarching goal motivating a particular character's actions. • Causal antecedents - Inferring the causal relationship between the current action and information that appeared previously in the text. Conversely, offline inferences, those made when the audience is afforded time to reason, are as follows: • Subordinate goals - Inferring the lesser goals or plan of action used to achieve the current event or state. • Causal consequences - Inferring the effects of the current action. In particular, the online processes drive the creative search for a narrative explanation of observed events. That is, the inference of a character's superordinate goal is a projection of that character's actions into the future, resulting in the construction of narrative structure explaining why a character has performed observed actions. Likewise, inference of causal antecedents results in the construction of narrative structure that fills in the gaps between observed events in order to explain how particular events came to pass. How does one represent a mental model of a story? Graesser and Franklin (1990) developed QUEST to model the human question-answering process as a theory of sense-making. QUEST was demonstrated in the context of story comprehension (Graesser et al. 1991). Stories are represented as directed graphs, where nodes represent story events, character goals, and world states. Edges represent relationships such as causality or the formation of goals. Traversing the arcs in a QUEST diagram allows one to answer questions, such as what enabled an event to come to pass, or why a character performed an action. Chains of causally-linked goals and events are called goal hierarchies, in which the last goal in the chain is the superordinate goal, the motivating goal for the entire sequence. Figure 1 shows a QUEST structure for a story. Event nodes E3 and E4, and goal nodes G3 and G4 make up a goal hierarchy. The superordinate goal, node G4, is Initiated (I) by the state node E2. That is, because of E2 the hero has the indicated goal. Causality is shown in QUEST by Consequence (C) arcs between event nodes. In the example, E5 occurred as a consequence of both E3 and E4. Goal hierarchies are formed by chains of Reason (R) arcs, indicating subordinate/superordinate relationships between goals. In the example, Goal G3 is subordinate to Goal G4. Related Work Story Understanding. Sense-making is an example of everyday creativity, a process that humans use on a daily basis to explain the world around them. Sense-making in the context of stories shares many similarities to story understanding, a process by which a computational system extracts knowledge from a narrative text. Mueller (2002) summarizes many of the approaches taken in story understanding, including the application of known scripts and plans, the inclusion of plot units, and connectionist approaches. Other approaches include generating questions while reading and attempting to answer them with subsequent text (Ram 1994). In general, story understanding systems extract knowledge from complete texts, whereas we infer through abduction events and character goals in incomplete stories in order to assist amateurs. Creativity Support. There are two general approaches to computational creativity support. The first are tools that assist creators by providing an appropriate interface for complicated creation processes without using AI (e.g., Skorupski et al. 2007). The second approach leverages AI to form a team made of the human creator and the computational tool. These mixed-initiative approaches (e.g., Si et al. 2008) lead to artifacts that have equal contributions from both the human and the AI. With the synthetic audience agent, we aim to leverage AI, as mixed-initiative tools do, without having explicit involvement in the creation of the product. In our approach, which we describe as "computer-as-audience" (Riedl and O'Neill 2009), the agent provides feedback to a human creator from the perspective of the recipient of a creative artifact. Story Generation. Computational approaches to narrativegeneration typically address the problem of creating content as either a search problem or an adaptation problem. See Gervás (2009) for a history of story generation research. Our system uses elements of both of these apFigure 1. A QUEST structure for a story. Nodes represent states (S), character goals (G), and events (E). 1: Villain wants to be powerful. 2: Villain coerces Hero into agreeing to help. 3: Hero robs a bank. 4: Hero gives money to Villain. 5: Villain bribes the president. S1 G2,vil E2,vil G3,h E3,h G4,h E4,h G5,vil E5,vil I I C C R R 154 proaches, incorporating a case-based reasoner to infer character goals and a search-based story generator to construct narratives that explain the inferred goals. We utilize the IPOCL narrative generation algorithm (Riedl and Young 2010), a refinement search approach to constructing novel narrative structures that are both causally coherent and believable, as part of our system. IPOCL requires character actions to be justified both causally and by motivations and intentions. Specifically, it utilizes special data structures called frames of commitment to enforce the constraint that all events must be motivated by some preceding event. That is, every frame, representing a character goal, must be caused by some event as means of explaining why that character has the goal in question. As a refinement search process, IPOCL works backwards from a goal state, using causal and intentional requirements to guide the selection and instantiation of new events. Figure 2 shows the frames and events of an IPOCL plan, describing the same story as the QUEST diagram in Figure 1. Events are rectangles, while rounded boxes represent frames of commitment. Solid lines between events indicate a causal relationship. Dashed lines indicate that the event was carried out in service of that frame of commitment - as part of the character's attempt at achieving its goal. IPOCL shares similarities to the above psychological theories of narrative comprehension. IPOCL narrative plans can be converted to QUEST structures, and vice versa. Christian and Young (2004) present an algorithm for converting partial-order plans into QUEST structures. This algorithm has been updated to IPOCL plans (Riedl and Young 2010), specifically translating frames of commitment into goal hierarchies. Synthetic Audience The goal of a synthetic audience is to provide an amateur storywriter with feedback from the perspective of a recipient. That is, the synthetic audience "reads" the story as it is being written and produces cognitive and emotive responses based on theories of human story comprehension. A synthetic audience is able to provide feedback faster, and with greater frequency, than a human critic. In order to provide such feedback, it is necessary to model human responses to creative artifacts. When working with stories as they are being written, the system must derive a mental model of the story-in-progress based solely on what has been authored. The synthetic audience has to make inferences about events that are missing from the story, the causal relationships between events, and character goals. These inferences are comparable to human gap-filling and sense-making processes, both of which are carried out during reading comprehension. Thus, the synthetic audience system derives a mental model of the story as it is written, based on recognized human creative processes. A storywriter using the synthetic audience authors states - facts and descriptions - and events by selecting event templates from a list of options. These templates allow the author to fill in the specifics of the state or event, such as people, locations, or objectives. Authors can add states or events at any point in the story, regardless of chronology. The synthetic audience continually re-reads the story as it is authored, constructing a model from the audience's perspective and uses that model to provide feedback. When there is feedback from the synthetic audience, it is displayed to the author in a non-intrusive manner; the author may respond to the feedback or ignore it. Knowledge Representation The synthetic audience requires knowledge about the semantics of events in order to make inferences. That is, the synthetic audience requires a domain theory - a description of how the story can change. We use a domain theory comprised of STRIPS-like event templates that provides information about preconditions and effects of any event the human authors. This is the same representation used by IPOCL. Because we employ a narrative planner, we also require every story to have a precondition. The user declares the facts of the initial state, as expository states where it makes sense to do so. If the initial state is incomplete, the Synthetic Audience uses a special operator, Assert, that causes facts to be inferred as true in the initial state, as a last resort to avoid failure. The use of Assert operator to modify the initial state is equivalent to the technique described by Riedl and Young (2006). Model Builder Algorithm The core of the synthetic audience is the Model Builder. The purpose of the Model Builder is to construct a possible mental model of a human audience, revising this mental model as each event is "read" in chronological order. The model is constructed by hypothesizing the goals of each character in the story, and constructing a narrative that explains what has been read. Once the story is read, the model is used to generate feedback in the form of critique. The Model Builder starts by reading the newest event. If the newest event is not at the end of the story, the Model Builder rewinds the construction process to the latest point before the newly authored event and processes the remainder of the story as if encountered for the first time. Figure 2. An IPOCL narrative plan corresponding to the QUEST structure from Figure 1. Coerce (vil, hero, has(a, money)) Give (hero, vil, money) Bribe (vil, prez) Initial state Villain intends to be powerful Hero intends that Villain has money Rob-Bank (hero) Goal situation: corrupt (prez) 155 The search for superordinate character goals drives the Model Building process because of their importance in story comprehension and sense-making (Graesser et al. 1994). The Model Builder uses four strategies to hypothesize the superordinate goal for each character that is actively engaged in the current event. When characters are not actively engaged, previously inferred goals for those characters are retained from earlier iterations. Character Goal Inference Strategies. The model builder uses the following four strategies, in the order given, to infer goals for characters actively engaged in the event. 1. Declared Goals (D). The Declared Goals strategy hypothesizes goals that are explicitly declared in the new event for other characters. For example, if a character states its intention, then that goal is accepted at face value. Likewise, if a character instructs a subordinate character to do something, that goal is accepted for the latter character. 2. Existing Goals (E). The Existing Goals strategy tracks goals that were hypothesized at an earlier point and remain unresolved based on the authored story. This strategy merely tries to place the new event into the hypothesized mental model from the previous iteration. 3. Proposed Goals (P). The Proposed Goals strategy uses a case-based goal recognizer to infer character goals, based on that character's existing goal hierarchies. The recognizer is given a QUEST model containing only the acting character's goal hierarchy, contextual state nodes, and the most recently added event. The recognizer searches its case library for a QUEST model of a story with a similar chain of events as those in the given event and hierarchy. For the best match, the recognizer returns the goal at the top of the relevant goal hierarchy. 4. Top-of-Hierarchy (T). The final strategy is the Top-ofHierarchy strategy, which assumes that the most recently authored event is the goal of the characters involved. Top-of-Hierarchy is a last-resort strategy that is tantamount to "wait and see what happens next." The name of the strategy refers to the notion that the new event is the top of a QUEST goal hierarchy. When more than one character is actively engaged in an event, the goal inference process is applied to each character, one at a time, in arbitrary order. The hypothesized goals and authored events are given to the IPOCL planner in order to test the goal hypothesis by generating a narrative sequence that explains the goals. Testing the Hypothesis. Once the model builder has identified the goals of the characters, it tests the hypothesis by generating a narrative that links together all authored events to the hypothesized goals of the characters. This is achieved as follows. First, an instance of an IPOCL plan is created by instantiating every authored event in the QUEST model. Narratives generated during the prior iteration may have events that were generated but not written by the human author; these events are discarded. Temporal constraints enforce chronological ordering of authored events. The model builder constructs a goal situation consisting of newly hypothesized goals for the current character as well as un-realized goals for other characters carried over from prior iterations. Additionally, a frame of commitment is created for each un-realized hypothesized character goal across all characters. A modified IPOCL planner is instructed to satisfy the preconditions for each event and each proposition of the goal situation, and to find a motivating event for each frame. As a refinement search algorithm, IPOCL takes a plan in any stage of completeness, finds a flaw - a reason why the plan is not complete - and resolves it. In the process, other flaws may be created, resulting in an iterative process. In this case, unresolved preconditions are solved by instantiating a new event, reusing an existing event, or by the initial world state. If IPOCL fails to find a plan, or if a plan is found that does not link all events in causal chains terminating with character goals, the current hypothesis is rejected and the model builder tries the next strategy. We modified the IPOCL algorithm as follows. First, we bound the search depth to approximate cognitive limitations. Second, we provide a heuristic that strongly prefers to reuse authored events. Third, we add the Assert operator described above to declare unstated facts to be part of the initial state. Finally, we provide a special event, Decide, which has no preconditions and has the effect of giving a character an intention. Decide is equivalent to the system admitting that it does not know why a character performed an action without failing the search. The Model Builder heuristic highly penalizes the inclusion of Decide events in the narrative, thus relegating its use to a last resort. When the hypothesis is accepted, the plan generated by IPOCL is converted to a QUEST model using the previously mentioned IPOCL-to-QUEST algorithm. Characters with Multiple Goals. It is possible for characters to pursue multiple goals throughout the course of a story. If goals hypothesized by the model builder using the P or T strategies are not accepted, then the Model Builder attempts to create a new goal hierarchy with any events that were not linked. The hypothesis is retested using the above technique. Synthetic Audience Feedback The synthetic audience generates feedback based on the QUEST model that resulted from hypothesis testing. Various QUEST structures indicate potential comprehension problems including: • Diverging storylines - Implied by disjoint causal chains. • Unmotivated goals - Indicated by the need to use Decide events to construct missing Initiates arcs. • Unexplained events or motivations - Indicated by the use of Assert operators to modify the initial state because missing information must be inferred. • Sudden shifts in model - Indicated by a sudden change in goals hypothesized by the Model Builder from the reading of one event to the next. If any potential mental model indicates possible reader comprehension issues, then this is feedback that we would aim to provide to the creator. The mental model con 156 structed by the Model Builder is only one possible model. However, this single model remains useful in the context of a larger synthetic audience system, as it can be instantiated with different domain theories and background knowledge to explore a variety of audiences. Example Consider the following scenario, selected to illustrate the goal inference strategies, in which a human author inputs a story with gaps. The story involves Aladdin, Jasmine, a genie, and a king. For purposes of illustration of the Model Builder, we will assume that the author has declared the characters up front, and provided various additional facts about the story world. The facts include: the genie is trapped in a magic lamp; a dragon possesses the lamp; and the king hates and fears the genie. The author writes the first few events: 1. The King orders Aladdin to retrieve the magic lamp. 2. Aladdin travels from the palace to the mountains. 3. Aladdin gives the magic lamp to the King. The Synthetic Audience "reads" the events in order. The first event, Order, involves multiple characters. It arbitrarily chooses to attempt to infer Aladdin's goals first. Using the Declared Goal (D) strategy, and based on the semantics of the order event, it hypothesizes that the orderee - Aladdin - will adopt the given goal: that the King should have the magic lamp. Next the Model Builder processes the King. Strategies D and E are not applicable as the event does not declare a goal for the King, nor is there a prior hypothesis about his goal. Invoking the Proposed Goal (P) strategy, the case-based goal recognizer hypothesizes that the King's goal could be to kill the Genie. This is because our case-base includes a story in which one character hires another to kill someone he hates. Hypothesis testing produces a plausible narrative in which Aladdin slays the dragon, takes the lamp, and gives it to the King who destroys it, killing the Genie. The resultant narrative is converted to a QUEST model and stored for later reference. The Model Builder processes the second and third events involving Aladdin traveling to the mountains and then giving the lamp to the King. In both cases, the E strategy verifies that this is consistent with the previously hypothesized goal for Aladdin. The Model Builder infers that the dragon lives in the mountains. In each case, the King's goal is retained from the first iteration because he is not an active character in either of the second or third events. Figure 3(a) shows the QUEST structure of the model constructed after the first three events. Now suppose that the author adds one last event: 4. The King commands the Genie to make Jasmine love him. The Model Builder arbitrarily decides to process the King first. The D strategy does not apply. The model builder attempts the (E) strategy, re-using the King's goal of killing the Genie. However, it cannot find any plausible narrative in which the King commands the Genie as part of a goal hierarchy resulting in the Genie's death. Therefore, the (E) strategy fails. The model builder again tries the (P) strategy. The case-based goal recognizer hypothesizes that the King's goal could be to marry Jasmine, replacing the earlier hypothesized goal. The model builder processes the Genie's involvement in event 4, and using the (D) strategy, determines that the Genie will adopt his given goal: to make Jasmine love the King. Hypothesis testing produces a narrative in which the King falls in love with Jasmine, and sends Aladdin to retrieve the lamp. The Genie, under the influence of the King, casts a love spell on Jasmine. Finally, Jasmine and the King get married. Figure 3(b) shows the QUEST model for the new model. Discussion The Synthetic Audience is a cognitively plausible process for computationally constructing a model of the story-sofar. The system employs the everyday creativity of inference and future prediction. Graesser et al. (1994) are vague about the exact inference process used by human readers, proposing spreading activation; we assert that IPOCL, which reasons over representations that are analogous to QUEST structures, is a plausible substitution. For any domain in which there are gaps in the events (a) Model after three events. (b) Model after four events. Figure 3. QUEST models of the example story. Numbers correspond to numbered events in the text. Nodes with dashed lines were inferred during narrative reconstruction. S0 G1,K E1,K G2,A G3,A E2,A E3,A Gdead,K Edead,K C C C I I R R R Aladdin's inferred quest G1,K E1,K G2,A G3,A E2,A E3,A E4, K G4,K C C C I R R R Gmarry Emarry Gspell,G Espell,G C R C Elove,K I Aladdin's inferred quest Guse,K Euse,K C R I 157 that can be observed, it is potentially non-trivial to create a well-formed sense-making model in which the relationships between adjacent and non-adjacent events are found. This is true of observations of the noisy world around us, and also true of stories authored by amateur storywriters. Searching for the connections is a creative act because a complete narrative explanation is created as a by-product. We believe that the approaches taken by the Model Builder are applicable in many domains, so long as the planner and case-based reasoner contain appropriate domain knowledge. Our approach works with stories that have (a) strong causal relationships (e.g., few non-sequiturs or random events), and (b) highly goal-driven character behavior - characters have a few top-level goals that are motivated by prior world events and not arbitrarily adopted. While not appropriate for all genres of story, these properties are common enough in popular massconsumption stories and video games. Is it enough just to employ a story planner or other search process to fill gaps? If one were to employ a story generator such as IPOCL after each event read, one would fill gaps between events; this is equivalent to the Model Builder's Top-of-Hierarchy strategy. However, such an approach would miss opportunities. First, any such model would not be representative of human models because the inference of superordinate character goals is one of the foremost online processes of an active reader. There are some events that are rarely ever superordinate goals, and thus rarely ever tops of goal hierarchies. Second, inference of superordinate character goals is a form of future prediction. By looking into the future and tracking back, one can often find connections between seemingly disparate causal chains; well-formed stories frequently tie plotlines together and human readers expect it. Of course, how closely the Model Builder matches human audience performance depends on the case library. Third, and most significantly, the Model Builder constructs the sense-making model incrementally. One could wait until the story is complete, in which case a naïve gapfiller and the Model Builder would likely produce the same result. By reading the story one event at a time and building the model incrementally, the Synthetic Audience can trace changes in the model over the course of the story, thus providing feedback about surprises, suspense, and other cognitive and emotive effects on audiences that result from drastic revisions of the model. We conclude that the Model Builder's character goal inference strategies are critical. The D and P strategies provide superordinate goals that drive the creative explanation process. The E strategy provides continuity. The T strategy is a "catch all" when all the audience can do is wait and see. The synthetic audience models human everyday creative processes. The synthetic audience reconstructs the narrative, making inferences about event causality and character goals, the same kinds of inferences that human readers make while reading a story. The synthetic audience performs incrementally, revising the model as the story is authored, rather than comprehending only complete stories, like typical story understanding systems. This model of human everyday creative processes can be applied to recipients of creative artifacts, allowing feedback to be offered to creators from the audience perspective. 2011_19 !2011 Towards MCTS for Creative Domains Cameron Browne Computational Creativity Group Imperial College London 180 Queens Gate, SW7 2RH, UK camb@doc.ic.ac.uk Abstract Monte Carlo Tree Search (MCTS) has recently demonstrated considerable success for computer Go and other difficult AI problems. We present a general MCTS model that extends its application from searching for optimal actions in games and combinatorial optimisation tasks to the search for optimal sequences and embedded subtrees. The primary application of this extended MCTS model will be for creative domains, as it maps naturally to a range of procedural content generation tasks for which Markovian or evolutionary approaches would typically be used. Introduction Ludi is a system for automatically generating and evaluating board games modelled as rule trees (Browne, 2008). New artefacts are created by evolving existing rule trees and measuring the results for quality through self-play. Although this process proved successful by creating a game of notable quality that is now commercially published (Andres, 2009), it also highlighted some problems with the evolutionary approach for game design: Wastage: Thousands of bad games were generated for every good one. Focus: Creativity only became evident when introns (flawed rules) were allowed to proliferate and breed. Bias: The choice of initial population biased the output, and if not themselves well-formed would not likely produce any playable children at all. Due to the random nature of crossover and mutation, there is no guarantee that the evolutionary process will converge to an optimal result. Might there be a better way? Monte Carlo Tree Search (MCTS) has revolutionised computer Go and is now a cornerstone of the strongest AI players (Coulomb 2006). It works by running large numbers of random simulations and systematically building a search tree from the results. It has produced world champion AI players for Go, Hex, General Game Playing, and unofficial world champions for a number of other games. An attractive feature of MCTS is its generality. It can be applied to almost any domain that can be phrased in terms of states and actions that apply to those states, and has been applied to optimisation tasks other than move planning in games, such as workforce scheduling, power grid control, economic modelling, and so on. MCTS is also: Aheuristic: No heuristic domain knowledge is required. Asymmetric: The search adapts to fit the search space. Convergent: The search converges to optimal solutions. MCTS systematically explores a given search space by preferring high-reward choices while guaranteeing the (eventual) exploration of low-reward options, and only requires a fitness function for completed artefacts to operate. This makes it an attractive proposition for procedural content generation in creative domains; however, such problems tend to be more complex than simple {state, action} pairs. They are typically modelled as sequences, grammars, rule systems, expression trees, and so on, which are outside the scope of the standard MCTS algorithm. We propose a generalisation of the MCTS algorithm and its extension from the search for optimal actions to the search for optimal sequences and subtrees. This should have direct applicability to procedural content generation in game design and other creative domains, where it might augment or even provide an alternative to existing methods for creating new high quality artefacts. MCTS Figure 1, from Chaslot et al (2006), shows the four basic steps of the MCTS algorithm. Each node represents a state s and each edge represents an action a that leads to an updated state s'. Each node maintains a record of its estimated value, number of visits, and a list of child actions. The algorithm repeats the following process: starting at the root node R, descend through the tree (choosing the optimal action with each step) until a leaf node L is reached. Then, expand the tree by adding a new node N, complete the game by random simulation, and backpropagate the result up the list of selected nodes. UCB The key to the algorithm's success lies in the method it uses to select optimal actions from among lists of those available during tree descent. A variation of the Upper Confidence Bounds (UCB) method (Auer et al, 2002) is typically used to select the node that maximises: where Xi is the estimated (mean) value of child i, ni is the number of times child i has been visited, and n is the number of times the node itself has been visited. 96 Figure 1. The four basic steps of the MCTS algorithm. UCB provides a good balance between the exploitation of estimated node values and the exploration of the search space, so that even low-value nodes are occasionally exercised to increase the reliability of their value estimates. Kocsis and Szepesvari first proposed the use of UCB in an MCTS setting with their UCT ("UCB applied to Trees") method in 2006, and this is the specific embodiment of MCTS used in most current applications. General MCTS Model The standard application of MCTS is to find optimal moves in zero-sum, two-player games with alternating moves and fixed turn order, such as Go (Gelly et al, 2006). While it has been applied to more general problems, such as general game playing (Bjornsson and Finnsson, 2009) and some combinatorial optimisation problems, the algorithm is specifically adapted for such different domains. We now consider ways to simplify some underlying assumptions to generalise the algorithm and more cleanly separate it from its given application domain. Zero-Sum MCTS is often used to model zero-sum games with a win given a discrete value of +1 and a loss -1 (draws are worth 0). We relax this assumption so that the algorithm works with continuous simulation results in the range [-1..1], where the extremes represent ideals that may never actually be realised. This generalises to domains in which individuals are measured by a fitness function rather than discrete win/lose/draw classifications. Two Players The algorithm often models two adversarial opponents, i.e. players competing for opposing rewards, which has implications for the backpropagation stage. In traditional game search terms, the instigator of the search (MAX) will try to maximise their reward while the opponent (MIN) will try to minimise this reward, resulting in a minimax tree. If a simulated playout yields a result of +1 for the current player, then the value backpropagated through the tree will be negated with each search ply: +1, -1, +1, -1, etc. Multiplayer games (i.e. those with more than two players) can be modelled using a paranoid approach in which each player simply assumes that all other players are acting against them. This removes complicating aspects of coalitions and effectively reduces N-player games to a 2-player model, but can give good results (Sturtevant, 2002). Playout results for say three players using the paranoid model would be negated in cycles of three during backpropagation: +1, -1, -1, +1, -1, -1, etc. Single player games, e.g. solitaire puzzles, are cooperative as there is only one player who will generally not seek to sabotage their own moves. Such games may return boolean results indicating success (puzzle solved) and failure (dead end), or continuous reward functions that indicate distance from a desired perfect solution. The puzzle environment may respond to player moves; if these responses are deterministic then they are simply part of the state update following each action, otherwise if the environment's responses are intelligent and adversarial then the puzzle is actually a 2-player game. Move Order Not all games have alternating moves; some have variable play order or composite multi-part moves. Consider a hypothetical Go variant in which the player who surrounds an enemy group need not remove all surrounded pieces but may elect which, if any, to remove. If a group of say 20 pieces is surrounded, the mover has 220 = 1,048,576 possible ways to remove a subset of 0 to 20 pieces. It would be ridiculous to list all of these choices for a given node, so instead a more practical option is to treat these optional removals as a multipart move, i.e. a variable length sequence of single piece removals. This approach may be strategically dubious as each sub-move is considered in isolation rather than part of the greater move, but it is the only practical solution in many cases and usually proves sufficient (Schmidt, 2010). Such multipart moves are another consideration that must be taken into account during the backpropagation stage, as simulation results must then be negated across variable ply numbers to reward/punish the players correctly. 97 Figure 2. Example solutions for the three search types: action, sequence and subtree. Generalising MCTS The solution to the limitations described above is relative node ownership. Each node in the search tree is assigned an owner - typically the player to move - and is updated during the backpropagation step according to the simulation result relative to its owner. Instead of returning a single value indicating the result of each simulation, the domain produces a vector of values indicating the result relative to each player. This removes underlying assumptions regarding player number, move order, move length and distinctions between adversarial versus cooperative modes of play, allowing a more general MCTS model that paves the way for the following extensions. Extended MCTS Model We now extend the general MCTS model from searching for optimal actions to searching for optimal sequences and embedded subtrees, such as those typically used in procedural content generation and computational creativity tasks. MCTS Sequence Search Figure 2 (left) shows the result of a standard MCTS search, which is the highest-valued root child action. As the basic operation of the algorithm is to complete a sequence of actions each iteration, it is straightforward to make these sequences the target of the search (Figure 2, middle). This can be achieved simply by keeping at all times a record of the best sequence so far including any random playouts (a pointer to the sequence's tail is sufficient), and using as the reward value for each sequence an estimate of its fitness. For solitaire puzzles this fitness value may involve distance to solution, or for more creative applications such as music generation, the fitness function may involve aesthetic measurement of passages of notes. As per standard MCTS, a sequence is run to completion per iteration, and its value backpropagated through the selected nodes. Each sequence may be completed within the search tree or may cross the tree boundary (as shown in Figure 4), hence it is possible that the best sequence could be at least partially randomly completed. As with action search, the root node is not part of the completed sequence but merely defines the list of possible starting points for the search. MCTS Subtree Search The second extension of the algorithm - from sequence search to subtree search - is complicated by the polyadic (multi-argument) nature of the problem. Rather than each state s having a single action a applied, each state may now have N actions or arguments that are simultaneously applied. For example, the search target may be an expression tree with nodes that contain multiple arguments. These search target subtrees should not be confused with the MCTS search tree itself; we distinguish between the search tree and the solution subtrees that are embedded within it. The search tree represents the possible solution space while solution subtrees represent actual realisations of those possibilities. As per the standard MCTS approach, a subtree is completed with each iteration, evaluated, and its value backpropagated through the selected nodes. The subtree may be entirely embedded within the search tree or it may cross the search tree boundary at one or more points, in which case each open branch is randomly simulated to completion (Figure 2, right). A record of the best subtree must be kept at all times; a list of its leaf nodes is sufficient to reconstruct the subtree. It is possible that one or more branches of the best subtree could be randomly completed. Polyadic Strategies In order to successfully extend the MCTS method to subtree search, we propose a number of strategies for handling the selection of arguments for polyadic nodes. Dependencies between such arguments - and indeed between nodes and even subtrees - become important in this context. For example, Figures 3 and 4 show two hypothetical games described as rule trees, such as those that might be generated by the Ludi system (Browne, 2008). In "Kill the Knights" players take turns moving one of their pieces in a knight move and win by capturing three enemy pieces. In "Pin the Knights" pieces instead pin enemy pieces that they land upon and a player wins by forming a stack three high. While neither game is a masterpiece, "Pin the Knights" may be the more interesting of the two. There is some tension between the benefit of pinning enemy pieces against the danger of providing height 2 stacks that the opponent might exploit to win the game. 98 Figure 3. Rule set for "Kill the Knights". The rule differences between the two games (replace/pin and score/stack) are dependent as changing either one in isolation will break the game by making it unwinnable; both changes must occur simultaneously for the modified game to work. In terms of these two rules, each game is in a local maximum that can only be escaped by modifying both rules simultaneously. This example demonstrates that superior results can be achieved if related degrees of freedom in the content can be identified and adjusted in tandem. It is not guaranteed that an evolutionary method would ever perform such dependent rule changes simultaneously, whereas MCTS performs a more systematic exploration of the search space due to its well-balanced exploration component and mechanisms may be added for detecting and exploiting such dependencies. We distinguish between independent and dependent node selection in subtree search, and propose polyadic strategies for each case in the following sections. Independent Node Selection The simplest strategies for optimally completing polyadic subtrees during MCTS search ignore node dependencies, so node choices made in one part of the tree will not affect node choices in other parts of the tree during tree descent or backpropagation. We now present two such strategies. Direct Choice The first case to consider is the most obvious; when presented with a polyadic node, simply choose for each argument the action selected from its available choices by UCB. Each selection is made independently of the other arguments and all other nodes in the tree. For example, Figure 5 shows a polyadic node pi with three arguments (branches) to be populated with actions. In the direct method, the action for argument a will be selected from the set {a1, a2 ,..., an}, the action for argument b will be selected from the set {b1, b2 ,..., bn}, and the action for argument c will be selected from the set {c1, c2 ,..., cn}. For each iteration of the search, a subtree is completed in this manner, measured for fitness, and the result backpropagated through the selected subtree's nodes. The result of the search will be the completed subtree with the highest reward value. Figure 4. Rule set for "Pin the Knights". Embroyonic Development Another approach inspired by genetic programming methods for circuit design (Koza et al, 2001) is to maintain a single "embryonic" individual that is modified over the course of the search. An embryonic tree is created, then for each iteration a non-branching sequence is followed through it, simulated to completion and the result backpropagated through the sequence, while unvisited nodes outside the active sequence remain frozen for that iteration. Figure 6 summarises the process for a given polyadic node pe visited along the active sequence during tree descent. One argument is chosen from those available, in this case b, which will be the best action of the worst performing argument. The result of the search will be a copy of the embryonic tree taken on the iteration at which it achieved its highest estimated value. Figure 5. Direct approach to argument completion. Figure 6. Embryonic approach to argument completion. 99 Figure 7. Compound approach to argument completion. Dependent Node Selection The following approaches for completing polyadic subtrees preserve node dependencies across the tree. In JohnsonLaird's nomenclature (2002), these node-dependent approaches will be more Lamarckian than Darwinian in nature, as individuals actively improve themselves in a systematic way. This list is not complete, but describes some candidate strategies that we plan to investigate in depth. Compound Arguments Figure 7 shows a method that correlates the available actions in all arguments by compounding them into a single list of all possible combinations. For example, if the action selected by UCB is a3b1c4 then action a3 is chosen for argument a, action b1 is chosen for argument b, and action c4 is chosen for argument c. This approach provides a minimal degree of node dependence by correlating the actions of sibling arguments. Boundary Correlation Figure 8 shows a subtree being constructed during a search in progress, with five arguments labelled a to e waiting to be completed along its boundary. Firstly, the most urgent argument is chosen using UCB to select among boundary updates possible from this state, then an action is selected for that argument. Separate UCT statistics are maintained for each possible combination of {argument, action} pairs, as indicated by the arrows in Figure 8. For example, if the actions available to argument a are listed {a1, a2 ,..., an} and so on, then statistics will be maintained for combinations a1b1, a1c1, a1d1, a1e1, a1b2, etc. The action with the best average UCB performance over all member combinations, in conjunction with the action's own value, is selected for the chosen argument, the boundary is updated and the process continues. Once the subtree is completed and evaluated, the backpropagation stage must update not only the values of all visited nodes but also all argument correlations for all boundaries (i.e. rooted subtrees) contained within the solution tree. The result of the search will be the solution tree with the highest reward value. Subtree Correlation Subtree correlation is similar in principle to boundary correlation, except that instead of storing subtree boundaries with associated argument combinations, the stored subtrees are themselves associated. During subtree construction, each node is then completed by the associated subtree with the highest UCB value. Figure 8. Boundary correlation of arguments. This approach compares with Contextual MCTS (Rimmel and Teytaud, 2010), which partitions the search space into a number of "tiles". Techniques for transposition tables could benefit this approach (Childs et al, 2008). The result should be that rule subsets found to work well together, even from disparate parts of the search tree, will occur more often in future solutions. Such correspondence is not guaranteed for crossover in evolutionary strategies. The statistics thus accumulated could yield useful insights into good rule combinations, as well as indicating individual rules that are more successful. For example, game inventors may be interested to know which subsets of game rules work harmoniously together and warrant further investigation, even if the actual games produced by the system are not of great interest in themselves. Tree-to-Sequence Conversion Other approaches suggested by combinatorial mathematics include the conversion of trees to sequences, which would then allow MCTS sequence search. This may be achieved using the Prüfer sequence of each tree (Prüfer, 1918) or Euler tours of their leaf nodes (Bender and Farach-Colton, 2000). Application Domains The extended model provides a framework for applying MCTS to PCG in creative domains. The proposed techniques will be tested through their application to creative tasks such as those listed below, which can be divided into two broad categories: sequence-based and tree-based. Sequence-Based Domains Artefact creation in sequence-based domains is typically achieved using Markovian approaches, which analyse the recent history of a problem to predict the next step that maximises some reward. Can MCTS-based methods add value to this process by predicting entire sequences? Word Play Pseudowords (non-words that sound plausible within a given language) are useful for a variety of tasks: word games, poetry, psycholinguistic experiments, the creation of unique but memorable usernames, passwords, domain names, CAPTCHAs, and so on. 100 Pseudowords are typically constructed using Markovian methods based on the distribution of letter or syllable combinations within the target language (Keullers and Brysbaert, 2010). However, the sequential process of word construction maps neatly to the extended MCTS approach. Music Synthesis The Continuator (Pachet, 2004) is a system that records passages played by musicians and produces continuations in a similar style, operating in real time using a Markovian model combined probabilistically with a fitness function. Tree-Based Domains Artefact creation in tree-based domains is typically achieved using evolutionary methods. We will investigate the use of the extended MCTS model for PCG instead. Game Design We are developing a general game system called Mogal (Modular Game Library) in which users - or the AI - may mix and match rule modules to define new games, which can then be measured for quality through self-play. Mogal will be used to compare the performance of extended MCTS for subtree search for the generation of new rule trees (and the optimisation of existing ones) against standard evolutionary approaches. Visual Art There are many applications of rule based systems for the automated generation of visual art, including expression trees, L-systems, context free grammars, and so on, most of which use evolutionary approaches (Romero and Machado, 2008). The extended MCTS model will be applied to a number of visual art creation tasks with known fitness functions, such as the approximation of target images in artistic styles and the generation of ornamental (e.g. celtic) designs, and its performance compared against current evolutionary methods. Conclusion MCTS offers a number of attractive features for AI search but has to date only been applied to particular domain types. We describe ways in which the standard algorithm may be generalised by decoupling the domain from the search, then extended to other domain types expressed as sequences and trees. This opens up the possibility of using MCTS for a range of computational creativity tasks for which Markovian processes or evolutionary strategies are typically used. The extension of MCTS to sequence search is straightforward, but its extension to embedded subtree search is complicated by its polyadic nature; several strategies for tackling this problem are presented. This work represents the first step in our investigation of MCTS for creative domains. Acknowledgements This work is supported by EPSRC standard grant EP/ I001964. Thanks to Simon Colton, Stephen Tavener, Phillip Rohlfshagen, Greg Schmidt and the anonymous reviewers for useful comments. 2011_2 !2011 Multiobjective Optimization for Meaningful Metrical Poetry Fahrurrozi Rahman and Ruli Manurung Faculty of Computer Science Universitas Indonesia Depok 16424, Indonesia rahman10@ui.ac.id, maruli@cs.ui.ac.id Abstract This paper reports our experiments to properly handle the multiobjective optimization nature of poetry generation — as defined in Manurung (2003) — as a stochastic search that seeks to produce a text that simultaneously satisfies the properties of grammaticality, meaningfulness, and poeticness. In particular, we employ the SPEA2 Algorithm (Zitzler, Laumanns, and Thiele 2001). Various results show that it consistently outperforms our previous system in its ability to generate a meaningful metrical text according to given semantic and metre specifications, and in some cases is able to generate the intended text, whereas our previous system fails to do so. However, it is still unable to reach the goal of generating an entire poem. We conclude with suggestions for further work to address this shortcoming. Introduction Manurung (2003) presents a model of poetry generation as a stochastic search process that seeks to produce a text that simultaneously satisfies various properties, as well as MCGONAGALL, an implementation of the model. It lays out the representational framework, and defines evaluation functions that independently assess the ability of a text to (i) convey a given semantics whilst (ii) conforming to a given rhythmic pattern. However, it fails to account for the multiobjective optimization nature of this model of poetry generation: form and function in poetry are highly interdependent, and as such, it is incorrect to optimize for both by optimizing a simple linear combination of the separate evaluation functions. In this paper we report our efforts to properly handle the multiobjective optimization nature of poetry as stochastic search. In particular, the Strength Pareto Evolutionary Algorithm (Zitzler, Laumanns, and Thiele 2001), one of the top-performing multiobjective optimization algorithms, is used. We start by introducing the model of poetry as a text that embodies meaningfulness, grammaticality, and poeticness, and a model of its generation as a multiobjective optimization stochastic search process. We then describe MCGONAGALL, an implemented system that adopts this model and uses genetic algorithms (Mitchell 1996) to generate texts that are syntactically well-formed, meet certain prespecified patterns of metre, and broadly convey some given meaning. Finally, we present results of some experiments we conducted given various inputs. Poetry writing as multiobjective optimization Despite the vast number of different definitions of poetry, one can argue that a common characteristic is the presence of a strong interaction, or unity, between the form and content of a poem. The diction and grammatical construction of a poem affects the message that it conveys to the reader over and above its obvious denotative meaning (Levin 1962). As such, Manurung (2003) argues that poetry generation is much harder than conventional NLG (natural language generation), which typically operates on the assumption that a text serves as a medium for conveying its semantic content. To account for all this, Manurung proposes a general model of a poem as a natural language artifact that simultaneously satisfies the constraints of grammaticality, meaningfulness, and poeticness. A grammatical poem must be syntactically well-formed. This might seem obvious for natural language texts, but it must be stated explicitly in the context of poetry. Syntactic well-formedness in poetry may well be different from that of ordinary texts (cf. figurative language), but it is still governed by rules. A meaningful poem must intentionally convey some conceptual message that is meaningful under some interpretation. Finally, poeticness states that a poem must exhibit features that distinguishes it from non-poetic text. We follow Manurung (2003) in concentrating on the concretely observable aspect of rhythmic patterns, or metre. This characterization suggests a ‘classical' account of poetry where adherence to regular patterns in form is essential. This avoids some of the complications of imagery or interpretation that are central to assessing more free forms of verse, and ensures that many of the important aspects of a text can be defined formally. Figure 1 shows a prototypical example of this genre — and one that we will revisit throughout this paper, i.e. a limerick by Arthur H.R. Buller, first published in Punch magazine on the 19th December 1923 (Knowles 2009). Evolving poetry Poetry generation can be viewed as a specialized instance of natural language generation, or NLG, i.e. the development of computer systems that can produce understandable 4 There was a young lady called Bright who could travel much faster than light. She set out one day in a relative way and re turned on the pre vious night. Figure 1: Arthur Buller's ‘relativity' limerick texts in a human language, starting from some nonlinguistic representation of a communicative goal as input (Reiter and Dale 2000). A considerable amount of research has been done in NLG on the so-called "generation gap" problem, where interdependent decisions must be made across various levels of linguistic representation. This problem is exacerbated by the unity of poetry. For example, with regards to metre, every single linguistic decision potentially determines the success of a poem in fitting a regular defined pattern. Poetry generation can be viewed as a state space search problem, where a state in the search space is a possible text with all its underlying representation, from semantics all the way down to phonetics. A goal state satisfies the three constraints of meaningfulness, grammaticality, and poeticness. Such a search space is undoubtedly immense, even — given the recursive structures in natural language — infinite. Stochastic search is a heuristic search strategy that relies on the random traversal of a search space with a bias towards more promising solutions. It has become increasingly popular for solving computationally hard combinatorial problems such as constraint satisfaction, and has been shown to outperform deterministic approaches in a number of domains. The genetic algorithm, or GA, is a particularly wellknown instance of such a strategy, which is essentially an iteration of two phases, evaluation and evolution, applied to an ordered set, called the population, of candidate solutions (Mitchell 1996). Multiobjective optimization There are two ways that constraints can be implemented in a GA: either invalid solutions are totally excluded from the search space, or they are admitted to the space but with a bias against them. In MCGONAGALL, grammaticality is treated as a prerequisite for meaningfulness and poeticness; that is, grammaticality is implemented as an absolute constraint on the items admitted to the search space, through the design of representation and genetic operators. The remaining constraints of meaningfulness and poeticnes will be implemented as preferences via the evaluation functions of the genetic algorithm. Given the two constraints to be optimized, poetry generation is thus an instance of multiobjective optimization. The Strength Pareto Evolutionary Algorithm (SPEA) 2 is an algorithm which handles multiobjective optimization. It is based on its predecessor, SPEA, which still has some weaknesses such as dealing with fitness assignment, density estimation and archive truncation (Zitzler, Laumanns, and Thiele 2001). SPEA originally uses a population and archive set that is maintained per iteration. Initially the archive set is empty and shall contain any nondominated individuals from the previous population and archive in the next iteration. Furthermore, if the size of the archive exceeds a predefined limit, the archive will be truncated without destroying its characteristics. SPEA2 overcomes the problem of fitness assignment and density estimation in SPEA by taking account of both dominating and dominated values for each individual. An individual a ∈ X is said to dominate another individual b ∈ X (also written as a b) if and only if ∀i ∈ {1, . . . , n} : fi(a) ≥ fi(b) and ∃j ∈ {1, . . . , n} : fj (a) > fj (b) for each objective function (f1, . . . , fn). From the definition above, the strength (dominating) value, S(i), for individuals in the population Pt and the archive At can be defined by: S(i) =| {j | j ∈ Pt + At ∧ i j | where | . | indicates the cardinality of a set and the symbol + denotes the multiset union. Based on the strength value, the raw (dominated) value of an individual, R(i), is calculated as: R(i) = X j∈Pt+At,ji S(j) The smaller the raw value the better it is, i.e. R(i) = 0 means the individual is a nondominated one. The density is obtained using an adaptation of the k-th nearest neighbor method, where the density estimation is the inverse of the distance to the k-th nearest neighbor : D(i) = 1 σ k i + 2 Here σ k i denotes the k-th nearest neighbor for individual i to all individuals j in the population and archive set. As commonly used, k equals to the square root of the population and archive size, or k = p | P | + | A |. Adding the density value to the raw value of an individual yields its fitness value : F(i) = D(i) + R(i) Unlike SPEA, SPEA2 preserves the number of individuals in the archive to be constant. If the number of nondominated individuals is less than the archive size, the best dominated individuals in the previous archive and population are copied to the archive until the archive is full. If the opposite happens, each individual in the archive is sorted based on its k-th nearest neighbor value and the one with the minimum distance will repeatedly be removed from the archive. This method prevents boundary solutions being removed. This archive set will be the solution set when the stopping criterion is satisfied. 5 MCGONAGALL: an evolutionary poet In this section we will briefly describe how MCGONAGALL ensures grammaticality through its linguistic representation and genetic operators, and how it optimizes poeticness and meaningfulness. Firstly, linguistic structures are represented using lexicalized tree-adjoining grammar, or LTAG. These grammars are based on the composition of elementary trees, of which there are two types, i.e. initial trees and auxiliary trees. These trees represent minimal linguistic structures in the sense that they account for all and only the arguments of the head of the syntactic constituent they represent, e.g. a sentential structure would contain a verb and all its complements. In LTAG, the derivation tree is a data structure that records the composition of elementary trees using substitution and adjunction. The derived tree, on the other hand, is the phrase structure tree created by performing all the operations specified within the deriviation tree; in a sense, the derived tree is the result of generation, whereas the derivation tree is a trace of how it was built up. The derivation tree can therefore be seen as the basic formal object that is constructed during sentence generation from a semantic representation (Joshi 1987), and is the appropriate data structure on which to perform nonmonotonic operations in a stochastic generation framework. Essentially, the LTAG derivation tree forms the genotypic representation of a candidate solution, from which one can compute the phenotypic information of semantic and prosodic features via the derived tree. A simple ‘flat' semantic representation (Hobbs 1985) is used. A semantic expression is a set of first order logic literals, which is logically interpreted as a conjunction of all its members. The arguments of these literals represent concepts in the domain such as objects and events, while the functors state relations between these concepts. For example, the representation of the semantics of the sentence "John loves Mary" is {john(j), mary(m), love(l, j, m)}, where l is the event of j, who has the property of ‘being John', loving m, who has the property of ‘being Mary'. The semantic expression of a tree is the union of the semantic expressions of its constituent elementary trees, with appropriate binding of variables during substitution and adjunction to control predicate-argument structure. Finally, each word is associated with its phonetic spelling, taken from the CMU pronouncing dictionary. Vowels are marked for lexical stress, with 0 for no stress, 1 for primary stress, and 2 for secondary stress. For example, the spelling of ‘dictionary' is [D,IH1,K,SH,AH0,N,EH2,R,IY0]. For monosyllables, closed class words (e.g. the) receive no stress, and open class words (e.g. cat) primary stress. Genetic mutation operators that randomly add or delete subtrees of a derivation tree have been introduced to move through the search space. In addition, a subtree swapping operator that randomly swaps two derivation tree subtrees is also implemented. When these subtrees belong to the same derivation tree, it behaves as a mutator, otherwise, it is used as a crossover operator. [w,s,w,w,s,w,w,s,b, w,s,w,w,s,w,w,s,b, w,s,w,w,s,b, w,s,w,w,s,b, w,s,w,w,s,w,w,s,b] Figure 2: Target form for a limerick Evaluating poeticness In this paper, poeticness is taken to be the well-defined and objectively observable metre, i.e. regular patterns in the rhythm of the lines. Figure 1 shows the metre of Buller's limerick, with stressed syllables in bold type, unstressed syllables in normal type, and syllables extraneous to the underlying metre in italics. The first, second, and fifth lines have the same number of stressed syllables, with a regular pattern of ‘beats' at intervals of two unstressed syllables, and likewise for the third and fourth lines. The system tries to maximize the similarity between the target form, a specification of the required metrical constraints, and the candidate form, the stress pattern of a candidate solution. The target form is encoded as a list of target syllables, notated as follows: w (‘weak') is an unstressed syllable, s (‘strong') is a stressed syllable, and b indicates a linebreak. Figure 2 shows the example target form for a limerick(formatted into lines for readability purposes). The candidate form, a representation of the metre exhibited by a candidate solution, is encoded as a list of candidate syllables obtained from a derived tree by concatenating the phonetic spellings at its leaf nodes. To compute the similarity between the target and candidate forms, we use the minimum edit distance, in which the distance between two strings is the minimal sum of costs of operations (symbol insertion, deletion, and substitution) that transform one string into another. The minimum edit distance can be efficiently computed in a way that produces a pairwise syllable alignment between candidate and target, thus indicating the operations that yield the minimum cost. The operation costs for substitution, insertion, and deletion of syllables have been assigned to reflect our intuitions in perceiving poetic metre. Our candidate forms indicate lexical stress patterns as if the words were pronounced in isolation. Within poetic text, context can affect stress. To compensate for this, the system iterates over the minimum edit distance alignment, detecting certain patterns and adjusting the similarity value. Two types of patterns are implemented: two consecutive deletions, or two consecutive insertions, of non-linebreaks, increases the cost by 1; the destressing of a stressed candidate syllable adjacent to a stressed target syllable, or the stressing of an unstressed candidate syllable adjacent to an unstressed target syllable, decreases the cost by 1. The metre evaluation function, Fmetre, takes the value computed by the minimum edit distance algorithm, adjusts it using the context-sensitive compensation scheme, and normalizes it to the interval [0,1]. 6 Line 1: There was a young lady called Bright. relativity1: {lady(l), young(l), name(l, b), bright(b)} Line 2: She could travel much faster than light. relativity2: {travel(t, l), f aster(f, t, li), light(li), much(f), can(t)} Line 3: She set out one day in a relative way. relativity3: {leave(le, l), relative(le), oneday(le)} Line 4: She returned on the previous night. relativity4: {return(r, l), on(r, n), night(n), previous(n)} Table 1: Modified limerick consisting of four sentences Evaluating meaningfulness An approach to meaningfulness is essentially similar to the above approach to poeticness: try to maximize the similarity between the target semantics, a specification of the meaning an optimal solution should convey, and a candidate semantics, the meaning conveyed by a candidate solution. This requires a method for computing the similarity between two semantic expressions. Love (2000) proposes two factors that must be considered: structural similarity and conceptual similarity. Structural similarity measures the degree of isomorphism between two semantic expressions. Conceptual similarity is a measure of relatedness between two concepts (logical literals). Computing a structural similarity mapping between two expressions is an instance of the NP-hard graph isomorphism problem. However, Manurung implemented a greedy algorithm that runs in O(N3 ), based on Gentner's structure mapping theory (Falkenhainer, Forbus, and Gentner 1989). It takes two sets of logical literals, Starget and Scandidate, and attempts to ‘align' the literals. In doing this, it creates various variable bindings and also two sets of unmatched literals that are left over (from Starget and Scandidate). A function Fsem, normalised to [0,1], is then applied to compute a score based on various aspects of the best match that has been achieved; this is based on Love's computational model of similarity (Love 2000). Experiments in multiobjective optimization For our experiments, we used Buller's ‘relativity' limerick shown in Figure 1 as the target to be generated. The semantics and metre of this limerick are encoded and provided to the system as the target semantics and target metre. Furthermore, aside from the task of generating the entire limerick, we experiment with the generation of this limerick on a line-by-line basis. Thus, we have modified the text slightly so that it consists of four complete sentences which our system can generate individually. These four sentences are shown in Table 1, along with the respective semantic targets, relativity1 to relativity4. This modified limerick preserves the metre and syllable count of the original. However, the third and fourth lines from the original limerick have now been merged into one line. The form targets for lines 1, 2, and 4 of the modified limerick are the same, as follows: limerickline1: [w,s,w,w,s,w,w,s,b] The form target for line 3 is as follows: limerickline2: [w,s,w,w,s,w,s,w,w,s,b] To summarize, we will be using relativity1, relativity2, and relativity4 as Starget along with limerickline1 as Ftarget to generate lines 1, 2, and 4, relativity3 as Starget and limerickline2 as Ftarget to generate line 3, and finally, the union of relativity1 to relativity4 as Starget along with the limerick pattern in 2 as Ftarget to generate the entire modified limerick. Experimental setup For each target, we ran the genetic algorithm using both the SPEA2 multiobjective optimization algorithm and the algorithm used in (Manurung 2003), i.e. where the fitness score being optimized is simply a linear combination of the semantic and metre evaluation functions, i.e. (Fsem + Fmetre)/2. Moreover, the GA employs proportionate selection, which assigns a distribution that accords parents a probability to reproduce that is proportional to its fitness (Back, Fogel, and Michalewicz 1997), where indi¨ viduals are sampled from this distribution using stochastic universal sampling, which minimises chance fluctuations in sampling, using an elitist population of 20% of the entire population. All other parameters were also adapted from Manurung (2003), i.e. the population size was set to 40, each test was run ten times, each run lasted for 500 iterations, the mutation operators used, along with their probabilities, were creation (0.5), adjunction (0.3), and deletion (0.2). For crossover, the subtree swapping operator was used. The linguistic resources used, i.e. grammar and lexicon, are unchanged from MCGONAGALL. Results and discussion Table 2 shows a statistical summary of the best fitness scores obtained during the various experiments. It shows the minimum, maximum, mean, and standard deviation of the best fitness scores obtained after the last iteration across the ten runs of each test. To further confirm the results obtained on Buller's limerick, we experimented with a different set of target semantics and metre, namely Hillaire Belloc's "The Lion", a poem that is used throughout Manurung (2003). We used two variations, the "half" variation which makes use of the first two lines of Belloc's poem, and the "full" variation which uses all four lines. Figure 3 shows, over time, the maximum and average of the best fitness scores across all the GA runs for the generation of limerick line 4. Finally, Tables 3 and 4 show the actual highest-scoring solutions from the various limerick tests. It shows the fitness score, text, and semantic mapping of Starget to Scandidate. Note that for all these experimental results, the fitness scores reported for the SPEA2 results are in fact (Fsem+Fmetre)/2. This is merely for the purpose of 7 Test Min Max Mean Std.Dev Line 1 Linear combination 0.57 0.77 0.60 0.06 SPEA2 0.69 0.88 0.76 0.08 Line 2 Linear combination 0.52 0.62 0.56 0.03 SPEA2 0.56 0.88 0.74 0.12 Line 3 Linear combination 0.52 0.61 0.53 0.03 SPEA2 0.6 0.87 0.79 0.10 Line 4 Linear combination 0.57 0.69 0.65 0.05 SPEA2 0.59 0.83 0.74 0.07 Entire Linear combination 0.42 0.61 0.51 0.05 SPEA2 0.49 0.62 0.55 0.04 Lion "half" Linear combination 0.45 0.70 0.57 0.07 SPEA2 0.61 0.71 0.68 0.03 Lion "full" Linear combination 0.47 0.53 0.51 0.02 SPEA2 0.56 0.65 0.60 0.03 Table 2: Statistical summary of test results presenting the results in such a way that it is meaningful to compare them against the linear combination fitness. During the evolution phase of SPEA2, it uses a Pareto-based fitness score based on population domination statistics. From all of these experimental results it can be observed that the GA consistently performs better using the SPEA2 algorithm compared to the simple linear combination used in (Manurung 2003). This can be easily seen in Figure 3, where the average and best scores using SPEA2 are always higher than the average and best scores using the linear combination. This pattern recurs in the plots for all the other targets. Note that none of the runs achieve a perfect fitness score of 1.0. This is to be expected, given that the target text 0 0.2 0.4 0.6 0.8 1 1 51 101 151 201 251 301 351 401 451 Relativity limerick line 4 Max (linear combination) Average (linear combination) Max (SPEA2) Average (SPEA2) Figure 3: Fitness scores progression for limerick line 4 Line 1 (Linear combination) fitness: 0.77 Text: A lady called Bright is on Bright. Matched: {name(l, b), lady(l), bright(b)} Umatched: {young(l)} Line 1 (SPEA2) fitness: 0.88 Text: There was the young lady called Bright. Matched: {name(l, b), lady(l), young(l), bright(b)} Umatched: {} Line 2 (Linear combination) fitness: 0.62 Text: It melted her. Light could be small. Matched: {can(t), light(li)} Umatched: {travel(t, l), f aster(f, t, li), much(f)} Line 2 (SPEA2) fitness: 0.88 Text: They could travel much faster than light. Matched: {f aster(f, t, li), travel(t, l), much(f), light(li), can(t)} Umatched: {} Line 3 (Linear combination) fitness: 0.61 Text: An animal left. An animal left. Matched: {leave(le, l)} Umatched: {relative(le), oneday(le)} Line 3 (SPEA2) fitness: 0.87 Text: They relatively left one day. It survived. Matched: {leave(le, l), oneday(le), relative(le)} Umatched: {} Line 4 (Linear combination) fitness: 0.69 Text: An animal dwells on a night. Matched: {on(r, n), night(n)} Umatched: {return(r, l), previous(n)} Line 4 (SPEA2) fitness: 0.83 Text: They left. On the night, she returned. Matched: {on(r, n), return(r, l), night(n)} Umatched: {previous(n)} Table 3: Best found solution for limerick lines 1 to 4 Entire limerick (Linear combination) fitness: 0.61 Text: A lady resides on a night. A night could be relative. Facts, that could be on Bright, could wander on Bright. A lady resides on the light. Matched: {on(r, n), can(t), bright(b), light(li), lady(l), relative(le), night(n)} Umatched: {young(l), name(l, b), travel(t, l), f aster(f, t, li), much(f), leave(le, l), oneday(le), return(r, l), previous(n)} Entire limerick (SPEA2) fitness: 0.62 Fitness: 0.62 Text: She could left much faster than men. There is the young lady, who with men, on the evening, returned, called previous Bright. She one day left much faster than them. Matched: {name(l, b), on(r, n), f aster(f, t, li), lady(l), young(l), return(r, l), bright(b), night(n), much(f), can(t), oneday(le)} Umatched: {travel(t, l), light(li), leave(le, l), relative(le), previous(n)} Table 4: Best found solution for entire limerick 8 itself, i.e. Buller's limerick, is suboptimal from a rhythmic perspective. Note, for example, the extraneous leading syllables in lines 2, 4, and 5 of Figure 1, and the fact that open class words such as ‘young' and ‘called' receive no stress in line 1. All of these incur penalties when measured by Fmetre against the target metre in Figure 2. Examining the best solutions from Table 3, it can be seen that the linear combination GA satisfies the metre pattern perfectly (modulo the destressing of the open class word ‘called' in line 1), at the expense of some unrealized semantics, i.e. unmatched literals in all cases. This would suggest that the linear combination approach hinders the growth of the rarer opportunities to satisfy the target semantics. On the other hand, SPEA2 is able to achieve perfect semantics in all but the last line, however does occasionally introduce extraneous syllables (lines 2 and 3). Since our system ignores the gender and number of pronouns, the fact that some lines contain they as opposed to she is not penalized. Putting it all together, the result of generating the limerick line-by-line using the linear combination approach in Manurung (2003) is as follows: A lady called Bright is on Bright. It melted her. Light could be small. An animal left. An animal left. An animal dwells on a night. whereas using SPEA2 it is as follows: There was the young lady called Bright. They could travel much faster than light. They relatively left one day. It survived. They left. On the night, she returned. Turning our attention to the generation of the entire limerick, we can see that both approaches still fail to generate a successful poem (Table 4). The same failure occurs for the generation of Belloc's "Lion" poem. This may suggest that some form of discourse modelling is imperative if the approach is expected to generate anything beyond the sentence level. The ability of MCGONAGALL to successfully generate a text is directly determined by the ability of its evaluation functions to discriminate between "good" and "bad" solutions. Since our evaluation functions only measure coverage of propositional semantics, it has no way of discerning between coherent and incoherent texts. When attempting to generate longer multi-sentence texts, such as limericks, this approach appears rather naive. One solution is to construct an evaluation function that accounts for discourse. Summary and future work This paper has shown how the multiobjective optimization nature of poetry generation as a stochastic search that seeks to produce a text that simultaneously satisfies the properties of grammaticality, meaningfulness, and poeticness, needs to be handled by appropriate algorithms, such as the SPEA2 algorithm. Our results show that it consistently outperforms the previous system in its ability to generate a meaningful metrical text according to given semantic and metre specifications. It is quite successful in generating short one-line sentences. Unfortunately, it is still unable to find a solution given the much harder task of generating an entire limerick. As discussed in the previous section, we believe this can be rectified by augmenting the evolutionary algorithm with evaluation functions that account for discourse models. One idea is to employ Rhetorical Structure Theory (Mann and Thompson 1987), which is quite often used in NLG systems, but rarely in a discriminative model. Another exciting avenue is to explore the possibilities of integrating the work of story generation systems, which explicitly aim to generate narratives, e.g. (Gervas et al. 2006). ´ 2011_20 !2011 Evaluating Evaluation: Assessing Progress in Computational Creativity Research Anna Jordanous School of Informatics, University of Sussex, Brighton, UK a.k.jordanous(at)sussex.ac.uk Abstract Computational creativity research has produced many computational systems that are described as creative. A comprehensive literature survey reveals that although such systems are labelled as creative, there is a distinct lack of evaluation of the creativity of creative systems. As a research community, we should adopt a more scientific approach to evaluation of the creativity of our systems if we are to progress in understanding creativity and modelling it computationally. A methodology for creativity evaluation should accommodate different manifestations of creativity but also require a clear, definitive statement of the standards used for evaluation. This paper proposes Evaluation Guidelines, a standard but flexible approach to evaluation of the creativity of computational systems and argues that this approach should be taken up as standard practice in computational creativity research. The approach is outlined and discussed, then illustrated through a comparative evaluation of the creativity of jazz improvisation systems. Introduction ‘[U]nless the motivations and aims of the research are stated and appropriate methodologies and assessment procedures adopted, it is hard for other researchers to appreciate the practical or theoretical significance of the work. This, in turn, hinders ... the comparison of different theories and practical applications ... [and] has encouraged the stagnation of the fields of research involved.' (Pearce, Meredith, and Wiggins 2002) In 2002 Pearce, Meredith, and Wiggins highlighted a ‘methodological malaise' faced by those working with computational music composition systems due to lack of methodological standards for development and evaluation of these systems: causing progress in this research area to ‘stagnate'. Computational creativity research is in danger of succumbing to this same malaise. Computational creativity research crosses several disciplinary boundaries. The field is influenced by artificial intelligence, computer science, psychology and specific creative domains in which we implement systems, such as art, music, reasoning, story telling, and so forth (Colton 2008; Widmer, Flossmann, and Grachten 2009; Leon and Gerv ´ as´ 2010; Perez y P ´ erez 1999, provide a selection of examples). ´ Currently many implementors of creative systems follow a creative-practitioner-type approach: produce a system then present it to others, whose critical reaction determines its worth as a creative entity. A creative practitioner's primary aim, however, is to produce creative work, rather than to critically investigate creativity; in general this investigative aim is important in computational creativity research. A comprehensive survey of the literature on computational creativity systems reveals the lack of systematic evaluation of the actual creativity of creative systems postimplementation. Although the quality of the system output is often subjected to some scientific evaluation, it is rare that the creativity of the creative system is evaluated post-implementation, or even critically commented upon (Peinado and Gervas 2006; Colton 2008, are examples of some notable exceptions). Creativity entails more than just the quality of the output: for example, what about novelty, or variety? Yet these systems are often described as creative systems without appropriate justification for this claim. A critical analysis of current evaluation practice in computational creativity raises issues that highlight a need for a more methodical approach to evaluation to be adopted across the research community. This paper presents Evaluation Guidelines: an evaluative approach that is flexible enough to deal with different types of creativity yet allows practical and objective cross-comparison of different systems to measure progress. The Evaluation Guidelines are presented in Figure 1 and illustrated through a comparative evaluation of the creativity of jazz improvisation systems. Computational creativity evaluation examined To see how computational creativity systems are currently evaluated, 75 journal and conference papers were surveyed, with the aim of including all papers presenting a computational system that described that system as being creative. Using the Web of Knowledge and Scopus databases, a literature search was conducted to find all journal papers presenting details of a computational creativity system. Words and phrases such as ‘computational creativity', ‘creative system', ‘creative computation', ‘system' and ‘creativity' were used as search terms. This set of papers was supplemented with papers from journal special issues on computational 102 Table 1: Summary of evaluation of the 75 creative systems surveyed Paper makes at least a mention of evaluation 77% Paper gives details what evaluation has been done 55% Paper contains section(s) on Evaluation 51% Paper states evaluation criteria 69% Main aim of evaluation: Creativity 35% Main aim of evaluation: Quality/Accuracy/Other 43% Mention of creativity evaluation methodology 27% Application of creativity evaluation methodology 24% System compared to other systems 15% System compared to systems by other researchers 11% Systems evaluated by independent judges 33 % creativity (the majority of which had already been identified in the search). Reflecting the current balance of conference/workshop publications to journal publications in computational creativity research, papers from recent Computational Creativity research events were also surveyed. Table 1 outlines the results of this survey1 . The key finding of this survey is that evaluation of computational creativity is not being performed in a systematic or standard way. Out of 75 computational systems presented as being creative systems, the creativity of a third of these systems was not even discussed when presented to an academic audience in paper format. Half the papers did not contain a section on evaluation. Only a third of systems presented as creative were actually evaluated on how creative they are. Less than a quarter of systems made any practical use of existing creativity evaluation methodologies. Of the 18 papers that applied creativity evaluation methodologies to evaluate their system's creativity, no one methodology emerged as standard across the community. Colton's creative tripod framework (Colton 2008) was used most often (6 uses), with 4 papers using Ritchie's empirical criteria (Ritchie 2007). No other methodology was used by more than one paper. Occurrences of evaluation being done by people outside the system implementation team were rare, as were any examples of direct comparison between systems, to see if the presented system outperforms existing systems and represents any real research progress in the field. Why is creativity evaluation not standard practice? By no means does this paper mean to suggest that computational creativity researchers do not wish to follow scientific practice. On the contrary, in personal communications many have expressed interest in how to evaluate creative systems, with some suggestions offered over the last decade (Ritchie 2007; Colton 2008; Pease, Winterstein, and Colton 2001). A culture is however developing in computational creativity research where it is becoming acceptable not to evaluate the creativity of a creative system in a methodical manner. 1 Space limitations unfortunately prevent all details being reported here; my thesis contains full survey results (Jordanous forthcoming) To a certain extent this follows common practice of creative practitioners: to produce work then exhibit it to an audience whose reaction (both immediate and longer term) asserts the value of the work, instead of performing retrospective comparative analysis of the creativity of the work. A lack of methodical evaluation can however have a negative effect on research progress (Pearce, Meredith, and Wiggins 2002). Evaluation standards are not easy to define. It is difficult to evaluate creativity and even more difficult to describe how we evaluate creativity, in human creativity as well as in computational creativity. In fact, even the very definition of creativity is problematic (Plucker, Beghetto, and Dow 2004). It is hard to identify what 'being creative' entails, so there are no benchmarks or ground truths to measure against. What do we gain from scientific evaluation? Scientific evaluation is important for computational creativity research, allowing us to compare and contrast progress. Ignoring this evaluation stage deprives us of valuable analytical information about what our creative systems achieve, especially in comparison to other systems. Existing evaluation frameworks Ritchie proposes empirical criteria to assess the creativity of a system based on rating the system's products for how typical of the intended genre they are and for the value of the products (Ritchie 2007). Pease, Winterstein, and Colton describe various tests of a creative system's output, input and creative process (Pease, Winterstein, and Colton 2001). Colton offers a creative tripod framework to qualitatively evaluate creativity (Colton 2008). Despite these methods being available, no method has been adopted as standard evaluative practice by the research community. Colton's approach has been the most adopted by authors in the few years it has been available so far (being used to evaluate 6 surveyed systems). It is most usually used to describe why a given system should be considered creative, rather than for any comparison between systems. As well as providing a way to evaluate the creativity of a computational system, a key function of a creativity evaluation methodology is if it enables comparison of systems against other systems, through the level of creativity demonstrated by each system. In practice, Ritchie's approach is the most frequently adopted quantitative comparison method, being applied to evaluate 4 surveyed systems. Ritchie's proposals acknowledge several theoretical issues but are relatively impractical to use in evaluation. Several implementation decisions are left open, such as how to obtain typicality and value ratings for system products, or how to choose weights and parameter values in the criteria. Ritchie argues this allows freedom in defining creativity in the relevant domain but offers no guidelines or examples. One other issue is how Ritchie incorporates measures of novelty (a key aspect of creativity) into the criteria. Novelty exists in more ways than whether an artefact replicates a member of the system's inspiring set: the artefacts that guided the construction of the system, or the inspirational material used by the system during the creative process. The criteria do not account for how surprising a product is, or 103 new ways of producing the end product, or how a product deviates from previous examples (Pease, Winterstein, and Colton 2001; Peinado and Gervas 2006). Also, the inspiring set may not be available for analysis, or the system may not use an inspiring set to generate new products. The set of tests offered by Pease, Winterstein, and Colton (2001) has seen little application (perhaps due to its densely packed presentation of the test formulae). This paper has often been cited, though, and offers a considered analysis on how to evaluate computational creativity. Pease, Winterstein, and Colton admit that their choices of assessment methods are ‘somewhat arbitrary' and should be treated as initial suggestions, in the hope of prompting further discussion and suggestions along similar lines. As of the time of writing, this hope has not been realised, either by the authors or by others. Of the authors of (Pease, Winterstein, and Colton 2001), only Colton makes subsequent recommendations for creativity evaluation, but these are unrelated to those in Pease, Winterstein, and Colton (2001), which is not even cited in Colton (2008). Although not without flaws, the frameworks mentioned above and other discussions of evaluation do offer useful material for our purposes, such as the way in which the concept of creativity is broken down into constituent components and the suggestion of practical tests to carry out in evaluation. The approach to evaluation suggested in this paper aims to complement and combine the useful parts of what has been suggested so far in previous frameworks. A reductionist approach to defining creativity A prevalent definition of computational creativity is: ‘The study and support, through computational means and methods, of behaviour exhibited by natural and artificial systems, which would be deemed creative if exhibited by humans' (Wiggins 2006) Whilst this definition is intuitive for us to understand, it reveals little about what creativity actually is. Understanding creativity is a key aim of much computational creativity research, e.g. (Widmer, Flossmann, and Grachten 2009). A more practical approach for detailed evaluation is taken here: that creativity is multi-dimensional, with many factors contributing to the creativity of a creative system (Pease, Winterstein, and Colton 2001; Plucker, Beghetto, and Dow 2004; Ritchie 2007; Colton 2008; Jordanous 2010a; Jennings 2010). This breaks down the concept of creativity to something more manageable and tangible, as opposed to an overarching, impenetrable concept of ‘creativity'. The need for a standard evaluation approach A flexible approach to evaluation in this field of research is necessary. By its very nature, creativity manifests itself in a variety of forms, with different creative domains prioritising aspects of creativity differently. For the same reason, though, some standardisation is necessary to avoid the concept of creativity being interpreted too liberally, where any system could be argued to be creative depending on how creativity is defined. This approach requires that the standards used to judge creativity are stated and open to discussion. This paper proposes a standard evaluative approach and demonstrates its application in a case study evaluating the creativity of various jazz improvisation systems. The aim of this approach is to encourage a more scientific approach to computational creativity evaluation, allowing us to identify in what areas we are achieving creative results and what areas we should focus more research attention on. Standardising our approach to evaluation Evaluation Guidelines for Computational Creativity 1. Identify key components of creativity that your system needs if it is to be considered creative. (a) What does it mean to be creative in a general context, independent of any domain specifics? (b) What aspects of creativity are particularly important in the domain your system works in (and conversely, what aspects of creativity are less important in that domain) ? 2. Using step 1, clearly state what standards you use to evaluate the creativity of your system. 3. Implement tests that evaluate your creative system under the standards stated in step 2. Figure 1: Proposed standard for creative systems evaluation The intention of this approach This approach aims to examine the creativity of a creative system more systematically; to pinpoint why and in what ways a system can justifiably be said to be creative. The point is to understand to a greater level of detail exactly why a system can be described as creative. The Evaluation Guidelines approach enables us to investigate in what ways a system is being creative and how research is progressing in this area, using an informed, multi-faceted approach that suits the nature of creativity. The Evaluation Guidelines allow comparison between a creative system and other similar systems, by using the same evaluation standards. A clear statement of evaluation criteria makes the evaluation process more transparent and makes the evaluation criteria available to other researchers, avoiding unnecessary duplication of effort. There is a time-specific element here; a creative system is evaluated according to standards at that point in time, where a creative domain is at a certain state, viewed by society in a certain context. These standards may change over time. If similar systems have previously been presented to similar audiences at similar times, however, then the evaluation standards can be reused. Hence detailed comparisons can be made using each standard, to identify areas of progress. What this approach is not This is not an attempt to offer a single, all-encompassing definition of creativity, nor a unit of measurement for creativity where one system may score 104 x %. The Evaluation Guidelines are not intended as a measurement system that finds the most creative system, or gives a single summative rating for the creativity system (though people may choose to use and adopt the approach for these purposes if it is relevant in their domain). Such a scenario is usually impractical for creativity, both human and computational. There is little value in giving a definitive rating of computational creativity, especially as we would be unlikely to encounter such a rating for human creativity. Nor is this an attempt to dissuade researchers from attempting to implement creative systems, or to put obstacles in the way of such researchers such that they are forced to target other goals and justifications for their research rather than the pursuit of making computers creative. It is of course reasonable for computational creativity researchers to aim their work towards better understanding creativity, rather than to implement computational systems that are themselves creative. For example the pursuit of making the YQX music performance system creative (Widmer, Flossmann, and Grachten 2009) is ‘abandoned' in favour of exploring human creativity via their research. However for those researchers whose intention is to implement a computer system which is creative, the approach outlined in this paper offers a methodological tool to assist progress. Incorporating previous evaluation frameworks Depending on how creativity is defined by the researcher(s), previous evaluation frameworks (Ritchie 2007; Colton 2008; Pease, Winterstein, and Colton 2001, and other discussions) may be accommodated if appropriate for the standards by which the system is being evaluated. For example if skill, appreciation and imagination are identified as some key components of creativity for a creative system, it would be appropriate to use the creative tripod (Colton 2008). The Evaluation Guidelines let the evaluator choose the most appropriate existing evaluation suggestions, without being tied into a fixed definition of creativity that may not apply fully in the domain they work in. At this point no recommendations are made on what tests to include (though this paper later investigates this issue in the context of jazz improvisation systems). What is emphasised here is that for scientific evaluation we must clearly justify claims for the success or otherwise of research achievements. This approach affords such clarity. Why not just ask humans how creative our systems are? As computational creativity is often defined as the creativity exhibited by a computational system (Wiggins 2006), experiments can be run with human judges to evaluate the creativity of a system. There is definitely a place for soliciting human opinion in creativity evaluation, not least as a simple way to consider the system's creativity in terms of those creative aspects which are overly complex to define empirically, or which are most sensitive to time and current societal context.The process of running adequate evaluation experiments with human participants, though, takes up a good deal of time and effort. Human opinion is variable; what one person finds creative, another may not (Leon and Gerv ´ as 2010; ´ Jennings 2010). Therefore large numbers of participants may be required, to capture a general consensus of opinion. In addition to the time and resources necessary to devise and run suitable evaluation experiments with large numbers of people, extra issues such as the procedure of applying for ethics permissions are introduced. There may also be some difficulty in attracting suitable participants, and a cost associated with paying participants. These issues may have adverse effects on the research process, many of which are out of our direct control to resolve. It would be useful if this outlay of research time and effort could be reduced. There are other practical concerns which hinder us from using human judges as the sole source of evaluation of a system. Human evaluators can say whether they think something is creative but can usually give minimal insight into why it is creative. As described above, it is hard to define why something is creative; this is a tacit judgement rather than one we can easily voice. It is useful to have a more informed idea of what makes a system creative, to understand both why a system is creative and what needs to be worked on to make the system more creative. Here one must acknowledge a common problem in computational creativity research: human reticence to accept the concept of computers being creative. On the other hand, researchers keen to embrace computational creativity may be positively influenced towards assigning a computational system more credit for creativity than it perhaps deserves. Hence our ability to evaluate creative systems objectively can be significantly affected once we know (or suspect) we are evaluating a computer rather than a human. Implementing the Evaluation Guidelines To illustrate how the Evaluation Guidelines approach works in practice, the approach has been applied to compare and contrast the creativity of four jazz improvisation systems: • Voyager (Lewis 2000) • GenJam (Biles 2007) • EarlyBird (Hodgson 2006) • My own jazz improvisation system (Jordanous 2010b) Step 1a: Domain-independent aspects of creativity To identify common components of creativity that transcend individual domains and that are applicable in all interpretations of creativity, one can look at what we prioritise as most important when we discuss creativity. This can be detected by analysing the language we use to discuss creativity, seeing what words are most prevalent in such discussions. Previous work (2010a) identified 100 words that are most commonly used in academic literature on the nature of creativity, surveying papers across computational creativity, psychology and other disciplines to generalise across different disciplines. This work used the log likelihood ratio (Dunning 1993) to detect which words appear significantly more often in academic papers about creativity, compared to typical use in written English (as represented in the BNC). Developing this work (Jordanous forthcoming), the same methodology was applied to compare a cross-disciplinary set of papers about creativity with a matched set of papers on subjects unrelated to creativity. This produced a list of 105 Figure 2: Key components of creativity words more likely to appear in the creativity literature than expected in academic papers. Grouping the results by semantic similarity, 14 key aspects or ‘building blocks' of creativity are identified: see Figure 2. Step 1b: Aspects of creativity in jazz improvisation Berliner describes how jazz improvisers need to balance the known and unknown, working simultaneously with thought processes and subconscious emergence of ideas (Berliner 1994). Berliner examines how jazz improvisers learn from studying those who precede them, then develop that knowledge to develop a unique style. The recent work of Louise Gibbs in jazz education equates ‘creative' with ‘improvisational' musicianship. She highlights invention and originality as two key components for creative improvisation (Gibbs 2010). To identify important factors in jazz improvisational creativity, 34 participants with a range of musical experience2 were surveyed (Jordanous forthcoming). The participants were asked to describe what creativity meant to them, in the context of musical improvisation. Their responses were grouped according to the 14 components in Figure 2. Figure 3 summarises the participants' responses. All components were mentioned by participants to some degree. Interestingly, some components were occasionally identified as having a negative as well as positive influence. For example, over-reliance on domain competence was seen as detrimental to creativity, though domain competence was generally considered important. Of the 14 components of creativity in Figure 2, those that were identified by participants as most relevant for improvisation were: • Social Interaction and Communication • Domain Competence • Intention and Emotional Involvement Step 2: Definition of jazz improvisation creativity Drawing upon the results from the above steps, the jazz improvisation systems were evaluated along all fourteen aspects listed in Figure 2, but with the criteria ordered so that those identified as most important were considered first, with each of the components weighted accordingly. 2Musical experience: µ=20.2 yrs, σ=14.5. Improvising experience: µ=15.1 yrs, σ=14.3 Figure 3: Relevance of creativity factors to improvisation Figure 4: Evaluating four jazz improvisation systems Step 3: Evaluative tests for systems' creativity Using the annotated participant data, statements were extracted to illustrate how each component is relevant to improvisation. These statements were used as test statements for each component, to analyse the four jazz improvisation systems, for example: How is the system perceived by an audience? (Social Communication and Interaction) What musical knowledge does the system have? (Domain Competence) Does the system get some reward from doing improvisation? (Intention and Emotional Involvement) Each system was given a subjective rating out of 10 for each component, as represented in Figure 4. The component ratings were then weighted, so that differences in more important components were magnified, with differences in less important components reduced. This is pictured in Figure 5. These results show that the Voyager system (Lewis 2000) can in general be considered most creative. Specifically focussing on my own system (Jordanous 2010b), while it performs well in terms of varied experimentation and in generating original results, it could be considered more creative if it was more interactive and if more musical knowledge was used during improvisation rather than random generation. Future work and evaluation of the approach The success of this approach can be judged by how closely it replicates creativity evaluations from human judges, so the results of applying the Evaluation Guidelines will now be 106 Figure 5: Weighted evaluation of the systems' creativity compared to human evaluations of the same systems. One reviewer of this paper commented that the Evaluation Guidelines should be applied to more domains if it is to be considered a standard evaluation methodology. I quite agree with this comment; although I am working on more applications, I hope that other researchers will consider adopting the Evaluation Guidelines to evaluate their own creative systems in other domains and share their results and observations. Concluding remarks A comparative, scientific evaluation of creativity is essential for progress in computational creativity. Surveying the literature on computational creativity systems, one quickly finds evidence that scientific evaluation of creativity has been neglected. While creative systems are often evaluated with regard to the quality of the output, and described as creative by the authors, in all but a third of cases the creativity of these systems is not evaluated and claims of creativity are left unverified. Often a system may be evaluated in isolation, with no reference to comparable systems. Figure 1 presents Evaluation Guidelines, a standard but flexible approach to creativity evaluation. To demonstrate the approach, four jazz improvisation systems were comparatively evaluated to see which were more creative and, importantly, in what ways a system was more creative than another. This gave valuable information on how to improve the creativity of my own system (Jordanous 2010b). This paper strongly advocates the adoption of the Evaluation Guidelines as standard practice in computational creativity research to avoid computational creativity research slipping into a ‘methodological malaise' (Pearce, Meredith, and Wiggins 2002). Acknowledgements Comments from Nick Collins, Chris Thornton, Chris Kiefer, Gareth White, Jens Streck and the ICCC11 reviewers were very helpful in writing this paper. 2011_21 !2011 No Free Lunch in the Search for Creativity Dan Ventura Computer Science Department Brigham Young University ventura@cs.byu.edu Abstract We consider computational creativity as a search process and give a No Free Lunch result for computational creativity in this context. That is, we show that there is no a priori "best" creative strategy. We discuss some implications of this result and suggest some additional questions to be explored. Introduction It seems natural to interpret the creative process, particularly in a computational context, as one of search. This has been done since the early years of thinking about computational creativity (Boden 1992; 1998), and, more recently, Wiggins has suggested a rather more concrete formalization of the idea (2006). Here we take up the idea of creativity as search and ask the question, "Is there a best creative (search) strategy?" Not surprisingly, perhaps, under some reasonable assumptions, we can show that the answer turns out to be an emphatic, "No." To show this, we present a simple reformulation of some classical ideas from the search, optimization and machine learning literature known collectively as the No Free Lunch (NFL) theorems (Wolpert and Macready 1995; Wolpert 1996; Wolpert and Macready 1997). For simplicity, we will limit our discussion to a discrete, finite domain D containing "artefacts" to be discovered. As is typical, we consider the problem of discovering novel, useful artefacts, and here we focus on the discovery process, attributing greater creativity to strategies that make quick discoveries. This is not unreasonable given the fact that with enough time, even exhaustive search can discover good artefacts, and these ideas have been formalized elsewhere (Ritchie 2007; Ventura 2008). Indeed, it is often suggested that part of creativity is an aspect of surprise (Boden 1995; Macedo, Coimbra, and Cardoso 2001), and another way to look at rapid discovery is as a surprising result (i.e., if an observer, unaware of the search strategy used, cannot produce the result nearly as quickly [or at all], they are likely to be surprised by the result). Main Result To begin with, we will consider the case that there is one best element a ∈ D, which we will call a ∗ . We are interested in how long it will take a particular creative (search) strategy π to discover a ∗ (note that we include heuristics, background knowledge, etc. in the concept of search strategy). In the general case, π can be probabilistic, and so the number of steps j required to find a ∗ should be represented as a probability distribution. Also, since the creator employing the strategy may have experience, exposure to an inspiring set, etc., which we will represent as I as in (Ritchie 2007), this probability distribution can be conditioned on this, and we can write P a ∗ π (j|I) to mean the probability, given I that strategy π will discover a ∗ in exactly j steps. Then, we are interested in the cumulative distribution function C a ∗ π (n) = Xn j=0 P a ∗ π (j|I) which gives the probability that π will discover a ∗ in n or fewer steps1 . Ideally, we would like to find a strategy π ∗ such that ∀a ∈ D, a = a ∗ =⇒ C a ∗ π∗ (n) ≈ 1 for some small, finite n |D|. In other words, we would like to find a strategy that quickly discovers the artefact, no matter which artefact in the domain is the artefact. Some reflection should suggest that such a strategy is unlikely to exist. However, perhaps we can at least find a strategy π + that dominates all other strategies, so that ∀π∀a ∈ D, a = a ∗ =⇒ C a ∗ π+ (n) ≥ C a ∗ π (n) That is, perhaps there is a strategy that will at least find the artefact as fast or faster than any other strategy. Unfortunately, there is the further complication that, in fact, we do not know a ∗ , so we can not compute P a ∗ π (j|I). Thus, we must sum over all possible a ∈ D, redefining our goal as finding a strategy π + that dominates all other strategies, independent of the artefact a ∈ D for which we are searching, so that ∀π, X a∈D C a π+ (n)P(a) ≥ X a∈D C a π (n)P(a) 1We have completely ignored here the structure of the domain D as well as the mechanism of strategy π; both are abstracted into the probability distribution P a ∗ π (j|I). 108 where P(a) is shorthand for P(a ∗ = a). Because we have no way of knowing a priori for which artefact we may be looking, we must, in essence, have a strategy that will find any artefact in the domain faster than any other strategy (at least in an expected sense, weighted by the likelihood). Of course, we do not know the likelihood distribution, P(a), either, so for now we will assume P(a) is uniform; that is, we will assume that all artefacts in D are equally likely to be the artefact we are seeking. The question now is, given this uniformity assumption, is there a "best" creative strategy? The following theorem says that no such strategy exists. Theorem 1. For a fixed, finite domain D, an integer 0 ≤ n ≤ |D| and any strategy π, X a∈D C a π (n)P(a) = n |D| Proof: X a∈D C a π (n)P(a) = X a∈D Xn j=0 P a π (j|I)P(a) = X a∈D Xn j=0 P a π (j|I) 1 |D| = 1 |D| X a∈D Xn j=0 P a π (j|I) = 1 |D| Xn j=0 X a∈D P a π (j|I) = 1 |D| Xn j=0 1 = n |D| The first equality is by definition; the second is by assumption of uniformity; the next two are simple algebra; the fifth is because the probability that some a ∈ D is found is unity; the last is obvious. What the theorem says is that, in the absence of biasing information about the creative task, the probability of discovering a ∗ is independent of π, the search strategy2 . In other words, if we do not know anything about the creativity task, no creative strategy is to be preferred over any other. Now, let us relax or remove some of our simplifying assumptions and ask if this makes a difference. First, we can consider the possibility that more than one a ∈ D is desirable; that is, we are searching for any member of a set 2The dual version of this says that the expected number of steps required to find a ∗ is E a ∗ π [n] = |D| 2 and, in particular, is independent of π. A∗ ⊆ D of desirable artefacts3 . Since P A π (j|I) ≤ P a∈A P a π (j|I) (with equality if the probabilities are independent), the consequent of Theorem 1 takes the form X a∈D C A ∗ π (n)P(a) ≤ |A∗ |n |D| and, notably, is still independent of the choice of π. Next, we can consider the non-uniform case for P(a). In this case, the consequent in the theorem statement requires an additional integral, taken over the continuous space of possible distributions, and the resulting form is Z P (a) X a∈D C a π (n)P(a)dP(a) = n |D| In other words, not assuming anything about the probability distribution P(a) has the same effect on our expected success as does assuming P(a) is uniform. Note that these generalizations naturally compose, so that we can make a statement about the distribution-free probability of finding one of multiple desirable artifacts4 : Z P (a) X a∈D C A ∗ π (n)P(a)dP(a) ≤ |A∗ |n |D| Finally, we can mention the case of non-stationary D (i.e. the case for transformational search). While we will not say much about this here, we will note that there are variations of the NFL theorems for optimization that treat the case of a changing objective function (Wolpert and Macready 1997), and similar results will likely hold for transformational search. Discussion On the one hand, anyone familiar with NFL-type results will not be surprised that one applies here. Indeed, even the original NFL theorems could be seen as, in some ways, "formalizing the obvious". On the other hand, the result, whether 3This can be thought of in terms of a fitness function f : D → [0, 1] that measures the desirability of an element of the domain, and a threshold θ, such that A ∗ = {a|a ∈ D, f(a) > θ}. Or, we can eliminate the hard constraint and compute with the fitness f more directly. In this case, rather than summing over the different elements for which we might be searching, we integrate over the different fitness values we might find, weighted by the probability of finding a domain element whose fitness is that particular value. Then, the probability of finding the a ∈ D with the highest fitness (which we assume is 1), assuming the distribution of fitness values P(f) is uniform becomes: Z 1 f=0 C f π (n)P(f)df = n |D| 4Our most general statement, distribution-free, directly including a fitness function, and making no assumption about the distribution of fitness values becomes: Z P (a) Z P (f) Z 1 f=0 C f π (n)P(f)df dP(f)dP(a) = n |D| 109 surprising or not, has profound implications for computational creativity and gives us a framework in which to discuss general principles. For example, the characterization of creativity as search as probability, allows a statistical interpretation of many aspects of computational creativity, and, in particular suggests that the optimal (in the Bayesian sense) approach to any creative endeavor (that can be cast as search) is to use the following search strategy: π(I) = argmax a∈D P(a) However, it is not even clear what knowledge of P(a) means; and, of course, even if we did somehow know P(a), for any interesting domain D, explicitly implementing such a search is completely intractable. So, the obvious question is how to approximate the Bayes optimal search. Perhaps it is possible to dynamically choose the search strategy π a for any a ∈ D such that ∀π∀a ∈ D, Ca πa (n) ≥ C a π (n) In other words, we would like a method for biasing our search strategy towards the artefact we are looking for. Short of a priori knowledge of a, this at the least requires some meta-knowledge about the creative task in question that can be used to guide the choice of π a . We note here the similarity to meta-learning in the field of machine learning and, further, suggest a close tie to the case of transformational search. If the domain D is transformed, becoming B, we must assume that a ∗ has likely changed as well, becoming b ∗ ∈ B\D (if not the case, how do we explain/justify the domain transformation?) If this is the case, we must have some mechanism of changing our search bias to match, switching from strategy π a ∗ to π b ∗ . In the case of machine learning, the NFL result says, crudely, that no learning algorithm is better than any other over all possible learning problems. The standard rejoinder to this result is that, in fact, we don't (and Nature doesn't) care about all possible learning problems, many of which represent "learning" scenarios that are not interesting or do not represent "real-world" scenarios. This dogma is universally accepted in the field of machine learning, and does seem intuitively appropriate. Further, it leads to interesting questions about which problems are the "interesting" ones, and how can we tell, and, knowing this, how can we build learning algorithms that are biased toward these types of problems. In our current discussion of computational creativity, the analogical argument would be either that we are not likely to be searching for any possible a ∈ D (and thus universal quantification over a is too strong a constraint) or, perhaps that Nature will favor certain members of D (and thus universal quantification over P(a) is too strong a constraint). This sort of argument, of course, leads to inquiries regarding which members of D might be interesting or which distributions P(a) might represent Nature; however, it is, at least at this point, much less clear that, in fact, we can make such a claim for computational creativity. Since our result states, essentially, that all search strategies are equally effective over all possible search problems, we are asking whether all search problems are in some way "interesting" and, as a result, whether all search strategies are valuable. If we content ourselves (for the moment) with equating creativity with search strategy, then, in turn, we are asking whether all creative approaches are valuable, or whether some can be shown to be inherently better than others. One might be tempted to claim that for a specific D, that a ∗ is fixed and thus, that a domain fully specifies a "creativity scenario". This would, indeed, make further analysis somewhat more tractable; however, it is unlikely that such a strong assumption is reasonable (e.g., if D is the set of all possible paintings, creating the "ultimate" painting is not likely to be either temporally or spacially consistent; indeed, the very definition of D, of what constitutes a painting, is likely to change over time and very possibly across locales as well). Thus, it is not clear that this question can be answered even for a specific domain D, let alone in the more general case, but it is, certainly, an interesting question to consider. Acknowledgments This material is based upon work that is partially supported by the National Science Foundation under Grant No. IIS0856089. 2011_22 !2011 Dynamic Inspiring Sets for Sustained Novelty in Poetry Generation Pablo Gervas´ Universidad Complutense de Madrid c/ Profesor Garc´ıa Santesmases s/n Madrid, 28040, Spain pgervas@sip.ucm.es Abstract One activity recognised as an interesting instance of creativity is the ability of poets to systematically come up with new poems, irrespective of how many they have already written in the past. Poets who periodically produce a new poem, different from the earlier ones and comparatively valuable, are considered to embody a more interesting type of creativity than poets who have produced only one good poem in their lifetime, or those that produce a succession of poems all built following a standard recipe so obvious that it can almost be described as a template. This paper explores the constraints imposed on computational creative processes by the requirement of novelty with respect to previous outputs. Two issues emerge as fundamental: how to evaluate novelty against a set of artifacts taken as reference, and how to adjust construction procedures so that each successive run leads to significantly different output. A possible modelling of these two issues is proposed in terms of two sets of sample artifacts: a reference set (against which novelty is measured) and a learning set (used to configure the construction procedures). Over this basic model, extended discussion is carried out to draw out interesting insights for the design of computational creative systems. Introduction The process of coming up with a novel poem involves a skill for producing something recognisable as a poem and the ability to recognise efforts that lead to results too similar to previous poems (in order to avoid them). Existing automatic poetry generators have mostly focused on modelling the generative skill rather than the historical evaluation function. But even if these two elements were succesfully modelled in a single system, the task of modelling how they evolve over time would remain an important challenge for computational creativity. Simply by generaring new material, but also by reading material by others in between acts of creation, human authors modify the frame of reference that they employ to judge their own creations. Additionally, their technique may evolve over time, sometimes through exploration of new possibilities but quite often as a result of a conscious effort to emulate material produced by others that they have liked. Human authors who produce new material by modifying their technique are considered more creative than those that simply obtain different material using the same technique. The present paper reviews a number of existing poetry generators focusing on their ability to model a generation skill and a validation mechanism for novelty. Two issues emerge as fundamental: how to evaluate novelty against a set of artifacts taken as reference, and how to adjust construction procedures so that each successive run leads to significantly different output. A possible modelling of these two issues is proposed in terms of two sets of sample artifacts: a reference set (against which novelty is measured) and a learning set (used to configure the construction procedures). Previous Work This section presents some useful references in terms of computational creativity that provide a basic vocabulary to discuss the phenomena under study, and reviews a number of automated poetry generators. Computational Creativity Many efforts over the recent years that address the study of creativity from a computational point of view acknowledge as a predecessor the work of Margaret Boden (1990). Boden proposed that artificial intelligence ideas might help to understand creative thought. This idea was taken up by a number of artificial intelligence researchers and gave rise to a research line that attempts to model or reproduce creative thought in computer systems. Some of Boden's ideas have had great influence in later work. One important idea was the distinction between historical and psychological views of creativity. Historical creativity (H-creativity) involves the production of ideas that have not appeared before to any one else in all human history. Psychological creativity (Pcreativity) involves the production by a given person of ideas that have not occurred before to that particular person. This distinction is important because it implies that, unless a computer program is given access to historical data (and generally provided with means for social interactions with other creators), it will only be capable of P-creativity. Wiggins (2006) takes up Boden's idea of creativity as search over conceptual spaces and presents a more detailed theoretical framework that specifies formally the different elements involved (the universe of possible concepts, the 111 rules that define a particular subset of that universe as a conceptual space, the rules for traversing that conceptual space, the function for evaluating particular points in that space). Wiggins points out that the rules for traversing a conceptual space may lead to elements in the universe but outside the definition of the conceptual space. In fact, definitions of search space and traversal function in a creative setting are not only particular to a given creator and different from those used by others, but also constantly in flux. Ritchie (2007) addresses another important issue in the development of creative programs, that of evaluating when a program can be considered creative. He does this by outlining a set of empirical criteria to measure the creativity of the program in terms of its output. He makes it very clear that he is restricting his analysis to the questions of what factors are to be observed, and how these might relate to creativity, specifically stating that he does not intend to build a model of creativity. Ritchie's criteria are defined in terms of two observable properties of the results produced by the program: novelty (to what extent is the produced item dissimilar to existing examples of that genre) and quality (to what extent is the produced item a high-quality example of that genre). Another important issue that affects the assessment of creativity in creative programs is the concept of inspiring set, the set of (usually highly valued) artifacts that the programmer is guided by when designing a creative program. Ritchie's criteria are phrased in terms of: what proportion of the results rates well according to each rating scheme, ratios between various subsets of the result (defined in terms of their ratings), and whether the elements in these sets were already present or not in the inspiring set. Jennings (2008) introduced computationally plausible modeling of the fact that most human creativity takes place with the creator embedded in a broader society of other creators and critics, and that this context affects significantly the creation of new artifacts. To capture the way in which humans react to these constraints, Jennings defines the concept of creative autonomy, which requires that a system be able to evaluate its creations without consulting others, that it be able to adjust how it makes these evaluations without being explicitly told when or how to do so, and that these processes not be purely random. The model he proposes relates the evaluation of a systems creations to its perception of how other members of its social context are likely to evaluate them. Changes in how this evaluation is carried out may be triggered by the need to align personal evaluations with other members of the society or as a side effect of trying to justify past evaluations. Creative autonomy is therefore argued to emerge out of the interactions with multiple critics and creators, rather than from meditative isolation. Automatic Poetry Generators A number of existing automated poetry generators are reviewed focusing on the basic techniques for text creation that have been used as underlying technologies. The generate & test paradigm of problem solving has been widely applied in poetry generators. Because metric restrictions are reasonably easy to model computationally, very simple generation solutions coupled with an evaluation function for metric constraints are likely to produce acceptable results (given an assumption of poetic licence as regards to the content). The WASP system (Gervas, 2000) draws on prior poems ´ and a selection of vocabulary provided by the user to generate a metrically driven recombination of the given vocabulary according to the line patterns extracted from the original poems. The WASP automatic poet used a set of construction heuristics obtained from formal metric constraints to produce a poem from a set of words and a set of line patterns provided by the user. The system followed a generate and test method by randomly producing word sequences that met the formal requirements. Output was impeccable from the point of view of formal metrics, but clumsy from a linguistic point of view, and it made little sense. An example of poem output by WASP is given below: Todo lo mudara la edad hermosa. ´ Marchitara la luz el vuelo helado ´ del gesto. Se escogio en color airado ´ con no hacer mudanza por su rosa. This is a metrically correct cuarteto, a bit stilted from a grammatical point of view, and clearly driven by the underlying choice of vocabulary and patterns, which recall very specific examples of Spanish sixteenth century classics. The actual meaning1 emerges from the construction process as a surprise. An initial work by Manurung (1999), based on chart generation, focuses on the generation of poetry in English, starting from a semantic representation of the meaning of the desired poem. A very important driving principle in this case is to respect the unity between form and meaning that is considered to provide the aesthetical backbone of real poetry. This implies that poems to be generated must aim for some specific semantic content, however vaguely defined at the start of the composition process. The approach relied on chart generation, taking as input a specification of the target semantics in first order predicate logic, and a specification of the desired poetic form in terms of metre. Words that subsume the input semantics are chosen from a lexicon, and a chart is produced incrementally to represent the set of possible results. At each stage, the partial solutions are checked semantically to ensure that no sentences incompatible with the original input are produced. Additionally, partial results are checked for compatibility with the desired poetic form. Because the search space is pruned of invalid partial solutions at each stage, the approach is generally efficient. This corresponds to a systematic generate & test approach, trying all possibilities and making sure that no partial constituent is generated twice by the system. It also allows the user to control the input in terms of meaning. This has the advantage of restricting somewhat the probability of obtaining non-sensical output, but it also limits the degree of freedom of the system. The amount of creativity that system can exercise on the semantics of its output is limited. This consti1Beautiful age will alter all. // The frozen flight of the gesture // will make light wilt. // It was chosen in angry colour // on not adapting for its rose. 112 tutes a significant restriction on the extent of poetic licence allowed. An example of poem output by Manurung's initial system is given below: the cat is the cat which is dead the bread which is gone is the bread the cat which consumed the bread is the cat which gobbled the bread which is gone This is produced by the system from input specifying a limerick target form and target semantics {cat(c), dead(c), bread(b), gone(b), eat(e,c,b), past(e)} but disregarding the rhyme scheme. Manurung went on to develop in his Phd thesis (Manurung, 2003) an evolutionary solution for this problem. Evolutionary solutions seem particularly apt to model this process as they bear certain similarities with the way human authors may explore several possible drafts in parallel, progressively editing them while they are equally valuable, focusing on one of them when it becomes better valued, but returning to others if later modifications prove them more interesting. Manurung's evolutionary solution is demonstrated in MCGONAGALL, a proof-of-concept system for a model of poetry generation as a state space search, solved using evolutionary algorithms. Manurung tests his system exhaustively, hoping to demonstrate separately its abilities as a form aware generator (come up with poems matching a given target form), as a tactical generator (come up with valid realizations for a target semantics) and as a poetry generator (combine both to come up with poems matching the target form and the target semantics as closely as possible). Results seem to indicate that acceptable output from a purely linguistic point of view is easily achievable for the form aware generator, achieved with difficulty for the tactical generator, and extremely difficult to achieve for the poetry generator. An example of poem output by MCGONAGALL in its form aware mode is given below: They play. An expense is a waist. A lion, he dwells in a dish. He dwells in a skin. A sensitive child, he dwells in a child with a fish. Manurung's results explain why most automatic poetry generators restrict themselves to operating in form aware mode. Another important tactic that human authors are known to use is that of reusing ideas, structures, or phrasings from previous work in new results. This is very similar to the AI technique of Case-Based Reasoning (CBR). Some poetry generators have indeed explored the use of this technique as a basic generation mechanism. ASPERA, an evolution of the WASP system (Gervas, 2001) used CBR to build ´ verses for an input sentence by relying on a case base of matched pairs of prose and verse versions of the same sentence. Each case was a set of verses associated with a prose paraphrase of their content. An input sentence was used to query the case base and the structure of the verses of the best-matching result was adapted into a verse rendition of the input. This constituted a different approach to hardening the degree of poetic licence required to deem the outputs acceptable (the resulting verses should have a certain relation to the input sentence). ASPERA is described as following a classic Retrieve-Reuse-Revise-Retain CBR cycle (Aamodt and Plaza, 1994). The Revise-Retain stages involve carrying out an analysis of any validated poems in order to add the corresponding information to the system data files, to be used in subsequent computations. The ASPERA system requires user supervision to carry out these steps, with a human revising successive outputs to decide which should be retained. An example of poem output by ASPERA is given below: Ladrara la verdad el viento airado ´ en tal corazon por una planta dulce ´ al arbusto que volais mudo o helado. It is interesting to observe that this poem2 shows certain similarities to the output of the WASP system shown above (the first verse of this terceto matches the structure of the second verse of the cuarteto). This suggest that the same classic verse must have been used to provide the line pattern used by WASP and the prose-verse version of the sentence used by ASPERA. There are also similarities in lexicon and rhymes. In 1984 William Chamberlain published a book of poems called "The Policeman's Beard is Half Constructed" (Chamberlain, 1981). In the preface, Chamberlain claimed that all the book (but the preface) had been written by a computer program. The program, called RACTER, managed verb conjugation and noun declension, and it could assign certain elements to variables in order to reuse them periodically (which gave an impression of thematic continuity). Although few details are provided regarding the implementation, it is generally assumed that RACTER employed grammar-based generation. The poems in Chamberlain's book showed a degree of sophistication that many claim would be impossible to obtain using only grammars, and it has been suggested that a savvy combination of grammars and carefully-crafted templates may have been employed, enhanced by heavy filtering of a very large number of results. An example of poem output by RACTER is given below: More than iron More than lead More than gold I need electricity I need it more than I need lamb or pork or lettuce or cucumber I need it for my dreams This poem shows how structural repetition (in this instance of constructions more than and I need) can be fundamental for aesthetic effect. The use of n-grams to model the probability of certain words following on from others has proven to be another useful tecnique. An example of poetry generation based 2 The angry wind will bark the truth // in such a heart for a sweet plant // to the bush that you fly mute or frozen. 113 on this is the cybernetic poet developed by Ray Kurzweil. RKCP (Ray Kurzweil's Cybernetic Poet)3 is trained on a selection of poems by an author or authors and it creates from them a language model of the work of those authors. From this model, RKCP can produce original poems which will have a style similar to the author on which they were trained. The generation process is controlled by a series of additional parameters, for instance, the type of stanza employed. RKCP includes an algorithm to avoid generating poems too close to the originals used during its training, and certain algorithms to maintain thematic coherence over a given poem. Over specific examples, it could be seen that the internal coherence of given verses was good, but coherence within sentences that spanned more than one verse was not so impressive. An example of poem output by RKCP (after Lord Byron) is given below: Oh! did appear A half-formed tear, a Tear. By the man of the heart. This example shows how extreme brevity can help to convey the idea of interesting underlying semantics even when none have actually been involved in the construction process. Modelling Example-Driven Creation Attempts at modelling the task of text creation computationally tend to focus on a static representation. The conceptual models used correspond to those of a single act of creation, or the creation of a single text. This section outlines how a model that considers the dynamics involved in a sequence of acts of creation might be described in terms of some classic concepts of computational creativity. With a view to exploring the phenomena that we are interested in, we need to make a few assumptions about the various ingredients that might be involved, and how we are to refer to them. We take up Ritchie's terminology (Ritchie, 2007) to describe the inspiring set, the set of (usually highly valued) artifacts that the programmer is guided by when designing a creative program. In our case, this will be a collection of texts that the program is aware of at the start. However, this set of poems may be used in two different ways. On one hand, it can be used to inform the production mechanism that is used. Some of the possible mechanisms for producing text used by the automatic poetry generators reviewed above rely on a collection of text either to act as case base, from which to extract a grammar, or on which to train an n-gram model. We will refer to the set of texts used for this purpose as the learning set. On the other hand, it can be used to inform the evaluation metric used for checking novelty of results. In most cases, this will take the form of checking new results against this previous collection to test for p-creativity. We will refer to the set of texts used for this purpose as the reference set. 3 http://www.kurzweilcyberart.com/poetry/rkcp overview.php3 Instances of Inspiring Sets in Poetry Generators ASPERA and RKCP clearly rely on a set of poems that they use as inspiring set. In both cases this inspiring set is used as learning set. They differ in the way in their approach to dynamic updating of this learning set, and on whether they use it as a reference set as well. ASPERA includes a mention of how the results of the system can be integrated into the creative process. Because this system applies a CBR solution, subsequent results may be added to the case base to provide additional material for later runs. In this case, the set of poems used to create the case base constitutes the learning set, in the sense that it determines what outputs can be constructed. The ReviseRetain stages constitute instances of expanding the learning set. However, one must take into account the fact that this task must be carried out by a human. No mention is made of how ASPERA avoids replicating partly or completely previous solutions. Indeed, in traditional CBR systems replicating previous solutions is considered an advantage rather than a drawback. This may be a disadvantage when they are applied for creative purposes. Under this light it seems fair to say that ASPERA does not use a reference set. In the case of RKCP, the selection of poems by a given author from which the language model used by system is built constitutes an instance of a learning set. RKCP introduces an innovation by allowing the possibility of maintaining separate models for the styles of different authors. In this case, each one of these models comes from a different learning set, and the system can switch from one to another to achieve variety. Yet the set of models remains fixed. The RKCP is said to include an algorithm to avoid generating poems too close to the originals used during its training. No details are provided as to how this algorithm operates. It is unclear whether it achieves this by checking against a set of reference texts or by constraining the construction process in some way. Yet it is fair to say that in this instance the learning set is also being used as a reference set. There seems to be no mechanism for ensuring that successive system outputs differ from one another at least as much as they differ from the reference set. The chart approach employed in Manurung (1999) is aimed at ensuring that no partial constituent is generated twice by the system. This can be seen as a different practical solution to the problem of avoiding redundant outputs. The use of a chart structure to store intermediate results is another alternative to comparing each candidate with a store of previously produced items, with no added cost of re-indexing after each production. However, this system applies that solution only to avoid redundancy during exploration towards a specific single output. No mention is made of keeping records of previous outputs to be considered when producing new ones. For this additional purpose, a chart would be impractical. Dynamic Inspiring Sets as Configuration Resources The review of existing automatic poetry generators has shown that is is possible to define both the construction process and the validation of output novelty with respect to a set 114 of texts, referred to as inspiring set. The concept of an inspiring set of poems as means of configuring a poetry generator presents the advantages of corpus-based approaches to natural language processing. Within this paradigm, the performance of a processing system is determined by the corpus on which it was trained. If the system needs to change, this is done by changing the corpus and retraining the system. In computational work carried out so far on automated poetry generators, inspiring sets have always been configured in a static way. A given set of texts is taken as inspiring set and used as learning set and/or reference set, with smaller or larger numbers of poems produced with no thought to how the act of producing new poems might affect either. Ideally we would want to explore the possibilities of having the creation process itself provide feedback to these sets, with a view to identifying whether such feedback might capture some of the patterns observed in human creativity. If this approach is applied in creative endeavours, it presents the additional problem of requiring some means for managing the progressive evolution of these inspiring sets, including the need to periodically retrain the system. This issue is discusses with respect to the technical solutions outlined earlier. CBR approaches include the means for systematically updating the case base, which constitutes an instance of a learning set. A case base can also be employed as a reference set if the retrieval stage is configured to always select slightly dissimilar cases, forcing the reuse stage to apply heavy adaptation. In a similar vein n-gram based approaches provide means for enforcing difference with the inspiring set by controlling the probability of the resulting chains. To model the possibility of progressively extending the inspiring set, a procedure would have to be introduced for periodically retraining the models employed. The Problem of Managing Inspiring Sets From the point of view of collecting intuitions on the various challenges involved in modelling human creativity, the issues outlined so far point to a candidate requirement that had not been identified as part of creative systems developed in the past: the need for appropriate management of the set of inspiring artifacts. Where these artifacts are used as a reference set, appropriate management of the reference set would become a must for any system that intends to be aware of whether its results in successive runs are indeed new with respect to previous output. This would correspond to the way in which human creators seem to be aware of the state of the art in their fields, at least to the extent of knowing when solutions being explored are indeed new. This requirement would be tightly related to Boden's concept of p-creativity. The most straightforward solution to apply would be to test every result for similarity against previously known texts. This requires each new result to be compared with an ever increasing set of previous results. Elaborate means of indexing or clustering the set of reference texts would help reduce the complexity of checking for novelty. However, the improvements obtained in such a way would have to be offset with the difficulties involved in having to re-index the reference set each time a new result is added. From an engineering point of view, it would make sense to resort to indexing the set of reference artifacts. The task of indexing would slow down the production of successive results, but it would speed up the task of exploring the conceptual space in search for new candidate solutions. This would match the intuition that human creators may take some time between the creation of one piece and setting out to produce a new one. During this time a creator may be digesting the results of his last creation, or searching for inspiration for the next. The task of re-indexing the reference set could be identified as part of either or both of these processes. In contrast, during the act of creation itself human creators shift very rapidly through candidate solutions, with no lingering on past material. This would match more closely an engineering model based on prior indexing than one based on systematic comparison. Where the set of already known artifacts is used to inform the construction procedure, different procedures will arise depending on the particular technique employed. In CBR solutions, management of the learning set would involve periodically re-indexing a growing case base. For n-gram based approaches, management of the learning set would consist of periodically retraining the set of models employed, or possibly more radical periodical rearrangements involving re-clustering of the learning set and training of a new set of models based on the resulting set of clusters. With respect to human performance, the periodical maintenance of the learning set would mirror evolution of author's technique. Intuitions for Computational Creativity The issue of considering successive system results as part of the reference set is fundamental for being able to discern whether any given result is p-creative as defined by Boden. Not many previous creative systems consider this. A notable exception is the MEXICA storytelling system (Perez y ´ Perez, 1999), which does include a mechanism for checking ´ its results with previous outputs. The subdivisions of the inspiring set described above could be related to some of the formal elements described in Wiggins' framework (Wiggins, 2006). The learning set of any such system actually determines the conceptual space, and the production mechanism derived from it constitutes the operational equivalent of a traversal function. In a similar vein, the reference set will undoubtedly play a fundamental role in any evaluation function designed to consider novelty, as that of any system aiming to be creative should. The possibilities discussed above of dynamically updating either the learning set or the reference set would capture the intuitions that definitions of search space, traversal function and evaluation function may be constantly in flux. The reference set in its various configurations described above plays an important role in the definition or identification of the novelty of the results of a program that aims at being creative, as described by Ritchie (2007). No mention has been made in this paper of the quality of the results, which is their other important observable property. An ap 115 propriate evaluation of the quality of results would play a fundamental role in all the processes described above. Only results that have been rated above a given threshold of quality should be considered for updating either the reference set or the learning set. This constitutes a less obvious but quite significant role of whatever evaluation function is used to determine quality in a creative system. If the most obvious role of the quality function is to select actual system outputs, by doing so it actually guides the development of an individual style for the system, through the task of filtering what results get added to reference or learning set. The choices taken in the past are fed back into system operation and affect choices taken in the future. The dynamic handling of the reference set and the learning set may provide a simple way of implementing in computational systems the kind of social interaction described by Jennings as fundamental for human creativity (Jennings, 2008). Systems defined in terms of dynamic inspiring sets as described in this paper would automatically adjust their novelty evaluation functions just by being exposed to new artifacts as they are added to the reference set. They would also adjust their production mechanisms by exposure to new artifacts as they are added to their learning set. This would allow for easy modelling of the interplay between learning to emulate the work of creators one admires (as driven by its inclusion in the individual's learning set) and trying to steer away from the work one dislikes (by including it in the reference set). Such a model would capture much of the way in which human creators learn their trade. Conclusions Computational approaches to creativity in the past have mostly considered static views of the creative process, in the sense that they have focused on a single act of creation rather than what it is that enables a creator to achieve a sequence of acts of creation, each one producing a different artifact that innovates with respect to previous ones and possibly is the result of a different process of construction or composition. The present paper has explored how such a dynamic view of the creative process may be modelled in terms of two subdivisions of the concept of inspiring set: a learning set from which the production mechanisms are learnt, and a reference set against which candidate results have to be checked to ensure novelty. This approach implies that each individual creative act must be considered in a larger context. From an engineering point of view this presents the problem of how to manage the reference set in a computationally tractable manner. It also introduces the problem of when and how each of these sets is updated, or when they are processed to re-train the production mechanisms or re-index the reference set. As future work it would be very valuable to consider possible ways of implementing some of these ideas in experimental systems. Such systems need not attempt to model full-blown literary generators. Even the simpler task of generating coherent text as output could serve as a test for some of the ideas presented in this paper. The mechanisms for ensuring novelty with respect to a set of source materials apply just as well to simple text as to more literary efforts. The ability to identify when a production mechanism is reaching its limits and begins to produce less novel material can also be tested over such simple set ups. Acknowledgments The research reported in this paper is partially supported by the Ministerio de Educacion y Ciencia ´ (TIN2006-14433-C02-01, TIN2009-14659-C03-01), Universidad Complutense de Madrid and Direccion General de ´ Universidades e Investigacion de la Comunidad de Madrid ´ (CCG07-UCM/TIC 2803). 2011_23 !2011 Bimba: Sensor Embedded Balls for Creative Sound Generation Pol Pla and Patricia Maes Fluid Interfaces Group MIT Media Lab Cambridge, MA 02139 USA {pol,pattie}@media.mit.edu Abstract Bimba is an exploration into designing playful interfaces which use embedded sensors to sonify object interactions. The system wirelessly collects data from sensors integrated into foam balls and generates a sound composition extracting the most relevant features of movement. We have designed and built a collection of sensor units, implemented a protocol to support a wireless real-time data network infrastructure, and proposed a sonification mapping for the generated output. The implementation of Bimba informs design possibilities for creative and expressive interfaces, with particular focus on those generating sound. Introduction Physicality has shown potential in facilitating the design of complex interactive systems. Fitzmaurice, Ishii and Buxton introduced the concept of graspable interfaces (1995) - physical objects that interface with computers for more meaningful human-computer interactions. Their work emphasizes the central role that physicality plays in the design of effective interactions. Further work (Ishii et al., 1997; Maynes-Aminzade, 2003) has investigated how users can leverage their acquired intuition about physical laws to better understand tangible systems. In our design, we seek natural and meaningful relationships between physical properties and sounds. While there is not a definitive way to sonify physicality, we are proposing one approach with the aim of generating conversation around the potential of such environments. The intersection between tangible interaction and sound generation has been explored before. Systems like the ReacTable (Jordà et al., 2007) are examples of physical manipulation of sound. In the same vein, there exists a large collection of projects that embed sensors into instruments (Young, 2002; Overholt, 2005) or into wearables (Marrin and Picard, 1998; Paradiso et al., 2000) for the control of musical output or even enhancing stage performances. The system we propose builds on concepts and mappings that we observed in these projects, but it is, at the same time, distinct in two main aspects. Firstly, our system proposes a duality between control and randomization, which makes it different from tangible musical interfaces that attempt to maximize control over the output. Secondly, it is designed to be embedded in everyday objects - as opposed to instruments, the performer herself, or her clothing. There are two projects in particular that have deeply informed our work: Squeezables (Weinberg and Gan 2001) and Squidballs (Bregler et al., 2005). Squeezables are a collection of soft instruments that generate sound in an intimate way. Squidballs, on the other hand, involve a large audience in massive collaborative play. We wanted to create a system that works in both contexts, which is why we implemented Bimba to suit both intimate environments as well as shared spaces. Expressiveness is an important factor in musical performance; it helps the artist to better communicate her music to the audience. At the same time, expressivity helps the audience to better appreciate the performance - establishing a connection between the performer's actions and the resulting sounds. We borrowed inspiration from projects that try to emphasize the emotional aspect of a performance when generating/affecting musical outcome such as (Marrin and Picard 1998). We believe that the physicality present in out system, combined with intuitive mappings, can help convey emotions and contribute to an interesting and informative visual - as well as auditory - experience. As reflected in (André et al, 2009), serendipity can have a great impact on the creative process. We already commented on the affordances of object-based musical systems, but we also believe that such systems offer interesting possibilities for serendipitous exploration and sound randomization. Physical laws can help us design more accessible systems, but at the same time enable users to experiment with uncontrolled environments. In doing so, performers may even new inspiration in the surprising output that they generate. For example, a performer might achieve unique results by dropping one of these balls down a set of stairs or juggling with a trio of them. Our investigation, Bimba - Catalan slang for ball - , builds upon research addressing the four characteristics described 117 above: physicality, scale, expressiveness and serendipity. Bimba is a system for sonification of everyday objects with embedded sensors. Technical Aspects of the System Bimba consists of a set of small wireless powerautonomous electronic circuits. The current configuration of each circuit includes a microprocessor, a battery, a wireless communication chip and three inputs - namely, a three-axis accelerometer, an air pressure and temperature sensor, and a piezoelectric sensor. The goal of this design was to create an integrated system that accommodates different physical objects collecting, processing and broadcasting a diversity of real-time sensor data. For the exploration discussed here, we incorporated the circuits into foam balls roughly 15cm in diameter Figure 1. One of the balls of the Bimba system and a prototype of the sensor board. We developed a communication protocol to process multiple packets of data coming from sensors distributed across different objects. Having built each of the boards with an integrated wireless communication module, we created a star topology network. That is, all communications are centralized in a controller node. This controller node then generates sound with the information received. This kind of network enables us not only to generate music from each individual sensor, but also to establish sound relationships between all of the devices connected. Physicality: Rules, Object Properties and Mappings In a system such as the one we are describing, it is important to have clear constraints which guide the creative process. The laws of physics, which are universally applicable to the objects that surround us, provide common ground for creation. The specific characteristics of each object add the necessary variability to the system; depending on its shape, weight, composition, and size, an object interacts with the environment differently. To achieve an appropriate balance between variability and predictability, it is necessary to relate physical rules to compositional rules in a meaningful way. For this reason, mappings are an essential component of a successful system. The system discussed here senses three different parameters: acceleration, altitude and impact. By no means are these the only interesting parameters to consider; for future iterations of the system, we will design a customizable circuit that enables addition and replacement of the sensors included. For each of the sensors used in the current version of the system, we propose a set of mappings. These are described in the following sections. Acceleration. The system reads the aggregated acceleration of the balls. This parameter is strongly related to the sense of motion and momentum. Because movement implies continuous variability, this parameter is especially suitable for controlling the base frequency of the background tone. Altitude. The relative height is particularly well suited to representing the primary sound's pitch because it is intuitively associated with a range from low to high. Impact. Bouncing motion is clearly associated with percussion. In our design, the system triggers a MIDI drum kit sound each time a ball senses contact with another object or surface. Design Criteria and Goals The system design was intended to explore four concepts tied to our research interests: relating serendipity and inspiration, supporting different creative styles, building immersive environments, and enabling collaboration. We believe that these four pillars lay the foundation for creative flow. Serendipitous approach to creativity. As we alluded to in the introduction, designing for serendipity is as important as designing for controlled environments. By taking advantage of environments familiar to the user, Bimba supports playful exploration in a relaxed and intuitive way. Ideally, this would minimize distraction so that the creative experience may also be an immersive one. Supporting different styles of use. Designing tools for creation means providing resources in an unencumbered manner so that users can challenge conventions of the system in new and interesting ways. Research has shown the importance of appealing to different creative styles to push the boundaries of creativity (Resnick and Silverman 2005). For this reason, Bimba aims to support a broad range of approaches to sound generation. Design that impacts the ecology. Successful designs affect the ecologies that they inhabit. Since our system is deeply connected to the environment - by means of the 118 laws of physics - we expect users to modify spaces to stage their own "worlds of music". From intimate use to massive collaboration. Bimba is designed to support the full range between small-scale, intimate interactions and massive collaborations. As previously described, the system can be easily used for personal exploration of sound landscapes. However, because the proposed design is distributed, there are also many instances in which it can be used collaboratively. Discussion and Future Work We talked about the importance of mappings for performers to connect with an interface. Enabling users to create or modify their own mappings may strengthen the bond between the performer and the sound object. The current system supports input from multiple objects, but it does not yet propose mappings reliant upon inputs distributed across a collection of objects. Improving these distributed mappings may encourage higher levels of collaboration. Additionally, if Bimba had an awareness of the material composition of the objects, then the mappings could reflect this. Zoran's Chameleon Guitar (2009) is an inspiring example of using resonance filters to represent different material arrangements. Another future line of research lies in creating a set of modular sensor boards. Such a modular system would enable performers to capture the most important aspects of each object. For example, the physical characteristics of a bouncing ball can be well represented by sensing motion, altitude and impact - but this may not be the case when the interface object is instead a glass bottle. Acknowledgements We would like to thank Joe Paradiso and Laurel Pardue for providing a great collection of inspirational materials as well as the technical guidance that enabled the production of the work reflected in this document. We would also like to acknowledge contribution of Emily Lovell in editing and organizing the text in this document. Conclusion Bimba serves as a preliminary investigation into turning everyday objects into musical interfaces through the use of embedded sensing. This paper reflects on the benefits of capitalizing on object affordances to creative sound generation. By grounding our exploration in four main design considerations - serendipitous approach to creativity, supporting different styles of use, designing for impacting the ecology, and cross-social scenario - we hope to stimulate future dialogue around them. 2011_24 !2011 How the Obscure Features Hypothesis Leads to Innovation Assistant Software Tony McCaffrey Lee Spector1 Cognitive Psychology Cognitive Science University of Massachusetts Amherst Hampshire College Amherst, MA 01002 USA Amherst, MA 01002 USA amccaffr@psych.umass.edu lspector@hampshire.edu Abstract A new cognitive theory of innovation, the Obscure Features Hypothesis (OFH), states that almost all innovative solutions result from two steps: (1) noticing a rarely noticed or never-before noticed (i.e., obscure) feature of the problem's elements, and (2) building a solution based on that obscure feature (McCaffrey 2011). Structural properties of the human semantic network make it possible to locate useful obscure features with a high probability. Innovation Assistant (IA) software interactively guides human users to the most likely obscure features for solving the problem at hand. Introduction The OFH articulates a core principle for solving the problems requiring innovation used in psychology experiments (i.e., insight problems) as well as real world problems in engineering and design. As an example of an insight problem, consider the Two Rings Problem in which you have to fasten together two weighty steel rings using only a long candle, match, and a two-inch cube of steel. Melted wax is not strong enough to bond the rings together so the solution relies on noticing the obscure feature that a wick is a string that can be extricated from the candle by scraping away the wax on the cube of steel before tying the rings together. Two lines of evidence suggest that all insight problems used in the psychology literature follow the OFH. First, the key features for solving known insight problems are not listed among that object's common associates in the Association Norms Database (Nelson et al. 1998) and are thus obscure (e.g., string is not listed as a close associate of either candle or wick). Second, Chrysikou (2006) had subjects list features from the key objects used in insight problems and found that the key feature from these objects was listed by only 9% of the subjects and are thus obscure. The OFH opens up a research program to improve innovation based on two questions. What inhibits humans from noticing the obscure? What techniques can help overcome these sources of inhibition? In general, the normal processing of our perceptual, motor, and semantic systems leads us to notice typical features. Our everyday experience entrains us towards the typical, which is efficient for everyday life but an enemy of innovation. Moving humans from the typical to the obscure seems to be a promising way to improve innovation. Immediately, however, a major challenge presents itself. The number of features of an object (e.g., a candle) is potentially unlimited (e.g., my aunt collects candles, candles are generally smaller than a toaster) and perhaps infinite (e.g., a candle generally weighs less than 100 pounds, weighs less than 101 pounds, etc.: Murphy and Medin, 1985). What techniques can lead humans to obscure features that have a high probability of leading to innovation? Specifically, can a computer program steer human users through the potentially infinite search space to the most promising areas of the human semantic network? McCaffrey (2011) analyzed the semantic networks of all insight problems used in the psychology literature plus a collection of engineering design problems and discovered several important structural properties relevant for innovation. In diagramming a semantic network graph for an object, a box is drawn around the object name and all its commonly noticed features. Analysis shows that the key obscure features for solving the examined problems are just outside the box. The Just Outside the Box Hypothesis (JOTB) posits that the key feature for most innovative solutions is one or two steps outside the box of commonly noticed features. Searching through a semantic network, a computer could help locate those concepts that are at the proper distance from the original concept according to the JOTB. This technique helps narrow the search a bit, but another principle related to the direction of the search from the original concept can narrow the search even more. Although there are potentially an infinite number of features for an object, there is a finite taxonomy that characterizes the most prevalent types of features and relations for an object that are known to be useful for innovation. McCaffrey (2011) has developed such a taxonomy called the Feature Type Taxonomy (FTT) and uses it to characterize the most promising direction to search in from the initial concept. The most promising direction will vary depending on the kind of problem being addressed. For example, an examination of all known insight problems involving concrete objects reveals that the key obscure feature for 68% of them resides in noticing the material composition of the parts (e.g., examining the material make-up of a wick and wax for the Two Rings Problem) (McCaffrey 2011). So, given a concrete object insight problem, looking 120 at the material category of the parts is a good idea. Other feature/relation types are useful for other innovative tasks. For example, a candle is often present during romantic events such as dinner dates. A comedian might take this information and create a funny situation by placing a candle in an unromantic event such as divorce court proceedings. In other words, the kind of innovative problem stipulates which feature types have the highest probability of being useful (e.g., Material or Events). For every type of problem articulated, we can create a probability distribution that measures the likely usefulness of any given feature type. Together, the JOTB and the probability distributions formed from the FTT help specify the best distance and direction to search in from the initial concept for a given problem. The JOTB operationalizes the idea of innovation standing between order and randomness (Watts 1999). A concept inside the box is too strongly connected to the initial concept (i.e., too much order) while a concept many links away is too weakly connected (i.e., too much randomness). A concept just outside the box is on the boundary between order and randomness, which seems to be ideal for innovation. Innovation Assistant Software We created a software prototype, called an Innovation Assistant (IA), which guides human users to the most promising obscure features based on problem type. Our overall goal is to create a human-machine interaction that is more innovative than a human working in isolation. Whereas traditional AI programs attempt to solve problems completely on their own, the IA guides humans to notice the obscure features that we tend to overlook. The human user must then construct the solution based on the key obscure feature. Currently, the prototype is instantiated with the probability distribution of the FTT for concrete object insight problems. Further, presently it only executes a technique to help users notice the material make-up of an object's parts; but other techniques for noticing the obscure members of other feature/relation types are ready for implementation. Evidence: Lab and Real-World Problems We tested the IA prototype on the eight insight problems used on human subjects in the experiments of McCaffrey (2011). Because the material composition of the parts is central to solving these problems, the IA asks the user to enter this information (e.g., "Can a wick be decomposed further into parts?" and "What material is a wick made of?"). While the IA successfully leads users to the key obscure features for these problems, on three of the problems the IA exceeded its goal and actually located the key feature and how it could be used to solve the problem. For example, for the Two Rings Problem, after the user enters the goal "fasten rings," enters the candle's parts as "wick" and "wax," and answers the IA's question about what material a wick is made of, the IA is able to generate how the string can solve the problem. The IA suggests: "a candle's wick is made of string, which might be able to tie the rings." Basically, the synonyms of the goal verb "fasten" are intersected with the verbs related to the potential actions of the known objects and parts. In this case, the verb "tie" is a synonym of the goal verb "fasten" so the IA can make the connection and solve the problem. In sum, the IA is designed to help users notice useful obscure features that are often overlooked. However, the IA can periodically go above this standard and actually solve the problem—if the user has not already solved it. Most real-world problems require noticing features other than the material make-up of an object's parts, so we will present how several of the other techniques have helped with real-world problems. First, a company challenged us to create an idea to detect roadside bombs. Most proposed solutions focus on either detecting the bomb itself through its metal or disrupting the electronics of the bomb. After shifting focus from the bomb's physical make-up to the events that the bomb is involved in (as suggested by the probability distribution for real-world engineering problems), a technique guided the user to articulate a hierarchy of these events and their sub-events (e.g., building the bomb, transporting the bomb, burying the bomb, and detonating the bomb). This hierarchy led to an idea for a mechanical method that could reliably detect the displacement of the dirt above a buried bomb rather than trying to detect the bomb itself. No further information will be disclosed at this time due to this problem being an ongoing open military problem. The technique performed its job by shifting attention to an obscure but promising feature type. Second, another company gave us the unsolved problem of adhering a coating to Teflon. After switching focus to the goal, a technique helped users to reveal the common assumptions of the verb adhere: including (1) direct contact between (2) two surfaces using a (3) chemical process. Questioning these assumptions led to the novel and workable idea of sticking something "through" Teflon to a magnetic surface beneath the Teflon (i.e., a sandwich of three surfaces in which the coating indirectly sticks to the Teflon due to its attraction to the magnetic surface). Of course, the coating would need to possess the proper makeup in order to induce sticking. Again, the technique shifted our focus to an obscure way to adhere things together. Third, at the beginning of the BP oil spill in late April 2010, we worked with the goal "contain oil" and a technique helped suggest a pipe wide enough to enclose the spill area. After making this "pipe" out of plastic, several engineers verified the workability of this plastic "curtain" that reached from the ocean floor to its surface and would guide the spilled oil to the top in a confined space where it could be dealt with. We submitted our idea to the BP disaster website as did other engineers. Fourth, two candle companies verified the novelty of our candle designs that were created using a technique which helped us notice that candles are built out of one piece of wax and are motionless when they burn. We created novel 121 candle designs consisting of multiple pieces of wax as well as designs in which the entire candle moves based on its own dynamics. These designs are under consideration by two candle companies so will not be disclosed. Finally, the U. S. Post Office solicited the public for money-saving ideas. Working with the goal "deliver mail," a technique helped us to consider places to "pick up mail." The option of drive-through windows led to the idea of car commuters signing up to pick up their mail at the drivethrough window where they get their daily coffee. Train commuters could pick up their mail at the subway stations they frequent every weekday. We added more logistics (e.g., banks of post office boxes at key pick-up locations), but estimates indicate that this method could eliminate up to one-third of daily home deliveries without delaying the reception of this mail. The Next Phases Given the promise of the IA prototype on insight problems and other techniques on real-world problems, we are moving forward in two directions. First, after implementing our other techniques, we will conduct controlled experiments with engineering and design students to further test the IA's effectiveness for improving innovation. Students will be given a design task. Students in a control group will complete the design problem on their own. Students in an experimental group will use the IA to help complete the design problem. Judges will rate the creativity and workability of the designs. We hypothesize that the IA users will produce designs deemed by the judges to be more creative than the designs produced by the control group. Second, we will record all the runs of the interactions between human users and the IA program in order to try to detect patterns in the use of the program that led to successful innovation, as well as interactional patterns common to unsuccessful use. To all these run traces, we will apply a genetic programming (GP) system (Koza, 1994). Note that GP systems have been shown to be effective on a variety of difficult problems, including pattern detection (Koza, 1994). Specifically, when all the parameters to explore are known, GP systems are competitive with humans on design problems thought to require innovation (Spector, 2008). A GP system is therefore a good method for evolving and detecting patterns in the traces of the IA runs. Further, the IA runs involve accessing the human semantic network as embodied by WordNet (Miller 1995). A GP system will be used to detect patterns in the semantic network searches that led to an innovative (i.e. novel and useful) connection. Using this method, we may discover other structural properties of the semantic network that are important for innovation. Each structural property we uncover will permit us to craft a new heuristic to exploit the semantic network for the benefit of innovation. All our current research focuses on the first step of the OFH (i.e., helping humans notice obscure features) and ignores the second step (i.e., building a solution based on an obscure feature). Because solutions can become intricate causal networks, human cognitive limitations may inhibit us from piecing together the solution once we have noticed all the relevant features. Future research will address this possible limitation to human innovation. Conclusion Initial evidence from laboratory insight problems and a collection of real-world design problems points to the potential for the IA approach to improve human innovation. Testing of the IA in controlled experiments will lead to scientific evidence measuring the effectiveness of the IA as well as the truth of the OFH, its underlying principle. Discovery of new semantic network properties relevant to innovation will lead to new heuristics that improve innovation. In sum, the IA approach is a promising one for creating software that can improve human innovation based on the psychologically plausible theory called the OFH. 2011_25 !2011 Scuddle: Generating Movement Catalysts for Computer-Aided Choreography Kristin Carlson, Thecla Schiphorst and Philippe Pasquier kca59@sfu.ca, thecla@sfu.ca, pasquier@sfu.ca School of Interactive Arts and Technology, Simon Fraser University, Canada Abstract We present a tool for computer aided choreography titled Scuddle. Scuddle uses a Genetic Algorithm to generate movement catalysts for contemporary choreography. The use of movement catalysts challenge choreographers to distance themselves from habits to better explore creative movement. Scuddle was designed as a method for both facilitating creativity and supporting active reflection on creativity through awareness of the decision making process. The system is successful in generating complex catalysts that result in creative approaches to movement and support articulation of decisions. We detail the design motivation, implementation, analysis and results of a qualitative evaluation by choreographers. Introduction Choreography is a creative practice based in extensive embodied knowledge and physical exploration. Intuitive decisions often drive composition in order to craft a work through practice based expertise (Blom, 1982; Humphrey, 2003). Many choreographers self-impose limitations that require creative solutions in order to distance themselves from habitual decisions. These limitations may be seen as catalysts that fuel exploration and new solutions. The reflective distance created when using a catalyst can sometimes provide additional awareness of the decision making process, allowing a deeper look at intuitive decisions. We are interested in investigating the process of making creative choreographic decisions to move towards the modeling of creative decisions in computational choreography. To initiate our discussion of decisions we divide choreographic process into 3 stages: 1) the investigation of movement itself as source material, 2) the development of movement material into phrases or sections and 3) the composition of the movement phrases into a final structure. Literature on choreographic process often focuses more on the creation of movement material (stage 1) and sequencing of material (stage 2) than the crafting of material into a finished work (stage 3). However, many computer based attempts at creating choreography limit focus to the sequencing of predefined movement material (stage 2) (Calvert, 1991; Lapoint, 2005; Nakazawa, 2009; Soga, 2006; Yu, 2003). Movement sequencing (stage 2) can be the most systematic stage of choreography, hence the easiest to model computationally. The select focus on algorithmic sequencing of codified movement can also reduce creative possibilities in composition instead of supporting them. We present a case study in computer-aided creativity to focus on the generative element of creative movement exploration (stage 1). Scuddle, our digital tool, has been designed to create incomplete movement data that is used as a catalyst for movement material (stage 1) when executed by a choreographer. The use of movement catalysts problematizes the usual process for creating movement material while allowing for interpretation from a unique perspective. Incomplete movement data consists of a 2 dimensional body position, the height for execution and four Laban effort qualities. The body positions support inhibition of habitual movement patterns through the use of Bartenieff Fundamentals. Movement catalysts are generated by a Genetic Algorithm to create diverse combinations of movement data (Russell and Norvig, 2010). The use of a genetic algorithm to create movement catalysts triggers exploratory creativity in the choreographer (Boden, 1998). By problematizing the process for creating movement material, we hope to encourage verbal articulation to study the decisions choreographers make through their creative process. Related Work The concept of creative catalysts has been used extensively by artists throughout history. It is often used to explore ideas in new ways and to push the self beyond known answers. Merce Cunningham used the IChing and utilized Chance Procedures as ways of exploring new movement ideas. Several systems have been designed to computationally support choreographic process through a combination of choreographer input and artificial intelligence techniques. While many of these systems could function as catalysts (Cunningham also used DanceForms as a catalyst) they were designed for different goals. Early systems explored innovative approaches to creative movement material through limited movement data. One system uses algorithms to create body outlines for interpretation by a choreographer while providing the body position with spatial directions and orientation (Lansdown, 1978; Gray, 1984). Menosky created an interactive silhouette that could be altered with the choreographer's touch of a body part through reconfiguration of the effort position or a library of suggested positions (Gray, 1984). Bradford used AI techniques to facilitate dance improvisation through spatial direction and orientation to generate rules for guiding dance quality and movement generation (1995). These approaches all focused on the creation of 123 creative movement material (stage 1). One system that focuses on all three stages of the choreographic process, allowing complex movement to be designed and viewed with A high level of detail, is DanceForms (formerly LifeForms). DanceForms (Calvert et al., 1991) is a graphic animation program for designing and visualizing dance movement based solely on user input or library selection. The system is timeline-based and allows the choreographer to design sequences and timings of movement. DanceForms supports choreography of multiple figures, spatial patterns and orientation. Merce Cunningham used DanceForms to design movement on avatars, transposing the movement decisions onto live dancers. This process allowed him to explore movement options that he may not have otherwise considered while facilitating his use of chance operations (Schiphorst, 1993). Yu and Johnson explored autonomous sequence generation through the use of a Swarm technique within DanceForms on the project titled Tour, Jete, Pirouette (2003). Systems that address sequence and composition (stages 2 and 3) focus on the linear arrangement of movement to create formulaic phrases. Web3D Composer creates sequences of ballet movements based on a predefined library of movement material. Through an interactive process the user selects movements from a pool of possibilities, which shift based on structural ballet syntax. This process allows the choreographer to select movements based on the possibilities presented by the system and presents nearly complete graphic movement information (Soga, 2006). The Dancing Genome Project (Lapointe 2005; Lapointe and Epoque, 2005) developed a genetic programming model to explore sequences of movement in performance. The movement material was created by gathering motion capture data and is using it to create a ‘mutated' sequence that is performed by virtual avatars while the original sequence is performed by live dancers. Dancing (Nakazawa, 2009) used a series of music related parameters, stage use rules and a predefined library of traditional movements to generate Waltz choreography using a Genetic Algorithm. This system generates syntactically correct movements in a complete choreography as ASCII symbols. Currently available contemporary systems that address the creation of movement material (stage 1) include DanceForms, Dance Evolution and Scuddle. Dance Evolution animates avatars by teaching them to dance to music through the use of an interactive evolutionary algorithm. Movement is generated by analyzing a rhythm and using it to control the energy an avatar uses to execute a position (Dubbin, 2010). Scuddle generates unique movement catalysts through the use of a genetic algorithm. The choreographer is provided with specific guidelines for execution that are controlled by the system yet require the choreographer's creativity for individual interpretation. This is the only current system that is designed specifically as a catalyst for creative movement material. These systems are compared to evaluate the quantity of data given to the dancer, how the data is given to the dancer and the stage of the choreographic process that is addressed. Five systems focus on the sequencing of movement material while one focuses on the design of movement material and one focuses on the animation of movement (Table 1). Six systems focus on creating computer generated choreography by giving the dancer complete movement data while Scuddle focuses on generating incomplete movement data for creating computer-aided movement material (Shedel, 2009; Hagendoorn, 2008). Computer Aided Choreographic Systems: Stage of Choreographic Process Movement Generation (stage1) Sequence Generation (stage 2, ~3) Final Selection Method Representation of Choreographic Data Precision of Movement Description DanceForms (LifeForms) Movement, Sequence, Choreography User or Library User User Multiple Figures, Space and Orientation in 3D High Tour Jete, Pirouette Sequence User or Library Swarm Technique User Multiple Figures, Space and Orientation in 3D High Web3D Composer Sequence Library Interactive Possibilities User Single Figure, Space in 3D High Dancing Genome Sequence User/ Motion Capture Genetic Algorithm Fitness Function Single Figure, Orientation in 3D Medium Dancing Sequence, Choreography Library Genetic Algorithm and Music Fitness Function Two Figures, Space, Orientation in ASCII Medium Dance Evolution Movement/ Animation of Neural Net and Music In Order of Creation User Multiple Figures, Orientation in 3D Medium Scuddle Movement Genetic Algorithm In Order of Creation Fitness Function Single Figure in 2D Low Pre-1990 Systems Movement User or System N/A User Shapes, Silhouettes, Minimal Low Table 1. Comparison of Related Computer-Aided Choreographic Systems 124 Design Motivation In an attempt at computationally modeling improvised theatre, Magerko recognized the need to better understand the active creative process (2008). Scuddle was designed to begin to address this issue through the author's direct experience of choreographic process and the desire to understand how intuitive movement decisions are made. As the knowledge behind intuitive decisions is deeply embodied and often could be considered tacit knowledge, an active approach to exploring process is needed. One observation of compositional practice is that both movement material and compositional structures develop into habit and can then facilitate a personal set of instructions for creation. Another observation is that movement material and compositional structure are often intricately entwined. In order to attempt a disruption of habits while still facilitating the entwinement of movement material with the compositional process, the concept arose to use specifically developed incomplete movement data. The incompleteness of data facilitates ‘open' exploration, enabling multiple solutions to be generated from an ‘incomplete' movement catalyst. Laban Efforts and Bartenieff Fundamentals The design of incomplete movement data is based on studies in movement patterns and effort qualities by Laban and Bartenieff (Laban, 1947; Hackney, 1998). Rudolf Laban developed a method of categorization to analyze, notate and create movement. One property of movement that Laban explores is ‘effort', the quality used to execute a movement. He emphasized that every movement possesses effort qualities as forerunners of the movement execution. He describes four quality components (See Figure 1.A): weight (light to strong), time (sudden to sustained), space (direct to indirect) and flow (bound to free). For example, ‘Movements performed with a high degree of bound flow reveal the readiness of the moving person to stop at any moment in order to readjust the effort if it proves to be wrong, or endangers success. In movements done with fluent flow, a total lack of control or abandon becomes visible, in which the ability to stop is considered inessential' (Laban, 1947). Scuddle uses all effort quality components as ‘instructions' for executing a position. The combinations of qualities are designed to create interesting yet complex physical patterns for the body to execute. Figure 1.A Rudolf Laban's Effort Quality Graph (Newlove and Dalby, 2003) and Figure 1.B Bartenieff Separation of Bodily Planes (Hackney, 1998). Bartenieff Fundamentals are a further development of Laban's research to the moving body (Hackney, 1998). Bartenieff uses anatomical body planes to deconstruct movement into categories such as pathways of movement, movement patterning, spatial intent and core support. The body planes (see Figure 1.B) sagittal, coronal and transverse help to illustrate movement patterns. For example homologous positions (same limb positions for one side of the transverse plane), homo-lateral positions (same limb positions for one side of the sagittal plane), contra-lateral positions (same limb position for one opposing limb on each side of the sagittal plane). Additional movement pathways include distal positions (all limbs fully extended) and medial positions (all limbs fully contracted). Bartenieff Principals are used in Scuddle to explore and inhibit habitual movement patterns. To create complex catalysts, emphasis on inhibiting habitual movements is designed through the use of asymmetry and complex variations between joint angles on a position (Birkhoff, 1956). Genetic Algorithm A Genetic Algorithm is used to evolve movement catalysts. This allows the system to control fundamental components that problematize the dancer's process of generating movement. Genetic Algorithms are typically used to explore a wider range of potential solutions than other search algorithms can (Holland, 1992). Initially a large population of random individuals are generated and given a score for their fitness against the prescribed goals for success. This initial population is then subjected to an iterative cycle of selection and breeding. Once a cycle is complete the new population is judged on its fitness once again and the process continues for a fixed number of iterations or until a certain fitness threshold is reached (Floreano and Mattiussi 2008; Russell and Norvig 2010). System Description A movement catalyst consists of movement data that is graphically represented as a 2 dimensional figure with text for height and effort quality instruction (See Figure 3). The 2D figure represents body position through the use of straight lines as limb positions with curves to suggest torso positions. This allows the 3 dimensional orientation and limb position to be determined by the choreographer. The interface has five button options that have the functions of Start (to run the algorithm), Watch (to view the 6 catalysts in order), Pause (to pause the playback), Back (to view the previous catalysts) and Clear (to erase the values to re-start the algorithm cleanly). Still images of the generated catalysts are saved every time the algorithm is run. The system begins by generating an initial population of 200 random ‘catalysts'. Body positions are designed to allow unlimited possibilities for positions in eight major joints; the shoulders, elbows, hips and knees. Positions are initially generated by calculating random angles between 0-360 degrees for each joint to alter the configuration of the position's limbs. Effort qualities are randomly generated as 1 or 2 (for fighting or indulging as later explained) 125 and height as a random level from low to high. Therefore, a catalyst is composed of 13 values: 8 joint angles, 1 height level and 4 effort qualities. An example of the values from Figure 2, showing height and effort qualities are: Fitness Function A rule based system is used to evaluate the fitness of each movement catalyst. We have developed heuristic rules based on movement patterns discussed in Bartenieff Fundamentals and the author's expertise in contemporary dance practice to inhibit traditional habits when creating movement. The fitness function evaluates each catalyst component separately (body position, height, effort qualities and Bartenieff) and then calculates the overall score. To compare the catalyst components we map each value separately. Each of the 8 joint angles are weighted based on their location within quadrants. For example, angles between 0-90 degrees are placed in one quadrant and 90180 degrees in another. The orientations of quadrants are based on their location from the center of the body (See Figure 2). This weighting is designed to lower the score for fully outstretched or contracted limbs by placing all joint angles on diagonals that score 1, creating an overall body position score of 8 (1 x 8 joints). For example, the bent arms in Figure 2 have scores as follows: the left shoulder is 340 degrees which is mapped to 4 and the left elbow is 220 degrees to map to 1. This sum of these mappings gives Figure 2 a body position score of 14. Height is the level at which the body position is to be executed. These values are used to emphasize more unstable positions such as balancing in crouches and on the toes (See Table 2). Effort Qualities refer to the effort used to execute a body position and height. Fighting efforts are direct, strong, sudden and bound. Indulging efforts are indirect, light, sustained and free. A combination of four fighting or indulging Figure 2. Weighting of Quadrants for Body Position Height Weighting Bartenieff Modifiers Jump High 2 Contralateral +30% Raised Mid-High 3 Homologous -40% Stand Middle 1 Homolateral -50% Crouch Mid-Low 3 Distal -60% Floor Low 2 Medial -50% Table 2. Height Weighting Table 3. Bartenieff Modifiers efforts results in modifying the sum of the position and height by -60%. Combinations of two fighting and two indulging efforts modify the sum of the position and height by +20%. Three fighting efforts and one indulging effort or three indulging efforts and one fighting effort modify the sum of the position and height by +40%. Symmetry of body position is analyzed as movement patterns (based in Bartenieff Fundamentals). Contra-lateral motions explore the diagonals made across the body. In homologous motions the relationship of the top half of the body is compared to the lower half. Homo-lateral motions compare the limb position of one side of the body. All limbs fully extended are considered distal and all limbs fully contracted as medial. To address habit inhibition, heuristic rules are designed to favor contra-lateral motion (asymmetry) while hindering homologous and homolateral motion (a tendency of codified dance techniques). See Table 3 for the assigned modifier that is applied. The fitness for a movement catalyst is calculated as the sum of body position and height that is modified based on the combination of Laban effort qualities and Bartenieff movement patterns. See Figure 3 for an example of mappings and fitness score. The equation for the score is: Fitnessሺmcሻ ൌ ܤܲ௠௖ ൅ Height୫ୡ ൈ ሺ1.0 ൅ ܤܽݐݎ݂݂݁݊݅݁௠௖ ൅ ܮܾܽܽ݊௠௖ሻ Figure 3. Example of Scoring for Fitness Function [340, 220, 240, 310, 110, 40, 240, 320, Mid-Low, 2, 1, 2, 1] Body Position Height Effort Quality 126 Selection, Cross Over and Mutation We select 20 percent of the movement catalyst population by Roulette Wheel to be parents for the next generation of offspring. The Roulette Wheel process selects individuals with likelihood proportional to their fitness. Two individuals at a time are bred through two-point cross over, chosen from the pool of parents. The breeding takes place by selecting two random placeholders from the two individual's values and switching the values between placeholders (See Table 4). The offspring are added into the new pool of individuals. The breeding process continues until the population has grown back to the original size. Once the size of the population has regenerated, ten percent of the individuals are randomly selected to mutate. The mutation occurs by choosing a random placeholder in the values of the individual and generating a new value for that place (See Table 5). Individual 1 [4, 1, 2, ||1, 1, 2, 1, 2||, 3, 2, 1, 2, 2] Individual 2 [1, 1, 2, ||2, 4, 2, 3, 1||, 3, 1, 1, 2, 2] New Indiv 1 [4, 1, 2, ||2, 4, 2, 3, 1||, 3, 2, 1, 2, 2] New Indiv 2 [1, 1, 2, ||1, 1, 2, 1, 2||, 3, 1, 1, 2, 2] Table 4. Example of Cross Over Individual 1 [4, 1, 2, 1, 1, 2, 1, ||4||, 3, 2, 1, 2, 2] Mutated 1 [4, 1, 2, 1, 1, 2, 1, ||2||, 3, 2, 1, 2, 2] Table 5. Example of Mutation The cycle of Selection, Cross Over and Mutation repeats until the termination criteria has been fulfilled. This has been set at 6 generations to retain diversity in the population. For the final selection of individuals, Roulette Wheel selection is used to choose 5 individuals from the population to be presented in sequence to the choreographer. The system is available online at: http://www.metacreation.net/kcarlson/Scuddle/applet/ Results A pilot study was performed with 7 choreographers using participant-observation methods followed by open-ended interviews. Five choreographers were given the system on a laptop to generate their set of movement catalysts and two were given printed copies of a generated set. Choreographers using laptops were given instructions to generate catalysts and all participants were asked to explore the movement catalysts on themselves. After time was spent exploring and reflecting on movement, they were asked to pair up and take the roles of dancer and choreographer. The study found five main results when choreographers used Scuddle: 1. The process of using Scuddle prompted comparison to their usual creative process (5 choreographers), 2. There was a heightened awareness of personal habits when a habit was explicitly addressed by Scuddle (5), 3. Choreographers tended to re-examine their approach to structuring movement when using Scuddle (4), 4. Movement was initiated in non-habitual and creative ways (7), 5. The experience could be articulated verbally to facilitate further study into creative cognition (4). Choreographers felt that working with the movement catalysts was very different from their typical processes. Statements that movement is often generated from a concept, through improvisation, to make creative decisions based on what feels ‘right' or ‘interesting' internally were made by 5 choreographers. Participant 2 stated ‘I usually start with a concept but this time I started with pure movement and I still made the movement meaningful to me.' Participant 5 discussed ‘a heavy reliance on the body's survival skills' that took time to explore before reflection could occur. This was noticed by 3 others, though was dependent on how exuberant the choreographer was in execution. Choreographers found a heighted awareness of particular habits when the system directly addressed them, especially in relation to body symmetry and balance. For example, participant 3 stating that the system ‘forces me to think of my arms at all times, which I never do' and participant 2 ‘it is weird for my body but actually feels really interesting it makes me be really asymmetrical'. Participant 1 found ‘with the legs I wanted to revert back to what I was comfortable with, but the arms I could really do something interesting with'. Decisions to structure movement based on the catalysts varied and required a re-examination of their personal approach. Participants 3, 4, 5 and 6 read the components from top to bottom in order of height, body position, effort qualities and attempted execution in that order. However, participant 1 selected height and the effort qualities first, and attempted to fit the position into these components second. When confused by a movement catalyst she changed her perspective to a bird's eye view, stating that ‘it was most important to find out what I think this is and then shift it to or adjust it for my body'. Participant 1 and 2 both tended to attach different effort qualities to different parts of the body, for example Time as Sustained to the legs with Weight as Strong to the arms. Participant 4 would focus on Weight and Time when executing a movement catalyst and assumed that Space and Flow would emerge automatically. Participant 2 looked for the similarities and differences between two catalysts and attempted to execute them consecutively. All choreographers initiated movement in non-habitual and creative ways. Participant 3 stated that ‘It pulls me out of my body at first, but it doesn't feel bad.' Participant 2 stated ‘This is not a narrative but makes me connect the dots in an interesting way.' Participant 6 stated that Scuddle ‘gives you these very specific guidelines, but being creative people we interpret them in our own way. It's a very valuable tool and gives an interesting angle to work from.' Participant 1 thought Scuddle would be useful ‘to get out of a rut or the habits you go back to.' Participant 4 felt ‘disjointed now physically but I am interested and would want to explore more artistically.' Choreographers found they could better articulate their experience verbally 127 with the technical perspective of Scuddle. Participant 3 said ‘yes, this helps me to verbalize my decisions' and participant 2 stated ‘I am talking about it more technically as opposed to making decisions that feel right'. Conclusion and Future Work This paper details a system that generates catalysts to challenge choreographers in making creative movement choices. Our results illustrate that the use of Scuddle prompted: comparison to choreographer's usual creative process, heightened awareness of personal habits when explicitly addressed by Scuddle, choreographers to re-examine their approach to structuring movement, non-habitual and creative movement choices and an experience that could be articulated verbally with the added technical perspective. From these initial results, we deduce that Scuddle is guiding the choreographer to explore creative movement while supporting articulation of creative decisions. This analytical approach to developing creative movement material separates the decision making process into concrete events that can be identified and verbalized. The articulation of events are able to facilitate a deeper exploration into the creative decision making process. This approach provid insight into the process of making creative choreographic decisions to move towards the modeling of creative decisions in computational choreography. We believe this tool will be useful to researchers of dance and technology while contributing to the exploration of creative decisions in computer-based choreography. L Lansdown, J., 1978. The Computer in Choreography. es Computer, 11(8), pp.19-30. Future work includes a comparative study to examine the affect our heuristic rules have on the choreographer's creative movement choices. This study will be performed by providing choreographers with movement catalysts that use the current rule settings, catalysts that use the opposite values to the current settings and catalysts that are generated randomly, without the fitness function. We also plan to develop Scuddle to be customizable. This will provide choreographers with control over modifiers, adjusting for personal habits. We will implement machine learning for the chorographer to determine positions they like or dislike. An extended study of Scuddle will be performed using comparative analysis to document the choreographic decisions made using the current heuristic rules and individual choreographer's custom rules. 2011_26 !2011 Towards Knowledge-Oriented Creativity Support in Game Design Adam M. Smith and Michael Mateas Expressive Intelligence Studio University of California, Santa Cruz {amsmith,michaelm}@soe.ucsc.edu Abstract This article reports on a work-in-progress system designed to support game designers in gaining knowledge about the implications of their design ideas on observable gameplay. Utilizing a convenient pattern language, evidence of the instantiation of many gameplay patterns can be gathered and organized, resulting in insight. Introduction In game design, practices such as prototyping and playtesting are integral parts of the iterative, exploratory process used to achieve the innovative gameplay sought by creative game designers (Fullerton 2008). These practices reveal concrete details about game design spaces, allowing designers to refine their personal store of design knowledge. This design knowledge is used to engineer the complete, polished products we recognize as popular games, but it most often comes from experience with crude or incomplete game artifacts. In this paper, we describe a work-in-progress system based on the theory of rational curiosity (Smith and Mateas 2011). This theory suggests that, in order to support creativity in game design, systems should directly support designers in gaining design knowledge. This contrasts with Yeap's desideratum of ideation (2010), that a support system should generate new ideas on its own. Quickly extracting useful feedback from existing ideas, we claim, is an underappreciated bottleneck in creative design process. In game design, knowledge-oriented creativity systems should systematically expose the relation between the concrete details in a game's definition, such as its mechanics and level design, and the implication of these details on gameplay. Building on the LUDOCORE logical game engine (Smith, Nelson, and Mateas 2010), our support system is targeted at early-stage computational gameplay prototypes (functioning models of a game that permit a designer to ask and answer specific design questions). LUDOCORE models capture focused situations in gameplay, including any available knowledge about the ideal player in addition to the game's mechanics. By using a logic programming representation, the system is able to exploit model-finding techniques to automatically solve for gameplay traces which exhibit properties that a designer has requested via a query. Knowledge gained from machine playtesting with LUDOCORE can be validated with human playtesting using the interactive, graphical features of BIPED (Smith, Nelson, and Mateas 2009), a process which often inspires new formal queries to pose in iterative machine playtesting. Using these tools in the larger game design process requires an external, creative agent to spot interesting patterns in gameplay traces and translate these patterns into a language the logical reasoning tools can understand in subsequent exploration. If LUDOCORE is about getting design feedback from prototypes, but it requires a designer to first specify formal queries, can we assist the designer by translating her highlevel interests into such queries and informatively aggregating the results? Such a straightforward process could dramatically speed up the rate at which a designer learns about her designs, improving her ability to appreciate artifacts - appreciation being one leg of Colton's creative tripod of perceived creativity (2008). In this paper, we report on a system capable of collecting and organizing evidence for a space of gameplay patterns which are described in a designer-friendly language. After reviewing our example game, we describe how a preliminary experiment with our support tool using simple, hand-authored patterns has resulted in design insight. DrillBot 6000 in LUDOCORE Our support system works using a LUDOCORE model as input. Our examples will use the game DrillBot 6000 (the example game that comes with BIPED). A screenshot of DrillBot is shown in Figure 1. In the game, the player controls a mining robot that must explore underground caverns, drilling out rocks and treasures. Actions such as mining rocks and moving upwards cost the robot energy that can only be recovered by refueling at the base. The logic program that defines the game model declares events that may occur (such as mining a rock, moving to a space, and trading or refueling) and elements of state that change over time (such as the robot's position, energy level, and the presence of the various rocks). Additionally, the 129 definition contains assertions about the configuration of the game world (including the existence and linkage of caverns and the treasure property of some of the rocks). Performing either human or machine playtesting with DrillBot produces symbolic gameplay traces. Simple traces log the actions (events) selected by player at each logical timepoint. Often, however, understanding an interesting property of play requires understanding the context of a particular sequence of player actions with respect to both the dynamic state of the game and its static configuration. We modified LUDOCORE to produce complete execution traces, records of every logical fact that is true in the game world in both the static and dynamic sense. Such complete traces represent an accurate view of the knowledge available to the designer when she is looking for patterns during playtesting, but they are very tedious to explore manually. Where a simple trace may state that the event mine(dino_bones) happened at timepoint 22, a complete trace will assert that mining is a player-selectable game event, that the event was possible at that time and others and was mutually exclusive with the two available movement events, that dino_bones is a rock with the treasure property, and that it is located in the cavern designated i which is linked to caverns g and h. If there is something interesting to be said about mining this rock, it is likely to involve some of these contextual details. Using LUDOCORE's query language (based on logical integrity constraints), it is possible to ask for gameplay traces that illustrate how a player might navigate the robot down, drill out dino_bones, and return it to the base without ever letting its energy level drop below 6. The code for query is small (just four lines), however writing it requires careful reasoning about the scope of variable quantification and domain restriction as well as reasoning through double-negation. A Language of Gameplay Properties To ease the definition of gameplay patterns that may be of interest in to a designer, we created a new language inside of Prolog (the syntax also used to define LUDOCORE games). Pattern definitions are declarations of what evidence must be present in (or absent from) a complete execution trace to detect an instantiation of that pattern. An example pair of patterns is shown in Figure 2. Syntax The <-(or is-detected-when) operator binds the name of a pattern (which might be parameterized by logic variables) to its requirements. Requirements can refer to the presence of elements in a trace such as that a game includes some event, that the event happens, or that some element of game state holds at some time. All LUDOCORE games share the general concepts of events and state, but many interesting patterns will make reference to gamespecific concepts (such as the action of mining or a particular rock in DrillBot). The primitive construct can be used to require (or forbid using the \+ operator) any element of a trace, whether it is game specific or not. To afford exploration of interesting patterns by seeing where they co-occur with other patterns and how their presence affects the conditional presence of other interesting patterns, requirements can also constrain the presence of any other pattern (using pattern construct). A final construct of the language, when, can be used to describe additional constraints not present in the trace. A common use for this construct is to assert that two pattern variables should never be equal, or that (if they are timepoints) the enclosing pattern should only be detected when the values of the variables are strictly ordered. Evidence Sets Patterns in this language can be automatically translated into the more tedious query language supported by LUDOCORE. So far, we have only explored a fixed database of pre-collected traces. When asked to show evidence for the presence of a given pattern, our system finds all possible sets of evidence that, due to their presence in a trace, permit the detection of a pattern using some instantiation of its pattern variables. Given a library of patterns, the system will produce a table of pattern names with concrete symbols substituted for variables scored by the number of distinct evidence sets which support each pattern. Given this table, a designer can then ask the system to display the detailed evidence sets for a particular instantiation. In many cases, it is the deeper examination of these evidence sets which suggests the definition of a new, composite pattern. It is possible to use the compiled form of the pattern detector as a query in LUDOCORE. Thus, the designer can Figure 1. A screenshot of gameplay in the DrillBot 6000 model. Black circles indicate game elements that our system automatically identified as ignored by players. The yellow token d2 is a non-valuable rock in a dead-end cavern, and the space f is a linked cavern which provides no apparent navigation benefits. 130 use machine playtesting to directly search for more evidence of known patterns or ask about the existence of any possible traces that realize a freshly conjectured pattern. Exploring Ignored Moves in DrillBot To make our discussion of patterns and evidence sets more concrete, we will now consider the results of using our system to explore ignored moves in DrillBot. Figure 2 shows two pattern definitions in our library. The sometimes(E) event captures the idea that some game event (bound to its pattern variable) happens at least once in a given trace. Building on this, the ignored_move(E) pattern describes the situation where an event that is supported by the game's rules is dynamically available to the player (possible) at least once in a trace while never observing the player selecting that action. Running our system with DrillBot and these patterns yielded a report which described several instantiations of the ignored move pattern. The most commonly ignored moves involved the mining event, particularly non-treasure rocks at leaves of the map's navigation graph (such as the rock d2 indicated in Figure 1). A less common (but more interesting) instantiation of the ignored move parameter involved the up_to(f) event. What is so special about this move? It turns out the f cavern is an emergent dead-end when players leave the c0 rock above it un-mined (because the robot cannot move into non-empty caverns). The nearby e cavern, despite being filled with rocks at the start of the game, is more often chosen by players as (1) it is filled with treasured rocks, (2) it is more connected to other caverns than f, and (3) it provides an equal length path to the deeper parts of map in comparison with the ignored f cavern. Before having the system draw our attention to the f cavern's properties, we were previously un-aware of this type of emergent dead-end in DrillBot's level design. In an iterative design process, we might intentionally create several such emergent dead-ends or even use a compiled pattern detector for these localized situations in conjunction with LUDOCORE's "structural query" feature to automatically solve for new level designs which include this pattern. Future Work Applying equally to humans and machines, the theory of rational curiosity suggests that we expand this creativity support system in two directions: further supporting human creativity and creating a software component that can be used in developing automated game design systems that are themselves creative. Towards both of these goals, we would like to eliminate the need to directly formulate even these high-level pattern descriptors. Instead, we believe machine learning techniques can be adapted to translate a collection of manually assembled evidence sets into a most-likely pattern definition which can be used to collect and organize additional evidence sets or form a part of a higher-level pattern. Conclusion Motivated by the theory of rational curiosity, this project has explored the idea that creativity support tools in game design should directly support gaining design knowledge. The system realized thus far has demonstrated the ability, in an automated manner, to direct a designer's attention to concrete instantiations of patterns of their interest and suggest subsequent patterns for future exploration. Acknowledgements This work was supported in part by the National Science Foundation, grant IIS-1048385. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 2011_27 !2011 Interpretation-driven Visual Association Kazjon Grace & Rob Saunders Faculty of Architecture, Design and Planning University of Sydney kazjon@arch.usyd.edu.au, rob.saunders@sydney.edu.au John Gero Krasnow Institute for Advanced Study George Mason University john@johngero.com Abstract In this paper we outline ongoing research into a computational model of association based on the reinterpretation of a source object to fit the target. We describe the structure of the model and the concepts from which it arises. Preliminary results of visual associations made by the system in a simple shape domain are presented. We also discuss a planned application of our model to the analysis of a real-world creative design. Introduction Association is the construction of a mapping between source and target objects. This fundamental cognitive ability underlies analogical reasoning, metaphorical imagery and other creative processes based on constructing abstract similarities. This paper presents a computational model that focuses on construction and reinterpretation of representations during association; a particularly important process for computationally modelling creative analogy-making (French 2002; Kokinov 1998). Results of applying an implementation of the computational model to simple visual association problems is given and the application of the system to more complex problems is discussed. Association is composed of three subprocesses: representation of the source and target objects; matching between the representations; and construction of a mapping around that match. These processes cannot be modelled serially, representation must occur in parallel with matching and mapping (Kokinov 1998). This contrasts with association as typically modelled in computational analogy-making (French 2002), where the concepts and/or the relationships between them are fixed. We have developed a model of association that focuses on the iterative interaction between the search for mappings and the construction of representations, an interaction that we call interpretation-driven search. The system's ability to reinterpret objects extends mapping capability beyond matching identical features present in the provided representations. Our system's interpretation process guides and is guided by the ongoing mapping process. New interpretations are discovered through the search for mappings. Interpretation provides a capability akin to Copycat's ‘conceptual slippage', except that there is no predefined list of conceptual equivalencies. The following sections describe the computational model with reference to an implementation for simple visual problems. We also explore the application of the system to more complex visual problems in a design domain. Interpretation-driven Association The model described in this paper can be decomposed into three interacting systems, Figure 1. Perception is the system that describes objects it encounters, mapping is the system that relates those objects; and interpretation is the system that changes the descriptions of the objects. Perception Interpretation Mapping candidate mappings objects successful mapping original representation new interpretations Figure 1: The structure of the interpretation-driven association system, showing how the representations produced by the Perception system are iteratively searched for mappings and changed through interpretation. Perception In the implementation presented here, objects are vector images composed of polygonal shapes. The perception system detects shapes and describes them using the contour of their outlines. The system's representations are constructed from these detected shapes and from relationships built between them, both typological and topological. Shapes are categorised into concepts, which are groups of shapes with similar outlines. A shape that has a representation unlike previously learned ones will generate a new conceptual category, and future shapes judged sufficiently similar will be added to that category. We model these constructive behaviours in perception as the less the authors of a system are involved in its specific representations the stronger the claim that can be made about the autonomy of its associations (Hofstadter and Mitchell 1994). This autonomy is a necessary precursor to any claim that the system itself is capable of acting creatively. 132 The set of shape features for each object is translated into a graph-based representation where nodes represent shapes and edges represent relationships between those shapes. Typological relationships are based on the similarity of the conceptual categories the shapes are placed into. Topological relationships are based on geometric relationships within the object, including: proximity, scale, orientation, bearing, overlap, contained within, shared vertices and shared edges. To support the matching and mapping processes, relationships between shapes are expressed relatively, e.g., size(A) = 0.5∗size(B). The result is a graph of shapes and relationships between them for each object, graphs which is then searched for mappings. Mapping The mapping process searches the source and target graphs for sub-graph mappings with an overlapping set of relationships between the two graphs. For example, a mapping between two pairs of shapes where both pairs share orientation would be successful, even if those shapes were connected by other relationships that did not match. As the relationships are stored as relative values, mappings can be made between quite different groups of shapes without applying any kind of interpretation to the representations. Interpretation changes the object representations, broadening the possible mappings beyond finding identical relationships. The interpretation system alters the representations of source and target, which in turn alters the search space for the mapping process. Interpretation Interpretation in association is changing representations by taking a different perspective on one or both of the objects being associated. Interpretation is defined for the purposes of this system as inducing an equivalency in meaning between one type of representation in the source and another in the target. An interpretation states that a relationship in the source graph is to be treated as a match with a different relationship in the target graph. The interpretation process takes an unsuccessful mapping under the current interpretation and, if a coherent substitution of one relationship in the source for another in the target would produce a better mapping, suggests it as an interpretation. Interpretations produced in this fashion are then evaluated against the current interpretation based on how many nodes they could add to a mapping if they were adopted. If a new interpretation compares favourably, it becomes the default way to view the objects and directs the mapping process accordingly. This process allows the system to make associations that are not based on identical patterns of relationships in the source and target, but on identical structures of relationships that may semantically be very different. Figure 2 shows an association made by our prototype system. Fig. 2 A and B show the visual representations of source and target objects, while C and D show the graph representations constructed by our perception system. Both objects contain five shapes, with the shapes in the target all falling into the same concept (they have identical outlines), while the shapes in the source are similar but a different conceptual category is created for each. Many relationships connect these shapes, but we highlight several pertinent relationships in the thick dashed lines in C and D. The lines connecting the two graphs show a mapping that was found by the system using the interpretation ‘being proximal in the source domain is the same as sharing a vertex in the target domain'. This interpretation was constructed and applied by the system during search. The system is designed to find many different associations for any problem, this is just one possible mapping with one possible interpretation. SOURCE TARGET INTERPRETATION shared vertex proximity = proximity proximity proximity proximity shared vertex shared vertex shared vertex shared vertex (A) SOURCE (B) (C) (D) TARGET Figure 2: An example association problem: A and B are the visual representations given to the system, C and D are the graph representations constructed. Applying Interpretation-driven Association Preliminary association results, like those presented in Figure 2, demonstrate that the model has the capacity for finding non-obvious associations between groups of shapes. The works of a particular creator or of a particular school often share stylistic elements; reoccurring features within or between designs that are all variations of the style or theme of the work. In this ongoing research project we aim to determine whether our system can connect similar design elements with mappings that demonstrate their common style. An example of a real world design that contains such a recurring stylistic element is Frank Lloyd Wright's 1921 ‘Hollyhock House'. This California residence has a stone roof lined by distinctive stone friezes of a pattern of rectilinear shapes, seen in Figure 3a. The design makes strong use of geometric shapes throughout, but several other details make direct reference to the iconic frieze-work feature. Fig. 3b shows one of the custom dining chairs designed for the House, with a wood-carved back that calls to mind the design in Fig. 3a. There is also a stained-glass window design (Fig. 3c) that is clearly inspired by the frieze, with similar proportions and isometric projections of cubes representing the square blocks of the original design. 133 (a) (b) (c) Figure 3: a) The iconic Hollyhock House frieze, b) a variation of the frieze pattern on a dining-room chair back, c) a stylistically related pattern on a window. Given a simplified vector representation of each of these design, we will test whether the interpretation-driven association system presented here is able to find associations between these variations of a visual theme, e.g., finding associations between the chair-back design and the stonework that inspired it. We expect our our system to construct interpretations that equate the of the frieze with those of the stained glass window. In doing so it will have generated an association that encodes part of the designer's visual style. The system as it exists currently, with its closed shape based perception system, is well-suited to working with ornamental features and other details found in a variety of domains including architecture, industrial design, textiles, iconography and graphic design. Initial tests with alternative perceptual systems, e.g., based on SURF and SIFT descriptions of shapes (Bay et al. 2008; Rowe 1999), have shown that it is also possible to work with photographs, although more work is required to identify the most salient features detected for the construction of graph representations. Discussion This paper presents a model of association based on the principle of re-interpreting two objects so that there is a new relationship between them. We demonstrate a proof-of-concept implementation capable of making non-obvious associations between groups of shapes. We intend to apply this prototype to vector-image representations of real-world creative design artefacts to see if it is capable of finding common stylistic elements, both within a design as with the Hollyhock example and between different designs. Cha and Gero (1999) describe style in design using a formal grammar as a set of relationships by which a hierarchy of visual elements are composed. Sets of shapes with consistent relationships between them form low-level patterns, and relationships between patterns form higher-level visual structures. The system presented here will be extended to support the construction of similarly sophisticated associations between patterns by allowing higher-level concepts to be formed from groups of existing shape elements. These meta-concepts will be treated as a ‘super-node' in the object graphs, composed of a number of other concepts but able to be related to singularly. This would remove the requirement for ordinality in associations (ie. five features in the target must always map to five features in the source). The ability to treat groups of concepts with particular relationships between them as a single entity relaxes the restrictions on possible mappings and opens up new kinds of associations. This meta-concept formation could be implemented using of algorithms for finding cliques in graphs (Moon and Moser 1965) and by learning from previously known groups. By adding the ability to construct hierarchies of thematic or stylistic features, interpretation-driven association will be able to construct mappings to relate complex creative works. Our aim is to determine whether our system can build associations that demonstrate commonalities of style and structure between creative works. The system described here implements one simple form of interpretation; induced equivalencies between relationships. Many other forms of re-interpretation are possible in our model, such as changing the definitions of shape elements, excluding or focussing on different elements and relationships within the representations or applying a variety of transformations to the objects or their representations. Our model is extensible to multiple forms of interpretation and the kind presented here is just one example. We have demonstrated the feasibility of our interpretation-based model of association. Our system re-represents objects in parallel with the search for mappings between those objects. Our system constructs its own representations using conceptual categories that have been developed through its experiences. This system can produce associations based on interpretations of objects in simple visual domains. Research into applying this model to more complex domains is ongoing. 2011_28 !2011 (Missing) Concept Discovery in Heterogeneous Information Networks Tobias Kotter ¨ and Michael R. Berthold Nycomed Chair for Bioinformatics and Information Mining University of Konstanz, Germany Tobias.Koetter@uni-konstanz.de Abstract This paper proposes a new approach to extract existing (or detect missing) concepts from a loosely integrated collection of information units by means of concept graph detection. Once the concepts have been extracted they can be used in order to create a higher level representation of the data. Concept graphs further allow the discovery of missing concepts which might lead to new insights by connecting seemingly unrelated information units. Introduction The amount of data researchers have access to increases at a breath taking pace. The available data stems from heterogeneous sources from diverse domains with varying semantics and of various quality. It is a big challenge to integrate and reason from such an amount of data. However by integrating data from diverse domains one might discover relations that span across multiple domains leading to new insights and thus a better understanding of complex systems. In this paper we use a network-based approach to integrate data from diverse domains of varying quality. The network consists of vertices that represent information units such as objects, ideas or emotions, whereas edges represent the relations between these information units. Once the data has been merged into a unifying model it needs to be analyzed. In this paper we propose concept graphs as an approach to extract semantical information from loosely integrated information fragments. Concept graphs allow for the detection of existing concepts which can be used to create an abstraction of the underlying data. By providing a higher level view on the data the user might get a better insight into the integrated data and discover new relations across diverse domains that have been hidden in the noise of the integrated data. Concept graphs also allow for the detection of domain bridging concepts (Kotter, Thiel, and Berthold 2010) that ¨ connect information units from various domains. Domain bridging concepts might support creative thinking by connecting seemingly unrelated information units from diverse domains. Another advantage of concept graphs is that they enable the detection of information units that share common properties but to which no concept has been assigned yet. This might lead to the discovery of concepts that are missing in the data or to the detection of new concepts. The rest of the paper is organized as follows: in the next chapter we will briefly review Bisociative Information Networks, which we use for the integration of heterogeneous data sources from diverse domains. Subsequently we will introduce concept graphs and describe their detection. We will then discuss the discovery of concept graphs in a real world data set and show some example graphs. Finally we draw conclusions from our discussion and give an outlook on future work. Bisociative Information Networks Bisociative Information Networks (BisoNets) (Berthold et al. 2008) provide a framework for the integration of semantically meaningful information but also loosely coupled information fragments from heterogeneous data sources. The term bisociation (Koestler 1964) was coined by Arthur Koestler in 1964 to indicate the "...joining of unrelated, often conflicting information in a new way...". BisoNets are based on a k-partite graph structure, whereby the most trivial partitioning would consist of two partitions (k = 2), with the first vertex set representing units of information and the second set representing the relations among information units. By representing relations as vertices BisoNets support the modeling of relationships among any number of members. However the role of a vertex is not fixed in the data. Depending on the point of view a vertex can represent an information unit or a relation describing the connection between units of information. Members of a relation are connected by an edge with the vertex describing the relation they share. One example is the representation of documents and authors where documents as well as authors are represented as vertices. Depending on the point of view, a document might play the role of the relation describing authorship or might be a member in the relation of documents written by the same author. The unified modeling of information units and relations as vertices has many advantages e.g. they both support assigning of attributes such as different labels. However these attributes do not carry any semantic information. Edges can be further marked as directed to explicit model relationships that are only valid in one direction. Vertices can also be asProeedings of the Second 135 signed to partitions to distinguish between different domains such as biology, chemistry, etc. Since relations are assigned a weight that describes the reliability of the connection, in contrast to ontologies, semantic networks or topic maps BisoNets support the integration of not only facts but also pieces of evidence. Thus units of information and their relations can be extracted from various information sources such as existing databases, ontologies or semantical networks. But also semistructured and noisy data such as literature or biological experiments can be integrated in order to provide a much richer and broader description of the information units. By applying different mining algorithms on the same information source diverse relations and units of information can be extracted, where each mining algorithm represents an alternative view that might highlight a different aspect of the same data. BisoNets focus only on the information units and their relations alone without storing all the more detailed data underneath the pieces of information. However vertices do reference the detailed data they stem from. This allows BisoNets to integrate huge amounts of data and still be able to show the data from which a vertex originates. Concept Graphs Once all the data has been integrated, it has to be analyzed in order to find valuable information. We propose a new method to extract semantical information from the loosely integrated collection of information units by the means of concept graph detection. A concept graph represents a concept which stands for a mental symbol. A concept consists of information units, which do not only refer to materialized objects but also to ideas, activities or events, and also their shared aspects, which represent the properties the information units share. In philosophy and psychology, information units are also known as the extension of a concept, which consists of the things to which the concept applies. Whereby the aspects are known as the intension of a concept, consisting of the idea or the properties of the concept. An example would be a concept representing birds with specific birds such as eagles or sparrows as information units, which in turn are related to their common aspects such as feather, wing, and beak. In addition to the information units and their shared aspects, a concept graph might also contain the symbolic representation of the concept itself. This symbolic representation can be used to generate an abstract view on the data since it represents all members of the corresponding concept graph. An example of a concept graph that represents the concept of flightless birds is depicted in Figure 1. It consists of the two information units Ostrich and Weka and their shared aspects wing and feather. The graph also contains the symbolic representation of the flightless bird concept, which can be used as an abstract representation of this particular concept graph. Preliminaries As mentioned above a concept graph contains information units which are similar in that they share some aspects. In Figure 1: Example concept graph BisoNets the aspects of an information unit are represented by its direct neighbors. The more neighbors two information units share the more similar they are. This leads to the representation of a concept graph as a dense subgraph in a BisoNet, consisting of two disjoint and fully connected vertex sets. Here the first vertex set represents the information units and the second vertex set the aspects that are shared by all information units of the concept graph. Thus a perfect concept graph would form a complete bipartite graph as depicted in Figure 1 with the information units as the first partition and the aspects with the concept as the second partition. An imperfect concept graph also contains relations among the vertices within a partition and thus does not form a perfect bipartite (sub) graph. However, such imprecise concept graphs are of prime interest, of course. Once a dense subgraph has been detected it needs to be analyzed in order to distinguish between the information unit set and the aspect set. We have developed heuristics to detect the different set types for directed and undirected networks. Both heuristics are based on the assumption that information units are described by their neighbors in the network. The heuristics for the directed network are also based on the assumption that information units point to their aspects. Hence in a directed network a relation consists of an information unit as source and an aspect as target vertex. The heuristics to identify the different vertex types are based on the following definitions: Let B(V, E) be the un/directed BisoNet that contains all information with V representing the vertices and E ⊆ V × V representing the edges. C(VA, VI , E! ) ⊆ B defines the concept graph C in the BisoNet B. VA ⊆ V represents the aspect set and VI ⊆ V the information unit set of the concept graph C in which VA ∩ VI = ∅. E! ⊆ E is the set of edges that fully connects the vertex sets of the concept graph so that VA × VI ⊆ E! . Let N(v) = {u ∈ V : {v, u} ∈ E} be the neighbors of the vertex v ∈ V in the BisoNet B. Proeedings of the Second 136 Whereby N +(v) = {u ∈ V : (v, u) ∈ E} denotes its target neighbors and N −(v) = {u ∈ V : (u, v) ∈ E} its source neighbors. The neighbors within a concept graph C for a vertex v ∈ VA ∪ VI are denoted by NC (v) = {u ∈ VA ∪ VI : {v, u} ∈ E! } . While N + C (v) = {u ∈ VA ∪ VI : (v, u) ∈ E! } denotes its target neighbors and N − C (v) = {u ∈ VA ∪ VI : (u, v) ∈ E! } its source neighbors. Information unit set The information units form the first of the two disjoint vertex sets of the concept graph. The heuristic that denotes the probability of a vertex set to be the information unit set is denoted by the function i(V ! ) → [0, 1], V ! ⊆ V . In an undirected network i(V ! ) is defined as the product of the ratios of neighbors inside and outside the concept graph for each vertex in V ! i(V ! ) = ! v∈V ! |NC (v)| |N(v)| . In a directed network the heuristic is defined as the product of the ratios of target neighbors within and outside of the concept graph for each vertex in V ! i(V ! ) = ! v∈V ! |N + C (v)| |N +(v)| . The information unit set VI ⊆ V is the vertex set of the concept graph that maximizes the function i(V ! ). Aspect set The aspect set is the second vertex set of the concept graph that describes the information units of the concept graph. Each aspect on its own might be related to other vertices as well but the set of aspects is only shared by the information units of the concept graph. The members of the aspect set might differ highly in the number of relations to vertices outside of the concept graph depending on their level of detail. More abstract aspects such as animals are likely to share more neighbors outside of the concept graph than more detailed aspects such as bird. The heuristic that denotes the probability of a vertex set to belong to the aspect set is denoted by the function a(V ! ) → [0, 1], V ! ⊆ V . In an undirected network a(V ! ) is defined as the product of the inverse ratios of neighbors inside and outside the concept graph for each vertex in V ! a(V ! )=1 − ! v∈V ! |NC (v)| |N(v)| = 1 − i(V ! ). In a directed network the heuristic is defined as the product of the ratios of the source neighbors inside and outside the concept graph for each vertex in V ! a(V ! ) = ! v∈V ! |N − C (v)| |N −(v)| . The aspect set VA ⊆ V is the vertex set of the concept graph that maximizes the function a(V ! ). Concepts The concept is a member of the aspect set VA. A concept differs from the other members of the aspect set in that it should only be related to the information units within the concept graph. Hence a perfect concept has no relations to vertices outside of the concept graph and can thus be used to represent the concept graph. The heuristic that denotes the probability of a vertex to be the concept that can represent a concept graph C is denoted by the function c(v) → [0, 1], v ∈ VA whereby 1 denotes a perfect concept. For an undirected network the heuristic is defined as the ratio of the neighbors inside and outside the concept graph c(v) = |NC (v)| |N(v)| . In a directed network the heuristic considers the ratio of the source neighbors inside and outside the concept graph c(v) = |N − C (v)| |N −(v)| . The concept that can represent the concept graph is the vertex v from the aspect set VA with the highest value for c(v). Depending on a user-given threshold we are able to detect a concept graph without a concept. The concept graph lacks a concept if the concept value c(v) of all vertices of its aspect set is below the given threshold. This might be an indication of an unknown relation among information units that has not been discovered yet and to which no concept has been assigned. Detection In this paper we use a frequent item set mining algorithm (Agrawal and Srikant 1994) to detect concept graphs in BisoNets. By using frequent item set algorithms we are able to detect concept graphs of different sizes and specificity. Frequent item set mining has been developed for market basket analysis in order to find sets of products that are frequently bought together. It operates on a transaction database that consists of a transaction identifier and the products that have been bought together in the transaction. Represented as a graph, the overlapping transactions form a complete bipartite graph, which is the basis of our concept graphs. In order to apply frequent item set mining algorithms to find concept graphs in BisoNets we have to convert the network representation into a transaction database. Therefore, for each vertex in the BisoNet, we create an entry in the Proeedings of the Second 137 transaction database with the vertex as the identifier and its direct neighbors as the products. Once the database has been created we can apply frequent item set mining algorithms to detect vertices that share some neighbors. Frequent item set mining algorithms allow the selection of a minimum support that defines the minimum number of transactions containing a given item set in order to make it frequent. They also allow a minimum size to be set for the item set itself in order to discard all item sets that contain fewer items than the given threshold. By setting these two thresholds we are able to define the minimum size of the concept graph. Since we want to find concept graphs of different specificity we need an additional threshold that takes the general overlap of the transactions into account. To achieve this we used an adaption of the Eclat (Zaki et al. 1997) algorithm called Jaccard Item Set Mining (JIM) (Segond and Borgelt 2011). JIM uses the Jaccard index (Jaccard 1901) as an additional threshold for pruning the frequent item sets. For two arbitrary sets A and B the Jaccard index is defined as j(A, B) = |A ∩ B| |A ∪ B| . Obviously, j(A, B) is 1 if the sets coincide (i.e. A = B) and 0 if they are disjoint (i.e. A ∩ B = ∅). By setting the threshold for the JIM algorithm between 0 and 1 we are able to detect concept graphs of different specificity. By setting the threshold to 1 only those vertices that share all of their neighbors are retained by the algorithm. This results in the detection of more specific concept graphs which contain either information units or aspects that exclusively belong to the detected concept graph. Relaxing the threshold by setting a smaller value results in the detection of more general concept graphs where the information units share some but not all of their aspects. Varying thresholds might lead to the detection of overlapping concept graphs. This can be used to create a hierarchy among the concepts. Application The 2008/09 Wikipedia Selection for schools1 (Schools Wikipedia) is a free, hand-checked, non-commercial selection of the English Wikipedia2 funded by SOS Children's Villages. It has been created with the intention to build a child safe encyclopedia. It has about 5500 articles and is about the size of a twenty volume encyclopedia (34,000 images and 20 million words). The encyclopedia contains 154 subjects which are grouped into 16 main subjects such as countries, religion and science. The network has been created from the Schools Wikipedia version created in October 2008. Each article is represented by a vertex and the subjects are represented by domains. Every article is assigned to one or more domains depending on the assigned subjects. Hyperlinks are represented by directed links with the article that contains the hyperlink as source and the referenced article as the target vertex. 1 http://schools-wikipedia.org/ 2 http://en.wikipedia.org This example data set and the representation as a hyperlink graph has been chosen since it can be validated manually by reading the Schools Wikipedia articles and inspecting their hyperlinks. Results This section illustrates concept graphs discovered in the Schools Wikipedia dataset using the JIM algorithm. The concept graphs consist of the discovered item sets that form the first vertex set and the corresponding root vertices of the transaction that build the second vertex set. Once we have discovered both vertex sets and determined their types we can display them as a graph. The following graphs display the information units with triangular vertices. Both aspects and the concept are represented by a squared vertex whereas the concept has a box around its label. Figure 2 depicts two different bird categories which were extracted from the animal section of the Schools Wikipedia dataset. Both graphs depict the aspects and the concept in their center and the information units in the surrounding circle. The first concept graph (Figure 2a) represents the group of waders. Waders are long-legged wading birds such as herons, flamingos and plovers. The concept graph also contains terns and gulls even though they are only distantly related to waders. However Schools Wikipedia states that studies in 2004 showed that some of the gene sequences of terns showed a close relationship between terns and the Thinocori some species of aberrant waders. Reptiles are included in the graph since most of the larger waders eat reptiles. The second concept graph (Figure 2b) represents the bird of prey group. Birds of prey or raptors hunt for food on the wing. The graph includes all the different sub families such as eagle, hawk, kite, osprey and falcon. It also includes some of the birds' prey such as chicken or crows. The common cuckoo is not a bird of prey but is included in the concept graph since it looks like a small bird of prey in flight as stated in its article in Schools Wikipedia. The animal examples benefit from the structure of the Schools Wikipedia pages of the animal section. They all contain an information box with the Kingdom, Phylum etc. of the animal. However this demonstrates that our method is able to discover ontologies if they are available in the integrated data. Furthermore the examples demonstrate the capability of the method to detect specific categories such as waders or birds of prey even though they are not part of the ontology structure in Schools Wikipedia. In contrast to the animal graphs do the next concept graphs contain more aspects than information units. Therefore the layout of the vertices has changed. The information units are depicted in the center whereas the aspects and the concepts form the outer circle. Figure 3 stems from the math section of the Schools Wikipedia data set and demonstrates the ability to detect specific concepts only based on the shared properties without an integrated ontology. Proeedings of the Second 138 (a) (b) Figure 2: Concept graphs from the animal section The first concept graph (Figure 3a) represents the concept of elementary arithmetic, and grouping the main operations of elementary arithmetic such as addition, subtraction, multiplication and division. It also contains the vertex for abacus since some of them, such as the Chinese suanpans, can be used to perform all of the mentioned elementary arithmetic operations. The graph also contains the vertex for the elementary algebra concept that extends elementary arithmetic by introducing symbols in addition to numbers. This is described in the following paragraph. The second concept graph (Figure 3b) groups some of the main laws of elementary algebra such as commutativity and associativity. Distributive law and symbol are not part of the concept graph since they are not explicitly explained in Schools Wikipedia and therefore not linked in the article. This is a limitation on the used data but not on the method itself. This is why we want to incorporate more information about each article in the next version of the Schools Wikipedia, such as information from the full text of the articles using text mining methods. (a) (b) Figure 3: Concept graphs from the mathematical section Both math examples contain some common vertices that belong to more general concepts such as mathematics, arithmetic and algebra, which could be used to generate a hierarchy of the mathematical section of the Schools Wikipedia data set. Figure 4 depicts two concept graphs from the physics domain of the Schools Wikipedia data set and demonstrates the detection of domain crossing concepts. The graphs do not contain previously unknown relations but cross several domains such as the domain for physics, astronomy, history, chemistry and people. The examples benefit to a certain extend, such as the animal examples from standardized information boxes in Schools Wikipedia. The first concept graph (Figure 4a) represents the concept graph for quantum field theory. It groups information units from the astronomy, physics and people domains with the domain for history. The second concept graph (Figure 4b) refers to the waveparticle duality concept, which combines the domains for physics, astronomy and people with the chemistry domain. Proeedings of the Second 139 (a) (b) Figure 4: Domain bridging concept graphs from the physics section Conclusion and Future work In this paper we have discussed a new approach to detect existing or missing concepts from a loosely integrated collection of information fragments which leads to a deeper insight into the underlying data. We have discussed concept graphs as a way to discover conceptual information in BisoNets. Concept graphs allow for the abstraction of the data by detecting existing concepts leading to a better overview of the integrated data. They further support the detection of missing concepts by discovering information units that share certain aspects but which have no concept, which might be a hint for a previously unknown and potentially novel concept. This approach can also be expanded to detect domain bridging concepts (Kotter, Thiel, and Berthold 2010) which ¨ might support creative thinking by connecting information units from diverse domains. Since BisoNets store the domain a vertex stems from, we can use this information to find concept graphs that contain information units from diverse domains. In addition to the discovery of concept graphs we plan to identify overlapping concept graphs which can be used to create a hierarchy among the detected concepts using methods from formal concept analysis (Wille 1982). The hierarchy ranging from most specific to most general concepts can be created by detecting more specific concept graphs that are included in more general concept graphs. The different levels of concept graphs can be detected by varying the threshold of the discussed Jaccard Item Set Mining algorithm. Acknowledgments The work presented in this paper was supported by a European Commission grant under the 7th Framework Programme FP7-ICT-2007-C FET-Open, project no. BISON211898. 2011_29 !2011 Creative Cognition in Choreography David Kirsh Cognitive Science Department University of California, San Diego La Jolla, CA 92093, USA kirsh@ucsd.edu Abstract Contemporary choreography offers a window onto creative processes that rely on harnessing the power of sensory systems. Dancers use their body as a thing to think with and their sensory systems as engines to simulate ideas nonpropositionally. We report here on an initial analysis of data collected in a lengthy ethnographic study of the making of a dance by a major choreographer and show how translating between different sensory modalities can help dancers and choreographer to be more creative. Introduction The design process of ‘making' a modern choreographic work offers insight into two creative processes much in need of understanding. 1. Distributed creativity: the mechanisms by which team members harness resources to interactively invent new concepts and elements, and then structure things into a coherent product; 2. Embodied cognition: the mechanisms by which creative subjects think non-propositionally, using parts of their own sensory systems as simulation systems, and in the case of dancers, using their own (and other's) bodies as active tools for physical sketching. The close study of both of these processes bears directly on the goal of developing new theoretical models of creativity. It relocates creativity from a within-the-mind process to a more socio-technical process involving resources and other people; and it recognizes the importance that bodies and sensori-motor systems - both non-verbal and perhaps sub-rational elements - play in creative cognition. In this paper we consider only process two: the role of embodied cognition in creativity. Why choreography? Usually, creative processes fall short of their potential because variance in ideas is not managed well. The generative phase of creation is closed down too early, or it runs dry of its own accord. (1). The choreographer observed in this study (henceforth WM) has developed techniques for keeping the process open longer and for maintaining substantial variance among the dancers despite the urge for group think and convergent behavior. (2). He has also developed techniques for exploiting the coding language of sensory systems, of both himself and his dancers, to create new movement ideas. WM is a remarkably successful choreographer. His track record raises some obvious questions about his creative process. In particular, how does WM help his dancers to: • break their personal signature? Each dancer has a standard repertoire of moves and styles of moving. How can they be pushed beyond their personal repertoire? • be creative for longer periods - to stay in a creative phase at full intensity for longer? Dancers can be creative in bursts that issue in phrases that last for 20 or 30 secs. What can a choreographer do to lengthen a dancer's period of creativity from 20 or 30 secs to 60 or even 70 secs? • sustain long term creativity? Typical brainstorming sessions can be successful for a few hours, or occasionally for a day. What methods can keep a dancer at near peak levels for weeks at a time? • prevent premature crystallization. Creativity requires a period of openness, followed by winnowing and narrowing of options. The danger is that ideas that seem good will be accepted before newer, even more radical ideas are proposed. How does the choreographer strike the right balance between keeping a process open and closing it? Methodology To study these and other fundamental questions about creative cognition we pursued a mixed methodology of close ethnographic observation, experimental study and computational analysis. To understand the choreographic process we videotaped all scheduled interactions between choreographer and dancers during the time they worked together over thirty work days to create a new dance that premiered at a major dance venue in London a week after its completion. 141 Five high definition video cameras were placed on the studio walls, and, whenever possible, two standard video cameras were placed on the ceiling. Written notes about the process were taken in real-time. During the first three week phase of ‘making', fifteen students took notes; during the second phase a single experienced ethnographer took notes. The choreographer was interviewed for between forty and sixty minutes on digital video each morning and night most days. The dancers were also interviewed. At the end of each rehearsal, four dancers were selected and interviewed for thirty minutes each. Our aim with the dancers was to have them reflect on specific elements of the rehearsal that day. Whenever possible we had them verbally describe their experience during the day and then show us through movement what they meant. We also reviewed all notebooks. Coding: To code the video we used ELAN, a free software system developed by the Max Planck Institute for Psycholinguistics. ELAN was designed for studying gesture and small-scale interactions. We developed our coding system iteratively. On the basis of interviews and common sense we started out with a vocabulary for obvious communicative phenomena: for example, WM talking to one, two three … all the dancers, WM gesturing, making certain non-linguistic sounds. We included other gross actions related to directing movement: touching or positioning dancers, WM showing the movements he wants, the use of props such as projected images or shared photos, joint attention. As our collection of instances of these phenomena grew we compared them for differences and began defining new coding predicates to differentiate or qualify them. For instance, when we looked more closely at sonifications (sounds WM would make to help communicate the shape, emphasis or dynamics of a phrase) we became interested in the relationship between the onset time of sonification and gesture, and then in the relationship between gestural form and a sonification's sound pattern. In another case, we became interested in a phenomenon that dancer's call marking. From interview we learned about this practice, learned how to recognize it, and then through further interview, from close study of video and having dancers mark for specific purposes, we began to look for behavioral indicators of different types of marking, (3). We found for instance that marking is very different when its goal is to coordinate grips in duets and when its goal is to help a dancer consolidate a movement just taught. The longer we work on our corpus the more our coding scheme grows and specializes. One Type of Dance Creativity: One specific problem WM sets for himself, as reported in interview, is to create dance where human bodies move in ways never before seen. In the past, he derived inspiration from studying motor disorders such as ataxia, and from observing the way the heart and other organs move when revealed in openheart surgery. But most often, he relies on a collection of techniques for harnessing sensory simulators - for recruiting the power of embodied cognition based in the senses and in the elasticity of the body. It is these techniques we consider here. Embodied Creativity From earlier work on this topic (4), we discovered that dancer and choreographer regularly use their bodies as things to think with. They spend much of their time thinking non-propositionally. When trying to create new movement forms they use their bodies as a cognitive medium, much the way a graphic artist uses drawing as a cognitive medium or a violinist uses the sound emanating from his violin as a cognitive medium. Just as an artist or musician develops a close coupling with their tools - pencil and paper for the artist, violin for the violinist - so a dancer must have a tight control relation between body-as-tool and body-as-display-medium. Embodiment bears on dance the way instruments bear on artistic or musical product. Change the instrument and you may change the form or style of the output. So too in dance, changing the body-astool, say by making parts of it rigid or spasmodic, leads to a change in form and style of dance. This places the mechanics of the body front and central in the generation of dancerly movement. Additionally, from earlier work (3,4) we found that both choreographer and dancer rely on imagery in the visual, somato-sensory, tactile, and motor systems to create novel movement. The choreographer explicitly gives his dancers tasks that require them to shift between modalities. For instance, he might ask them to imagine that their bones are made of firm rubber, or that they should imagine the feeling of being attacked. Their task is to translate those feelings into movements. One reason to see this process of simulating in one sensory modality and then translating to another modality as embodied cognition is that it relies on each modality having its own way of coding input, and ‘concepts'. Although embodied cognition, as a scientific expression, has different meanings, (5,6) a common element across most versions is that cognitive processes are grounded in modality specific brain systems. The way we acquired concepts through sight, sound, touch, and so on, continues to affect our understanding of those concepts, long after they have been abstracted from specific senses. The idea of running is abstract. But we ground our understanding of that idea in the physical activity of running which we experienced when running. Embodied cognition, then, can be understood as a form of computation, distinct from familiar symbol manipulation or connectionist computation, wherein parts of the body, or parts of a sensory system, are harnessed to simulate some process. By simulating that process a subject understands it. For instance, the mirror neuron system is sometimes cited as an example of embodied cognition because it is 142 thought to explain how a subject can imbue meaning to the actions that someone else performs. By personally simulating in their own motor or visual cortex, the planning and other processes related to executing those actions themselves (7) they understand what it is like to perform that action. Thus, when subjects see another person pouring a cup of tea, their brains respond by activating many of the same parts of cortex as would be activated were they pouring tea themselves. Some psychologists have argued that fainter versions of these same activations occur whenever a subject understands a sentence about pouring tea, and that this activation is what grounds much linguistic understanding. (8) In dance, the tenets of embodied cognition may explain how dancers invent ‘dancerly' movements. Often WM will task the dancers with ‘solving' a choreographic problem. An example problem is to imagine what it's like to have a rigid rod connected to your shoulder. The rod is pushed and pulled. To solve this problem a dancer works with a partner some distance away. That partner is notionally holding the rod and moving it. The dancer then generates mental imagery associated with the movement of the rod. Most of this imagery will be about the somatic or kinesthetic feelings of being pushed and pulled. The pattern of somatic or kinesthetic priming these images creates serves to bias the next somatic or kinesthetic images in the dancers imagination. The priming defines a weighting function over somatic or kinesthetic image continuations. It is obvious that without a body or neural system capable of image continuations there would be no causal basis for priming and hence image continuations. It would be impossible to link or translate a given somatic state into motor movement continuations. No body no motor movement continuations. The upshot is that a dancer's capacity to relate somatic or kinesthetic images to motor dispositions can be used to help him or her create interesting movements and also judge their aesthetic quality. By interpreting their movement through the lens of one or more sensory modalities other than movement control per se, they are able to judge whether the movement looks right visually, feels right somatically and kinesthetically, or whether it captures a sound right. This form of cognition is both embodied and non-propositional. (9,10). Here is another choreographic problem that may clarify the method. A timeworn choreographic task is to ask a dancer to ‘paint' a contour, say Manhattan's skyline. Dancers would never use their hands alone as the paint brush - that is too simple and boring - they use different body parts. For example, they might start with their elbow, continue the contour line with their head, then move to their hip or foot. This process involves several modalities because the visual modality is required to imagine the contour, and if the dancer has feelings attached to parts of the contour then whatever modality is tied to emotional feeling will be a factor too. For instance, a dancer may believe that people have jumped off the empire state building so he or she may have a special feeling about that part of the contour. As the different parts of the body trace the different parts of the contour the feeling in one of these modalities somato-sensory, visual, emotional - can be used to judge the movement's aesthetic virtue. The creative process here is: generate in one modality, map to another, test in a third. This suggests that there are two distinct types of embodied cognition at play. • using the body as a medium to think in - dancers don't think in words they think physically, through their bodily form; • using sensory systems as non-propositional systems to think in - dancers don't think in words or propositions, but in visual, tactile or somato-sensory forms. Although this makes it sound as if embodied cognition is a continuous or analog process relying on a body's elasticity or a sensory system's simulation capacity, it is important to recognize that these thinking processes are still representational. They are representational but they are so tied to the properties of the underlying medium (muscle, tendon and bone, body control mechanisms, sensory modalities and sensory simulators) that the cost of embodying the representation is significant. The cost of creating a reprsentation or simulation, of sustaining it and transforming it depends on the cost structure of the neural system implementing the representation. Properties of the underlying neural system show through. Accordingly, for a given person it might be easier to run imagery in his or her visual system than in their somato-sensory system. For another person it might be easier to run somato-sensory imagery. It is also likely that differences in the ease of generating modality specific imagery will depend on the content of the image. It's easy to visually imagine entering into a sphere through a small opening but harder to imagine what it might feel like to enter a self healing sphere, where you use your hands to open a hole and then step in and seal it up. This idea, that the cost structure of cognition changes with the medium of cognition is central to our approach to creatvity. In non-propositional systems, where the structures to be created are not interpreted as being true or false, an ‘idea' can be shifted around by moving it from code to code, system to system, each system making it possible to discover different things. Thus, in architecture, a domain rife with image based representations, an idea that starts as a sketch on paper, where certain issues are worked out, may be transformed when the architect tries to model his sketch in three dimensions in foamcore or wood. Each medium teaches the architect something different. Sometimes you have to encode an idea in a different form or medium to appreciate its strength or weakness most clearly. The same applies to music. A piece of music that 143 sounds one way when played on a violin may sound quite another way when played on a tuba. Each instrument may stimulate the composer to notice new aspects of his original ‘germ' idea, or to derive new associations, or to ‘infer' new ideas. Each encoding is situated in a different energy landscape of closeness. Ironically, the special power of embodied thinking in dance, then, is the power of representation everywhere. If an ‘idea' can be encoded in one representational system easily, or worked out easily there, it can then be translated into another representational system where it might have been difficult to discover initially. Once encoded in that new representational system, though, it has a form that carries new possibilities and makes it easier to discover new connections. A problem stated in geometry may be hard to solve in classical geometric representations, but once translated into an algebraic representation it is easy. Once solved algebraically it can be translated back to geometry. This is the huge power of representational systems. Each representational systen operates with its own metric of inferential distance. Two ideas that are close in one may be distant in another and vice versa. A graphical account of this basic idea is shown in figure 1. It is an energy landscape to show the attractor space of a game like scrabble. In a simple experiment designed to test the value of moving between different representation systems we compared the performance of subjects who were allowed to move scrabble tiles with those whose tiles were fixed. Because of differences in the way people manipulate letters mentally and letters (tiles) physically, they are likely to stumble on good sequences by simple physical rearrangement that would be hard for them to find mentally. Mental rearrangements follow least energy paths in a lexical or phonological landscape, while physical rearrangement is sensitive to how easy or hard it is to move the tiles. The state spaces are different and hence the trajectories through those spaces will be different. Figure 1. Energy landscape for phonological and lexical search. Thus, because ‘letters' that are physically close are often different than those that can be phonologically or lexically close it is probable that by physically moving tiles new ideas will occasionally be stimulated. During the movement phase there will be moments when phonologically implausible sequences are visually present and therefore considered momentarily. This potentially increases the number of combinations reviewed. In a like way, bodies, sensory systems and artifacts each constrain different energy landscapes of possibilities. The trick is to know how to harness their comparative virtue. Sensory simulators To explore the idea that sensory simulation can be used as a filter on goodness, we need some definitions. Sensory systems operate with a sensory code. The code need not be symbolic; it only need be able to encode different states. It may be an analog code. Having a code makes it possible to talk of a sensory system having an expressive power - its full state space - and to talk about trajectories through this state space. Assume, further, that sensory states can be classified into equivalence classes, such as those associated with a smell, taste, visual shape, body feeling, or movement; and also that there are contingency tables specifying probability measures between these equivalence classes. Regularities in experience have trained our sensory systems to ‘expect' certain pathways. These pathways become primed whenever states that lead to them are activated. Because our senses encode different aspects of the world each is informative, and contains bits of information the others do not. Hence each sensory system supports different priming pathways. Events that seem ‘natural' or obvious in one sensory system may seem unnatural or completely unobvious in another. We can think of this on analogy with numerical representational systems. To decide whether the number 30,163 is divisible-by-7 takes some computation. In the base 7, however, 30,163 is represented as 153,640, and here it is completely obvious that it is divisible-by-7, just as it is obvious that 97,230 in base 10 is divisible-by-10. It is transparent. See (10, 11). In the somato-sensory system, a dancer may immediately recognize graceful movements. What feels graceful, however, may not always look graceful, since the encoding of a movement in the visual system is so different than its somato-sensory encoding. This is even more obvious when we consider impossible movements. What the motor system deems impossible may be quite different than the visual system. One potentially interesting consequence of this account is that it explains how humans can think nonpropositionally. They think in their sensory systems. They simulate outcomes, and they control the simulation process in non-propositional thought much the way that they control propositional thought by controlling auditory images of linguistic elements. Moreover, because of the different encoding properties of sensory systems dancers are able to reach ‘conclusions' in some sensory systems that are hard to reach in others. It is sometimes easier to think in one modality than another. I 144 believe that when a dancer visualizes an object - say a reptile slithering around a chair - and then transforms the visual experience into a movement they are first trying to draw creative insight from a visual solution before moving to a bodily solution. They visually imagine themselves slithering before feeling themselves moving and then finally moving. They transform between sensory media. Multi-modal translation The choreographer relies heavily on this sort of modality translation to stimulate movement ideas in his dancers. He does this in two ways. First, he personally uses a broad range of modalities to communicate with his dancers - modalities to direct or guide them. Second, he assigns them ‘choreographic' tasks that require imagining scenarios or processes and then translating these into interesting movement. We have already sonification, as a vehicle for shaping movement. We observed WM sometimes ‘saying' things like "Yah ooh ehh" to communicate the shape of a movement. He used sound to shape dynamic form or perhaps to communicate feeling or attitude. The choreographer also uses tactile and kinesthetic imagery as a creative stimulus, either by touching the dancers and then asking them to draw tactile or kinesthetic inferences from the dynamics of his touch, or by speaking to them, and assigning each a cognitive task that requires them to recruit their tactile and kinesthetic imagery abilities. This is an instance of the second method, the general technique of inventing new shape through cross-modality problem solving. (12). Here is another example of that. Figure 2. This is the bell shape a dancer told us she was imagining herself to be moving. She said it was very heavy. In one task we observed, a dancer conjured the visual image of a massive bell gonging. She then transformed that moving image into a new structure, the kinetic feel of moving body parts as if those parts are connected to the heavy bell, or perhaps the feeling of rocking the bell. See figure 2 where we show a snapshot of a video we annotated based on an interview with a dancer. The dancer seems to be comparing the feel of the body movement to the visual or perhaps conceptual structure of wrapping one's hands and legs around a heavy bell and moving it, honoring its inertia. This is interesting for what it shows us about using visualization to unleash individual creativity. Conclusion I have briefly reviewed some methods a noted choreographer uses with his contemporary dance company to reliably generate novel dance phrases. Choreography is a revealing domain to study creativity because the process often lasts over many weeks and requires both choreographer and dancer to generate countless candidate ideas, then select and refine them. We found by careful ethnographic analysis that WM relies heavily on modality translation as a generative technique. He often assigns his dancers tasks that require them to imagine what something feels like kinesthetically, or to imagine what something would look like, or smells like, or feel like in an emotional sense, then to translate this to movement. At other times he communicates with his dancers non-linguistically and relies on their ability to translate his gestures, touches, sounds or sights to movement relevant forms. I argued that this is a successful method for creativity because it harnesses the power of multiple representation systems. Acknowledgements I gratefully acknowledge the help of Dafne Muntanyola in helping with the organization of this study and her thoughtful comments throughout. Funding for this work is from the NSF CreativeIT program, grant. IIS-1002736. 2011_3 !2011 Autonomously Creating Quality Images David Norton, Derrall Heath and Dan Ventura Computer Science Department Brigham Young University Provo, UT 84602 USA dnorton@byu.edu, dheath@byu.edu, ventura@cs.byu.edu Abstract Creativity is an important part of human intelligence, and it is difficult to quantify (or even qualify) creativity in an intelligent system. Recently it has been suggested that quality, novelty, and typicality are essential properties of a creative system. We describe and demonstrate a computational system (called DARCI) that is designed to eventually produce images in a creative manner. In this paper, we focus on quality and show, through experimentation and statistical analysis, that DARCI is beginning to be able to produce images with quality comparable to those produced by humans. Introduction DARCI (Digital Artist Communicating Intention) is a computer system designed to eventually create visual art in order to convey intention and meaning to the viewer. Currently, DARCI can automatically render a given image to match an accompanying list of adjectives. This ability is the foundation of a visual language for DARCI to communicate with an audience—an important element of creative expression in the visual arts. DARCI is part of ongoing research that is exploring the perception of creativity in an artificial system. Measuring creativity both quantitatively and qualitatively is a difficult challenge. Ritchie describes quality, novelty, and typicality as being essential in ascribing creativity to a system (2007). Ritchie defines quality as the extent to which the artefact is a high quality example of its genre. In this paper, we focus on quality, and show that DARCI is beginning to be able to produce quality artefacts comparable to human artists given the same resources. DARCI's design has two main components: the image appreciation component, and the image creation component. The image appreciation component is designed to allow DARCI to learn to evaluate it's own artwork according to various descriptive words. This ability to assess these qualities in an image guides the image creation component. The image creation component uses evolutionary mechanisms to create artefacts and the appreciation component serves as part of the fitness function. We briefly describe the main components of DARCI and how they work together to produce artefacts. We then present several images that DARCI has created and describe an experiment in which we compare DARCI's images with ones made by humans. Finally, we discuss how the results show that DARCI is becoming comparable to humans in producing quality artefacts. Image Appreciation It has been argued that the ability to appreciate and evaluate its own artefacts is necessary for a system to be considered creative (Colton 2008). In order for DARCI to appreciate art, it must first acquire some basic understanding of art. For example, in order for DARCI to appreciate an image that is dark and gloomy, DARCI must first understand the concepts dark and gloomy. To do this, DARCI must learn to associate images with artistic descriptions. Image Features Before DARCI can form associations between images and descriptive words, appropriate image features for the task must be extracted from the image. Significant research has been done in the area of image feature extraction (Gevers and Smeulders 2000; Datta et al. 2006; Li and Chen 2009; Wang, Yu, and Jiang 2006; King ; Wang and He 2008), and we have culled 102 image features from this. These are low-level features that can be coarsely classified as treating one the following image characteristics: color, light, texture, and shape. Artistic Descriptions As an initial step, the artistic descriptions that DARCI can learn are limited to lists of adjectives. We use WordNet's (Fellbaum 1998) database of adjectives to give us a large, yet finite, set of descriptive labels. In WordNet, each word belongs to a synset of one or more words that share the same meaning. If a word has multiple meanings, then it can be found in multiple synsets. To collect training data, we have created a public website for training DARCI (http://darci.cs.byu.edu). From this website, users are presented with a random image and asked to provide adjectives that describe the image. Additionally, for each image presented to the user, DARCI lists seven adjectives that it associates with the image. The user is allowed to flag those labels that are not accurate. This creates strictly negative examples of those synsets, which is important for learning. Another program for creatively generating visual art, NodeBox, is also dependent on semantic networks such as WordNet. The NodeBox project takes the use of semantic networks even further by using a more elaborate database they created called "Perception" (De Smedt, De Bleser, and 10 Nijs 2010). However, unlike DARCI, NodeBox does not have a strong learning component. In the future, we hope to expand DARCI by using more sophisticated semantic networks, perhaps even "Perception" itself. Learning Method In order to make the association between image features and synsets, we use a collection of artificial neural networks (ANNs) that we call appreciation networks. There is an appreciation network for each synset that has a sufficient amount of training data. As we incrementally accumulate more data, new neural networks can be dynamically added to the collection to accommodate the new synsets. Currently, there are 211 appreciation networks. This means that DARCI essentially "knows" 211 synsets. For more details on our learning method, image features, and use of synsets, the reader is referred to earlier work describing DARCI (Norton, Heath, and Ventura 2010). Image Creation DARCI uses a evolutionary mechanism to render images according to given synsets, and this mechanism operates in two modes. The initial mode, which we call practice mode, operates by exploring the space of image filters that will render any image according to a single specific synset. For this mode, DARCI creates and maintains a separate gene pool for each synset that the system knows. The second mode, called commission mode, operates by exploring the space of image filters that will render a specific image according to a specified list of synsets. There is no restriction on synset combinations; in fact, incoherent combinations can produce unexpected and interesting results as we will demonstrate later. For commission mode, users prescribe the image and list of synsets that they wish DARCI to render—in other words, they "commission" DARCI. For each commission, DARCI creates a unique gene pool that terminates once the commission is complete. For both modes, the evolutionary mechanism functions as follows. The genotypes that comprise each gene pool are lists of filters, and their accompanying parameters, for processing an image. Many of these filters are similar to those found in Adobe Photoshop and other image editing software. Others come from a series of 1000 filters Simon Colton discovered using his own evolutionary mechanism (Colton et al. 2010). Colton's set of filters, called Filter Feast, is divided into categories of aesthetic effect that were discovered by exploring combinations of very basic filters within a tree structure. We have treated Colton's filters as if each category were a unique filter with a single parameter that specifies the specific filter within the category to use. Figure 1 gives an example of a genotype and its effect on a sample image. There are a total of sixty-one traditional filters that we selected for DARCI to use and a total of thirty-one categories of filters from Filter Feast, making ninety-two filters available for each genotype. We selected traditional filters that were easily accessible, diverse, fast, and that didn't incorporate alpha values (since our feature extraction techniques cannot yet process alpha values). The fitness function for the evolutionary mechanism can be expressed by the following equation: Figure 1: Sample genotype (list of image filters with parameters) and its effect on an image. "Ripple" and "Weave" are the names of two (of ninety-two) possible filters. Fitness(g) = λAA(g) + λI I(g) (1) where g is an image artefact and A : G → [0, 1] and I : G → [0, 1] are two metrics: appreciation and interest. These compute a real-valued score for an image artefact (here, G represents the set of all image artefacts). λA + λI = 1, and for now, λA = λI = 0.5. Both metrics used in the fitness function are applied to the phenotype (the image that results when each genotype is applied to a source image). The fitness of every phenotype within a generation of the evolutionary mechanism is determined using the same source image; but, the source image used from generation to generation depends upon which mode the system uses. In commission mode, the source image is the same from generation to generation, while in practice mode the source image for each generation is randomly selected from DARCI's growing image database. The appreciation metric A is computed as the (weighted sum) of the output(s) of the appropriate appreciation network(s), producing a single (normalized) value: A(g) = X w∈C αwnetw(g) (2) where C is the set of synsets to be portrayed, netw(·) is the output of the appreciation network for synset w, P w αw = 1, and αw = 1/|C| (though this can, of course, be changed to weight synsets unequally). The interest metric I penalizes phenotypes that are either too different from the source image, or are too similar. This metric is useful for producing images that meet our definition of imaginative; however, the interest metric is currently too simplistic to do more than prevent extreme cases. The metric begins by tallying the number, n, of image analysis features that have similar values between the two images (i.e. that fall within a specified distance of each other). This can be expressed with the following equation: n = X i d0.3 − |F S i − F P i |e (3) 11 Number of Sub-Populations 8 Size of Sub-Populations 15 Crossover Rate 0.4 Filter Mutation Rate 0.03 Parameter Mutation Rate 0.1 Migration Rate 0.2 Migration Frequency 0.1 Tournament Selection Rate 0.75 Initial Genotype Length 2 to 4 filters Table 1: Parameters used for the evolutionary mechanism. F S i represents feature i of the source image and F P i represents feature i of the phenotype. Note that all features are normalized to the range [0...1], so the ceiling function above returns either 0 or 1. The value 0.3 was chosen empirically. The interest metric is calculated using n as follows: I(g) = 1 −    τd−n τd if n < τd n−τs |F |−τs if n > τs 0 if τd ≤ n ≤ τs (4) τd and τs are constants that correspond to the threshold for determining, respectively, when a phenotype is too different from or too similar to the source image. The values τd = 20 and τs = 57 were used here. |F| is the total number of features analyzed, in our case 102. Fitness-based tournament selection determines those genotypes that propagate to the next generation and those genotypes that participate in crossover. One-point "cut and splice" crossover is used to allow for variable length offspring. Crossover is accomplished in two stages: the first occurs at the filter level, so that the two genomes swap an integer number of filters; the second occurs at the parameter level, so that filters on either side of the cut point swap an integer number of parameters. By necessity, parameter list length is preserved for each filter. Table 1 shows the parameter settings used. Mutation also occurs at two levels. Filter mutation is a wholesale change of filter (discrete values), while parameter mutation is a change in parameter values for a filter (continuous values). When filter mutation occurs, either a single filter within a genotype changes or a new filter is added. When a parameter mutation occurs, anywhere from one to all of the parameters for a single filter in a genotype are changed. The degree of this change, ∆fi , for each parameter, i, is determined by one of the following two equations chosen randomly with equal probability: ∆fi = (1 − fi) · rand 0, (|f| + 1) − |∆f| |f| (5) ∆fi = −fi · rand 0, (|f| + 1) − |∆f| |f| (6) Here, |f| is the total number of parameters in the mutating filter, |∆f| is the number of changing parameters in the mutating filter, and rand(x, y) is a function that uniformly selects a real value between x and y. Because there are potentially many ideal filter configurations for modeling any given synset, we have implemented sub-populations within each gene pool. This allows the evolutionary mechanism to converge to multiple solutions, all of which could be different and valid. The migration frequency controls the probability that a migration will occur at a given epoch, while the migration rate refers to the percentage of each sub-population that migrates. Migrating genomes are selected uniform randomly, with the exception that the most fit genotype per sub-population is not allowed to migrate. Migration destination is also selected uniform randomly, except that sub-population size balancing is enforced. Practice gene pools are initialized with random genotypes, while commission gene pools are initialized with the most fit genotypes from the practice gene pools corresponding to the requested synsets. This allows commissions to become more efficient as DARCI practices known synsets. It also provides a mechanism for balancing permanence (artist memory) with growth (artistic progression). Methods and Results The evaluation of artefacts is very subjective, making an evaluation of DARCI non-trivial. Furthermore, the quality of the artefacts that DARCI produces can be judged based on two distinct criteria: how well the artefacts portray the synsets dictated by a commission, and how well the artefacts demonstrate artistic skill. Depending on the synsets in question, the first criterion can be considered less subjective than the second. For example, if the synset blue, as in the color blue, were chosen, the degree to which an artefact possesses the color blue could be measured quite objectively. As less simple/concrete synsets are applied, this criterion becomes increasingly subjective; however, we argue that it will never be more subjective than a general assessment of artistic merit. For this reason, we have chosen to focus on the first criterion of quality and relegate the second criterion to an interesting side note in this paper. Despite focusing on the first criterion of quality, we want to eventually move in the direction of artistic analysis of DARCI's artefacts. Thus, we have selected three synsets that, while dictating some expected traits within an image, also prescribe subjective features within an image. The synsets we have selected are "fiery" as in like or suggestive of fire, "happy" as in enjoying or showing or marked by joy or pleasure, and "lonely" as in lacking companions or companionship. These synsets are well represented in DARCI's database and are distinct in meaning. Because there is always a subjective component in determining whether an image can be described by a given adjective, the most objective way that we can evaluate such quality is through a combination of many personal opinions. For this reason, we designed a survey in which people rank DARCI's artefacts, alongside several other artefacts, with respect to how well the images reflect particular adjectives. For this survey we selected three images on which to test DARCI's rendering of the aforementioned synsets. The images we selected are shown in Figure 2. We chose photographs in order to accentuate the impact of the non-photorealistic rendering tools available to DARCI. The 12 (a) Image A (b) Image B (c) Image C Figure 2: The three source images used to evaluate the quality of DARCI's artefacts. three photographs explore the light vs. dark, chromatic vs. monochromatic, and close vs. distant spectrums. For each photograph and for each synset we commissioned DARCI to produce an image that portrays the synset; we also commissioned DARCI to produce a variation of each image that portrays all three synsets simultaneously in order to demonstrate the effect of combining synsets with disjointed meaning (this results in a total of 3 × 3 + 3 = 12 images). For comparison, we collected three additional sets of 12 homologous images: a set chosen by us from a collection of images created by DARCI, a set commissioned to human artists, and a set chosen by us from a collection of randomly generated images. For the set created by DARCI, we allowed DARCI to practice the three synsets for eight hours a piece, and then gave the system sixteen hours to complete each commission. For every commission, DARCI chose the single image with the highest fitness as the result of the commission. In addition, for each commission, DARCI saved the top five unique images (those with the highest fitness) encountered within each sub-population, for a total of forty images. From these, we chose the single image we thought best portrayed the commission target synset(s). (We made this selection with no knowledge of DARCI's fitness values for the 40 images, and, in particular, we did not know which of the images DARCI ranked highest and selected as the result of its commission.) This image we selected represents a close cooperation between DARCI and DARCI's programmers— or, looked at another way, the use of DARCI as a tool rather than as an autonomous agent. A third set of images was created by human volunteer artists, who were restricted to a toolset similar to that used by DARCI (i.e. image filters) and were skilled with programs (e.g. Photoshop) using this toolset. Each image in the final set was chosen from a set of 40 randomly generated images, each of which was generated using 1 − 8 of the same filters available to DARCI. In order to ensure a reasonable image, and to provide a point of comparison between random filter generation and DARCI's evolutionary mechanism, we chose the one image (out of 40) that we thought best portrayed the synset in question. In summary, we acquired four images for every synsetimage combination. One was DARCI's most fit artefact (DARCI), one was our choice out of DARCI's top artefacts (Coop), one was produced by a human (Human), and one was our choice out of randomly filtered images (Best Random). Representative examples of some of the twelve Figure 3: The average ranking for each synset across images A, B, and C for each of the four artefact sources: DARCI, Human, Coop, and Best Random. "triple" refers to the artefacts rendered with all three synsets. These results were obtained from 42 volunteers. Lower rank is better. Human DARCI Coop Best Random Average Rank 2.4067 2.7282 2.3194 2.5456 Table 2: The average ranking of the four artefact sources across all image-synset combinations. These results were obtained from 42 volunteers. Lower rank is better. synset-image combinations can be found in Figures 4-7. In the online survey, volunteers were instructed to rank the four images for each synset-image combination according to how well they portrayed the synset(s) in question. In addition, we asked the volunteers to indicate which images they liked regardless of adjective compatibility. This additional question was added to stress to the volunteers the fact that the ranking was to be independant of personal preference for the images. We obtained a total of forty-two survey responses. The results of this survey are encapsulated in Figure 3 and Table 2. Figure 3 shows the average ranking for each synset across all three images for each of the four artefact sources just summarized: DARCI, Human, Coop, and Best Random (the lower the rank, the better). Table 2 shows the average ranking of each of the four artefact sources across all synsets and images. Table 3 shows which pairs of datapoints in the aforementioned figures are statistically significant— such pairs are denoted with an asterisk. fiery happy lonely triple all synsets Human/DARCI 0.501 * 1.000 * * Human/Coop * * 0.155 * 0.217 Human/Best Random 0.286 * 0.326 0.339 0.0540 DARCI/Coop * * 0.132 * * DARCI/Best Random 0.691 * 0.298 * * Coop/Best Random * 1.000 0.640 * * * p-value < 0.01 Table 3: Results of t-Test comparing all binary combinations of image sources for each synset. The "all synsets" column refers to Table 2. The other columns refer to Figure 3. 13 (a) DARCI (b) Best Random (c) Coop (d) Human Figure 4: Image A rendered "fiery"—ranked left-to-right from most to least fiery. (a) Human (b) Coop (c) DARCI (d) Best Random Figure 5: Image A rendered "happy"—ranked left-to-right from most to least happy. Discussion By looking at Table 2, we see that DARCI functioning autonomously does perform the worst of the artefact sources, but not dramatically so. Furthermore, Table 3 indicates that human performance was not distinguishable, in the statistical significance sense, from the performance of DARCI in cooperation with humans nor from the performance of humans choosing the best random image. These results suggest that, overall, volunteers are not strongly preferring one artefact source over another. When looking at performance over individual synsets (Figure 3), we see that a more distinct preference is given to certain artefact sources over others. But, even in these cases, the source given preference varies from synset to synset. Looking at Figure 3, the clearest distinction between sources is between the human and autonomous DARCI when rendering "happy" images. In this case humans clearly outperform DARCI. However, in the case of "fiery" images, DARCI performs statistically the same as humans. When in cooperation with humans, DARCI significantly outperforms solo humans in both "fiery" images and images combining all three synsets. In the case of "lonely" images, none of the artefact sources perform statistically different from one another. Volunteers prefer human creations for "happy" images and they prefer DARCI-human collaborations for both "fiery" images and images combining "fiery", "happy", and "lonely". If we look even more specifically at the individual synsetimage pairs, we find that all artefact sources are top ranked for some of the pairings. Autonomous DARCI is top ranked for "fiery" image A and "lonely" image C; the best-ofrandom source is top ranked for "happy" image C, "lonely" image A, and "triple" image A; DARCI in cooperation with humans is ranked top for "fiery" image B, "fiery" image C, and "triple" image B; humans creating solo are ranked top for "happy" image A, "happy" image C, "lonely" image B, and "triple" image C. The rankings for the most substantial successes of each artefact source are shown in Figures 4-7. While DARCI's solo artefacts often rank on par with hu(a) Best Random (b) Coop (c) Human (d) DARCI Figure 6: Image B rendered "happy"—ranked left-to-right from most to least happy. (a) Coop (b) Human (c) DARCI (d) Best Random Figure 7: Image B rendered "fiery", "happy", and "lonely"—ranked left-to-right from most to least fiery, happy, and lonely. man artefacts, the best random artefacts do as well. Furthermore, these partially random artefacts are sometimes ranked better than DARCI's. If these were totally randomly generated artefacts, then this would be an area of concern. It turns out, however, that given the number of random images from which we selected, it is fairly common to encounter at least one image that (at least to some extent) satisfies the demands of the synset in question. Taking into account Ritchie's proposal that the proportion of high quality artefacts produced should be correlated with creativity (Ritchie 2007), and observing DARCI's top forty artefacts, it becomes clear that DARCI is accomplishing something better than random image generation. Figure 8 shows the 40 images DARCI chose to save while rendering image A as "fiery", while, for comparison, Figure 9 shows the 40 random images generated for the same task. While we did not empirically determine the proportion of images in these sets that are "fiery", it is apparent that significantly more images are "fiery" in Figure 8 than in Figure 9. Conclusions If we assume that the human artists commissioned to produce artefacts for this research did indeed produce renderings that portray the synsets, then we conclude that, given the same toolset, DARCI can also produce renderings that portray them. This is a compulsory assumption since by the nature of art, the only way DARCI can be evaluated as an artist, is in comparison to other (human) artists. While on the whole, at this point people tend to favor human solo works over DARCI's solo works, the differences are not substantial or consistent enough to warrant a different conclusion. Furthermore, the collaboration between DARCI and human's was frequently favored over human solo artefacts. This indicates the potential for DARCI to be used as a tool to augment the creative process of human artists. Only three synsets were tested in this experiment. However, these synsets are representative of the meaning that we want DARCI to be able to incorporate into artefacts to facilitate visual communication with an audience. DARCI 14 Figure 8: The top five "fiery" renderings for all eight subpopulations discovered by DARCI for image A (not ordered by fitness). has sufficient data, ergo sufficient appreciation to perform similarly on many more synsets. We are currently updating DARCI so that the system can perform commissions online while using any known synsets. This will allow us to further observe DARCI's capacity for rendering. In future work regarding the evaluation DARCI, we will be exploring Ritchie's other criteria for creativity: namely novelty and typicality. In addition, we will explore the artistic side of quality, rather than the strictly pragmatic one explored in this research (i.e. the degree to which synsets were incorporated into the artefacts). Acknowledgments Warm thanks to Simon Colton for providing us with Filter Feast image filters that were included in DARCI's toolset. This material is based upon work supported by the National Science Foundation under Grant No. IIS-0856089. 2011_30 !2011 Steps Toward the AIR Toolkit: An Approach to Modeling Social Identity Phenomena in Computational Media D. Fox Harrell, Ph.D., Greg Vargas, Rebecca Perry Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge, MA 02139 USA {fox.harrell, gvargas, rebperry}@mit.edu Abstract The Advanced Identity Representation (AIR) Project is a new interdisciplinary approach to the problem of designing identity technologies to enable imaginative selfrepresentations for users by implementing dynamic social identity models grounded in computing and cognitive science. AIR Project research develops models of social computational identity (e.g., characters, avatars, and social networking profiles) to enable user representations that dynamically change in response to context and use, and to implement an identity modeling toolkit for constructing cross-application self-representations. This paper reports on the developing AIR Toolkit's support for modeling social identity phenomena in which single users deploy multiple self-representations (avatars, characters, or profiles) for different purposes. Introduction Computational media have transformed the creation and representation of human identities. Understanding identity representation as both a creative and a computational act can inform development of technologies to enhance how identities are enacted as social and technical practices, particularly in videogames and social networks. Human-centered computing researchers have tended to focus on issues such as user and task analyses, cooperation, and usability, e.g. in (Muramatsu and Ackerman 1998; Suchman 1987). In contrast, humanists and social scientists have often investigated identity-inflected issues such as power, class, stigma, racism, sexism, and related themes (Nakamura 2002, 2008; Nelson and Tu 2001; Waggoner 2009) - exposing identity as a dynamic, creative feat of self and social construction. Games studies scholar Zach Waggoner (2009) describes identity creation as an unfolding process of self-representation that takes place in the creative liminal space between the user and the videogame avatar - between the embodied materiality of the player and the imagination. Social scientist Sherry Turkle's (2004) studies of membership in multiple communities have revealed that users often experience a sense of "cycling through" different selves. Expression of multiple selves is intrinsic to everyday human creativity. Indeed, in his seminal work Erving Goffman (1959) described a negotiation between the socially constructed, public performance of the self, and the desired inner self - a complex, creative social and imaginative act. Informed by such perspectives, we take the view here that creation and maintenance of computation identities is, in part, an active creative feat of imaginative cognition. Furthermore, social categories are often aspects of identity that are reified in computational systems. Hence, we focus on a cognitive science perspective on categorization that highlights its imaginative nature and basis in cognitive mechanisms for metaphorical and metonymic mapping. The Advanced Identity Representation (AIR) Project consists of developing new technologies informed by categorization and classification theories from cognitive science and sociology. (Harrell 2009) We are developing a toolkit that can take data-structures for characters in games or profiles in social networks and use them to model social phenomena such as presenting oneself differently to different groups, becoming a member of a group, or passing as a member of another group. This is accomplished through performing operations such as finding analogically matching profiles/character data-structures, adapting them to different social categories, forming new categories based on analogical relationships between individuals, revealing or simulating stereotypical categories at the data-structural level, and more. Hence, we address a computationally reified, reductive form of identity, but do so: (1) as a critical technical practice (Agre 1997) aware of the aspects of identity that are not computational, and (2) recognizing the this reduction has already taken place "in the wild" as users have built identities already encoded as data-structures. Theoretical Framework Technical Components of a Sociodata Ecology Computational identity systems, e.g., social networking profiles, online accounts, and avatars/characters are implemented using a limited and often overlapping set of components. 147 Figure 1: Shared technical underpinnings of computational identity applications There are two important motivations for describing these components: (1) identifying an appropriate level of abstraction for analyzing the technical side of computational representations comparatively across different types of applications, and (2) identifying components that can be analyzed both in terms of how they appear visually and how they are implemented algorithmically and datastructurally. Figure 1 describes the six components that comprise the majority of widely used computational identity technologies. (Harrell 2009) This paper focuses on support for components at levels 4 and 5 (statistical/numerical representation and formal annotation). These underpinnings exist in a sociodata ecology (Harrell 2010), wherein technical infrastructure, datastructures and algorithms, and code are looked at as they relate to issues such as embodied experiences, subjective interpretations, power relationships, and cultural values. Cognitive Model of Computational Identity The AIR Project approach begins with the basic cognitive building blocks of identity upon which social identity categories are built. Cognitive scientists have proposed that human conceptual categories form "idealized cognitive models" (ICMs) upon which categories of objects in the world are built (Lakoff 1987). Social networking sites explicitly group users into categories called "friends," while games may group users into categories called elves or half-orcs. These categories may also manifest implicitly, for example Eric Gilbert and Karrie Karahalios's (2009) metric for "tie strength" determines "friendliness between" users evidenced through use of the system. Yet, most computational user categorizations invoke much less robust models. Technical infrastructures may implement (often incorrect) stigmatizing identity classification models (Bowker and Star 1999; Goguen 1997), indeed some games feature datastructures instantiated with values where some races/genders are less intelligent than others. Cognitive science theory is presented below to provide models that can help explain how users project their identities onto their computational surrogates. (Gee 2003) Cognitive Categorization The AIR approach is influenced by the prototype theory of Eleanor Rosch and work in categorization by George Lakoff. (1987) Lakoff describes a metonymy/metaphor-based account of how imaginative extensions of "prototype effects" result in several phenomena of social identity categorization that have proven useful for the AIR Project: • Representatives (prototypes): "best example" members of categories; • Stereotypes: normal, but often misleading, category expectations; • Ideals: culturally valued categories even if not typically encountered; and • Salient Examples: memorable examples used to understand/create categories. Since the AIR Project technology involves techniques to formalize and implement ICMs as computational datastructures, identity phenomena become amenable to algorithmic manipulation and experimentation. Conceptual Blending and Multiple Selves Learning scientist James Gee's concepts of the real, virtual, and projective identities in games provide a useful starting point for thinking about how embodied identity experiences and values in the real world intersect with the affordances and semiotic values of computational representations. (Gee 2003) For Gee, player representations as projected identities manifest the ways that real player values are reconciled with values understood as associated with avatars. The AIR Project approach emphasizes projected identity. (Corneliussen and Rettberg 2008) Using cognitive science terminology, this can be seen as metaphorically mapping ICMs (mental spaces) that humans have of themselves onto characters, or to use terminology from Gillles Fauconnier and Mark Turner's (2002) conceptual blending theory as selectively projecting aspects from conceptualizations of both a real identity and a virtual identity into a blended identity. Examples of blended identities include the venerable notion of double-consciousness, the dual awareness of a person from a marginalized or oppressed group's self-conception and the social stigma attributed to the social group (Du Bois 1903), and identity torque, the often psychologically painful experience of a person's selfconception differing from a stigmatized perceptions reinforced by classification infrastructure (Bowker and Star 1999). The notion of blended identities is central here because it informs the idea that a single user can have multiple identities depending on the elements being projected. Implementation and Findings We have developed a model of multiple user identity datastructures and ways of displaying the contents of those data-structures via a GUI. For example, a profile on the social networking site Facebook consists of structured data 148 indicating friends, items a user likes, personal information (such as gender or location), etc. Figure 2: A subgraph of a Facebook profile This can be represented as a graph in which items and attributes are nodes that are connected to users by relations such as ‘like' or ‘friend.' Some of these may also include numerical statistics such as integers for age (see Figure 2). Figure 3: A subgraph of a role-playing game character In such a profile the number of friends and pages for many typical user may reach the hundreds or thousands, resulting in interesting graph structures to analyze. Similarly, for a character in a game (especially roleplaying games in which character creation is a primary focus) a graph can be used to represent stats (numerical values for gameworld attributes like intelligence or dexterity), skills, race, class, gender, etc. (see Figure 3). Despite their differing structures, the similarities in these representations at the abstract data-structural level have allowed us to consider how multiple representations (or views on representations) can reflect identity phenomena from the real world such as self-presenting differently in different communities, attempting to "pass" as a member of another community, or being a central or marginal member of a community. In games, multiple representations can be used to implement phenomena such as critically modeling stereotyping (by making non-player characters uniformly respond to characters based on some subgraph of elements rather than the full graph), developing emergent profession/class models rather than top-down designations, and decoupling real world racial, ethnic, and gender categories from game mechanics-oriented numerical statistics for combat and exploration of game worlds. Toward this end, our models support implementation of: • Multiple Identities based upon: o adding to, subtracting from, or reorganizing the graphs described above; this can be used to automatically customize a user's profile/character, or view of a profile/character, based upon who the profile/character is presented to o users explicitly creating multiple profiles (or views of a single profile/character) based on privacy settings or membership in different groups • Identity Categories emerging from finding clusters of users with analogous graphs • Prototypical Members of categories based upon maximizing analogy with other users • Critical Attributes are profile/character attributes that are most telling in revealing analogy with other users It is not clear that only manipulating these data-structures provides the necessary affordances for modeling real world identity experiences adequately. Further development may require augmenting these structures with metadata indicating salience of particular attributes or additional attributes. It will also require study of how users take up and deploy the data-structures beyond technical affordances of the systems (e.g., chatting in virtual worlds or flat text descriptions of characters in games). However, our model does introduce an extensible set of features to allow system designers to implement the semantics of social identity phenomena rather than hardcoding in racism as social critique (as in the game Dragon Age's portrayal of racism against elves) or simplistic models of group membership such as the opt-in/opt-out model in Facebook. In future AIR project development, phenomena such as stereotyping, marginalization, naturalizing in communities, and stigmatization will be addressed. Technology Development There have been two main thrusts of technology development. These are: (1) AIR Toolkit Development (2) Application Development and Deployment (assessing popular software systems to use the AIR toolkit with and deploying the toolkit in those systems) 149 Regarding (1), we are currently developing an interface, implemented in Python, capable of comparing and adapting user profiles. This interface is agnostic toward applications (it can be applied to games and social networking applications alike) and is agnostic toward algorithms used for comparing users. Initially, comparison is being done using a system called AnalogySpace developed by the Commonsense Reasoning research group led by Henry Lieberman at the MIT Media lab. (Speer, Havasi, and Lieberman 2008) We also have been considering using the Structure Mapping Engine developed by Ken Forbus, Dedre Gentner, Ron Ferguson, and others at Northwestern University. (Ferguson, Forbus, and Gentner 1997; Forbus 2001; Gentner 1983) Finally, we also have considered using a matching algorithm developed in (Chow and Harrell 2009; Harrell 2010). Aside from potentially varying in effectiveness, these different approaches require differing amounts of background knowledge and may be more or less useful for particular applications. Regarding (2), we have deployed the toolkit to implement multiple identity representations, categories, and comparisons in Facebook. Before selecting Facebok for our initial deployment, we assessed popular systems used in both social networking and gaming in order to determine which would be optimal for initially testing the system. Toolkit API We are designing an API for the basic functionality of the toolkit. The current AIR toolkit iteration uses Facebook's Graph API to download information about the user and his/her friends including profile information, friends, and likes. The toolkit then creates a large, sparse n x n matrix and performs a truncated Singular Value Decomposition (SVD) using the Divisi library from AnalogySpace. It offers functions for the following purposes (using the term "object" to refer to a profile or character structure): • Finding Similar Objects: The truncated SVD approximates dot products between each pair of objects. These approximated dot products are used as a similarity metric and the toolkit can return the objects most similar to a given object. • Predicting Features: The truncated SVD has a "smoothing" effect on the values in the matrix in a way that makes it useful for making inferences. The toolkit can use this to calculate the likelihood of a particular feature belonging to an object, whether or not it was represented in the original graph, as well as return the top predictions. • Projecting one object onto another: The toolkit can return a filtered view of a particular object filtered by the predictions of another object. We shall discuss more of the potential uses of such a tool later. • Creating Categories: The toolkit allows for the manual creation of a category by choosing initial seed objects, averaging the objects' feature vectors and then suggesting other objects to be included in the categories as well as predicting important features for the category. • Creating and Inserting Objects into the Graph: The toolkit also allows the creation and insertion of new objects into the graph. This could be useful for creating prototypical objects and examining their relation to other objects or experimenting with the graph structure and seeing the changes it causes. The first use of this API is a web interface for exploring a user's Facebook graph with the toolkit. We wrote a program that authenticates a Facebook user and downloads metadata from the user's profile as well as their "likes," then does the same for each of the user's friends. The web interface we created downloads this information and converts it to the graph structure that the toolkit can read. The website then provides an interface structured like a readonly social network site focused on exploring the user's network and examining other profiles. One key feature of this site is that it can allow the user to view other users' profile data based on their relationship to her/his own. That is, when a user visits a friend's profile, the user could see only the connections that they share or that the system thinks they should share (see Figure 4). Figure 4: User2547 filtered to show only the links predicted to be present in User 6366's graph Figure 5: The interface allows the selection of groups of users (objects) to create categories based upon analogy between the users, find key features of those categories, and find other possible members of the category The interface enables exploration of basic toolkit functions such as comparing profiles, calculating predictions, adding profiles, and creating categories (see Figure 5). 150 Model and Toolkit Development The AIR Toolkit is still under development and we hope to continue to implement mechanisms that allow those using the toolkit to represent the types of identity phenomena discussed above. Extensions to the models developed will consist of refining and extending techniques to implement a small subset of cognitive and social identity phenomena in software, initially addressing torque, metonymic category models, marginalization, markedness, naturalization, and category gradience. In addition to that work, we will add support for implementing modular graphical user-representations for users. Currently, our toolkit is limited to altering textual and semantic representations. Adding functionality for examining and altering graphical representations is potentially a more difficult problem, but would be helpful in systems that place an emphasis on avatars or other graphical models. With the progress made on the toolkit, it is possible to prototype further applications that take advantage of the models we have discussed. Examples might include: • a social networking GUI for changing a user's self representation for different social groups as opposed to cumbersome alteration of privacy settings, • integrated networking/gaming applications allowing information social networking information to influence play style and vice versa, • a system modeling the phenomena of "passing" as a member of a different social group to facilitate a learner's transition from a novice to an expert member of a group, • a social networking system supporting the ability to swap between multiple identities, perhaps based the user's perception of which identities would be empowering, stigmatizing, or challenging in a given context, and more. Evaluation It will be important to assess whether or not users feel that our AIR Project systems are more empowering than current systems and if they can be used to minimize stigma built into identity representation structures. Though this assessment has not been completed yet, sufficient development work has been done so as to warrant reporting. We also have been developing methods to pursue such assessments. In the spring of 2010, Harrell conducted a pilot study for the AIR Project with four female participants and two researchers. The subjects, who were novice computer users, engaged in identity creation via the manipulation of character creation systems in three game systems, The Sims, the Nintendo Mii Channel, and the game Elder Scrolls IV: Oblivion. As the subjects engaged in character creation, semi-structured clinical interviews were conducted regarding the character creation process and the relationship of the characters created to a range of identity issues after users were first prompted to describe their creations "in their own words." The dialogue was captured via digital video and the sessions were screen captured comprising raw data for analyses to be presented elsewhere. The dialogue captured in these files is being transcribed and will serve as the basis for crafting an empirical instrument for evaluating AIR systems as well as assessing whether users feel that these well-known games are adequately expressive. Transcripts and videos will be analyzed using grounded theory techniques (Glaser 1992; Strauss 1987), a well-known method of qualitative analysis. Open Questions and Concluding Reflections While we have made a good start with the preliminary framing and ongoing development of the AIR Toolkit a number of interesting open questions remain. In particular, given our reliance on cognitive accounts of metaphor and analogy, we have been influenced by the critique of Chalmers, et. al. in (Chalmers, French, and Hofstadter 1992) regarding computational approaches to the same as they assert: How are these data put into the correct form for the representation? Even if we have determined precisely which data are relevant, and we have determined the desired framework for the representation—a frame-based representation, for instance—we still face the problem of organizing the data into the representational form in a useful way. The AIR project takes heed of this concern, however it asks a reciprocal question. How can one design idealized logical forms amenable to our algorithmic techniques useful for modeling the social phenomona we are interested in? We see design of such ontologies as a creative problem requiring human judgment and do not intend the ontologies to be models of the real world. Rather, they are user's own expressive self-representations or subjective ontologies. Identifying methods to reduce identities to abstract data types is both a non-trivial problem and a double-edged sword, potentially both facilitating and hindering analysis of the data. Can these data types be effectively optimized for use with analogical reasoning systems like AnalogySpace or SME? We will need to further develop our rationale for adopting particular analogy systems, and basis for our belief in their usefulness and validity. Another open question considers the relationships between OS level and application level GUIs. Turkle describes users toggling between online identities, arguing that this comprises a type of conversation between different identities, which enables a fluid, decentered, fragmented self to be deployed across different domains in creative and sometimes unexpected ways. (Turkle 1995) The experience she describes is linked to interactions with computer graphical user interfaces (GUIs) rather than specific applications. The AIR Project model will explore analytic methods and tools to identify and facilitate these changing presentations of self at either level. Finally, the core motivating observation for the AIR Project is that identity is a feat of imaginative cognition. 151 Social categories are often reified in software systems which cognitive science theories have suggested are not objective, but are unconscious and based in metaphorical thought. Humans have great power in determining and shifting the meanings of our categories - the AIR Project is a modest step toward doing so in software. Acknowledgments We gratefully thank the National Science Foundation's support provided by CAREER Award #0952896. We also thank Henry Lieberman, Catherine Havasi, Jason Alonso and others from the MIT Commonsense Reasoning Group. 2011_4 !2011 Knowledge-Level Creativity in Game Design Adam M. Smith and Michael Mateas Expressive Intelligence Studio University of California, Santa Cruz {amsmith,michaelm}@soe.ucsc.edu Abstract Drawing on inspirations outside of traditional computational creativity domains, we describe a theoretical explanation of creativity in game design as a knowledge seeking process. This process, based on the practices of human game designers and an extended analogy with creativity in science, is amenable to computational realization in the form of a discovery system. Further, the model of creativity it entails, creativity as the rational pursuit of curiosity, suggests a new perspective on existing artifact generation challenges and prompts a new mode of evaluation for creative agents (both human and machine). Introduction Paintings (Colton 2008), melodies (Cope 2005), and poems (Hartman 1996) are familiar domains for artifact generation in computational creativity (CC), and much established theory in the field is focused on evaluating such artifacts and the systems that produce them. In this paper, we draw inspiration for a new understanding of creativity from the less familiar (but no less creative) domain of game design. In its full generality, game design overlaps visual art, music, and other areas where there are many existing results, but where it stands apart is in its unavoidably deep, active interaction with the audience: in gameplay. Crafting gameplay is the central focus of game design (Fullerton 2008). Play, however, is not an artifact to be generated directly. Instead, it is a result that emerges from the design of the formal rule system at the core of every game (Salen and Zimmerman 2004, chapter 12), a machine driven by external player actions. Where, in visual art, we might judge the creativity (as novelty and value) of an artifact on the basis of the work's similarity to known pieces and its affective qualities (Pease, Winterstein and Colton 2001), it is not so easy to make direct statements about the properties of the artifacts in game design. Desirable games are celebrated for their innovative gameplay or the fun experiences they enable- these are properties of the artifact's interaction with the audience, not of the artifact itself. The focus on predominantly passive artifacts in CC, those which can be appreciated via direct inspection rather than through interactive execution, has masked what is obvious in game design: that the desirability of artifacts is in their relationship to their environment. Armed with such an understanding, we seek a theoretical explanation of creativity in game design- not the engineering application of established design knowledge, but the rarer experimentation that realizes new forms of gameplay and original player experiences. This theory should speak to both the artifacts and processes of game design, and do so in a way that meaningfully explains game design as done by humans as well as computational means. Towards capturing the richness of existing human design activity, we are most interested in a theory of transformational creativity (Boden 2004) that explains how designers build new conceptual spaces of game designs and reshape them in response to feedback experiences observing play. We introduce a new theoretical model that is amenable to computational realization which describes creative game design as a knowledge-seeking process (a kind of active learning). Our broader contribution, creativity as the rational pursuit of curiosity, can provide an explanation of and suggest new questions for applications in traditional CC artifact generation domains. In the following sections we will review established game design practices, draw an analogy between game design and scientific discovery, review and apply Newell's concept of the knowledge level, and then introduce our model of creativity. Finally we will conclude with a discussion of the implications of this theory for game design and the larger CC context. Game Design Practices In a standard text, Salen and Zimmerman (2004, p. 168) introduce the "second-order" problem of game design bluntly: "The goal of game design is meaningful play, but play is something that emerges from the functioning of the rules. As a game designer, you can never directly design play. You can only design the rules that give rise to it. Game designers create experience, but only indirectly." Play includes the objective choices made by a player and the conditions achieved in the game, along with the play 16 er's subjective reactions and expectations. At this point, it is straight forward to adopt the first tenet of our theory of creative game design: game designers are really designers of play. The idea of adopting an iterative, "playcentric" (Fullerton 2008) design process, in which games are continually tested to better understand their emergent (play) properties, is corroborated by others like Schell (2008), who further describes the supreme importance of "listening" in the design process (being able to process feedback from the player's experience of candidate designs). Beyond initial conceptualization of a game idea and tuning and polish of the final product, the two most important practices of game design are prototyping and playtesting, both of which are intentionally focused on providing the designer with a better understanding of play. Prototypes are playable artifacts, working models of a game idea that permit asking and answering questions about how a game will interact with its environment without requiring the effort to create a complete, polished game. The aesthetics of prototypes are very different than for complete games: most artwork and sound is stripped away from a design idea to produce an artifact that most effectively elicits feedback on a designer's current focus of interest (often how the interaction of game mechanics affects the trajectory of play). Prototypes must be set to interact with an audience to gain the answers they are designed to provide. Playtesting is the practice of playing a game or gameplay prototype while observing the choices made, actions taken, or reactions expressed by sample players (the designer, a friend, a dedicated tester, or even a member of the target audience). Observations made during playtesting can reveal objective properties of a game such as unwritten-but-implied mechanics, exploits, and alternative puzzle solutions, or subjective properties such as the level of engagement, fun, or hesitation expressed by the sample players (Smith, Nelson, and Mateas 2009). Despite the ostensible purpose of game design being the production of complete, desirable games for play by endusers, the practices of playtesting and prototyping are centered on providing feedback to the designer. Through the rough generate-and-test process of iterative game design, where several prototypes are created and playtested during a single project, the underlying goal is to build up sufficient skill and understanding to later produce the highquality, final game artifact. Such a self-affecting process is exactly what McGraw and Hofstadter call the "central loop of creativity" (1993). Beneath the surface, the practices of game design are almost exclusively about the collection of design knowledge, knowledge regarding the relationship between the component elements of a game system and that game's potential execution in interaction with a player. Such design knowledge spans what design patterns to employ, how to assemble them, and why such an assembly will produce a certain play experience. Existing design knowledge can be applied to realize familiar, well-understood play experiences, but creative game design demands a continuous source of new design knowledge. Thus, the second tenet of our theory is this: creative game design is about seeking design knowledge. An Analogy with Science To expand on knowledge-seeking in game design, we want to draw an extended analogy between game design and science. Doing so will allow us to connect the creative activity in the game design process with the activity carried out by scientific discovery systems in CC. For design generally, Dasgupta claims "design problem solving is a special instance of (and is indistinguishable from) the process of scientific discovery" (1991, p. 353). While Dasgupta focuses on explaining design activity specifically in terms of finding a confirmable theory which resolves a particular unexplained phenomena, our analogy is intentionally softer, to enable applications in a variety of CC domains that no not immediately appear as "design problem solving" domains. Scientific Practices Roughly, the scientific method is a closed loop with the following phases: A hypothesis is generated from a working theory, the hypothesis drives the design of an experiment (usually realized with a physical apparatus), data from executing this experiment is collected, and conclusions are drawn which can be integrated into the working theory. A scientist will design an experimental setup, despite already possessing a theory which makes prediction about the situation, precisely because there is some uncertainty about the result. This result, whether matching the prediction or not, should provide informative detail about the natural laws at play in the experiment's environment. In our analogy, experiment design is prototyping; experiment execution and subsequent analysis is playtesting. The combination of the declarative knowledge of natural laws and the procedural knowledge of operating laboratory equipment is game design knowledge. Finally, the closed loop of the overall scientific method corresponds to iterative game design. Making the parallel clearer, Gingold and Hecker (2006) talk about how gameplay prototypes should be informative, answer specific questions, and be falsifiable. In the philosophy of science, the notion of informative content (and its relation to falsifiability) guides the evaluation of theories and experimental designs. In their capacity to design and execute informative experiments and produce coherent and illuminating explanations of the anomalous results, scientists are clearly creative. In Colton's terms (2008), we can easily perceive this creativity in the skill of precise experimental design, the appreciation of unexpected result in the context of a working theory, and the imagination of previously difficult to consider alternative theories and the invention of new instruments. By the analogy above, a scientist's kind of creativity can apply to the game designer as well, prompting 17 our third tenet: designers act as explorers in a science of play, an "artificial science" in Simon's terms (1996, Ch. 5). Automated Discovery Though largely distinct from artifact generation, automating discovery in science and mathematics is an established CC tradition (Langley et al. 1984; Lenat 1976). Within these systems it is common to find subprocesses which generate artifacts as part of the larger discovery process. The GT system (Epstein 1988), an automated graph theorist, would periodically "doodle" random graphs within a specific design space as a means of generating relevant data which might spark a new conjecture about the desired area of focus. With our analogy in mind, such doodling is reminiscent of the exploratory, rapid prototyping process sometimes used in game design (in what Gingold and Hecker call the "discovery" phase of development). A heuristic to generate new concrete examples of abstract concepts was even present in the original automated mathematician, AM, working with number theory (Lenat 1976). Beyond generating artifacts with only the indirect intention of knowledge gain, more recent discovery systems internally optimize expected knowledge gain when deciding which experimental setup to test next, realizing an active learning process (Bryant et al. 2001). Where artifact generation provides opportunitistic benefits in mathematical domains (in which graphs and conjectures are noninteractive, static artifacts), discovery systems working in the physical sciences fundamentally cannot avoid artifact generation (as experimental design) during active exploration of their domain. A notion of "interestingness" is the glue that binds the various subprocesses of automated discovery (artifact generation included) together into an overall control flow (Colton and Bundy 2000). In many cases, interestingness measures the likelihood or quality of knowledge expected to be discovered by taking a particular action (e.g. searching for a counterexample) or focusing on a particular concept (e.g. looking for new examples of special graphs). A system's overall notion of interestingness can be used to induce a measure of value for artifacts generated in its artifact generation subprocesses, a measure related to potential for knowledge gain as opposed to aesthetic value. Returning to game design, our fourth tenet holds that automated discovery systems inspire a computational model of creative game design that explains the prototypes produced in exploratory game design as the doodles produced trying to flesh out design theories residing in the designer's head, motivated by an interest in designs that have the potential to reveal new patterns. Newell's Knowledge Level To complete the image of the creative game designer as a discoverer, we need some better vocabulary for talking about a designer's knowledge, around which the entire discovery process revolves. Newell (1982) describes the "knowledge level" as a systems level set above the symbolic, program level. At the knowledge level, we find agents with bodies of knowledge that can take actions in some environment to make progress towards goals. The actions taken by an agent are said to be governed by a principle of rationality that states "If an agent has knowledge that one of its actions will lead to one of its goals, then the agent will select that action." This sense of rationality is distinct from decision theoretic rationality in that it does not necessarily imply that an agent must optimize anything. While this radically underspecifies an agent's behavior from a computational perspective, the constellation of concepts at the knowledge level is useful for making statements about our game designer. The intention of knowledge-level modeling is to explain the behavior of knowledge-bearing agents (be they human or machine) without reference to how that knowledge is represented or access to an operational model of the agent's mode of processing. Understanding the game designer, at the knowledge level, starts with making an assumption about what is known and what is sought. But from our theory so far, we can safely assume an important body of knowledge possessed is that of tentative game design knowledge. This knowledge permits the designer the use of tools such as paper and trinkets for physical gameplay prototypes (often styled after board games) and programming languages and compilers for more detailed, computational prototypes. This same knowledge permits understanding to be gained from the observation of game artifacts in play, and suggests a tentative vocabulary for composition of those artifacts (i.e. knowledge of design patterns for game rules). The creative designer's goal, per our analogy with science, is clearly to gain more design knowledge. Given this, we expect the designer to rationally (specfically in the knowledge level sense) go about the practices of game design as part of taking actions that lead towards the gain of design knowledge. That is, game design activity can be explained as the rational pursuit of design knowledge gain. Creativity as Rational Curiosity The knowledge level lets us talk about a kind of rationality, one that gives an explanation for why game designers take the actions they do. But not all game design activity is creative, no more than all of science being creative. So where does creativity come in? The most creative parts of game design, we claim, are the ones where the designer's behavior is best explained by the direct intention to gain new knowledge, to satisfy curiosity. The bulk efforts of game production are a kind of engineering which applies the knowledge gained in the curiosity-driven creative mode. As the motivation to reduce uncertainty and explore novel stimuli (Berlyne 1960), curiosity has long been known to be intertwined with the judgment of aesthetics (Berlyne 1971). Saunders' "curious design agents" (2002) generate aesthetic artifacts according to their potential to satisfy an internal measure of curiosity, doing so in order to learn about an outside environment. This framework has 18 also been used to drive the behavior of a simulated society of curious visual artists and even a flock of curious sheep in a virtual world (Merrick and Maher 2009). How curiosity-driven behavior can explain the various processes and artifacts of human creativity at a high level has been demonstrated at great length (Schmidhuber 2010). Our unique claim, that creativity is a knowledge level phenomenon, gains similar explanatory power without reference to algorithmic details (such as the use of reinforcement learning or optimization procedures) or human psychology, as in Loewenstein's comprehensive review of research on human curiosity (1994). Looking concretely at curiosity in the domain of game design, consider the example of speed runs, gameplay traces that demonstrate a way of doing something in a game (completing a level or collecting certain items) much faster than a designer previously expected. Speed runs are interesting to game designers, from a curiosity perspective, because they often represent a novel stimulus and quickly increase uncertainty about what is possible in a game, creating an urge to seek out related gameplay traces that would illustrate the general pattern by which the run was achieved. With additional experience, the designer can learn to either design-out such speed runs by adjusting the game's rules, or create new mechanics that reward them. Putting together curiosity about design knowledge with knowledge-level rationality, we have our complete theory of creativity in game design: Creativity is the rational pursuit of curiosity. This claim applies to human and machine design agents and gives a goal-oriented explanation to sequences of design activity that result in design knowledge gain (clearly including prototyping and playtesting). A creative game designer makes games, not because that is their function, but because they want to learn things about play that require experimentation with certain artifacts to illuminate. We call this theory rational curiosity because it is a knowledge-level treatment of the concept of curiosity that focuses on how curiosity explains the selection of actions towards a known goal. It claims that curiosity, applied rationally, will result in behavior recognizable as creative design activities. Transformational Creativity in Game Design Consider our model creative game designer, over time, producing increasingly complex and refined playable artifacts that are in line with the complexity of their currently operating design theory. That playable artifacts are produced is just an externally visible byproduct of the more interesting process going on in the designer's thoughts, the growth and refinement of design knowledge. Taking a snapshot at any one time, the designer's knowledge is fixed. The present knowledge describes a "conceptual space", in Boden's terms, of game designs and play possibilities. Combinational creativity within this pace would entail the generation of artifacts from known structures and construction constraints or, perhaps, the enumeration of explanations of a player's behavior with respect to known patterns. Taking a series of steps in terms of the current design theory, producing a new game using a design pattern of interest, producing a prediction of player behavior, and then performing a playtest and comparing the results with the prediction is an example of exploratory creativity in this space. These activities are weakly creative in the rational curiosity view because, though they might indeed be motivated by potential knowledge gain, neither realizes an actual change to the designer's personal theory. Transformational creativity in game design, then, is design activity which results in a redefinition of design theories. In iterative game design, where many prototypes are produced in succession in response to feedback from playtesting, is an intensely transformative process. Such transformations can include the definition of a new design pattern which simplifies the explanation of how another designer's game was constructed, a constraint which limits the use of two patterns together, or a rule which predicts a certain kind of player's behavior when a certain combination of pattern is present. Discussion Having proposed a theoretical explanation of creativity in game design, let's look at what it entails. Computational Creativity in Game Design The theory of rational curiosity in game design can be realized computationally along two major paths: the development of a game design discovery system, and new, knowledge-oriented creativity support tools. More generally, however, it suggests new elements that need to be modeled computationally in support of either path. Game Design Discovery Systems Recalling the analogy with scientific and mathematical discovery systems, we can imagine the design of a new kind of discovery system that would work in the domain of game design knowledge. This discovery system would produce games as part of its experiments in exploring play, but it would also allocate significant attention to decomposing games made by other designers and producing explanations of observed human player actions. The notion of interestingness in this discovery system would correspond to a symbol-level realization of the agent's knowledge-level goal: the satisfaction of curiosity about design knowledge. By selecting actions (such as the construction of a prototype, the simulation of a playtest using a known player model, or the analysis of a previously produced game in light of a refined theory) according to their calculated prospects for improvement of the working design theory (a library of design patterns, predictive rules for player behavior, and constraints on play-model construction), the system's behavior would implement a rational pursuit of curiosity, creativity in game design. Constructing such a system would require new research into adapting symbol-level representations of design knowledge for use in game design, the development of a task decomposition of creative game design into subgoals and 19 actions (such a concrete design methodology which would be of interest to human designers as well), and the identification of the relevant external tools of game design (certainly paper prototyping materials and programming environments are some of these tools, but where are the CAD systems for games?). Knowledge-oriented Creativity Support Tools With recognition that design knowledge gain is the designer's goal, creativity support tools in game design should focus on easing this process. In terms of Yeap's desiderata for creativity support tools, these tools should focus on ideation and empowerment (2010). In game design, these translate to the generation of candidate design knowledge for the designer to consider and then leaving the adoption of the new knowledge up to the designer (without undue interference). Knowledge-oriented creativity support tools should attempt to remove bottlenecks in the discovery process. We created a gameplay pattern language and corresponding search tool which is intended to accelerate the extraction of feedback about game designs from prerecorded traces of play (Smith and Mateas 2011). The system is also capable of compiling patterns into a lower-level form that can be used to search for additional evidence of gameplay patterns with the machine playtesting tools included in the BIPED early-stage computational prototyping tool (Smith, Nelson and Mateas 2009). Neither of these systems are themselves creative, but they are designed to provide new actions to the creative designer for rational selection in the service of knowledge gain. In another project (Smith and Mateas 2010), we captured a design space of mini-games in a logic program and used model-finding techniques to automatically generate artifacts from the described space. Use of the logic programming tools automates a small slice of game design (the literal construction of artifacts) and provides a convenient symbolic representation for some types of design knowledge. While this project automates combinational creativity in game design, the design space representation and sampling tools are intended to support an external designer's transformational creativity in the realization of new forms of gameplay through rapid exploration and expressive redefinition of the mini-game design space. New Perspective for Computational Creativity We have mostly focused on game design, but rational curiosity is intentionally worded so as to apply to other CC domains. In fact, it should apply even to domains with apparently non-interactive artifacts. Where, in game design, we were concerned with the implications of game rule system on player actions and reactions, in the domain of music we should explore the implications for sound patterns on audience anticipation and mood, in visual art the implications for perceptual details on where the viewer's eye lingers or flees, and in sculpture the implications of geometric arrangements on audience interest from particular viewpoints. Such domains are not as interactive as game design, but they could be equally deep in the subtlety of how an audience reacts to an artifact - depth enough to keep the rationally curious artist busy producing experiments for quite some time. The knowledge-level analysis of creativity suggests new questions to ask of CC systems: What does this system want to learn? How is knowledge represented in this domain? Is the system experimenting with the affordances of the raw medium or focusing on audience reactions achievable through it? (Both are equally creative assuming the desired kind of knowledge is gained.) Consider NEvAr (Machado and Cardoso 2002), a creative system in the domain of visual art. Unlike the straightforward interactive genetic algorithm in PicBreeder (Secretan et al. 2008), NEvAr does not ask its audience for feedback on every artifact it internally considers. The system summarizes sparse feedback from its human audience in the form of a neural network which becomes a proxy for their ratings in an internal evolutionary process. Rational curiosity would describe NEvAr as a creative system, not because it produces novel and valuable images, but because, at the knowledge level, the system appears to be rationally soliciting reactions on believed high-valued images and incorporating the responses in a way that transforms the space of images that the system will next produce- it behaves consistently with the explanation that it is rationally pursuing its curiosity (albeit with limited design knowledge storage capabilities). If redesigned from scratch with rational curiosity in mind, the system might incorporate a more interpretable representation of learned knowledge (for which it is easier to read as a design theory that improves over time) and put more computation into experimental design, reasoning over the expected knowledge gain from enticing the human audience to provide feedback on a particular work rather than always trying to display the estimated-best available artifacts. Orienting the system around active learning, we predict, would improve the system's apparent creativity. From our perspective, it is natural to ask what a system learns as it runs. Though established techniques in computational visual art such as design grammars and iterated function systems can, in some cases, produce very interesting (valuable) images, the static nature of these techniques in isolation implies that, over time, our sense of novelty of the kinds of artifacts these techniques produce will necessarily wane because these techniques do not learn. While rational curiosity would deem a technique that merely samples a fixed generative space uncreative, these techniques are still valuable to us- they encode very rich design spaces that, upon gaining experience through experimentation, a creative agent can alter as part of large-scale experiments in the design of these generative spaces. Conclusion We have followed the clues embedded in the practices of human game designers to a set of building blocks for a theory of creative game design. To recap: 1) Game designers are really designers of play. 2) Creative game design is about seeking knowledge. 20 3) Designers act as explorers in a science of play. 4) Automated discovery systems inspire a computational model of creative design. 5) Game design activity can be explained as the rational pursuit of design knowledge gain. This has led us to a new statement about creativity that can apply to human or machine design agents in any artifact generation domain: creativity is the rational pursuit of curiosity, a knowledge-level phenomenon. We hope this model of creativity will inspire the exploration of discovery system architectures for artifact generation systems and the development of a new space of knowledge-oriented creativity support tools. Acknowledgements This work was supported in part by the National Science Foundation, grant IIS-1048385. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 2011_5 !2011 We Can Re-Use It For You Wholesale Serendipity and Objets Trouvés in Linguistic Creativity Tony Veale School of Computer Science and Informatics University College Dublin, Belfield D4, Ireland. Tony.Veale@UCD.ie Abstract The objet trouvé or found object has become a staple of modern art, one which demonstrates that artistic creativity is just as likely to arise from serendipitous encounters in the real world as it is from purposeful exploration in the studio. These readymades, like Duchamp's Fountain (a urinal!) seem banal in their conventional contexts of use, but take on a new resonance and meaning when viewed in an artistic space. This paper considers the linguistic equivalent of the readymade object: a well-formed phrase that takes on a new significance when used in a new context with a potentially new and resonant figurative meaning. We show how linguistic readymades can be recognized and harvested on a large scale, and used to provide a robust and scalable form of creative language generation. For Best Results, Just Add Water Computationalists in the classical AI tradition generally prefer to model the creative process as a purposeful exploration of a conceptual space (e.g., see Boden, 1994). This concentration of computational effort makes good engineering sense, but little artistic sense, since much of what we consider to be creative insight occurs not in the studio or the laboratory but through everyday interaction with the real world. Serendipitous discovery is thus unlikely to arise in purposeful explorations, since specific applications have no remit beyond the immediate concerns of their programming (and programmers), and have no other lives to live when not actively pursuing these concerns. The objet trouvé or found object is perhaps the most potent example of serendipity in artistic creation. An artist encounters an object with aesthetic merits that are overlooked in its banal, everyday contexts of use, yet when this object is moved to an explicitly artistic context, such as an art gallery, viewers are better able to appreciate these merits. The transformational power of a simple context switch is most famously demonstrated by the case of Marcel Duchamp's Fountain, a humble urinal that becomes an elegantly curved piece of sculpture when viewed with the right mindset. Duchamp referred to his objets trouvés as "readymades", since they allow us to remake the act of artistic creation as one of pure insight and inspired recognition rather than one of manual craftsmanship (see Taylor, 2009). In computational terms, the Duchampian notion of a readymade allows artistic creativity to be modeled not as a construction problem but as a decision problem. A computational Duchamp need not explore an abstract conceptual space of potential ideas. Rather, the issue instead becomes: how do we expose our Duchampian agent to the multitude of potentially inspiring real-world stimuli that a human artist encounters everyday? Readymades represent a form of creativity that is poorly served by exploratory models of creativity, such as that of Boden (1994), and better served by the investment models such as the buy-low-sell-high theory of Sternberg and Lubart (1995). In this view, creators and artists find unexpected or untapped value in unfashionable objects or ideas that already exist, and quickly move their gaze elsewhere once the public at large come to recognize this value. Duchampian creators invest in everyday objects, just as Duchamp found artistic merit in bottles and combs. From a linguistic perspective, these everyday objects are commonplace words and phrases which, when wrenched from their conventional contexts of use, are free to take on enhanced meanings and provide additional returns to the investor. The realm in which a maker of linguistic readymades operates is not the real world, and not an abstract conceptual space, but the realm of texts: large corpora become rich hunting grounds for investors in linguistic objets trouvés. This proposal is realized in a computational form in the following sections. A rich vocabulary of cultural stereotypes is acquired from the web, and it is this vocabulary that facilitates the implementation of a decision procedure for recognizing potential readymades in large corpora - in this case, the Google database of web ngrams (Brants and Franz, 2006). This decision procedure provides the basis for a robust web application called The Jigsaw Bard, and the cognitive insights that underpin The Bard's conception of linguistic readymades are then put to the empirical test using statistical analysis. While readymades remain a contentious notion in the public's appreciation of artistic creativity - despite Duchamp's Fountain being considered one of the most influential artworks of the 20th century - we 22 shall show that the notion of linguistic readymade has significant practical merit in the realms of text generation and computational creativity. A Modest Proposal Readymades are the result of artistic appropriation, in which an object with cultural resonance - an image, a phrase, a thing - is re-used in a new context with a new meaning. As a fertile source of cultural reference points, language is an equally fertile medium for appropriation. Thus, in the constant swirl of language and culture, movie quotes suggest song lyrics, which in turn suggest movie titles, which suggest book titles, or restaurant names, or the names of racehorses, and so on, and on. The 1996 movie The Usual Suspects takes its name from a memorable scene in 1942's Casablanca, as does the Woody Allen play and movie Play it Again Sam. The 2010 art documentary Exit Through the Gift Shop, by graffiti artist Banksy, takes its name from a banal sign sometimes seen in museums and galleries: the sign, suggestive as it is of creeping commercialism, makes the perfect readymade for a film that laments the mediocrity of commercialized art. Appropriations can also be combined to produce novel mashups; consider, for instance, the use of tweets from rapper Kanye West as alternate captions for cartoons from the New Yorker magazine (see the hashtag #KanyeNewYorkerTweets). Hashtags can themselves be linguistic readymades. When free-speech advocates use the hashtag #IAMSpartacus to show solidarity with users whose tweets have incurred the wrath of the law, they are appropriating an emotional line from the 1960 film Spartacus. Linguistic readymades, then, are well-formed and highly quotable text fragments, that carry some figurative content which can be reused and revitalized in many different contexts. In this spirit, the title of this paper, and all of its section headings, are readymades of varying provenance. Naturally, a quote like "round up the usual suspects" or "I am Spartacus" requires a great deal of cultural knowledge to appreciate. Since literal semantics only provides a small part of their meaning, a computer's ability to recognize linguistic readymades is only as good as the cultural knowledge at its disposal. We need to explore a more modest form of readymade, as in the following phrases: a wet haddock snow in January a robot fish a bullet-ridden corpse Each phrase can be found in the Google database of web ngrams, and each is likely a literal description of a real object or event - even "robot fish", which describes an autonomous marine vehicle whose movements mimic real fish. But each exhibits figurative potential as well, providing a memorable description of physical or emotional coldness. Whether or not each was ever used in a figurative sense before is not the point: once this potential is recognized, each phrase becomes a reusable linguistic readymade for the construction of a vivid figurative comparison, as in "as cold as a robot fish". We now consider the building blocks from which these comparisons can be ready-made. Round Up The Usual Suspects How does a computer acquire the knowledge that fish, snow, January, bullets and corpses are cultural signifiers of coldness, and that "heartless" robots in particular are cultural signifiers of emotional iciness? Much the same way that humans acquire this knowledge: by attending to the way these signifiers are used by others, especially when they are used in cultural clichés like proverbial similes (e.g., "as cold as a fish"). In fact, folk similes are an important vector in the transmission of cultural knowledge: they point to, and exploit, the shared cultural touchstones that speakers and listeners alike can use to construct and intuit meanings. Taylor (1954) catalogued thousands of proverbial comparisons and similes from California, identifying just as many building blocks in the construction of new phrases and figurative meanings. Only the most common similes can be found in dictionaries, as shown by Norrick (1986), while Moon (2008) demonstrates that large-scale corpus analysis is needed to identify folk similes with a breadth approaching that of Taylor's study. However, Veale and Hao (2007) show that the world-wide-web is the ultimate resource for harvesting similes. Veale and Hao use the Google API to find many instances of the pattern "as ADJ as a|an *" on the web, where ADJ is an adjectival property and * is the Google wildcard. WordNet (Fellbaum, 1998) is used to provide a set of over 2,000 different values for ADJ, and the text snippets returned by Google are parsed to extract the basic simile bindings. Once the bindings are annotated to remove noise, as well as frequent uses of irony, this web harvest produces over 12,000 cultural bindings between a noun (such as fish, or robot) and its most stereotypical properties (such as cold, wet, stiff, logical, heartless, etc.). Stereotypical properties are acquired for approx. 4,000 common English nouns. This is a set of building blocks on a larger scale than even that of Taylor, allowing us to build on Veale and Hao (2007) to identify linguistic readymades in their hundreds of thousands in the Google ngrams. However, to identify readymades as resonant variations on cultural stereotypes, we need a certain fluidity in our treatment of adjectival properties. The phrase "wet haddock" is a readymade for coldness because "wet" accentuates the "cold" that we associate with "haddock" (via the web simile "as cold as a haddock"). In the words of Hofstadter (1995), we need to build a SlipNet of properties whose structure captures the propensity of properties to mutually and coherently reinforce each other, so that phrases which subtly accentuate an unstated property can be recognized. In the vein of Veale and Hao (2007), we use the Google API to harvest the elements of this SlipNet. Specifically, we hypothesize that the construction "as ADJ1 and ADJ2 as" shows ADJ1 and ADJ2 to be mutually 23 reinforcing properties, since they can be seen to work together as a single complex property in the context of a single comparison. Thus, using the full complement of adjectival properties used by Veale and Hao (2007), we harvest all instances of the patterns "as ADJ and * as" and "as * and ADJ as" from Google, noting both the combinations that are found and their relative frequencies. These frequencies provide the link weights for the Hofstadter-style SlipNet that is then constructed. In all, over 180,000 links are harvested, connecting over 2,500 adjectival properties to each other. We put the intuitions behind this SlipNet to the empirical test in a later section. We Can Remember It For You, Wholesale In the course of an average day, a creative writer is exposed to a constant barrage of linguistic stimuli, any small portion of which can strike a chord as a potential readymade. In this casual inspiration phase, the observant writer recognizes that a certain combination of words may produce, in another context, a meaning that is more than the sum of its parts. Later, when an apposite phrase is needed to strike a particular note, this combination may be retrieved from memory (or from a trusty notebook), if it has been recorded and suitably indexed. Given a rich vocabulary of cultural stereotypes and their properties, computers are capable of indexing and recalling a considerably larger body of resonant combinations than the average human. The necessary barrage of stimuli can be provided by the Google 1T database of web ngrams - snippets of web text (of one to five words) that occur on the web with a frequency of 40 or higher (Brants and Franz, 2006). Trawling these ngrams, a modestly creative computer can recognize well-formed combinations of cultural elements that might serve as a vivid vehicle of description in a future comparison. For every phrase P in the ngrams, where P combines stereotype nouns and/or adjectival modifiers, the computer simply poses the following question: is there an unstated property A such that the simile "as A as P" is a meaningful and memorable comparison? The property A can be simple, as in "as dark as a chocolate espresso", or complex, as in "as dark and sophisticated as a chocolate martini". In either case, the phrase P is tucked away, and indexed under the property A until such time as the computer needs to produce a vivid evocation of A. The following patterns are used to identify potential readymades in the web ngrams: (1) NounS1 NounS2 where both nouns denote are stereotypes that share an unstated property AdjA. The property AdjA serves to index this combination. Example: "as cold as a robot fish". (2) NounS1 NounS2 where both nouns denote stereotypes with salient properties AdjA1 and AdjA2 respectively, such that AdjA1 and AdjA2 are mutually reinforcing. The combination is indexed on AdjA1+AdjA2. Example: "as dark and sophisticated as a chocolate martini". (3) AdjA NounS where the noun is a known stereotype, and the adjective is a property that mutually reinforces an unstated, but salient, property of the stereotype. Example: "as cold as a wet haddock". The combination is indexed on this property. Other, syntactically richer structures for P are also possible, as in the phrases "a lake of tears" (a melancholy way to accentuate the property "wet") and "a statue in a library" (for "silent" and "quiet"). In this current work, we shall focus on 2-gram phrases only. Using these patterns, our application - the Jigsaw Bard - pre-builds a vast collection of figurative similes well in advance of the time it is asked to use or suggest any of them. Each phrase P is syntactically well-formed, and because P occurs relatively frequently on the web, it is likely to be semantically well-formed as well. Just as Duchamp side-stepped the need to physically originate anything, but instead appropriated pre-built artifacts, the Bard likewise side-steps the need for natural-language generation. Each phrase it proposes has the ring of linguistic authenticity; because this authenticity is rooted in another, more literal context, the Bard also exhibits its own Duchamp-like (if Duchamp-lite) creativity. We now consider the scale of the Bard's generativity, and the quality of its insights. Use Only The Finest Ingredients The vastness of the web, captured in the large-scale sample that is the Google ngrams, means the Jigsaw Bard finds considerable grist for its mill in the phrases that match (1)…(3). Thus, the most restrictive pattern, pattern (1), harvests approx. 20,000 phrases from the Google 2-grams, for almost a thousand simple properties (indexing an average of 29 phrases under each property, such as "swan song" for "beautiful"). Pattern (2) - which allows a blend of stereotypes to be indexed under a complex property - harvests approx. 170,000 phrases from the 2-grams, for approx. 70,000 complex properties (indexing an average of 12 phrases under each, such as "hospital bed" for "comfortable and safe"). Pattern (3) - which pairs a stereotype noun with an adjective that draws out a salient property of the stereotype - is similarly productive: it harvests approx. 150,000 readymade phrases for over 2,000 simple properties (indexing an average of 125 phrases per property, as in "youthful knight" for "heroic" and "zealous convert" for "devout"). The Jigsaw Bard is best understood as a creative thesaurus: for any given property (or blend of properties) selected by the user, the Bard presents a range of apt similes constructed from linguistic readymades. The numbers above show that, recall-wise, the Bard has sufficient coverage to 24 work robustly as a thesaurus. Quality-wise, users must make their own determinations as to which similes are most suited to their descriptive purposes, yet it is important that suggestions provided by the Bard are sensible and well-motivated. As such, we must be empirically satisfied about two key intuitions: first, that salient properties are indeed acquired from the web for our vocabulary of stereotypes (this point relates directly to the aptness of the similes suggested by the Bard); and second, that the adjectives connected by the SlipNet really do mutually reinforce each other (this point relates directly to the coherence of complex properties, as well as to the ability of readymades to accentuate an unstated property). Both intuitions can be tested using Whissell's (1989) dictionary of affect, a psycholinguistic resource used for sentiment analysis that assigns a pleasantness score of between 1.0 (least pleasant) and 3.0 (most pleasant) to over 8,000 commonplace words. We should thus be able to predict the pleasantness of a stereotype noun (like fish) using a weighted average of the pleasantness of its salient properties (like cold, slippery). We should also be able to predict the pleasantness of an adjective using a weighted average of the pleasantness of its adjacent adjectives in the SlipNet. (In each case, weights are provided by relevant web frequencies.) We can use a two-tailed Pearson test (p < 0.05) to compare the predictions made in each case to the actual pleasantness scores provided by Whissell's dictionary, and thereby assess the quality of the knowledge used to make the predictions. In the first case, predictions of the pleasantness of stereotype nouns based on the pleasantness of their salient properties (i.e., predicting the pleasantness of Y from the Xs in "as X as Y") have a positive correlation of 0.5 with Whissell; conversely, ironic properties yield a negative correlation of - 0.2. In the second, predictions of the pleasantness of adjectives based on their relations in the SlipNet (i.e., predicting the pleasantness of X from the Ys in "as X and Y as") have a positive correlation of 0.7. Though pleasantness is just one dimension of lexical affect, it is one that requires a broad knowledge of a word and its usage to accurately estimate. In this respect, the Bard is well served by a large stock of stereotypes and a coherent network of informative properties. So Long, And Thanks For All The Robotic Fish Fishlov (1992) has argued that poetic similes represent a conscious deviation from the norms of non-poetic comparison. His analysis shows that poetic similes are longer and more elaborate, and are more likely to be figurative and to flirt with incongruity. Creative similes do not necessarily use words that are longer, or rarer, or fancier, but use many of the same cultural building blocks as non-creative similes. Armed with a rich vocabulary of building blocks, the Jigsaw Bard harvests a great many readymade phrases from the Google ngrams - from the evocative "chocolate martini" to the seemingly incongruous "robot fish" - that can be used to evoke an equally wide range of properties. This generativity makes the Bard scalable and robust. However, any creativity we may attribute to it comes not from the phrases themselves - they are readymades, after all - but from the recognition of the subtle and often complex properties they evoke. The Bard exploits a sweet-spot in our understanding of linguistic creativity just as much as Duchamp and his followers exploited a sweet-spot in the public's understanding of art and how it is practiced. But as presented here, the Bard is a starting point on our exploitation of linguistic readymades, and not an end in itself. By harvesting more complex syntactic structures, and using more sophisticated techniques for analyzing the figurative potential of these phrases, the Bard and its ilk may gradually approach the levels of poeticity discussed by Fishlov. For now, it is sufficient that even simple techniques serve as the basis of a robust and practical thesaurus application. Exit Through The Gift Shop A screenshot of The Jigsaw Bard is presented in Figure 1 (see overleaf). The application can be accessed online at: http://www.educatedinsolence.com/jigsaw 2011_6 !2011 Negotiated Content: Generative Soundscape Composition by Autonomous Musical Agents in Coming Together: Freesound Arne Eigenfeldt School for the Contemporary Arts Simon Fraser University Vancouver, BC CANADA arne_e@sfu.ca Philippe Pasquier School of Interactive Arts and Technology Simon Fraser University Surrey, BC CANADA pasquier@sfu.ca Abstract Generative music systems have been successful in styles and genres where there are explicit rules that can be programmed into the system. Practices and procedures within soundscape composition have tended to be implicit, in which recordings are selected, combined, and processed based upon contextual relationships. We present a system - Coming Together: Freesound - in which four autonomous artificial agents choose sounds from a large pre-analyzed database of soundscape recordings (from freesound.org), based upon their spectral content and metadata tags. Agents analyze, in realtime, other agentʼs audio, and attempt to avoid dominant spectral areas of other agents by selecting sounds that do not mask other agent's spectra. Furthermore, selections from the database are constrained by metadata tags describing the sounds. Example compositions have been evaluated through subject testing, comparing them to human-composed compositions, and the results are discussed. Introduction Generative music systems have been successful in styles and genres where there are explicit rules that can be programmed into the system. Practices and procedures within soundscape composition have tended to be implicit, in which recordings are selected, combined, and processed based upon contextual relationships. Any generative system that attempts to create music based upon implicit rules will, therefore, require an awareness of the musical environment within which it is currently active. Coming Together: Freesound is part of an ongoing exploration of musical metacreative systems1 that generate music that would be considered creative if generated by a human (Whitelaw 2004). It can be considered a (generative) real-time composition system that creates soundscape compositions. Generative Music Generative music systems are those that create musical output that is different with each iteration. Although there 1 http://www.metacreation.net/ is no direct requirement for such systems to be softwarebased (Galanter 2006) - for example, Riley's In C can be viewed as a generative system - the ability for algorithmic methods to control software synthesizers directly has made the sonification of generative systems much more practical. Generative systems have varying degrees of autonomy. Fully algorithmic systems may only require the specification of parameters, and the system can then produce musical output, in, or out, of real-time (Collins 2008). Others may involve a composer interacting with the system during performance, an approach Chadabe terms interactive composing (Chadabe 1984). These latter systems have tended to be top-down, in the sense that a composer can control the system much as a conductor can control an orchestra; our approach is bottom-up, in which intelligent musical agents interact. The approach described in this paper is different from Chadabe's, in that it relies more upon intelligent decision making by the agents, rather than controlled random processes: as such, it can be seen as real-time composition. Real-time Composition Real-time composition (Eigenfeldt 2008) is the application of musical agents to interact in musically intelligent ways, during performance. Each agent has the potential to control an independent musical gesture — either pitch-based, or timbral — and the complexity of the interactions, along with the quantity of simultaneous gestures, cannot be controlled in any detailed way using existing performative actions. In other words, knowledge must be built into the agents on how to interact musically, and an environment created in which these agent interactions can result in artistically interesting and compositionally satisfying sonic artworks. Real-time composition (RTC) is not improvisation, just as improvisation is not real-time composition (Lewis 2000). Although RTC has evolved from improvisatory interactive systems, the complexity desired by composers in RTC cannot be controlled through existing performative methods used in improvisational systems, nor through constrained random procedures (Eigenfeldt 2007). Imbuing 27 multi-agents with musical knowledge and intelligence, and facilitating their interaction in real-time, allows for the creation of compositional environments during performance. As RTC systems model the composer's decisions, rather than an improvising performer, RTC is, first and foremost, a compositional medium, albeit one that is based within performance. Soundscape Composition Soundscape composition is a form of electroacoustic music characterized by the presence of recognizable environmental sounds and contexts, the purpose being to invoke the listener's associations, memories, and imagination related to the soundscape (Truax 2002). Four of its basic principles (after Truax) include: listener recognisability of the source material is maintained; listener's knowledge of the environmental and psychological context is invoked; composer's knowledge of the environmental and psychological context influences the shape of the composition at every level; the work enhances our understanding of the world and its influence carries over into everyday perceptual habits. Soundscape composition tends to keep a degree of recognisability in its sounds in order to retain a listener's recognition of and associations with these sounds (Truax 2002); Successful soundscape composition plays with the listener's associations between the recordings, and the expectations arising from these associations. Truax points out that these relationships are intrinsic to the composition: montages or collages of random environmental sounds are rarely successful: "The problem here is that the arbitrary juxtaposition of the sounds prevents any coherent sense of a real or imagined environment from occurring. In addition, the lack of apparent semantic relationship between the sounds prevents a syntax from being developed in the listener's mind, hence it is impossible to construct a narrative for the piece" (Truax 2002). Furthermore, generative systems have also tended to be limited to symbolic representations - i.e. MIDI - as opposed to audio (e.g. Assayag 2006). A generative soundscape system must combine audio recordings in ways that rely upon an understanding of those recordings spectral components, and semantic contexts. Previous Work Some work in generative soundscape creation has been done using virtual environments as a model (Eckel 2001, Serafin 2004, Birchfield et al 2005, Finney 2009, Janer et al 2009). These systems generate sonic environments in real-time in response to user actions and movements through a virtual space. Misra and Cook (2009) provide a survey for potential synthesis tools and methods that are best implemented for specific types of sound types, including complex environmental scenes and compositions. The authors provide an example of a completed synthesized "sound scene". Freesound radio (http://radio.freesound.org/) is a webbased system that allows users to collaborate and interact to create "sample-based music creations", using the freesound.org library as a source. An Editor interface allows users to program their own simple patches, while a Player interface allows users to vote on existing patches while bookmarking sounds and tags; this influences an evolutionary algorithm the creates new patches and remixes and mutates existing ones. The system described here generates soundscape compositions during performance, a style that normally is composed as a fixed medium. It is one system within a series of systems under the title Coming Together (Eigenfeldt 2010, Eigenfeldt and Pasquier 2010). Each of these systems explore the potential for autonomous agents to negotiate content within a predefined compositional framework. The goal in each case is a computationally creative system which produces music in real-time, that would be considered creative if generated by a human. Coming Together: Freesound is designed to generate soundscape compositions using a database derived from freesound.org. The systems was designed based upon an autoethnographic analysis of one of the author's own methods of soundscape composition. A recording will suggest a particular context and combination with other recordings based upon its spectral content and its semantic meaning; however, no effort is made to separate listener's degree of recognition and/or relationship to the sounds—in other words, acoustic ecology models are not employed. For example, a recording of urban traffic that contains a preponderance of low frequencies may suggest a combination with high frequency squealing of truck brakes, without worrying whether listener has a familiarity with urban soundscapes. We consider this system to be a generator of a single composition, with an infinite variety of possible realisations. As such, the composition has a formal structure that is repeated with each generation (see Predefined Formal Characteristics). Description Decisions on how to combine recordings are made using a selection method that combines metadata tags and preperformance audio analysis of available sound material. A database of 227 soundscape recordings, varying in length from 15 seconds to 3 minutes, was assembled from freesound.org. Metadata tags for each file were generated by hand by the composer. Up to four metadata tags were applied per recording, ordered by recognition. One example is "voices, inside, foreign, ambience", while another is "voices, animals, outside". The order of the tags is important, in that initial tags are perceived almost immediately during listening (i.e. "I hear voices"), while subsequent tags take longer to perceive and understand (i.e. "the voices are inside...they are speaking a foreign language"). 28 Next, each file was analysed using a 24-band Bark analysis (Zwicker and Terhardt 1980), for maximum, mean, and standard deviation of each band. The database is randomly distributed between four agents prior to performance, with each agent receiving a unique combination of recordings. Selection by Metatag Data During performance, agents listen to the agent-generated sonic environment, and select material from their database based upon their perception of the current context. At different points of the composition, selection methods vary: during the first of four sections, agents select material based entirely upon metadata tags; during the final two sections, agents select material based entirely upon spectral regions; during the second section, both methods are used. When an agent selects a recording (the initial selection is random), it places the associated metadata tags into a communal blackboard; agents access the blackboard, randomly selecting up to four tags, then rate their own database based upon similarity to this target. Scores are given based upon relative position to the request: a "hit" on the first tag scores 1.0, and each subsequent hit decreases by 0.2 (see Table 1). A Gaussian selection is made from the highest rankings, so as to avoid identical selections given the same request. Metadata tags Scores Rating file a Voices animals outside 1.0 0. 0. 1.0 file b footsteps inside water 0. 0.8 0. 0.8 file c inside office ambience 0.8 0. 0.4 1.2 Table 1. A request - voices inside foreign ambience - and three metadata tagged files, showing their scores based upon relative positions to the request, and a cumulative rating for each file. Selection by Spectral Content Agents generate beliefs about the spectral content of the current environment by analysing each individual agent's audio separately over five second windows, using the same 24 band Bark analysis (see Figure 1). Figure 1. Spectral bands in agents 1-3 over separate five-second windows. Note that this analysis will be different than the information agents use to select their recordings: beliefs are generated over discrete time windows, while the data selection is made of the recording's statistical data. Agents can thus never really assemble an accurate understanding of their continually changing environment, a compositional decision that ensures variability. Combining the other agent's spectra, the agent generates a cumulative spectrum which represents its belief for that period in time. An inverse spectrum is calculated to determine low spectral regions, which is used to generate a request to its own database (see Figure 2). Figure 2. Generating a request using inverse spectrum, and the returned result. In all four sections, agents attempt to converge their selections using either contextual or spectral relationships. As the composition progresses, convergence is further facilitated by lowering the bandwidth of the agentʼs resonant filters, projecting an artificial harmonic field upon the recordings that are derived from the spectral content of the recordings themselves. Finally, in the last section, each agent adds granulated instrumental tones at the resonant frequencies, thereby completing the ‘coming together'. Predefined Formal Characteristics Although limited performance control exists over the environment — overall duration can be set at initialization, and agent volumes are controlled in real-time — certain aspects of the environment's evolution in time are predefined: the four sections define agent interactions, while the relative length of these sections within the overall duration is generated randomly at initialization; the global evolution of certain parameters (filter bandwidth, duration of files, delay between files) use preset tendency masks, the ranges of which were set by listening to the system and deciding the best balance between variety and guaranteed success; the overall increase in resonant filtering, which can be considered the defining audio feature of the composition; the initial selection of recordings from freesound.org. Freesound demonstrates characteristics of a sonic ecosystem (Bown 2009) in that the environment is carefully designed, yet the interaction between its components remains nondeterministic, yet not random. Musically, Freesound generates surprising and varied compositions. For example, in the initial section, agents react to one another's metadata tags, and the resulting relationship between selections is clear; during the second section, an agent may select a sound based upon spectral analysis, yet the metadata tags for this recording, with potentially little contextual 29 relationship to the other sounds, will enter the blackboard, and influence the further selection of recordings. Validation In many generative music systems, success is determined solely by its creator: if the system produces output with which the designer is artistically satisfied, then the creator could claim is to be successful. Any argument as to its artistic merits could be deflected, suggesting the criticism is one of the creator's artistic sensibilities, rather than any failing of the system. However, such arguments are obviously moot in assessing the true success of computationally creative systems. Collins discusses various methods of analysis of generative systems (Collins 2008), while Colton provides a set of criteria for assessing whether a computational system is actually creative (Colton 2008). Finally, Boden's segregation of creative systems into H- and P- creativity are also useful (Boden 2003). By these measures, Coming Together: Freesound is not a creative system (it is not aware of its own creativity in that it cannot adjust its behaviour based upon prior output), and it is limited to P- creative output (it will only generate soundscape compositions within predefined style; however, those will be original). Although a composition by the system was selected for performance at an international soundscape concert (Sound and Music Computing 2010 Barcelona) - thereby seemingly validating its output at an artistic level - further validation was sought through subject testing. Test Compositions One composition was generated by the system, and recorded. Another composition was generated using the same parameters (database, methods of processing, overall duration) but without the contextual linking through metadata tags, nor the spectral combinations: in other words, a random selection of soundfiles from the database. It should be pointed out that soundscape composition consists of a continuum of aesthetics, between transparent recording (sometimes called phonography) and more acousmatic, in which recordings are treated much like any other sound object recording and ripe for processing. As such, randomly selecting soundscape recordings for playback does arguably result in appropriate soundscape composition. Two additional soundscape compositions were created by a human, a composer who has received national awards for his soundscape compositions: one composition was limited to the same parameters (database, methods of processing, overall duration, static spatial distribution of four gestures in four channels) as the system, while another was to be freely composed, restricted only by the duration and the selection of material from the same database. At the same time, the composer was asked to create a journal of his compositional decisions. This will potentially allow a comparison between two different working methods - the commissioned composer and the system designer - and whether Coming Together: Freesound could be expanded to include alternative methods of creative decisionmaking. The four 8 minute compositions were played in a random order to discrete test groups that consisted of 39 novice listeners (sound design students unfamiliar with the genre of soundscape composition), 8 expert (composers and graduate students of a soundscape class) and 11 semiexpert (electroacoustic composition students). The groups were unaware that two of the compositions were machine generated, and were asked to rate each composition on a seven point scale on twelve questions, grouped into four sections. Results All the comparative claims made in the text have been proven statistically significant using a paired two-sided ttest. P<0.05, often an order or more less. Soundscape characteristics The first four questions sought to discover how accurate a soundscape composition was produced (1 Disagree, 7 Agree): 1. Listener recognisability of the source material is maintained; 2. Listener's knowledge of the environmental and psychological context is invoked; 3. Composer's knowledge of the environmental and psychological context influences the shape of the composition at every level; 4. The work enhances our understanding of the world and its influence carries over into everyday perceptual habits. Question Humanlimited Random System Human-free 1 6.05 (0.73) 5.41 (1.17) 4.82 (1.39) 4.44 (1.45) 2 5.53 (1.16) 4.54 (1.48) 4.95 (1.32) 5.1 (1.21) 3 5.27 (1.04) 4.69 (1.39) 4.95 (1.23) 5.26 (1.02) 4 4.32 (1.55) 3.97 (1.32) 4.31 (1.58) 4.26 (1.63) Table 2. Experimental results for novice listeners. Mean levels, with standard deviation in parentheses, for success within the genre of soundscape composition. Question Humanlimited Random System Human-free 1 5.52 (1.5) 6.1 (0.7) 5.62 (1.02) 3.81 (1.47) 2 5.67 (0.86) 4.67 (1.32) 4.67 (1.56) 5.1 (1.26) 3 5.76 (1.09) 4.33 (1.65) 4.62 (1.4) 5.43 (1.12) 4 4.95 (1.47) 4.14 (1.56) 4.4 (1.57) 4.43 (1.6) Table 3. Experimental results for expert and semi-expert listeners. Mean levels, with standard deviation in parentheses, for success within the genre of soundscape composition. In almost all cases, both groups of listeners rated the system as a better generator of soundscape composition than random. The expert listeners could distinguish that the freely composed human composition was slightly more acousmatic, while the random composition used the least amount of processing, and thus remained truest to the first goal of soundscape recognisability of source (question 1). 30 Compositional success The next five questions rated the success of each work on a comparative scale between two descriptors: 5. Boring Interesting; 6. Predictable Surprising; 7. Mechanical Organic; 8. Sterile Emotional; 9. Uncommunicative Communicative. Question Humanlimited Random System Human-free 5 4.42 (1.37) 3.64 (1.38) 4.62 (1.71) 4.9 (1.76) 6 3.73 (1.17) 4.19 (1.29) 4.56 (1.07) 5.08 (1.55) 7 4.86 (1.18) 4.51 (1.39) 5.03 (1.16) 3.9 (1.71) 8 3.92 (1.3) 3.46 (1.46) 4.82 (1.45) 4.49 (1.37) 9 4.45 (1.11) 3.62 (1.44) 4.49 (1.3) 4.42 (1.45) Table 4. Experimental results for novice listeners for compositional success. The system was rated higher by novice listeners than the randomly generated work in every case, and was even considered better than the human-composed limited work in terms of interest, and surprise. Furthermore, it was considered the most organic, the most emotional, and the most communicative of all four works. Question Humanlimited Random System Human-free 5 5.95 (0.89) 4.38 (1.6) 4.48 (1.72) 6.1 (1.04) 6 5.43 (1.12) 4.67 (1.43) 5.14 (1.01) 6.14 (0.96) 7 4.95 (1.32) 5.14 (1.24) 4.62 (1.24) 4.62 (1.72) 8 5.62 (1.12) 3.48 (1.57) 4.81 (1.25) 5.71 (0.9) 9 5.55 (1.1) 4.05 (1.66) 4.65 (1.63) 5.71 (0.9) Table 5. Experimental results for expert and semi-expert listeners for compositional success. Expert listeners judged the system to be better than random in every instance except mechanical vs. organic; however, the system was judged similar to the freely composed human work in that aspect. Skill level The next two questions assessed the skill level of the composer, on a comparative scale between two descriptors: 10. Student-like Professional 11. Poor craftsmanship high craftsmanship Question Human limited Random System Human-free 10 5 (1.19) 4.05 (1.25) 4.69 (1.22) 5.26 (1.18) 11 5.32 (1.02) 4.57 (1.07) 4.86 (1.13) 5.38 (1.26) Table 6. Experimental results for novice listeners for skill level. Questions Humanlimited Random System Human-free 10 5.76 (0.89) 3.86 (1.74) 4.14 (1.59) 6.14 (1.01) 11 6.05 (0.74) 4.25 (1.33) 4.52 (1.44) 6.29 (0.96) Table 7. Experimental results for expert and semi-expert listeners for skill level. Here, both sets of listeners were able to discern the human-composed from the machine-composed music. Although the expert listeners rated the system less successful than the novice listeners, they also rated the random composition much lower. In all instances, the system was considered more skillful than randomly assembled soundscapes. Subjective Reaction Finally, the last question asked whether the listener disliked or liked the composition, on a comparative scale between "Did not like it" and "Liked it a lot". 12. My feelings towards this soundscape composition. Question Human limited Random System Human-free 12 4.42 (1.27) 3.57 (1.41) 4.36 (1.51) 4.92 (1.44) Table 8. Experimental results for novice listeners for listener subjective reaction. Question Humanlimited Random System Human-free 12 5.76 (0.94) 3.67 (1.53) 4.05 (1.69) 5.95 (0.8) Table 9. Experimental results for expert and semi-expert listeners for listener subjective reaction. Again, both sets of listeners preferred human-generated soundscape composition to machine-generated. Interestingly, the variation in responses was higher to the machinegenerated works than the human-composed works, and the spread of these differences is higher for the expert listeners than for the novice. Qualitative Results Respondents were allowed to add any further comments on each of the works. One expert listener admitted to having a difficult time distinguishing between the success of the system piece and the limited human piece, only slightly preferring the latter for the sole reason that the signal processing was more closely correlated to the material itself - something that would be extremely difficult to automate. Conclusions and Future Work Listeners did prefer human-composed soundscape compositions to machine-generated. Interestingly, the freely composed human work was consistently rated higher than the piece that imposed the same restrictions in which the system operated: the type of processing, and the limited spatial distribution. This suggests that the compositional decisions that define Coming Together: Freesound may, in fact, be limiting its artistic success. One aspect that differentiated both machine-generated compositions from the human-composed was the static nature of the overall amplitude envelope. This is a very high-level parameter that would require subtle changes in volume based not only upon the overall density and amplitude, but the recent past. This action is actually managed by the composer during performance, carefully balancing 31 levels, and, for example, bringing down levels of more static recordings in favour of more dynamic ones. Creating such intelligent, autonomous high-level actions is currently being investigated, with the potential for a high-level "listener" agent analysing the cumulative result, and communicating its suggestions to the four generative agents. The research instrument discussed here is a contribution in itself. As this system is a musical metacreation, validation and evaluation of such a system's output is itself a challenging research area. Our future will investigate and try to evaluate the methodologies to do so. One particularly challenging aspect is that the system is capable of generating numerous pieces, with possibly varying levels of success. Designing methodologies to measure that variability is an inherent challenge of the area. Acknowledgements This research was funded by a grant from the Canada Council for the Arts, and the Natural Sciences and Engineering Research Council of Canada. The authors would also like to thank Barry Truax for his suggestions, James O'Callaghan for his compositions, and Alireza Dovoodi for his data analysis. 2011_7 !2011 Shared Mental Models in Improvisational Digital Characters Brian Magerko, Peter Dohogne, and Daniel Fuller School of Literature, Communication and Culture Georgia Institute of Technology {magerko, pdohogne3}@gatech.edu, gentristar@gmail.com Abstract Improvisational theatre is a unique art form that requires actors to co-construct stories on stage in realtime without the benefits of any explicit communication. All negotiation about the content of the scene, including characters, setting, plot, and relationships, must be done within the context of the performance. This negotiation process is a special form of constructing shared mental models between the performers as well as with the audience. This article explores the process of building shared mental models in improvisation and describes computational improv agents that employ this process in an interactive implementation of the improv game called Party Quirks. Introduction Improvisational theatre (improv) has been the subject of a handful of interactive narrative systems over the past two decades (Hayes-Roth et al. 1994; Bruce et al. 1999; Perlin and Goldberg 1996; Swartjes, Kruizinga, and Theune 2008; Harger 2008). The computational systems that have been developed typically focus on a specific aspect of improvisation teachings or practice. For example, Harger's (2008) system explores how to represent character status (i.e. how powerful or confident a character is) with virtual actors who walk out onto a stage. In an earlier system, Hayes-Roth et. al explored how status can effect the interactions between two virtual characters (1994). The above computational approaches to improvisation have primarily based themselves on single phenomena described in seminal improv texts or concepts generally known by practitioners in improvisational theatre. This approach has yielded relatively shallow agents that can exhibit one particular aspect of improvisation; it has not produced larger, more complex agents capable of performing as an improvisational actor would. We aim to build computational representations of the formal understandings gained from studying human actors. Of particular note in our findings have been two broad categories of data: narrative development (how improvisers reason about and cocreate stories on stage in real-time) and the construction of shared mental models (how improvisers reach a shared understanding of where the scene is going, what is true in the story world of the scene, etc.). For example, two actors on stage during an experimental session established early on that a) they were in a national forest, b) they were both plumbers and friends, and c) one of them was unhappy with his life as a plumber. The progression of establishing these facts in the scene can be dissected into two parts. First, there is the story content, also called the frame. Second, there is the process through which the actors presented their characters, mutually agreed on who and where those characters were, offered any hints when the other was unclear what was being established, etc. Sawyer refers to this process of establishing the frame as the process of creative convergence - the act of co-creation in a creative act (2003). Through the lens of studying problem solving in organizational psychology, this process can be thought of as the process of actors building shared mental models with each other and with the audience as they perform1 . Shared mental models involve a) individuals having their own model of the world, b) individuals having their own model of what is publicly known, and c) a process for reconciling unknowns or conflicts in their models (e.g. actor A thinks that they are in a movie theatre but actor B then states that they are in a baseball stadium). This article presents our work on studying shared mental models in human improvisers and the computational representation of the results of that study in agents that play an improv game called Party Quirks with human interactors. Shared Mental Models in Improvisational Theatre Misunderstandings and miscommunications are common in improv because coordination between improvisers is not an explicit act (i.e. improvisers do not directly communicate their intentions in a scene outside of what occurs in the performance on stage). The free-flowing, unscripted nature of improv makes all the more transparent the process of recognizing and resolving divergences in mental 1 This is closely related to Clark and Schaefer's contribution model (1989) and Traum's grounding acts model (1999). The key differences between these works and ours is the performative nature of the domain of improvisational theatre as opposed to a narrow focus on the utterance level of human discourse. 33 models in order to achieve cognitive consensus (a state of agreement about some aspect of the scene) and create shared mental models among the improvisers. Some improv "games" (scenes that have specific rules for the improvisers to follow), such as "Party Quirks," even have this mechanic (which we call "knowledge disparity") built into the structure of their performance. In Party Quirks, one improviser plays the part of a party host to three other improvisers, all of whom are given specific character quirks known to everyone except the host. It is the goal of the host to infer the quirks of all three other improvisers from their behavior and interactions on stage. In other words, the host must deliberately seek out cognitive consensus with his fellow improvisers and vice versa. Other improv games which do not deliberately disrupt cognitive consensus still often involve divergences between improvisers. Improvisers constantly have to communicate their internal understanding of the scene's frame via their performance as a character as opposed to explicitly saying what their understanding of the frame is. We have constructed a model of these communicative acts (Fuller and Magerko 2010). This paper presents the improvisational agents we have built based on this model. Ambiguity in Knowledge The main reason cognitive divergences in improvisation occur is because actors' communication of intention, knowledge, and goals on stage is imperfect and ambiguous; they do not coordinate entire scenes backstage nor do they perfectly know and communicate everything on stage. Improvisers often execute actions on stage that can be interpreted in a variety of fashions (e.g. starting a scene doing a raking action on the ground may lead to another actor coming on stage and commenting on how they are sweeping the floor, mopping, or even dancing - depending on their interpretation of the raking motion). The communication and representation of ambiguous knowledge is a main feature of our current implementation of agents that can play as guests in Party Quirks. Guests in the game typically execute actions on stage that give hints to the party host about their quirk. A common strategy is to give hints that are very ambiguous and then to give more obvious hints over time, a strategy we call reverse scaffolding. Therefore, the agents must be able to a) reason about what kinds of actions their character may execute, and b) how ambiguous (or, conversely, iconic) those actions are in terms of communicating their character's quirk. Within this particular improv game, we view quirks as prototypical characters, such as "ninja" or "alien," for simplicity (e.g. we do not handle quirks that attempt to blend characters or concepts together). Each prototype has a degree-of-membership (DOM) value for each of a list of attributes4 . Attributes are characteristics of each persona, 4 This non-Boolean description of categorical knowledge is informed by contemporary views in cognition and category such as "strength," "attractiveness," or "cleverness." DOM values can run anywhere from 0 (no membership) to 1 (full membership). Actions in the Virtual Stage Attributes themselves describe character prototypes, but lend no information about how those descriptors are portrayed on stage. Actions, which are observable gestures, animations, and/or dialog that can be executed on our virtual stage, are associated with one or more attributes for a range of DOM values. For example, the action "hides behind things" is a member of the attributes stealth, fearless, and immunity to projectiles with DOM ranges 0.7-1.0, 0.00.4, and 0.0-0.3 respectively. This means any prototype with a DOM from 0.7 to 1.0 for stealth can hide behind things, as can any prototype with DOM from 0.0 to 0.4 for fearless. If an agent wants to do something on stage to portray something about its prototype / quirk, it knows which actions it can execute by reasoning about a) their DOM values for attributes (i.e. "what is my prototype's membership in each attribute set?") and b) what actions map to those pairs (i.e., "Given my attribute values, what actions are associated with those attribute values?" However, some DOM values for attributes are very generic; the attribute "eats," for example, has many prototypes with DOM around 0.5, which represents "eats an average amount." This means that if an agent that has a common value for this attribute chooses to it, nothing will really be learned about it (i.e., knowing that a character eats normayll provides little information about its prototype). In order to determine which actions are more unique to a given prototype, we introduced the concept of DOM ambiguity. In terms of attributes, the ambiguity of a given DOM value is a factor of the number of other prototypes with a similar DOM value (unique values are very characteristic) and how distant the value is from the "normal" value for that attribute (e.g. a value of 0.2 for "facial hair," is fairly unique in our dataset, but is not very distant from the normal value, which is 0. However, only zombies have a high value for "eats_brains," which means it is very unique / not ambiguous for portrayal). In terms of actions, ambiguity is a factor of the number of attributes the action can represent and how many prototypes can naturally execute that action. Portraying Prototypes The selection of which attribute to portray depends on which portrayal technique the character is assigned (either randomly or predetermined by setting internal variables) to use for this scene. There are several techniques we have observed human actors employing while playing Party Quirks and other similar games that involve knowledge disparity. Many actors reference the idea of "pacing" in a theory, such as work by Lakoff (1987) and Rosch et. al (1999). 34 scene, which relates to making the scene "interesting to watch". As such, actors often purposefully do not give obvious hints at the beginning of a scene, as that would cause the game to end too soon and not be interesting. Our agents represent this approach (called reverse scaffolding) by executing progressively less ambiguous (more iconic) actions over the course of the game. Another technique is caricaturization, which is when in actor is very obvious about their prototype. In our system, agents that use this technique present only the least ambiguous actions for the attributes characteristic to their prototype (i.e. being an obvious caricature of their prototype). Finally, actors sometimes use a technique called opposing. When actors oppose, they choose an attribute characteristic to their prototype and invert its value (e.g., someone who can fly but cannot control their movement). This aids in comedic effect and makes the scene more interesting as it conflicts with the normal model of the prototype. Reaching Cognitive Convergence Our implementation of Party Quirks guests has relied heavily on the computational representation of the process of reaching cognitive convergence. The guests' goal in Party Quirks is to help host's mental model match the prototype; therefore, the guests react to the host's actions. Based on our observations of how improvisers communicate to build cognitive convergence, the first and most common host action hosts is to defer (i.e. wait to see what happens next before) and let the guest present naturally. In response to a deferment, an agent picks which action to present based on the selected technique (reverse scaffolding, caricature, or opposing) and how far they are into the scene. The next distinctive action hosts execute is to guess which quirk (i.e., prototype) the guest is representing. This is a type of verification, as detailed above. In response to a correct guess, the agent acknowledges the host's success and leaves the game. In response to an incorrect guess, the agent indicates the host was wrong and refutes the guess by presenting an action from an attribute with a significantly different DOM value than the corresponding value from the prototype the host guessed. This demonstrates to the host a reason why their guess was incorrect while providing guidance in the right direction. A more proactive technique a host can use to get information from a guest is to make a blind offer. In our representation, an offer is a prompt for an attribute, essentially asking "What is your value for this attribute?" In response to an offer, the agent responds with a presentation representing their value for that attribute. Another type of blind offer involves the physical environment. Just as guests can assert something about the state of the environment, so can the host (e.g."I'm turning off the lights"). Guests respond to this with an action that uses the new environmental state as a precondition if possible ("Don't turn off the lights, I'm afraid of the dark!"). While an offer allows the host to attempt to gather new information, a host can also try to verify their assumptions about the guest, which often happens when the host has a specific guess about a guest's quirk but still has uncertainties they desire to resolve before committing to the guess. Verifying involves stating assumptions about the guest's value for an attribute. For example, if the host thinks the guest is a ninja, before they make a direct guess, they might say "I think you are very good with a sword." In response to this statement, the agent responds with either a confirmation or a denial of the host's assumption. Next, the guest makes a presentation for the attribute in question and then continues the scene as normal. In some cases, the host may be unsure exactly what the agent was trying to demonstrate with a presentation. For example, if the action is "strikes a pose," it might represent multiple attributes, such as fame or strength. In this case, the host asks, "Did you mean this attribute?" This is another type of verification in which the host is trying to clarify what the guest just presented. The agent will respond with either a confirmation or denial of the host's guess as well as a different action for the same attribute, basically to say "Yes, that is what I meant, see?" Finally, when the host is completely lost, they can make a generic clarification request and ask the guest to give more obvious clues. In response to such a request, the agent becomes less ambiguous with its presentations by narrowing the list of possible of attributes for presentation selection. In summation, by better understanding how human improvisers construct shared mental models, we have taken steps towards building computational actors that can employ similar processes. This is one major step towards creating improvisational actors that can interact with each other and with human users within an improv theatre framework. 2011_8 !2011 Artificial Creative Systems and the Evolution of Language Rob Saunders Design Lab University of Sydney Sydney, NSW 2010 AUSTRALIA rob.saunders@sydney.edu.au Abstract Most studies of human creativity have focused on individuals, assuming that creativity can be defined with respect to the characteristics, processes or activities of extraordinary people. Computational models of creativity have often inherited this assumption and emphasised generative processes to the exclusion of considering social or cultural aspects. This paper presents work to extend computational models of social creativity to support the evolution of domain specific languages. Artificial creative societies provide the opportunity for studying creativity-as-it-as in the context of creativity-as-itcould-be. The computational model of an artificial creative systems presented here extends previous computational models by introducing a linguistic component that supports the production and sharing of works with associated descriptions. This paper examines the potential for this extended model of social creativity to support the study of the roles that language plays in the formation, interaction and maintenance of creative domains. Introduction The need to define the nature of creativity has haunted attempts to develop theories of creative thinking: the difficulty is apparent from the abundance of definitions; Taylor (1988) gives some 50 examples. Expressed in the definitions of creativity are some widely different opinions about what it means for an individual to be creative, yet two board categories of definitions can be identified: (1) creativity as a mental phenomenon; and, (2) creativity as a social construction. For example, the models of creativity proposed by Koestler (1964), Newell, Shaw, and Simon (1958), and Hofstadter (1979) go into great detail about the cognitive processes involved in creative thinking, particularly the processes involved in the generation of potentially creative ideas. Computational models of creativity are often based directly on such models, e.g., Langley et al. (1987) ,Hofstadter (1995), or are based on similar models of creative thinking from psychology, e.g., Partridge and Rowe (1994). Creativity as a social construction has a strong honorific sense that is as much the result of an audience's appreciation of a work as it is the creator's production. Proponents of these definitions contend that creativity cannot occur in a vacuum and must be studied in the context of the socio-cultural environment of the creator (Gruber 1974; Simonton 1984; Martindale 1990). Attempts to combine these two views of creativity into unified theoretical frameworks often maintain the distinction between personal and socio-cultural notions of creativity, as in Bodens P-creativity and H-creativity (Boden 1990) and Gardners small-c and big-c creativity (Gardner 1993). Dong (2009) argues that language plays a central role in creative behaviours, semantic and sentiment analysis of the use of language in design texts have been used to illustrate how the reality producing effect of language is itself an enactment of design. This insight is compatible with Clark's argument that language is the ‘ultimate artefact' whose primary purpose is not to communicate ideas between individuals but to overcome cognitive limitations of the human brain through the externalisation of complex thought in a grounded symbolic form (Clark 1996). A Systems View of Creativity The systems view of creativity was developed by Csikszentmihalyi as a model of creativity to include interactions between individuals and the social and cultural environments they are embedded within (Csikszentmihalyi 1988). A map of the systems view of creativity is presented in Figure 1. 05+0=0+<(3 *<3;<9(3 +64(05 -0,3+ :6* (0 3 05-694(;06 5 56=,3 >692: * 9 ,(;0=, >692: 7 , 9:65(3 ,=(3<(;065: Figure 1: The Systems View of Creativity. An individual's role in the systems view is to bring about 36 some transformation of the knowledge held in the domain. The field is a set of social institutions that selects from the variations produced by individuals those that are worth preserving. The domain is a repository of knowledge held by the culture that preserves ideas or forms selected by the field. In a typical cycle, an individual takes some knowledge provided by the culture and transforms it, if the transformation is deemed valuable by society, it will be included in the domain of knowledge held by the culture, thus providing a new starting point for the next cycle of transformation and evaluation. Using the language of Gardner, what distinguishes small-c creativity from big-c creativity is that big-c creativity affects changes to the domain whereas small-c creativity does not. In Csikszentmihalyi's view, creativity is not to be found in any one of these elements, but in the interactions between them. Computational Models of Creative Systems Liu's dual generate-and-test model of creativity was the first attempt to produce a computational model of Csikszentmihalyi's creative systems (Liu 2000). The dual generate-andtest model encapsulates two generate-and-test loops: one at the level of the individual and the other at the level of the society. The generate-and-test loop at the individual level provides a model of creative thinking, incorporating problem finding, solution generation and evaluation of potential creativity. The outer generate-and-test loop models the field in Csikszentmihalyi's systems view of creativity; providing a model of peer evaluation and a repository of works. The limitations of Liu's computational model lay in the centralised nature of the socio-cultural test and the limited notion of the domain in the model as a repository of artefacts. The dual generate-and-test model provides a way to integrate computational models of creative thinking with models of social creativity but can say little about how fields and domains emerge as a consequence of the actions of individuals. The artificial creativity approach proposed by Saunders and Gero (2001) provides a framework for developing computational models of individual and social creativity to support the emergence of social structures as the result of the actions of multiple individuals. Early implementations explored the role that an individual's search for novelty plays in social creative systems. Individuals who produce works that are considered interesting by other agents are rewarded. Works communicated between agents that are considered worth sharing by peers are added to the domain. Other multi-agent models of social creativity have examined the relationship between the field and domain. Gero and Sosa (2002) explored the emergence of ‘gatekeepers' in creative fields, i.e., individuals with the ability to strongly affect the contents of the domain. Bown (2008) developed multi-agent models to explore cohesion, competition and maladaptation in the evolution of musical behaviour. Colton, Bundy, and Walsh (2000) present a computational model involving multiple agents working together to explore a mathematical domain, which proved to be so successful that the agents produced new knowledge that has been accepted into the domain of number sequences. Axelrod's model of the dissemination of culture, while not attempting to model cultural creativity, illustrate the significance that individual acts of communication can have on the formation and multi-cultural societies (Axelrod 1997). Meme and Variations (MAV) is a computational model of cultural evolution in a society of interacting individuals (Gabora 1995) based on the premise that novel ideas are variations of existing ones. Each agent in an artificial society can acquire new ideas through innovation, by mutating a previously learned idea, or by imitation, by copying a neighbour agent. Thus, cultural evolution occurs through the collective choices of individual agents about which ideas to mutate, how to mutate them, and which ideas to copy. Miranda, Kirby, and Todd (2003) developed a model of the evolution of simple musical forms using a language game, the imitation game, similar to the one presented later and used in this study. In the society of musical agents, compositions are shared through agents performing for each other. The success of a tune is measured by the ability of another agent to reproduce it. A tune is successfully reproduced when the agent who produced the initial performance knows no tunes that are more similar to the imitators recital than the one it initially performed. Simulations show that the society of agents quickly develop coherent sets of tunes and were capable of successful recitals. The tunes, as a set of artefacts collectively agreed upon by a society of individuals, represent a simple form of domain distributed across the memories of the individuals in a similar way to Gabora's MAV. Creative domains, as described by (Csikszentmihalyi 1988), are dynamically maintained and contain symbolic as well archive material. Domains are distributed across creative fields, existing within a variety of media, with each individual in the field having a partial view of the whole. The computational models described above share a limited notion of domains as repositories of artefacts. Some of these models use a centralised database while others, e.g., MAV, capture the distributed nature of a domain described by Csikszentmihalyi where each agent maintains some part of the whole in memory. None of the models described here however, maintain a distinct symbolic description of the knowledge stored in the domain, i.e., none of them model a domain-specific language that can be used to describe the artefacts or practices. Computationally modelling the evolution of language in creative domains opens up the possibility of computationally investigating a range of important aspects of creativity that are outside the scope of studies focussed on individuals, including: the emergence of specialised languages that are grounded in the practices of a field; the effects of a common education on the production and evaluation of creative works; and, the emergence of subdomains as a consequence of differences in language use across a field. The computational model below attempts to address some of the limitations of the earlier implementation of artificial creative system by Saunders and Gero (2001) by including the evolution of language as a central component in the negotiation of works between individuals and the distribution of domain knowledge across a field. 37 The Evolution of Language In the extended model of artificial creative systems presented here agents continue to share works with peers in a field as before, sending ‘interesting' works for evaluation to other agents; in addition, agents communicate descriptions of works as simple linguistic expressions. This extended model incorporates a model of the evolution of domainspecific languages using models of the evolution of language proposed by Steels (1996b) based on the playing of 'language games'. A language game is an abstract and simplified method of communication, first proposed by Wittgenstein to study the use of language in society. Wittgenstein (1953) describes language games in which participants can communicate to describe or learn about objects, report events, give commands or solve problems. Steels (1995) introduced the use of language games to simulate the evolution of language in multi-agent systems. In the language game first proposed by Steels, the guessing game, one agent, the initiator, describes an object using a simple utterance to a second agent, the recipient, who attempts to identify the topic of the utterance based on their experience of the previous utterances. Steels has shown that repeated playing of such language games is capable of evolving languages grounded in shared experiences to describe, for example, other agents (Steels 1996a) and coloured shapes in a shared context (Steels 1996b; 1998). In the course of attempting to succeed at as many language games as possible, the society of agents is driven to adopt common meanings for their initially random words and as a consequence a shared lexicon emerges. Steels uses this model to support the position that language is an autonomous adaptive system and that its emergence in humans could have been the result of self-organisation rather than the acquisition of a specific language-capable area of the brain. Other types of language games have been developed by researchers to explore the evolution of language under different conditions. For example, imitation games have been used to explore the self-organisation of vowel systems (de Boer 2000) and the evolution of simple musical forms (Miranda, Kirby, and Todd 2003) described earlier. The distinction between the evolution of musical forms through the playing of language games and the model presented here is that while the model of the evolution of tunes distributes domain knowledge of acceptable forms across the individuals in the associated field it does not support the co-evolution of a set of symbolic descriptions. The evolution of language is distributed and selforganising; through the repeated playing of language games between pairs of agents, a shared lexicon of words and their associated meanings evolve in combination. Of particular interest, from the perspective of modelling domain-specific languages, are the ambiguities that arise in the languages evolved; a single word may have multiple meanings and multiple words may have the same meaning. Anyone who has tried to communicate across disciplinary boundaries, no matter how similar they may appear at first, will likely have experienced something similar, e.g., familiar words having unfamiliar meanings. But the resolution of tensions created when individuals from different fields communicate has the potential for creative output as the meanings of words are negotiated (Gemeinboeck and Dong 2006). By extending previous models of social creativity with the capacity to negotiate a commonly understood lexicon, the model presented here distributes the domain of recognised works with associated descriptions across it's associated field, or fields, with each agent holding a subset of all known works and descriptions within their internal model of the domain. Such a model opens up new opportunities for simulating social institutions, e.g., education, and for studying the effects on domains when fields come into contact through the interaction of individuals. The Computational Model Individuals are modelled as curious design agents (Saunders 2002): each agent is capable of generating new works and assessing its novelty. If a generated work is appropriately novel, the agent produces an utterance and uses this to communicate the work and its description to another agent. To assess the novelty of new works and determine an appropriate utterance for them, each agent maintains an associative memory based on a Category Adaptive Resonance Theory (CART) network (Weenink 1997). This memory maintains vector prototypes for classes of work with an associated label, in this case the utterance. A threshold around each prototype defines a hyper-ellipsoid within which similar works will be associated with the same label. In the following experiments, individuals explore the design space of simple, coloured shapes of varying sizes; similar to the space of coloured shapes that Steels used in "Talking Heads" (Steels 1998). Unlike the "Talking Heads" experiment, however, shapes are not selected from a relatively small finite set, but rather are generated by individual agents. The process of generation implemented for the following simulations is simple: an agent uses the prototypes of shapes that it has stored in its ontology to generate a variant. Generated shapes are perceived by the agents using a set of sensory channels similar to those used by (Vogt 2003), i.e., the agents can sense the type (square, circle, etc.), size and colour hue of shapes. All sensory channels defined for the agents in the following simulations have been normalised to fall in the range [0..1], with types mapped to specific values within this range and size and colour taking continuous values. The perceived novelty of the generated shape is assessed as the city-block distance from the closest known prototypes. If the novelty of the new shape falls into the preferred range for the agent, the shape may be used as the topic in a guessing game. The preferred range of novelty for an agent is defined by an internal model of preference based on the Wundt curve (Saunders 2002), where similarbut-different perceptual experiences are preferred. In the simulations that follow, the centroid for preferred novelty vary between 0.025 and 0.125, and have a fixed range about the centroid of 0.05, representing suitably small distances from known prototypes in the perceptual space. Each field in the following experiments consists of between 10 and 40 individuals. The communication policy between the individuals follows Steels (1995) and implements 38 either the guessing game, or a variant upon this, the education game. Through the interaction of members of a field, the development of domain-specific lexicons is modelled as a consequence of individuals generating and exchanging ‘interesting' works with associated utterances. In the model a domain is determined to have formed when a population of agents agree upon a stable lexicon of words with agreed meaning for the associated works. In the experiments that follow, a stable lexicon is said to have formed when communicative success exceeds 80%. Simulations and Results This section describes the results of three simulations using the computational model. The experiments conducted so far with the extended model have focused on modelling the domain. The results below explore how domains are (1) formed under the influence of novelty seeking behaviour; (2) combined through the interaction of individuals that are members of multiple fields; and, (3) effectively maintained through the use of education. Domain Formation In the computational model presented here, the formation of a domain occurs when the members of a field agree upon a stable lexicon for describing a corpus of works. The model does not require a central repository of all knowledge, rather the domain is distributed amongst the members of the field, such that no individual has a complete record of the domain. Consequently, small differences in the characteristics of individuals can have a large impact on the formation of a domain. Figure 2 illustrates how individual preference for novelty affects the size of the lexicon and ontology stored in the domain as a consequence of the field's actions. Figure 2 shows size of the active lexicon and ontology for a field of 10 individuals after playing a total of 10,000 language games. The preferred novelty reported is the mean of each individual's preferred novelty with the range of preferred values ±0.025 either side of the mean. 20 40 60 80 100 120 140 160 180 200 220 0.025 0.045 0.065 0.085 0.105 0.125 Words / Meanings Preferred Novelty Lexicon Ontology Figure 2: Domain growth as a consequence of individual preference for novelty. The results of these simulations show that for this artificial creative system increasing the preference for novelty used by individuals to select the topic of a language game has a modest effect on the size of the active lexicon compared to the increase in the size of the active ontology developed across the domain. In other words, the variety of meanings held by a field for a single word increases significantly as a consequence of individuals searching for novel topics. Domain Interactions Simulations based on the evolution of language are open; agents can be added or removed at any time. Agents that are added to a system can adapt to the lexicon in use. We use this capacity to develop models of the interaction between domains as individuals migrate between their associated fields. This type of movement allows individuals to both adapt to the lexicons used in different domains but also affect the development of language as agents transport meanings and words from one domain to another. Figure 3 shows the results from simulating the interaction of two domains through the communication of individuals taken from two distinct fields. The results have been averaged for 10 simulation runs with 20 individuals in the combined field. The degree of overlap between fields reported is the minority percentage of the new field, i.e., where the degree of overlap is reported as 20% this means that 20% (4) of the agents have been taken from one field and 80% (16 agents) have been taken from the other field. 1000 2000 3000 4000 5000 6000 7000 0 5 10 15 20 25 30 35 40 45 50 Games Overlap Between Fields (%) Figure 3: The number of games for a domain to re-form, i.e., reach a communication success rate of 80%, as a function of the percentage of overlap between two existing fields. The results from the simulations shown in Figure 3 indicate that in this artificial creative system the time taken for the contents of two pre-existing domains to be combined reduces as the number of individuals combined from each existing field approaches 50%. The degree of disruption caused by a small minority of agents is perhaps surprising but it can be easily understood with few opportunities for interaction with agents from the minority percentage many more language games are required for the combined field to reach agreement. 39 Education Education, whether through self-study or a more formal education process, plays an essential role in an individual's mastery of a domain: it is only by learning the history of valued works and the language used to describe them that an individual can hope to contribute something new and describe it in such a way as to have it accepted by the ‘gatekeepers' of a domain (Csikszentmihalyi 1988). The guessing game can be used to model informal education through exposure to domain knowledge through interactions with members of a field. To model formal education with institutional frameworks, a modified form of the guessing game, called the education game, can be played where the initiator of a game is assumed to be an expert in the domain. In this modified version the initiator takes on the role of teacher, and selects topics for the game with which it has a high confidence based on previous communicative success. The recipient agent takes on the role of student and must choose the most likely object in the shared context that best matches the teacher's utterance. When the identity of the topic is revealed only the student updates its mappings between words and meanings. To test the efficacy of the education game versus the guessing game at initiating new individuals to a field, a series of simulations were performed to compare how quickly new individuals achieved communicative success rate of 80% with instructors drawn from a pre-existing field using a stable domain. In each run of the simulation a population of 10 individuals engage in language games with an existing field, also containing 10 individuals, where the initiator (teacher) is always chosen from the pre-existing field and the recipient (student) is always chosen from the population of introduced individuals. Figure 4 compares the communicative success for simulations using guessing games and education games. 0 20 40 60 80 100 0 2 4 6 8 10 Communicative Success (%) Games Played (x1000) Guessing Game Education Game Figure 4: A comparison of communicative success rates of guessing games versus education games with the introduction of individuals to an existing domain. The results suggest that, in these simulations, the use of the model of formal education significantly decreases the number of language games required for an individual to be able to effectively communicate with a field. To reach a communicative success rate of 80% between the existing field of initiators (teachers) and recipients (students) the number of games required is reduced by 40%, i.e., from an average of 8,920 to 5,143 language games. Discussion There is no doubt that computational modelling will continue to focus on developing analogs for creative cognition and individual creative behaviour. After all, the promise of developing computer programs able to solve problems in ways that are obviously "creative" is so tantalising that we cannot help ourselves. What this paper seeks to accomplish, however, is to show that the potential exists for developing computational models that capture how creativity works within a cultural environment. The model presented here represents a first attempt to implement a computational model of creative activity within a field that involves language. There are different kinds of creative individual (Policastro and Gardner 1999) and each kind may take part in different types of language games as they interact. For example, Saunders and Grace (2008) proposed the use of the generation game, where a speaking agent takes on the role of the client and multiple listener agents take on the role of designers attempting to satisfy the design brief encapsulated in the client's utterance. Unlike Steels' guessing game and the education game presented here, there may be many possible designs that satisfy a single design requirement. This opens the possibility for judging success or failure on more than just the ability of a design to satisfy a set of required features, but to have an implicit requirement for all designs to be ‘interesting', according to some function of interest that does not contradict the intended meaning of words within a lexicon. The generation game highlights the role that clients often play in the creative process. The computational model presented here advances the modelling of the artificial creative systems by introducing a way for domain specific languages to develop from the interactions of individuals within a creative system. The simulations have shown that it is possible to integrate language games with models of individual and social creativity without undermining the grounding of words for describing works within an evolving language. Future work in this area will need to incorporate similar mechanisms for the evolution of languages to describe processes, policies and rules. The artificial creative system supporting the evolution of language that has been presented in this paper is limited in a number of ways that will need to be addressed. In particular, the language implemented here is holistic, i.e., the words evolved cannot be decomposed into components that describe properties of the shape. To address this limitation a computational model that supports the evolution of compositional languages, similar to that described by Vogt (2003), is being investigated. Computational models of the evolution of compositional languages support the emergence of words that play particular roles in a linguistic construct, e.g., adjective, noun, etc. The meaning of utterances is then formed by the composition of words. The use of compositional languages in computational models of cultural creativity opens up new and interesting possibilities for modelling the role that language plays in the creative process, e.g., using a compositional language it 40 is possible for an agent to form a sentence such that all of the words have familiar and agreed upon meanings, but that the combination of words is novel. This has implications for the modelling of creativite processes; the ability to produce and evaluate novel descriptions as hypothesised experiences opens up the possibility for modelling grounded forms of specific curiosity (Berlyne 1960). 2011_9 !2011 Understanding Human Creativity for Computational Play Alexander E. Zook and Mark O. Riedl School of Interactive Computing Georgia Institute of Technology {a.zook, riedl}@gatech.edu Brian S. Magerko School of Literature, Communication and Culture Georgia Institute of Technology magerko@gatech.edu Abstract Play is a creative activity involving the construction, use, and modification of game frameworks. Developing computational agents capable of play with humans requires a formal categorization of the key aspects of play. We propose a theoretical framework to differentiate the knowledge, actions, and intentions employed by play agents. Play knowledge may be pre-conventional (lacking formal rules), conventional (composed of domain-specific rules), or post-conventional (including both domain specific and out-of-domain rules). Actions may exploit, explore, generate, or modify play knowledge to create play experiences. These experiences may be pursued with ego-centric (self-oriented) or exocentric (other-oriented) intentions. We illustrate this framework with examples from research on play and relate this to existing creativity models. Introduction Play is a fundamental human activity with a central role in creativity and human interactions (Caillois 2001; Huizinga 2003). Fields including social robotics and virtual agents have begun to address the ways computational agents can interact with humans and integrate into our everyday lives. However, little work has been done on how these agents can intentionally engage in play. The field of game AI has focused on building agents that can realistically (or optimally) play digital games with humans or other agents, but they have no formal concepts of what it means to "play" or "be playful." For instance, game AI approaches have yet to explore how agents can be engaged within a social context to co-create a game together, as children do. Programming agents and robots with concepts of play within a social context can improve their capacities to relate to humans, increase their social acceptance, encourage human companionship and interest, and stimulate human creativity, learning, and motivation. Playing with humans requires the capacity to construct, inhabit, and modify an openended make-believe world as humans do. This article discusses a theoretical framework for computational play, categorizing types of human play as the first steps toward developing computational agents that play with humans. Play spans a wide range of activities, most focally games. Games are systems of rules that define a set of game configurations, legal moves between these configurations, and winning and losing conditions (Salen and Zimmerman 2003). We can distinguish agents - both human and machine - through how they make use of the rules of a game. This article discusses six categories of players along two axes: one of knowledge, the other of intent. Players may possess knowledge that is: pre-conventional (i.e., lacking fixed predetermined rules), conventional (i.e., confined by a prearranged rule set), and post-conventional (i.e., using prearranged rules along with outside rules). Differentiating the knowledge employed by creative agents enables detailed investigation of the ways a creative domain is constructed and modified during interaction. Playing a game is the process of choosing specific actions during a game to achieve a total trajectory through the space of game states. These actions operate on the player's knowledge and therefore may not be restricted by the known rules of a game. As such, an agent can potentially act outside the space of the rule-based play knowledge. At any given time an agent may attempt to exploit its knowledge to act according to game rules, explore the game state or rules when they are ambiguous, generate new game states or rules, or modify existing game states or rules. These different actions are used both to construct a play experience over a series of actions and to potentially alter the game itself. Agents may thus be creative in how they act within the space (exploitation), how they interact with the space and other players (exploration), or how they manipulate the space itself (generation and modification). We contend that agents, human or computer, differ in their intentions when acting within this space of play experiences, depending on the social purposes of their actions. Ego-centric agents attempt to maximize their own reward when playing the game. Conversely, exo-centric agents seek to optimize the experience of all participants in the game. These intentions direct the creative process towards particular outcomes and provide an agent with orientation to how it acts. In both cases other players form a key context to creation of a play experience. We define play as the combination of the knowledge of the rules of a game or play activity, the processes for using 42 that knowledge to enact an experience, and the goals guiding this process. In the next section will we discuss Boden and Wiggins's work on creativity. We will then elaborate the knowledge, action, and intention components of our framework and support our definitions with examples of play both from humans and existing computational frameworks. Finally, we will compare our framework to Boden and Wiggins's frameworks for creativity, contextualizing play activities within the domain of research on creativity. We conclude with research directions for developing computational systems capable of play with humans. Previous models of creativity Boden Boden (2009) proposed a general framework for understanding creativity involving production of ideas that are both new and valuable. She subdivides methods for producing creative ideas into combinational, exploratory, and transformational modes. Combinational creativity is defined as the unfamiliar combination of familiar ideas, performed by associating ideas that were previously only indirectly linked. This process is guided by associative knowledge rules, connecting and transferring ideas between domains. Exploratory creativity involves moving through a conceptual space, where the space is defined by a culturally accepted style of thinking. Generative rules define means to produce concepts that fit the defined style. Finally, transformational creativity is the alteration of a conceptual space through modification of the rules defining that space (Boden 2009, 24-25). Wiggins Wiggins (2006; 2001) provides a computational formalization of Boden's framework as a search process. He defines several sets of rules employed by the searching agent, including: • U - the universe of possible complete or partial artifacts • R - the rules which defined an artifact of interest • T - the rules that define methods to traverse the space of artifacts defined by R • E - the rules that define how to attribute value to an artifact R is external to an agent and agreed upon among a group of agents, while T captures the individual agent's methods for searching within this space (Wiggins 2003). T generates artifacts that may or may not fall within the space defined by R and E may place value on concepts outside of the space defined by R. This means the conceptual spaces covered by R, T, and E are not necessarily co-extensive. This mismatch enables creativity through reaching novel artifacts valued by E but not within R. In response, creative systems may engage in either R-transformation or Ttransformation, changing the set of rules used to represent or construct artifacts, respectively. As a computational formalism, Wiggins's work articulates the difference between the rules used to define an artifact, the rules used to generate an artifact in the creative process, and the rules used to evaluate an artifact. Both Boden's model and Wiggins's formalization, however, focus on creativity at the level of individual isolated acts of creation. Other agents only act to define the rules for a conceptual space before a given agent acts to create artifacts within that space. This model leaves open questions for building a play agent regarding the means of cocreation among groups of interacting agents. In play the rules of a game are continually negotiated by groups of agents, while their choice of play actions dynamically responds to those of their playmates. Addressing these processes will require an understanding of the knowledge, processes, and intentions involved in play activities. Play Framework We model play agents as possessing knowledge of play, the capacity to engage in play actions, and a set of intentions regarding the play experience. An agent reasons about its game rule knowledge to select play actions toward particular play intentions. Play knowledge describes the defined rules of a particular play activity, delimiting a space for taking play actions. Play actions are the ways an agent may use its knowledge during a play activity, potentially acting outside the domain of game rules. A total sequence of play actions made by all players from the beginning to end of a game constitutes a play trajectory or experience. Agents evaluate potential trajectories according to their play intentions. We categorize play according to the kinds of knowledge employed, types of actions used, and intentions guiding agent actions. Knowledge Play knowledge formally consists of the set of rules that define the legal states within a game and the transitions between these states. Agents vary in the types of knowledge they possess, being pre-conventional, conventional, or post-conventional. Pre-conventional players lack knowledge of the game rules, but possess outside sources of knowledge (e.g., rules of other games, social norms regarding turn-taking among peers). This typically occurs in the context of a game not yet formally defined from the viewpoint of the play agents, but co-created by the agents through their activity. Conventional players have fully defined the rules for an activity and are restricted to the use of game-specific rules. Post-conventional players employ both the rules of the game and outside sources of rules, enabling modification of the game rules. Young children at play exemplify pre-conventional play, where rules are freely created. Professional sports exhibit prototypical conventional play, where all activity is constrained to obey the set 43 the set rules for the game. Post-conventional play occurs when players modify games to follow house rules, altering the rules of an activity using outside rule sources and norms. These models of knowledge capture differences in the formal game structures that players employ: spontaneous rules and shared outside knowledge, the defining rules of an activity, or both sets of rules. In addition, different knowledge types vary in their flexibility: pre-conventional players construct a game, conventional players bar modification of game-specific knowledge, and post-conventional players use game-based knowledge while having the capacity to modify it. Playing in any of these ways involves differences in what structures are defined and how they may be modified. Computational agents that implement these knowledge structures will require flexible schema capable of adding, removing, and modifying potential game states and transitions among these states. The computational model will need to capture the semantics of game rules, incorporating the relations of game states and rules to one another and the actions available to agents. As pre-conventional and post-conventional players both draw from beyond the knowledge of a specific activity, these computational agents will require the capacity to relate game knowledge to other outside frameworks. Actions Play actions differentiate how agents may use, learn, build, or modify the structure of play. Play agents reason about a given game state and the agent's game knowledge to decide how to act. Play actions can be divided into four primary categories: • Exploitation - acting according to known game state and rules • Exploration - eliciting information from outside (e.g., from other players or the game environment) regarding the game rules • Generation - declaring a new game state or rule • Modification - declaring a modification to the current game rules Exploitation is illustrated by chess players who capture pieces according to their prescribed movements. Exploration is employed in pre-conventional or post-conventional knowledge contexts when ambiguity exists regarding the structure of the activity. During charades players may ask for a repetition or clarification of an action. Generation is used in pre-conventional and post-conventional knowledge situations to add to game structure and knowledge. Modification occurs principally in post-conventional knowledge contexts when existing game structure is altered. Gottman and Graziano (1983) found that children playing together employ this set of actions during friendship formation. Children initially exchange information and establish a common ground activity (exploration) before escalating to play activities (exploitation). During play conflicts may arise, which are resolved through conflict resolution processes (including generation and modification) and message clarification (exploration). A sequence of play actions made by all players defines a play trajectory or play experience. While play knowledge defines the space for play, play trajectories represent the actualized experiences. Employing play actions enables construction of play experiences. Actions are capable of manipulating both the play knowledge (via exploration, generation, and modification) as well as the play experience (via exploitation). Conventional play restricts the actions available to exploitation, remaining strictly within known states. Pre-conventional play employs exploratory actions for clarification of poorly-known states and generation to construct an activity from existing knowledge. Postconventional play enables the alteration of game states and rules through exploration, generation, and modification. It is important to note that these actions need not be arranged into discrete phases of constructing a game and playing a game, but may be interwoven during any play experience. Interacting with other players requires interpretation of the play actions those players take. Ambiguity in these interactions involves interpreting player actions in the context of the play activity and identifying the specific type of action performed. Research on play has identified the role of meta-communication in mediating between player actions and game meanings. Meta-communication is a form of communication where a message has different meanings with respect to different levels of an activity (Bateson 1972). In play, an action has both a real-world meaning and a meaning with respect to a particular play activity. Exploration involves clarification of this relationship, while exploitation involves conveying action information to others. Generation and modification both involve constructing new meta-communicative relationships, eitherignoring the existing game framework or working with respect to it. Computational play agents will need the capacity to engage in this set of play actions. In playing within a game, this entails reasoning about existing game states to identify lacking information regarding game rules. Being able to engage in pre-conventional or post-conventional play will additionally require methods to generate new and relevant game states and rules, using outside sources of knowledge and existing known rules. All of these actions implicitly require the capacity for meta-communication, involving interpretation of ambiguous actions in the context of a play activity. Intentions Play experience intentions guide the selection of actions toward the construction of a play trajectory. These intentions capture the combination of features of a play trajectory that an agent values. Play intentions capture the focus of the set of goals employed by an agent - itself or others. Ego-centric players evaluate trajectories with respect to 44 desired personal play experiences, while exo-centric players aim for group experiences. Competitive sports professionals exemplify an ego-centric approach, where all actions during a play activity are chosen for the ultimate goal of winning. In contrast, a parent who intentionally makes bad moves when playing with a child exhibits an exocentric play style, hoping to create an interesting experience for both themselves and their child. Ego-centric and exo-centric styles highlight differences in the goals players work towards during a play experience, embedding the role of social interactions into the means for enacting play. Creating a computational play agent will require the definition of the relevant features of a play experience as well as means to evaluate any given experience with respect to these criteria. Evaluation cannot consider only the end state of a play experience, but must incorporate the full trajectory of actions. This involves accounting for the experience of both the given agent and any other playmates. When selecting actions, agents will need to calculate the impact of any choice over the remainder of a play activity with respect to all participants. This entails weighing different goals and assessing their value at different points along a total experiential trajectory. Evaluation also involves reasoning about the relationship of particular rules to the set of available trajectories in order to generate or modify game structure towards particular intentions. Play Style Categories The knowledge and intention axes of play styles we propose intersect to define six play style categories: ego-preconventional, exo-pre-conventional, ego-conventional, exo-conventional, ego-post-conventional, exo-postconventional. We support these categories with examples of human play activities and computational models that illustrate these distinctions. Pre-conventional play Ego-pre-conventional play involves the construction of games that support individual interests through the use of outside knowledge. Solo play with toys is a typical example of children constructing a game for personal enjoyment (Sutton-Smith 2001). Constructive play with blocks and objects often involves invention of structured meanings applied to game states, where activities are oriented towards personal satisfaction. Exo-pre-conventional play involves constructing activities towards the enjoyment of a group of players using outside knowledge. Children's group pretend play demonstrates this category, where a play structure emerges to support all players through iterative negotiations (Sawyer 1997; Sawyer 2002; Eckler and Weininger 1989). Caillois describes the unstructured play of taking up social roles to construct an imaginary world (2001). In both of these cases players create the structure of the activity spontaneously, without prior agreed-upon rules. Meckley (1994) describes the establishment of a play society among 12 three to four year old nursery school children over a period of five months. Here, the games and play activities gradually became ritualized from various actions the children engaged in, with particular groups coming to emphasize different norms of conduct. For example, a group of girls developed a game of playing house, where group enjoyment norms included methods to cope with disruptive intrusions from boys not part of the game. Conventional play Ego-conventional play involves acting solely within game rules to achieve personal play experiences (including, but not limited to, victory). Ego-conventional play is exhibited by most players in tournaments, where all actions aim toward personal victory. Game artificial intelligence (AI), in particular adversarial search techniques, exemplifies this approach as the agent evaluates actions in the service of optimizing personal score. Exo-conventional play is the adherence to game rules while pursuing play trajectories that optimize certain group experiences. Exo-conventional play occurs when players intentionally make poor-quality moves, seeking to keep the game interesting for themselves and others by evening the odds of winning. Beaudry (2010) presents an application of Markov decision theory to a Snakes and Ladders like game, where a computational agent generates plans to avoiding creating too large a gap between its score and that of an opponent. Roberts, Riedl, and Isbell (2009) similarly argue for the application of narrative storytelling techniques to AI systems in an effort to produce interesting trajectories of actions during a game, rather than optimal end states for a game-playing agent. Social goals can vary in guiding goal states and criteria. Caillois and Kohlberg both recognize two common exocentric intentions employed by humans (Caillois 2001; Table 1. Six categories of play. Pre-conventional Conventional Post-conventional Egocentric Generation of new game states and rules to structure a play experience towards personal reward Adherence to game rules in pursuit of personal reward Modification of game states and rules toward personal reward Exocentric Generation of new game states and rules to structure a play experience towards group experience Adherence to game rules in pursuit of group experience Modification of game states and rules toward group experiences 45 Kohlberg 1987). Caillois's competition (agôn) play category and Kohlberg's reciprocity justice operation both emphasize merit-based rewards for players. Caillois's chance (alea) play category and Kohlberg's equality justice operation instead emphasize fairness in games and evenness of chances. Merit and fairness are two criteria agents may use to evaluate experiences, such as Beaudry (2010) above emphasizing fairness. Post-conventional play Ego-post-conventional play is the modification of game rules to maximize personal reward in play. Ego-postconventional players are exemplified by cheaters, who violate game rules for personal gains in the game. In video games this can include becoming invincible, skipping ahead of sequences that are difficult or tedious, or gaining powers to fly or move through obstacles to explore the game world more fully without its normal constraints. Other examples include stacking the deck in poker for monetary gain or covertly taking money from the bank in Monopoly™. In commercial video games, AI systems are often allowed to cheat through unfair advantages in resources or having access to information not available to human players (e.g., ignoring the "fog of war" in strategy games). The procedural particle generation system employed by Galactic Arms Race (Hastings, Guha, and Stanley 2009) is an example of an ego-post-conventional system that modifies the game weapon mechanics to match player play style preferences. Exo-post-conventional play is the modification of game rules toward desired group play experiences. Examples of this play include human players imposing handicaps to ensure more even chances among players of varying skill (fairness) or using house rules to modify a game towards particular group interests. Young children most commonly resolve conflicts among players by adding rules to a game (Kolominskii and Zhiznevskii 1992). In pretend play, children often draw from a cultural narrative to ground their play activity, subsequently modifying the narrative to suit their particular play interests and desires (Sawyer 1997; Sawyer 2002). Meckley (see above) found children gradually increased the complexity and diversity of the games they played. As exo-post-conventional play, these children demonstrated manipulation of game rules as the group sought different sorts of play experiences. Discussion Play activities can be understood through the lenses of the creative frameworks set forth by Boden and Wiggins. Comparing to these models we map conceptual spaces to the game knowledge (i.e., defined game rules) and creative artifacts to the play experience trajectories. This captures the distinction between a generative space (game rules) and specific instances within that space (play experiences). Boden's model subdivides creative activities into combinational, exploratory, and transformative types. Preconventional play can be seen as a form of combinational creativity, where players construct a game space by combining elements from other domains. Conventional play matches exploratory creativity, where social conventions (game rules) define the space used by individual agents. In conventional playing, agents adhere to game rules in the process of exploring the space of play trajectories defined by these rules. As rules are generative their implications are not necessarily known in advance. Post-conventional play maps onto transformative creativity, where the guiding rules of a game are modified. Boden notes that transformation requires making new creations possible and involving interactions with the outside world. Postconventional playing meets these requirements through altering the game rule space to enable play trajectories not previously possible and using outside knowledge in creation or modification game rules. With respect to Boden's framework, types of play are avenues for interactive creation of game rules and play experiences. Wiggins specifies both R-transformation and Ttransformation as modes of creativity. R maps onto the game rules, T maps onto the play actions employed, and E maps onto the play intentions pursued for a given game. As in Wiggins's model, actions may result in experiences not possible within the bounds of the game rules. Rtransformation may be involved in pre-conventional or post-conventional play activities. In pre-conventional play, agents form a set of rules for a game from a null set of rules using outside knowledge. Post-conventional play modifies the existing rules of a game by drawing from rules beyond the set defined by R. T-transformation alters the actions employed by agents, involving specific types of exploitation, exploration, and creation. Agents may alter the set of actions they employ that obey the rules of an activity, potentially restraining themselves to a particular subset of legal actions (e.g., refraining from killing enemies in a shooting game). Transforming exploration rules involves seeking different types of information from the environment, examining the bounds of rules. Altering the creative rules employed changes what aspects of the game rules may be modified and what kinds of rules may be proposed. Wiggins's framework brings to the fore two different levels of creativity in play: manipulating the rules of the game themselves and changing the experiences achieved when playing. Thus, our framework provides a model of creative activity that incorporates the role of interaction and interdependence among agents into the creative process by altering the rules forming an activity and means of playing. Conclusions We propose a framework to classify play activities according to the play knowledge and intentions employed by play agents. Play knowledge may be pre-conventional, conventional, or post-conventional, where game rules are not pre 46 viously defined, strictly obeyed, or subsumed within a larger set of rules, respectively. Play actions involve using, clarifying, adding to, or modifying existing game states and rules. Agents may pursue ego-centric goals in playing towards personal experiences or exo-centric goals in pursuing desired group experiences. Play actions give play agents the means to construct play experiences from their knowledge toward particular intentions. The intersection of play knowledge and intent defines six types of play. The knowledge, action, and intent division we draw maps onto similar distinctions made by Boden and Wiggins in describing creative systems, while extending their work towards creativity involving interaction with others. Future research will examine the interactions among players in different play categories. How do players categorize one another and how does this impact their play styles? Computational play agents can leverage this knowledge both in co-creating games based on the play category of a user and in playing games to create interesting play experiences for a specific kind of player. Our categorization defines a space for future research towards computational agents capable of playing with other agents and humans to co-create particular activities and experiences. Computational formalizations of play knowledge will investigate what particular knowledge and knowledge structures agents require when involved in open-ended pre-conventional play. Computational models of play actions will explore how agents can reason about the relationship between game states, rules, and player experiences. When should an agent seek information about the game space? What processes and information are involved in adding rules to a game? How can an agent reason about existing rules to modify those rules? How can agents and players communicate about a play activity when engaged in unstructured play? We speculate fuzzy schemas will be required to represent the ambiguity involved in meta-communication, modeling the distribution of potential game actions being performed by any given agent action. Computational implementations of play intentions will examine what features agents must account for during play experiences and how they can be employed in evaluations. How should agents evaluate a trajectory of play actions? What goals should be used and how do they fall along the ego-centric-exo-centric axis? How can they address the open-ended nature of potential play experiences, where the set of possible game states and rules do not remain fixed? Researching these questions will enable agents that can creatively interact with humans. 2012_1 !2012 From Conceptual "Mash-ups" to "Bad-ass" Blends: A Robust Computational Model of Conceptual Blending Tony Veale School of Computer Science and Informatics University College Dublin, Belfield D4, Ireland. Tony.Veale@UCD.ie Abstract Conceptual blending is a cognitive phenomenon whose instances range from the humdrum to the pyrotechnical. Most remarkable of all is the ease with which humans regularly understand and produce complex blends. While this facility will doubtless elude our best efforts at computational modeling for some time to come, there are practical forms of conceptual blending that are amenable to computational exploitation right now. In this paper we introduce the notion of a conceptual mash-up, a robust form of blending that allows a computer to creatively re-use and extend its existing common-sense knowledge of a topic. We show also how a repository of such knowledge can be harvested automatically from the web, by targetting the casual questions that we pose to ourselves and to others every day. By acquiring its world knowledge from the questions of others, a computer can eventually learn to pose introspective (and creative) questions of its own. The Plumbing of Creative Thought We can think of comparisons as pipes that carry salient information from a source to a target concept. Some pipes are fatter than others, and thus convey more information: think of resonant metaphors or rich analogies that yield deeper meaning the more you look at them. By convention, pipes carry information in one direction only, from source to target. But creativity is no respecter of convention, and creative comparisons are sometimes a two-way affair. When the actor and writer Ethan Hawke was asked to write a profile of Kris Kristofferson for Rolling Stone magazine, Hawke had to create an imaginary star of his own to serve as an apt contemporary comparison. For Hawke, Brad Pitt is as meaningful a comparison as one can make, but even Pitt's star power is but a dim bulb to that of Kristofferson when he shone most brightly in the 1970s. To communicate just how impressive the singer-actoractivist would have seemed to an audience in 1979, Hawke assembled the following Frankenstein-monster from the body of Pitt and other assorted star parts: "Imagine if Brad Pitt had written a No. 1 single for Amy Winehouse, was considered among the finest songwriters of his generation, had been a Rhodes scholar, a U.S. Army Airborne Ranger, a boxer, a professional helicopter pilot - and was as politically outspoken as Sean Penn. That's what a motherfuckin' badass Kris Kristofferson was in 1979." Pitt comes off poorly in the comparison, but this is precisely the point: no contemporary star comes off well, because in Hawke's view, none has the wattage that Kristofferson had in 1979. The awkwardness of the comparison, and the fancifulness of the composite image, serves as a creative meta-description of Kristofferson's achievements. In effect Hawke is saying, "look to what lengths I must go to find a fair comparison for this man without peer". Notice how salient information flows in both directions in this comparison. To create a more rounded comparison, Hawke finds it necessary to mix in a few elements from other stars (such as Sean Penn), and to also burnish Pitt's résumé with elements borrowed from Kristofferson himself. Most of this additional structure is imported literally from the target, as when we are asked to imagine Pitt as a boxer or a helicopter pilot. Other structure is imported in the form of an analogy: while Kristofferson wrote songs for Janis Joplin, Pitt is imagined as a writer for her modern counterpart, Amy Winehouse. This Pitt 2.0 doesn't actually exist, of course. Hawke's description is a conceptual blend that constructs a whole new source concept in its own counterfactual space. Blending is pervasive in modern culture, and can be seen in everything from cartoons to movies to popular fiction, while the elements of a blend can come from any domain of experience, from classic novels to 140-character tweets to individual words. As defined by the cognitive linguists Gilles Fauconnier and Mark Turner (1998, 2002), conceptual blending combines the smoothness of metaphor with the structural complexity and organizing power of analogy. We can think of blending as a cognitive operation in which conceptual ingredients do not flow in a single direction, but are thoroughly stirred together, to create a new structure with its own emergent meanings. The Kristofferson-as-Pitt blend shows just how complex a conceptual blend can be, while nonetheless remaining intelligible to a reader: when we interpret these constructs, 2012 1 we are not aware of any special challenge being posed, or of any special machinery being engaged. Nonetheless, this kind of blend poses significant problems for our computers and their current linguistic/cognitive-modelling abilities. In this paper we propose a computional middle-ground, called a conceptual mash-up, that captures some of the power and utility of a conceptual blend, but in a form that is practical and robust to implement on a computer. From this starting point we can begin to make progress toward the larger goal of creative computational systems that - to use Hawke's word - can formulate truly badass blends of their own. Creative language is a knowledge-hungry phenomenon. We need knowledge to create or comprehend an analogy, metaphor or blend, while these constructs allow us to bend and stretch our knowledge into new forms and niches. But computers cannot be creative with language unless they first have something that is worth saying creatively, for what use is a poetic voice if one has no opinions or beliefs of one's own that need to be expressed? This current work describes a re-usable resource - a combination of knowledge and of tools for using that knowledge - that can allow other computational systems to form their own novel hypotheses from mashups of common stereotypical beliefs. These hypotheses can be validated in a variety of ways, such as via web search, and then expressed in a concise and perhaps creative lingustic form, such as in poem, metaphor or riddle. The resource, which is available as a public web service called Metaphor-Eyes, produces conceptual mash-ups for its input concepts, and returns the resulting knowledge structures in an XML format that can then be used by other computational systems in a modular, distributed fashion. The Metaphor-Eyes service is based on an approach to creative introspection first presented in Veale & Li (2011), in which stereotypical beliefs about everyday concepts are acquired from the web, and then blended on demand to create hypotheses about topics that the computer may know little about. We present the main aspects of Metaphor-Eyes in the following sections, and show how the service can be called by clients on the web. Our journey begins in the next section, with a brief overview of relevant computational work in the areas of metaphor and blending. It is our goal to avoid hand-crafted representations, so in the section after that we describe how the system can acquire its own common-sense knowledge from the web, by eavesdropping on the revealing questions that users pose everyday to a search engine like Google. This knowledge provides the basis for conceptual mashups, which are constructed by re-purposing web questions to form new instrospective hypotheses about a topic. We also introduce the notion of a multi-source mash-up, which allows us to side-step the vexing problem of context and user-intent in the construction of conceptual blends. Finally, an empirical evaluation of these ideas is presented, and the paper concludes with thoughts on future directions. Related Work and Ideas We use metaphors and blends not just as rhetorical flourishes, but as a basis for extending our inferential powers into new domains (Barnden, 2006). Indeed, work on analogical metaphors shows how metaphor and analogy use knowledge to create knowledge. Gentner's (1983) Structure-Mapping Theory (SMT) argues that analogies allow us to impose structure on a poorly-understood domain, by mapping knowledge from one that is better understood. SME, the Structure-Mapping Engine (Falkenhainer et al., 1989), implements these ideas by identifying sub-graph isomorphisms between two mental representations. SME then projects connected substructures from the source to the target domain. SMT prizes analogies that are systematic, yet a key issue in any structural approach is how a computer can acquire structured representations for itself. Veale and O'Donoghue (2000) proposed an SMT-based model of conceptual blending that was perhaps the first computational model of the phenomenon. The model, called Sapper, addresses many of the problems faced by SME - such as deciding for itself which knowledge is relevant to a blend - but succumbs to others, such as the need for a hand-crafted knowledge base. Pereira (2007) presents an alternative computational model that combines SMT with other computational techniques, such as using genetic algorithms to search the space of possible blends. Pereira's model was applied both to linguistic problems (such as the interpretation of novel noun-noun compounds) and to visual problems, such as the generation of novel monsters/creatures for video games. Nonetheless, Pereira's approach was just as reliant on hand-crafted knowledge. To explore the computational uses of blending without such a reliance on specially-crafted knowledge, Veale (2006) showed how blending theory can be used to understand novel portmanteau words - or "formal" blends - such as "Feminazi" (Feminist + Nazi). This approach, called Zeitgeist, automatically harvested and interpreted portmanteau blends from Wikipedia, using only Wikipedia itself and Wordnet (Fellbaum, 1998) as resources. The availability of large corpora and the Web suggests a means of relieving the knowledge bottleneck that afflicts computational models of metaphor, analogy and blending. Turney and Littman (2005) show how a statistical model of relational similarity can be constructed from web texts for handling proportional analogies of the kind used in SAT and GRE tests. No hand-coded or explicit knowledge is employed, yet Turney and Littman's system achieves an average human grade on a set of 376 SAT analogies (such as mercenary:soldier::?:? where the best answer among four alternatives is hack:reporter). Almuhareb and Poesio (2004) describe how attributes and values can be harvested for word-concepts from the web, showing how these properties allow word-concepts to be clustered into category structures that replicate the semantic divisions made by a curated resource like WordNet (Fellbaum, 1998). Veale and Hao (2007a,b) describe how stereotypical knowledge can be acquired from the web by harvesting similes of the form "as P as C" (as in "as smooth as silk"), and go on to show, in Veale (2012), how a body of 4000 stereotypes is used in a web-based model of metaphor 2012 2 generation and comprehension. Shutova (2010) combines elements of several of these approaches. She annotates verbal metaphors in corpora (such as "to stir excitement", where the verb "stir" is used metaphorically) with the corresponding conceptual metaphors identified in Lakoff and Johnson (1980). Statistical clustering techniques are then used to generalize from the annotated exemplars, allowing the system to recognize other metaphors in the same vein (e.g. "he swallowed his anger"). These clusters can also be analyzed to identify literal paraphrases for a given metaphor (such as "to provoke excitement" or "suppress anger"). Shutova's approach is noteworthy for the way it operates with Lakoff and Johnson's inventory of conceptual metaphors without actually using an explicit knowledge representation. The questions people ask, and the web queries they pose, are an implicit source of common-sense knowledge. The challenge we face as computationalists lies in turning this implicit world knowledge into explicit representations. For instance, Pasca and Van Durme (2007) show how knowledge of classes and their attributes can be extracted from the queries that are processed and logged by web search engines. We show in this paper how a commonsense representation that is derived from web questions can be used in a model of conceptual blending. We focus on well-formed questions, found either in the query logs of a search engine or harvested from documents on the web. These questions can be viewed as atomic properties of their topics, but they can also be parsed to yield logical forms for reasoning. We show how, by representing topics via the questions that are asked about them, we can also grow our knowledge-base via blending, by posing these questions introspectively of other topics as well. "Milking" Knowledge from the Web Amid the ferment and noise of the Web sit nuggets of stereotypical world knowledge, in forms that can be automatically harvested. To acquire a property P for a topic T, one can look for explicit declarations of T's P-ness, but such declarations are rare, as speakers are loathe to explicitly articulate truths that are tacitly assumed by listeners. Hearst (1992) observes that the best way to capture tacit truths in large corpora (or on the Web) is to look for stable linguistic constructions that presuppose the desired knowledge. So rather than look for "all Xs are Ys", which is logically direct but exceedingly rare, Hearstpatterns like "Xs and other Ys" presuppose the same hypernymic relations. By mining presuppositions rather than declarations, a harvester can cut through the layers of noise and misdirection that are endemic to the Web. If W is a count noun denoting a topic TW, then the query "why do W+plural *" allows us to retrieve questions posed about TW on the Web, in this case via the Google API. (If W is a mass noun or a proper-name, we instead use the query "why does W *".) These two formulations show the benefits of using questions as extraction patterns: a query is framed by a WH-question word and a question mark, ensuring that a complete statement is retrieved (Google snippets often contain sentence fragments); and number agreement between "do"/"does" and W suggests that the question is syntactically well-formed (good grammar helps discriminate well-formed musings from random noise). Queries with the subject TW are dispatched whenever the system wishes to learn about a topic T. We ask the Google API to return 200 snippets per query, which are then parsed to extract well-formed questions and their logical forms. Questions that cannot be so parsed are rejected as being too complex for later re-use in conceptual blending. For instance, the topic pirate yields the query "why do pirates *", to retrieve snippets that include these questions: Why do pirates wear eye patches? Why do pirates hijack vessels? Why do pirates have wooden legs? Parsing the 2nd question above, we obtain its logical form: ∀x pirate(x) ! ∃y vessel(y) ∧ hijack(x, y) A computational system needs a critical mass of such commonsense knowledge before it can be usefully applied to problems such as conceptual blending. Ideally, we could extract a large body of everyday musings from the query logs of a search engine like Google, since many users persist in using full NL questions as Web queries. Yet such logs are jealously guarded, not least on concerns about privacy. Nonetheless, engines like Google do expose the most common queries in the form of text completions: as one types a query into the search box, Google anticipates the user's query by matching it against past queries, and offers a variety of popular completions. In an approach we call Google milking, we coax completions from the Google search box for a long list of strings with the prefix "why do", such as "why do a" (which prompts "why do animals hibernate?"), and "why do aa" (which prompts "why do aa batteries leak?"). We use a manual trie-driven approach, using the input "why do X" to determine if any completions are available for a topic prefixed with X, before then drilling deeper with "why do Xa" … "why do Xz". Though laborious, this process taps into a veritable mother lode of nuggets of conventional wisdom. Two weeks of milking yields approx. 25,000 of the most common questions on the Web, for over 2,000 topics, providing critical mass for the processes to come. Conceptual "Mash-ups" Google milking yields these frequent questions about poets Why do poets repeat words? Why do poets use metaphors? Why do poets use alliteration? Why do poets use rhyme? Why do poets use repetition? Why do poets write poetry? Why do poets write about love? Querying the web directly, the system finds other common presuppositions about poets, such as "why do poets die poor?" and "why do poets die young?", precisely the kind 2012 3 of knowledge that shapes our stereotypical view of poets yet which one is unlikely to find in a dictionary. Now suppose a user asks the system to explore the ramifications of the blend Philosophers are Poets: this prompts the system to introspectively ask "how are philosophers like poets?". This question spawns others, which are produced by replacing the subject of the poet-specific questions above, yielding new introspective questions such as "do philosophers write poetry?", "do philosophers use metaphors?", and "do philosophers write about love?". Each repurposed question can be answered by again appealing to the web: the system simply looks for evidence that the hypothesis in question (such as "philosophers use metaphors") is used in one or more web texts. In this case, the Google API finds supporting documents for the following hypotheses: "philosophers die poor" (3 results), "philosophers die young" (6 results), "philosophers use metaphors" (156 results), and "philosophers write about love" (just 2 results). The goal is not to show that these behaviors are as salient for philosophers as they are for poets, rather that they can be meaningful for philosophers. We refer to the construct Philosophers are Poets as a conceptual mash-up, since knowledge about a source, poet, has been mashed-up with a given target, philosopher, to yield a new knowledge network for the latter. Conceptual mash-ups are a specific kind of conceptual blend, one that is easily constructed via simple computational processes. To generate a mash-up, the system starts from a given target T and searches for the source concepts S1 … Sn that might plausibly yield a meaningful blend. A locality assumption limits the scale of the search space for sources, by assuming that T must exhibit a pragmatic similarity to any vehicle Si. Budanitsky and Hirst (2006) describe a raft of term-similarity measures based on WordNet (Fellbaum, 1998), but what is needed for blending is a generative measure: one that can quantify the similarity of T to S as well as suggest a range of likely S's for any given topic T. We construct such a measure via corpus analysis, since a measure trained on corpora can easily be made corpusspecific and thus domainor context-specific. The Google ngrams (Brants and Franz, 2006) provide a large collection of word sequences from Web texts. Looking to the 3grams, we extract coordinations of generic nouns of the form "Xs and Ys". For each coordination, such as "tables and chairs" or "artists and scientists", X is considered a pragmatic (rather than semantic) neighbor of Y, and vice versa. When identifying blend sources for a topic T, we consider the neighbors of T as candidate sources for a blend. Furthermore, if we consider the neighbors of T to be features of T, then a vector space representation for topics can be constructed, such that the vector for a topic T contains all of the neighbors of T that are identified in the Google 3-grams. In turn, this vector representation allows us to calculate the similarity of a topic T to a source S, and rank the neighbors S1 … Sn of T by their similarity to T. Intuitively, writers use the pattern "Xs and Ys" to denote an ad-hoc category, so topics linked by this pattern are not just similar but truly comparable, or even interchangeable. Potential sources for T are ranked by their perceived similarity to T, as described above. Thus, when generating mash-ups for philosopher, the most highly ranked sources suggested via the Google 3-grams are: scholar, epistemologist, ethicist, moralist, naturalist, scientist, doctor, pundit, savant, explorer, intellectual and lover. Multi-Source Mash-Ups The problem of finding good sources for a topic T is highly under-constrained, and depends on the contextual goals of the speaker. However, when blending is used for knowledge acquisition, multi-source mash-ups allow us to blend a range of sources into a rich, context-free structure. If S1 … Sn are the n closest neighbors of T as ranked by similarity to T, then a mash-up can be constructed to describe the semantic potential of T by collating all of the questions from which the system derives its knowledge of S1 … Sn, and by repurposing each for T. A complete mashup collates questions from all the neighbors of a topic, while a 10-neighbor mashup for philosopher, say, would collate all the questions possessed for scholar … explorer and then insert philosopher as the subject of each. In this way a conceptual picture of philosopher could be created, by drawing on beliefs such as naturalists tend to be pessimistic and humanists care about morality. A 20-neighbor mashup for philosopher would also integrate the system's knowledge of politician into this picture, to suggest e.g. that philosophers lie, philosophers cheat, philosophers equivocate and even that philosophers have affairs and philosophers kiss babies. Each of these hypotheses can be put to the test in the form of a web query; thus, the hypotheses "philosophers lie" (586 Google hits), "philosophers cheat" (50 hits) and "philosophers equivocate" (11 hits) are each validated via Google, whereas "philosophers kiss babies" (0 hits) and "philosophers have affairs" (0 hits) are not. As one might expect, the most domain-general hypotheses show the greatest promise of taking root in a target domain. Thus, for example, "why do artists use Macs?" is more likely to be successfully re-purposed for the target of a blend than "why do artists use perspective drawing?". The generality of a question is related to the number of times it appears in our knowledge-base with different subjects. Thus, "why do ___ wear black" appears 21 times, while "why do ___ wear black hats" and "why do ___ wear white coats" each just appear twice. When a mash-up for a topic T is presented to the user, each imported question Q is ranked according to two criteria: Qcount, the number of neighbors of T that suggest Q; and Qsim, the similarity of T to its most similar neighbor that suggests Q (as calculated 2012 4 using a WordNet-based metric; see Seco et al., 2006). Both combine to give a single salience measure Qsalience in (1): (1) Qsalience = Qsim * Qcount / (Qcount + 1) Note that Qcount is always greater than 0, since each question Q must be suggested by at least one neighbor of T. Note also that salience is not a measure of surprise, but of aptness, so the larger Qcount, the larger Qsalience. It is time-consuming to test every question in a mash-up against web content, as a mash-up of m questions requires m web queries. It is more practical to choose a cut-off w and simply test the top w questions, as ranked by salience in (1). In the next section we evaluate the ranking of questions in a mash-up, and estimate the likelihood of successful knowledge transfer from one topic to another. Empirical Evaluation Our corpus-attested, neighborhood-based approach to similarity does not use WordNet, but is capable of replicating the same semantic divisions made by WordNet. In earlier work, Almuhareb and Poesio (2004) extracted features for concepts from text-patterns found on the web. These authors tested the efficacy of the extracted features by using them to cluster 214 words taken from 13 semantic categories in WordNet (henceforth, we denote this experimental setup as AP214), and report a cluster purity of 0.85 in replicating the category structures of WordNet. But if the neighbors of a term are instead used as features for that term, and if a term is also considered to be its own neighbor, then an even higher purity/accuracy of 0.934 is achieved on AP214. Using neighbors as features in this way requires a vector space of just 8,300 features for AP214, whereas Almuhareb and Poesio's original approach to AP214 used approx. 60,000 features. The locality assumption underlying this notion of a pragmatic neighborhood constrains the number of sources that can contribute to a multi-source mash-up. Knowledge of a source S can be transferred to topic T only if S and T are neighbors, as identified via corpus analysis. Yet, the Google 3-grams suggest a wealth of neighboring terms, so locality does not unduly hinder the transfer of knowledge. Consider a test-set of 10 common terms, artist, scientist, terrorist, computer, gene, virus, spider, vampire, athlete and camera, where knowledge harvested for each of these terms is transferred via mash-ups to all of their neighbors. For instance, "why do artists use Macs?" suggests "musicians use Macs" as a hypothesis because artists and musicians are close neighbors, semantically (in WordNet) and pragmatically (in the Google n-grams); this hypothesis is in turn validated by 5,700 web hits. In total, 410,000 hypotheses are generated from these 10 test terms, and when posed as web queries to validate their content, approx. 90,000 (21%) are validated by usage in web texts. Just as knowledge tends to cluster into pragmatic neighborhoods, hypotheses likewise tend to be validated in clusters. As shown in Figure 1, the probability that a hypothesis is valid for a topic T grows with the number of neighbors of T for which it is known to be valid (Qcount). Figure 1. Likelihood of a hypothesis in a mash-up being validated via web search (y-axis) for hypotheses that are suggested by Qcount neighbors (x-axis). Unsurprisingly, close neighbors with a high similarity to the topic exert a greater influence than more remote neighbors. Figure 2 shows that the probability of a hypothesis for a topic being validated by web usage grows with the number of the topic's neighbors that suggest it and its similarity to the closest of these neighbors (Qsalience). In absolute terms, hypotheses perceived to have high salience (e.g. > .6) are much less frequent than those with lower ratings. So a more revealing test is the ability of the system to rank the hypotheses in a mash-up so that the topranked hypotheses have the greatest likelihood of being validated on the web. That is, to avoid information overload, the system should be able to distinguish the most plausible hypotheses from the least plausible, just as search engines like Google are judged on their ability to push the most relevant hits to the top of their rankings. Figure 2. Likelihood of a hypothesis in a mash-up being validated via web search (y-axis) for hypotheses with a particular Qsalience measure (x-axis). Figure 3 shows the average rate of web validation for the top-ranked hypotheses (ranked by salience) of complete mash-ups generated for each of our 10 test terms from all of their neighbors. Since these are common terms, they have many neighbors that suggest many hypotheses. On average, 85% of the top 20 hypotheses in each mash-up are 2012 5 validated on by web search as plausible, while just 1 in 4 of the top 60 hypotheses in a mashup is not web-validated. Figure 3. Average % of top-n hypotheses in a mash-up (as ranked by Qsalience) that are validated by Web search. Figures 1 - 3 show that the system is capable of extracting knowledge from the web which can be successfully transferred to neighboring terms via metaphors and mashups, and then meaningfully ranked by salience. But just how useful is this knowledge? To determine if it is the kind of knowledge that is useful for categorization - and thus the kind that captures the perceived essence of a concept - we use it to replicate the AP214 categorization test of Poesio and Almuhareb (2004). Recall that AP214 tests the ability of a feature- set / representation to support the category distinctions imposed by WordNet, so that 214 words can be clustered back into the 13 WordNet categories from which they are taken. Thus, for each of these 214 words, we harvest questions from the Web, and treat each question body as an atomic feature of its subject. Figure 4. Performance on AP214 improves as knowledge is transferred from the n closest neighbors of a term. Clustering over these features alone offers poor accuracy when reconstructing WordNet categories, yielding a cluster purity of just over 0.5. One AP214 category in particular, for time units like week and year, offers no traction to the question-based approach, and accuracy / purity increases to 0.6 when this category is excluded. People, it seems, rarely question the conceptual status of an abstract temporal unit. But as knowledge is gradually transferred to the terms in AP214 from their corpus-attested neighbors, so that each term is represented as a conceptual mash-up of its n nearest neighbors, categorization markedly improves. Figure 4 shows the increasing accuracy of the system on AP214 (excluding the vexing time category) when using mashups of increasing numbers of neighbors. Blends really do bolster our knowledge of a topic with insights that are relevant to categorization. Conclusions: A Metaphor-Eye to the Future We have shown here how common questions on the web can provide the world knowledge needed to drive a robust, if limited, form of blending called conceptual mash-ups. The ensuing powers of introspection, though basic, can be used to speculate upon the conceptual make-up of a given topic, not only in individual metaphors but in rich, informative mash-ups of multiple concepts. The web is central to this approach: not only are questions harvested from the web (e.g., via Google "milking"), but newly-formed hypotheses are validated by means of simple web queries. The approach is practical, robust and quantifiable, and uses an explicit knowledge representation that can be acquired on demand for a given topic. Most importantly, the approach makes a virtue of blending, and argues that we should view blending not as a problem of language but as a tool of creative thinking. The ideas described here have been computationally realized in a web application called Metaphor-Eyes. Figure 5 overleaf provides a snapshot of the system in action. The user enters a query - in this case the provocative assertion "Google is a cult" - and the system provides an interpretation based on a mash-up of its knowledge of the source (cults) and of the target (Google). Two kinds of knowledge are used to provide the interpretation of Figure 5. The first is common-sense knowledge of cults, of the kind that we expect most adults to possess. This knowledge includes widely-held stereotypical beliefs such as that cults are lead by gurus, that they worship gods and enforce beliefs, and that they recruit new members, especially celebrities, which often act as apologists for the cult. The system possesses no stereotypical beliefs about Google, but using the Google 2-grams (somewhat ironically, in this case), it can find linguistic evidence for the notions of a Google guru, a Google god and a Google apologist. The corresponding stereotypical beliefs about cults are then projected into the new blend space of Google-as-a-cult. Metaphor-Eyes derives a certain robustness from its somewhat superficial treatment of blends as mash-ups. In essence, the system manipulates conceptual-level objects (ideas, blends) by using language-level objects (strings, phrases, collocations) as proxies: a combination at the concept-level is deemed to make sense if a corresponding combination at the language-level can be found in a corpus (or in the Google n-grams). As such, any creativity 2012 6 exhibited by the system is often facile or glib. Because the system looks for conceptual novelty in the veneer of surface language, it follows in the path of humour systems that attempt to generate interesting semantic phenomena by operating at the punning level of words and their sounds. We have thus delivered on just one half of the promise of our title. While conceptual mash-ups are something a computer can handle with relative ease, "bad-ass" blends of the kind discussed in the introduction still lie far beyond our computational reach. Nonetheless, we believe the former provides a solid foundation for development of the tools and techniques that are needed to achieve the latter. Several areas of future research suggest themselves in this regard, and one that appears most promising at present is the use of mash-ups in the generation of poetry. The tight integration of surface-form and meaning that is expected in poetry means this is a domain in which a computer can serendipitously allow itself to be guided by the possibilities of word combination while simultaneously exploring the corresponding idea combinations at a deeper level. Indeed, the superficiality of mash-ups makes them ideally suited to the surface-driven exploration of deeper levels of meaning. Metaphor-Eyes should thus be seen as a community resource thru which the basic powers of creative introspection (as first described in Veale & Li, 2011) can be made available to a wide variety of third-party computational systems. In this regard, Metaphor-Eyes is a single instance of what will hopefully become an established trend in the maturing field of computational creativity: the commonplace sharing of resources and tools, perhaps as a distributed network of web-services, that will promote a wider cross-fertilization of ideas in our field. The integration of diverse services and components will in turn facilitate the construction of systems with an array of creative qualities. Only by pooling resources in this way can we hope to go beyond single-note systems and produce the impressive multi-note "badass blends" of the title. 2012_10 !2012 Computational and Collective Creativity: Who's Being Creative? Mary Lou Maher University of Maryland mlmaher@umd.edu Abstract Creativity research has traditionally focused on human creativity, and even more specifically, on the psychology of individual creative people. In contrast, computational creativity research involves the development and evaluation of creativity in a computational system. As we study the effect of scaling up from the creativity of a computational system and individual people to large numbers of diverse computational agents and people, we have a new perspective: creativity can ascribed to a computational agent, an individual person, collectives of people and agents and/or their interaction. By asking "Who is being creative?" this paper examines the source of creativity in computational and collective creativity. A framework based on ideation and interaction provides a way of characterizing existing research in computational and collective creativity and identifying directions for future research. Human and Computational Creativity Creativity is a topic of philosophical and scientific study considering the scenarios and human characteristics that facilitate creativity as well as the properties of computational systems that exhibit creative behavior. "The four Ps of creativity", as introduced in Rhodes (1987) and more recently summarized by Runco (2011), decompose the complexity of creativity into separate but related influences: • Person: characteristics of the individual, • Product: an outcome focus on ideas, • Press: the environmental and contextual factors, • Process: cognitive process and thinking techniques. While the four Ps are presented in the context of the psychology of human creativity, they can be modified for computational creativity if process includes a computational process. The study of human creativity has a focus on the characteristics and cognitive behavior of creative people and the environments in which creativity is facilitated. The study of computational creativity, while inspired by concepts of human creativity, is often expressed in the formal language of search spaces and algorithms. Why do we ask who is being creative? Firstly, there is an increasing interest in understanding computational systems that can formalize or model creative processes and therefore exhibit creative behaviors or acts. Yet there are still skeptics that claim computers aren't creative, the computer is just following instructions. Second and in contrast, there is increasing interest in computational systems that encourage and enhance human creativity that make no claims about whether the computer is being or could be creative. Finally, as we develop more capable socially intelligent computational systems and systems that enable collective intelligence among humans and computers, the boundary between human creativity and computer creativity blurs. As the boundary blurs, we need to develop ways of recognizing creativity that makes no assumptions about whether the creative entity is a person, a computer, a potentially large group of people, or the collective intelligence of human and computational entities. This paper presents a framework that characterizes the source of creativity from two perspectives, ideation and interaction, as a guide to current and future research in computational and collective creativity. Creativity: Process and Product Understanding the nature of creativity as process and product is critical in computational creativity if we want to avoid any bias that only humans are creative and computers are not. While process and product in creativity are tightly coupled in practice, a distinction between the two provides two ways of recognizing computational creativity by describing the characteristics of a creative process and separately, the characteristics of a creative product. Studying and describing the processes that generate creative products focus on the cognitive behavior of a creative person or the properties of a computational system, and describing ways of recognizing a creative product focus on the characteristics of the result of a creative process. When describing creative processes there is an assumption that there is a space of possibilities. Boden (2003) refers to this as conceptual spaces and describes these spaces as structured styles of thought. In computational systems such a space is called a state space. How such spaces are changed, or the relationship between the set of known products, the space of possibilities, and the potentially creative product, is the basis for describing processes that can generate potentially creative artifacts. There are many accounts of the processes for generating creative products. Two sources are described here: Boden (2003) from the philosophical and artificial intelligence perspective and Gero (2000) from the design science perspective. Boden (2003) describes three ways in which creative products can be generated: combination, exploration, 2012 67 and transformation: each one describes the way in which the conceptual space of known products provides a basis for generating a creative product and how the conceptual space changes as a result of the creative artifact. Combination brings together two or more concepts in ways that hasn't occurred in existing products. Exploration finds concepts in parts of the space that have not been considered in existing products. Transformation modifies concepts in the space to generate products that change the boundaries of the space. Gero (2000) describes computational processes for creative design as combination, transformation, analogy, emergence, and first principles. Combination and transformation are similar to Boden's processes. Analogy transfers concepts from a source product that may be in a different conceptual space to a target product to generate a novel product in the target's space. Emergence is a process that finds new underlying structures in a concept that give rise to a new product, effectively a re-representation process. First principles as a process generates new products without relying on concepts as defined in existing products. While these processes provide insight into the nature of creativity and provide a basis for computational creativity, they have little to say about how we recognize a creative product. As we move towards computational systems that enhance or contribute to human creativity, the articulation of process models for generating creative artifacts does not provide an evaluation of the product. Computational systems that generate creative products need evaluation criteria that are independent of the process by which the product was generated. There are also numerous approaches to defining characteristics of creative products as the basis for evaluating or assessing creativity. Boden (2003) claims that novelty and value are the essential criteria and that other aspects, such as surprise, are kinds of novelty or value. Wiggins (2006) often uses value to indicate all valuable aspects of a creative products, yet provides definitions for novelty and value as different features that are relevant to creativity. Oman and Tumer (2009) combine novelty and quality to evaluate individual ideas in engineering design as a relative measure of creativity. Shah, Smith, and Vargas-Hernandez (2003) associate creative design with ideation and develop metrics for novelty, variety, quality, and quantity of ideas. Wiggins (2006) argues that surprise is a property of the receiver of a creative artifact, that is, it is an emotional response. Cropley and Cropley (2005) propose four broad properties of products that can be used to describe the level and kind of creativity they possess: effectiveness, novelty, elegance, genesis. Besemer and O'Quin (1987) describe a Creative Product Semantic Scale which defines the creativity of products in three dimensions: novelty (the product is original, surprising and germinal), resolution (the product is valuable, logical, useful, and understandable), and elaboration and synthesis (the product is organic, elegant, complex, and well-crafted). Horn and Salvendy (2006) after doing an analysis of many properties of creative products, report on consumer perception of creativity in three critical perceptions: affect (our emotional response to the product), importance, and novelty. Goldenberg and Mazursky (2002) report on research that has found the observable characteristics of creativity in products to include "original, of value, novel, interesting, elegant, unique, surprising." Amabile (1982) says it most clearly when she summarizes the social psychology literature on the assessment of creativity: While most definitions of creativity refer to novelty, appropriateness, and surprise, current creativity tests or assessment techniques are not closely linked to these criteria. She further argues that "There is no clear, explicit statement of the criteria that conceptually underlie the assessment procedures." In response to an inability to establish and define criteria for evaluating creativity that is acceptable to all domains, Amabile (1982, 1996) introduced a Consensual Assessment Technique (CAT) in which creativity is assessed by a group of judges that are knowledgeable of the field. Since then, several scales for assisting human evaluators have been developed to guide human evaluators, for example, Besemer and O'Quin's (1999) Creative Product Semantic Scale, Reis and Renzulli's (1991) Student Product Assessment Form, and Cropley et al's (2011) Creative Solution Diagnosis Scale. Maher (2010) presents an AI approach to evaluating creativity of a product by measuring novelty, value and surprise that provides a formal model for evaluating creative products. Novelty is a measure of how different the product is from existing products and is measured as a distance from clusters of other products in a conceptual space, characterizing the artifact as similar but different. Value is a measure of how the creative product compares to other products in its class in utility, performance, or attractiveness. The measure of value uses clustering algorithms and distance measures operating on the value attributes of existing products. Surprise has to do with how we develop expectations for the next new idea. This is distinguished from novelty because it is based on tracking the progression of one or more attributes, and changing the expected next difference. Computational creativity can be described by identifying the generative processes that are associated with being creative and how the process changes the conceptual space. Alternatively, computational creativity can be asserted when the product is recognized as creative, independently of the process. However, computational creativity is more complicated than a single process that generates a selfcontained product, partly due to the different roles that people and computers play in computational creativity but also due to recent phenomena of scaling up participation to achieve collective human-computer creativity. 2012 68 Collective Creativity Collective creativity is associated with two or more people contributing to a creative process. Using the internet to develop and encourage creative communities has led to large scale collective creativity. Some examples of such creative communities are: Designcrowd.com, Quirky.com, 99Designs.com and OpeningDesign.com. Designcrowd and 99Designs are examples of websites that source creative work from a very large community of people that identify themselves as designers. Quirky crowdsources innovative product development, where the community works together with an in-house design team to design products from idea to market. OpeningDesign is a platform for architecture and urban planning, encouraging people from different backgrounds to participate in projects and providing a space for opinion polls and crowdsourcing jobs. These platforms rely on community participation, both amateur and professional, and their websites support community discussion and various amounts of involvement. They attract a range of contributions, from the casual observer who might be motivated to comment once or twice, to the active contributor who closely tracks progress, contributes new ideas, and responds often and with minimal delay. Maher, Paulini, and Murty (2010) show how the nature of the contributions and collaboration can be considered along a spectrum of approaches, ranging from collected intelligence to collective intelligence. DesignCrowd collects individual designs and is an example of collected intelligence and Quirky is an example of collective intelligence in design by encouraging collaboration and voting. Large scale participation from individuals that may or may not have expertise in the class of products being designed or created can synthesis ideas that go beyond the capability of a single person or a more carefully constructed team. Page (2007) describes how diverse individuals bring different perspectives and heuristics to problem solving, and shows how that diversity can result in better solutions than those produced by a group of like-minded individuals. Hong and Page (2004) prove a theorem that "Diversity Trumps Ability" every time. Page (2007) argues that diversity improves problem solving, even though our individual experiences in working with a diverse group may be associated with the difficulty of understanding other viewpoints and reaching consensus. Many of the successful examples of collective creativity encourage diversity but do not require that everyone understand others' perspectives or even necessarily to reach concensus. A recent study of communication in Quirky.com shows how the crowd contributes to ideation and evaluation as part of a larger design process (Paulini, Maher, and Murty, 2011). Their analysis shows that a design process that includes crowdsourcing shares processes of ideation and evaluation with individual and team design, and also includes a significant amount of social networking. Collective creativity is an emergent property of an online community where team design is structured and managed intentionally to produce an innovative product. Who is being creative? Creativity can be the result of introducing a novel and surprising idea and developing that idea into a product that is valuable in the context of an existing conceptual class of products. A creative idea can originate by bringing a different perspective or set of heuristics (as described by Page (2007)) to a conceptual class or existing patterns of design. This diversity can be achieved through social, computational and collective creativity. The various processes as described by Boden and Gero show how different algorithms or heuristics result in creative ideas. The various approaches to evaluating creativity show how creative ideas can be evaluated. The field of computational creativity now has the basis for developing and evaluating creative systems, and can benefit from characterizing individual contributions to the field. By asking "Who is being creative?" we can identify where the focus of computational creativity is now, and where there are research opportunities. Did the computer generate the creative idea or did a person, or was it an emergent structure from the interaction of people and computational systems? In this section we structure a framework around the concept of human/computer ideation and interaction, and map individual contributions onto a space of possibilities. The contributions include a sample of computational creativity drawing on the proceedings of ICCC 2011 (Ventura et al, 2011) and ICCCX (Ventura et al, 2010), including Quirky (quirky.com) and Scratch (Maloney et al, 2010) to fill gaps in ICCC coverage. Ideation Ideation is a process of generating, synthesizing, evaluating, implementing ideas that lead to a potentially creative product or solution. Ideation is a creative process and an idea is a product of that process. Using the term ideation to characterize computational creativity provides a basis for analyzing human, computational, and collective creativity with respect to the origin of a creative idea. While it may be hard to track precisely where an idea comes from in a complex creative process, we can identify where in a human-computer collective there is potential for creative ideas to be expressed and evaluated. Figure 1 places systems that contribute to computational creativity within a space according to the origin of the creative idea as human or computational agent. The "human" and "computational agent" dimensions of this space characterize the role of each in the computational creativity system. 2012 69 Along the human dimension the framework includes two categories that describe the role of the human in computational creativity: model or generate. • Model: the role of the human is developing a computational model or process. The computational system is effectively being creative because it is the source of the creative ideas or artifact. For example, The Painting Fool is a computational system that generates artistic paintings (Colton, 2011). • Generate: the human generates the creative idea and the computational system facilitates or enhances human creativity by providing information, by providing a digital environment for generating the creative artifact, and/or by providing a perceptual interface to digital content that influences creative cognition. For example, Scratch, is a computational system for people to create interactive stories, animations, games, music, and art (Maloney et al, 2010). Along the computational dimension the framework includes three categories that describe the role of the computational system: support, enhance, generate. • Support: the computational system supports human creativity by providing tools and techniques. Scratch is an example of a creativity support tool. • Enhance: the computational system extends the ability of the person to be creative by providing knowledge or changing human perception in ways that encourage creative cognition. For example, Scuddle uses a genetic algorithm to generate movement catalysts for dancers (Carlson et al, 2011). • Generate: the computational system generates creative ideas that the human then interprets, evaluates or integrates as a creative product. The Painting Fool is a computational system that generates creative paintings. Figure 1 shows a distribution of computational systems in which the human generates the creative ideas, aka creativity support tools, and the computational system generates creative ideas. The contributions in the space that is empty in Figure 1 includes theoretical contributions rather than the development of computational systems, for example the contributions to models of process that generate creative products and models for evaluating creative products. Interaction Interaction plays an important role in computational creativity, particularly interaction between computers and humans (as the generators or users of the computational system). Traditionally, human-computer interaction has been a one-to-one interaction in which one person interacts with one computational device or environment. Recently, interaction has changed in scale. Figure 2 places the same systems from Figure 1 within a space that characterizes the interaction between people and computers where the dimensions of this space express scale: from a single human or computational system to many. Figure 1. Ideation and Computational Creativity Along the human dimension there are three categories that describe the scale of the human interaction: individual, group, or public. • Individual: the computational system is developed to support one person working alone, for example Scuddle. • Group: the computational system supports a group or a predefined team of people. This is exemplified by the collaborative technologies that support design and drawing such as Groupboard1. This area is not well represented in the ICCC series. • Society: the computational system encourages crowdsourcing and collective intelligence, for example Quirky. Along the computational agent dimension there are three categories that describe the scale of the computational agent interaction: individual, team, or multi-agent society. • Individual: there is one computational system with a centralized control that is interacting with a person or people, for example The Painting Fool. • Team: there are multiple, centrally organized agents that interact with one or more people. For example, Curious Whispers is a collection of autonomous mobile robots that communicate through simple songs (Saunders et al, 2010). • Multi-agent society: the computational system is a multiagent society with distributed control. For example, the designer agents and consumer agents in Gomez de Silva Garza and Gero (2010). This area is not well represented in the ICCC series. From Figure 2 we see that the contributions in the ICCC series focus on the interaction is between one person and one computational system. 1 http://www.groupboard.com/products/ 2012 70 Figure 2. Interaction and Computational Creativity Conclusions As we develop a better understanding of processes and products in creative people or systems, we are able to develop more capable computational creativity. Ideation and interaction distinguish research in computational creativity by asking: Who is being creative? The word "who" is used to refer to one or more people or computational systems. When creativity is ascribed to the plural "who", that is when the ideas come from multiple sources, there is an assumption of interaction. An area of research in computational creativity that has received little attention is the role and scale of interaction. Interaction at the scale of one person and one computational system has been the norm in computational creativity, with a recent trend in developing collaborative environments to support or enhance creativity, multi-agent models of creativity and online communities that achieve collective creativity. This paper shows that there is an opportunity for researchers in computational creativity to build on our theoretical and practical advances in understanding creative processes and the evaluation of creative products to address the concepts of interaction and scale. 2012_11 !2012 A Quantitative Study of Creative Leaps Lior Noy*, Yuval Hart*, Natalie Andrew*, Omer Ramote, Avi Mayo and Uri Alon Molecular Cell Biology Weizmann Institute of Science Rehovot, Isreal lior.noy@weizmann.ac.il Abstract We present a novel quantitative approach for studying creative leaps. Participants explored the space of shapes composed of ten adjacent squares, searching for ‘interesting and beautiful' shapes. By recording players' actions we were able to quantitatively study aspects of their exploration process. In particular our goal is to identify populated sub-regions in the shape space and study the dynamics of ‘creative leaps': a jump from one such area to another. We present here the experimental system, our methods of analysis and some preliminary results. We show that the network of shapes created by human participants is different from the class of networks created by applying a simple random-walk algorithm. Chosen shapes show an interesting negative correlation between their abundance and the probability to be chosen as beautiful. We further analyzed the human network unique signature using its network motifs profile. Intriguingly, this signature shows similarity to words-adjacency networks extracted from texts. Lastly, we find preliminary evidence that human players exhibit two types of exploration: ‘scavenging', where shapes similar in their visualiconic meaning are quickly accumulated, and ‘creative leaps', where players shift to a new region in the shape space after a prolonged search. We plan to build upon this result to quantitatively study creative processes in general and creative leaps in particular. Introduction In his book "the Act of Creation" the author Arthur Koestler describes the similarities between three types of creative acts: the pun of the joker, the discovery of the scientist and the lyric expression of the poet (Koestler 1964). The crux of the creative act is the creative leap, the momentary intersection of two different matrices of association (Fig. 1, left). Consider a search resulting in a creative solution for a given problem. Before the creative leap the search is confined to some familiar sub-space (the horizontal plane in Fig. 1, left). Using chance or intuition the solver has managed somehow to reach a point on the plane which also belongs to another plane, a totally different class of solutions (the vertical plane in Fig. 1, left). The creative leap is the ability to recognize this transition point and to jump from one class of solutions to another. Figure 1. A Symbolic representation of creative leaps. Left: according to Koestler the heart of any creative act is a creative leap between two intersecting domains. Right: a hypothetical creative space. Solutions are grouped into two clusters. Searching within a cluster requires short moves and creates similar solutions. In order to move to a different cluster of solutions the agent needs to perform a creative leap. Little is known about the dynamics of creative leaps. Previous work has described creative leaps of exceptional creators (Miller 1996) while empirical work has focused mainly on moments of insight in problem solving, such as the Remote Association Test, using both behavioral (Dominowski and Dallob 1995) and brain studies (Sandkühler 2008). It is difficult to capture creative leaps in a laboratory setting. Moreover, many solution spaces might be highdimensional and complex, with no clear metric defining the similarity between points. For example, consider the space of all answers to the following question used in a group creativity test: "how can the number of tourists visiting your city be increased" (Nijstad and Stroebe 2006). While this problem has solutions that belong to different classes (for example ‘increase advertisement' vs. ‘improve infrastructure') it is not clear how to define and construct the space of all such ideas. Our goal is to study a creative task with an underlying solution space that is (a) simple and well defined to enable a quantitative investigation of the search dynamic (b) that contains clusters of solutions, with the possibility of performing creative leaps between them (see Fig. 1, right). Our approach resembles recent work by Jennings that similarly studied people's search trajectories in a visual domain (Jennings 2010; Jennings et al. 2011). We searched for a parameterized space that will be complex enough to allow for possible creative leaps, but not too complex to allow a computational description of human search in this space. We suggest using the set of all N 2012 72 size polyominoes - the set of two dimensional shapes composed of N adjacent squares (Golomb 1994). Besides its well defined structure which allows for establishing a metric on the search space, the polyominoes space provides a crucial advantage: the shape space exploration complexity is tunable by changing the parameter N. We can thus aim to have an exploration process which is on one hand not too trivial and on the other hand not too complex to quantify. In that we hope to capture the gist of what Boden describes as ‘an exploratory frame of mind' (Boden, 2004). Since this exploration process resembles a creative process undertaken by, say, a graphic designer designing a new icon in a limited space, we hope to gain insights in the growing field of computational models for design processes (Gero, 2000). We analyzed the network of shapes and moves created by human participants and compared the human exploration with a simple random-walk algorithm that transverses the network of shapes discovered by the human participants. This comparison shows that the human search behavior is not simply the results of a random travel between the shapes. Our results suggest that humans perform two types of searches: ‘scavenging', a simple search in an area of shapes, which can be explained by an algorithmic search, and ‘insight' moves, or leaps, that cannot be explained by simple algorithm. The first type of moves corresponds to the within cluster exploration in Fig. 1, while the second type contains, we hope, the creative leaps. We next describe our experimental setup, the methods of analysis we employed and some initial findings supporting the notion that creative leaps can be quantitatively studied using the suggested approach. Experimental Setup System We developed a system to experimentally test human trajectories in the shape space of polyominoes. We are currently experimenting with decominoes, 10-size polyominoes (consisting of 4655 unique shapes and 36,446 shapes if rotations and mirror images are counted). We tested several variants of the creative task and report here results from the ‘journey in shape space': exploring the space by moving one square at a time, transforming one legitimate shape to another. The starting point shape is always the horizontal line. We ask people to "explore the space of ‘shifting shapes' and to discover shapes that you find interesting and beautiful". We developed an experimental setup using Processing, an open source, cross-platform, programming language used for visualization (see Fig. 2). Figure 2. Exploring the space of shapes. Left: a screen shot of the ‘Shape Shifter' game. At each step players move one square to create a new polyomino. Shapes can be stored in the ‘shape gallery' by pressing the gray rectangle at the top-right corner. Right: examples of different shapes created by human players. Procedure 123 participants (58 females and 65 males, ages 12-75 years, mean = 34.3), recruited through emails and social networks, were invited to participate in a short experiment in creativity. At any point players could store the current shape to a ‘shape gallery'. The players moved freely between shapes, within a time limit of 25 minutes (no participant reached this limit). When choosing to finish the exploration they continued to the ‘rating stage'. In this last stage players observed the ‘shapes gallery' and were asked to choose ‘the five most creative shapes you discovered'. We recorded square moves between shapes and their timing, as well as each player chosen gallery shapes and the final five shapes. Analysis A random-walk algorithm over the entire shape network We used a network representation (a graph) of the shape space in the following way. Each shape is a node in the graph. Shape A and B are connected by an edge if shape A can be reached from shape B by moving a single square in a valid way. This structure is a directed graph representing all possible valid moves The algorithm explores the network by first randomly removing one square from the current shape. The next decomino in the path is then generated by placing the 10th square in a new random location (self-loops are not excluded). This extends the path by one step. The path is further extended by repeating these steps up to a predetermined steps number. This algorithm was used to establish both the entire shape space network and a random walk generated network to compare with the human generated network of travelled shapes. For the entire shape space the algorithm was run until all possible 36,446 decominoes were generated (with mean path length of 150,000 steps). For comparison with the human network, the algorithm was run 123 times (the number of human participants) with a number of steps which is sampled from the number of steps distribution of the human players. 2012 73 A random-walk algorithm over the human generated network In order to create computer generated networks which are more closely related to the human networks we restricted the algorithm to travel only on edges which were travelled by at least one human player. First the human generated decominoes network is generated and the allowed steps are listed. Although the network is naturally directed, the computerized walker is allowed to move on the undirected network (that is, the computer can also move backward on any human edge). The algorithm is seeded and a new shape is chosen randomly from the set of shapes which are connected by allowed edges. The length of the path is sampled from the distribution of lengths of paths traversed by the human players. This process is repeated 123 times. Figure 3. Comparing human and computational exploration networks. The number of occurrences of edges where edges are grouped by the number of times they were traversed. Shown are the values for human players' network (red) and the random walk network restricted to the human network shapes (mean of 10 simulation in dark blue, each specific simulation in light blue). Our current goal is to compare the features of the human generated network to a network generated by a randomwalk algorithm and to study if there is a noticeable difference between the two, in order to show that the human behavior cannot be explained as a result of a random-walk in the shape-space. Triad Significance Profile Calculation The 13 network motif frequencies of the human and random generated networks were calculated. The normalized Z score of each of the 13 possible triads was then calculated. Z score is computed by the difference of the triad frequency to the mean frequency of the same triad in a computerized agents' network, measured in STD units. Frequency mean and STD were calculated from 10 simulations of the computational networks. Results Human and Computational Networks We first asked whether the exploration network created by human players is different from the network created by a random-walk algorithm traveling the entire shape networks. We find that the exploration network created by human players is much more compact. Furthermore, the players' network obeys a power-law distribution of node degree frequencies (how many edges go in or out from a specific node), while the computational algorithm produces a Gaussian-like distribution of node degree frequencies. In addition, human exploration on the network of all allowed edges is very constrained and compact relative to a random exploration process of the whole shapes space. We next asked whether the type of exploration players perform is dictated only by some constraint on shapes available to people's perception. We thus compared the human exploration network with an ensemble of networks created by allowing a random-walk algorithm to choose shapes randomly, but restricting it to shapes that were selected by the human players. We find that the algorithm travels much less than the human players and so create a much smaller network than the players' network. Furthermore, the properties of the computational exploration networks, such as the distribution of nodes degrees is markedly different from the human exploration network (Fig. 3). Consensus in Participants' Choices A possible concern regarding our creative task is whether there is some consensus among different participants regarding their aesthetic choices. While we do not expect to have total agreement - for example some players preferred iconic shapes, while other preferred more abstract ones, a total lack of consensus could raise doubts on the validity of this task to measure human creativity. To assess the consensus in participants' choices we plotted the selection ratio, the percentage of times a shape was chosen (number of times chosen divided by number of times traversed) against the number of times a shape was traversed (Fig. 4). We differentiated between shapes ranked as interesting shapes in the last stage of the game (in blue) and those that were only chosen to the gallery (in red). We note that there is a large number of shapes with high (>50%) selection ratio, with few shapes exhibiting selection ratio of more than 90%. At least for these shapes there seems to be a consensus among the different human participants. In addition, shapes that were ranked in the last stage had a statistically significant higher selection ratio (ranked: centered around (23.34, 50) with STD (19.41, 20); notranked: centered around (15.6, 20) with STD (6.7, 13); non-paired t-test, p<10-7 ). We also note the negative correlation (Pearson correlation = -0.25, p<0.05) between the prevalence of a shape (how many times it was traversed) and its selection ratio. Intriguingly, this might suggest that shapes ‘less traveled by' are appreciated more by the people who have reached them. A Network Motifs Signature In order to further characterize the human exploration network we measured its network motifs signature, termed 2012 74 triad significance profile (TSP). This network signature is calculated by taking the frequencies of all three node subgroups of a network and normalizing each frequency by the triad frequency in a network created by a similar random process (Milo 2002). In our case, we compared triad frequencies of human network with triad frequencies created by the random walk algorithm on the human network (see Analysis). Previous studies in our lab showed that networks with similar structure and function have a similar TSP signature. Thus, this method offers another quantitative classification to networks. This preliminary calculation (Fig. 5) indicates that the network motifs significance profile shares a similar frequency signature of text networks (Milo 2004), suggesting that the human visual exploration process in shape space consists of visual rules similar to those of language networks, having categories of words with a certain formulated way of combining between different categories. Future work should check the dependency of the calculated triad significance profile on the randomization process used to create the base-line random network. Figure 4. Consensus in participants' choice of shapes. Y-axis: the number of times a shape was admitted into the gallery out of the number of times it was visited. X-axis: the number of times a shape was visited. Only shapes that were visited at least 10 times are presented. Dots in blue represent shapes that were also ranked in the final stage while red dots represent shapes that were chosen to the gallery but were not ranked. Correspondingly, shapes shaded in blue are representative of the set of finally chosen shapes. Initial Evidence for Creative Leaps In order to more closely examine the exploration process of individual players, we focused on the ‘chosen to the gallery' shapes (Fig. 6), enumerating both the number of steps between two sequential shapes (the number above each shape) and the time interval between selections of the two shapes (the y axis). For several players we observe an interesting pattern: the time and number of steps between two sequential chosen shapes is declining at the beginning, usually creating similar content shapes. Then, a long traversal exploration process is commenced, usually leading to shapes belonging to a new cluster of similar shapes. As exemplified in Fig. 6, the player moves from "Animals" shapes to "Space invaders" shape to "Symbolic male/female" shapes. One can interpret this saw-tooth pattern as consisting of scavenger explorations connected by a creative leap, which serves to reach a new iconographic domain. We hope to utilize these processes to cluster the shapes automatically into different domains and thus create a semi-metric on the shape space. Another utility to aid building the metric comes from the use of the rating process at the end of the game. Subjects are requested to choose the five most creative shapes. Our assumption is that subjects will choose shapes that they see as most distinct from one another, thus providing another metric measure on the shape space. Figure 5. The triad significance profile (TSP) of the human players' network suggests a similarity to word-adjacency networks of texts. The main feature of the TSP is the under-representation of triangle-shaped triads 7 to 13. Conclusions and Future Work We presented a novel quantitative approach for studying creative leaps. Our goal is to study a creative task using computational tools. Specifically we aim to define the space of products of the creative task, to detect clusters of similar products and to study creative leaps between them. Working toward this goal we developed a web-based game in which players explored a visual space composed of 10-size polyominoes, while searching for interesting and beautiful shapes. As a first step we tested whether human behavior in this task can be explained as a result of a random-walk algorithm. We therefore compared the exploration network created by human players to two computational exploration networks. The first network was created by random walks on all possible shapes, and the second one was created by random walks restricted to shapes chosen by human players. We compared general properties of these networks, such as in/out degree, and found a significant difference between the human and the computational networks. Compared to a network made by a random walk on shapes chosen by players, the computer's random walk is much smaller, suggesting that the trajectories of human exploration contain also segments of directed motion toward interesting regions of the space. Following the fogginess metaphor of Jennings (2011) these segments might correspond to the areas of the landscape with have ‘good visibility'. We also used the concept of network motifs to characterize the human search network. We identified which of a known super-families of networks (e.g. social, transcrip 2012 75 tion networks, and language originated), matches the human exploration network. We find that the human network is similar to language-originated network, and are planning to further study the connection between these two networks. We further find preliminary evidence of players' paradigm shift while playing the game. Players show periods of ‘scavenging', where they exploit shapes similar in iconic meaning (e.g. animals, letter, symmetric shapes) accompanied by long walks on the grid of possible shapes, which leads to a different region in the shapes space. The ‘sawtooth' pattern we have found in the time between chosen shapes (Fig. 6) might be the first clue for the existence of clusters in our shape space. We plan to corroborate these finding by different methods that can be used to detect clusters of shapes in this visual domain. In particular, we plan to use the human choices embedded in our task at multiple levels (which shape to move to; which shapes to insert to the gallery; which shapes to choose in the final stage) as a different probe into the structure of the shape space. This paper presents work-in-progress aiming to develop a computational platform for studying human search in creative tasks, and in particular to study creative leaps. We are currently performing a large-scale human experiment with this platform and plan to apply a host of quantitative methods to further test the preliminary results presented here. Using these methods we hope to be able to measure and study the dynamics of creative leaps. Fig 6. Preliminary evidence for clusters in the shape space. Looking at the time differences between chose shapes we often see ‘saw-tooth' patterns. Humans seem to reach a fruitful region, ‘scavenge' it, that is, to quickly pick a few similar shapes, and then to move to another region, a move that takes much more time. Notice for example the two clusters of similar shapes around 100 and 180 seconds. Only chosen shapes are shown, and shapes in the ‘top five' (chosen between all gallery shapes) appear with a blue background. The number above each shape is the number of moves from the previous shape. 2012_12 !2012 On the Notion of Framing in Computational Creativity John Charnley, Alison Pease and Simon Colton Computational Creativity Group Department of Computing, Imperial College 180 Queens Gate, London SW7 2RH, United Kingdom. ccg.doc.ic.ac.uk Abstract In most domains, artefacts and the creativity that went into their production is judged within a context; where a context may include background information on how the creator feels about their work, what they think it expresses, how it fits in with other work done within their community, their mood before, during and after creation, and so on. We identify areas of framing information, such as motivation, intention, or the processes involved in creating a work, and consider how these areas might be applicable to the context of Computational Creativity. We suggest examples of how such framing information may be derived in existing creative systems and propose a novel dually-creative approach to framing, whereby an automated story generation system is employed, in tandem with the artefact generator, to produce suitable framing information. We outline how this method might be developed and some longer term goals. Introduction Michael Craig-Martin's 1973 work, An Oak Tree, comprises a glass of water on a shelf and an accompanying text, in which Craig-Martin claims that the object which appears to be a glass of water is really an oak tree. The text takes the form of a question and answer session written by CraigMartin about how he has changed the water into a tree: A. [.....] I've changed the physical substance of the glass of water into that of an oak tree. Q. It looks like a glass of water. A. Of course it does. I didn't change its appearance. But it's not a glass of water, it's an oak tree. ... Q. Haven't you simply called this glass of water an oak tree? A. Absolutely not. Craig-Martin is rather mysterious as to how he has accomplished the change: Q. Was it difficult to effect the change? A. No effort at all. But it took me years of work before I realised I could do it. Q. When precisely did the glass of water become an oak tree? A. When I put the water in the glass. Q. Does this happen every time you fill a glass with water? A. No, of course not. Only when I intend to change it into an oak tree. The status of the piece as a work of art is then raised: Q. Do you consider that changing the glass of water into an oak tree constitutes an art work? A. Yes. Q. What precisely is the art work? The glass of water? A. There is no glass of water anymore. Q. The process of change? A. There is no process involved in the change. Q. The oak tree? A. Yes. The oak tree. This is an example of human creativity which is taken seriously in its field. First shown in 1974, it was bought by the National Gallery of Australia in Canberra in 1977, and has been exhibited all over the world with the text translated into at least twenty languages. Like many important works, opinion is divided: artist Michael Daley referred to it as "selfdeluding" and "pretentious" (Daley August 31 2002), while art critic Richard Cork wrote: "I realise that one of the most challenging moments occurred in 1974 when the Rowan Gallery mounted an exhibition of Michael Craig-Martin's work." (Cork October 9 2006). Researchers in Computational Creativity (CC) can learn from this work. The main point that we consider in this paper is that the artefact (the glass of water) has no creative value without the title and accompanying text. The value, the creativity associated with the piece, lies in the narrative surrounding the glass of water. This point has clear implications for CC, which has traditionally focused on artefact generation, to the extent that the degree of creativity judged to be in the system is often considered to be entirely dependent on characteristics of the set of artefacts it produces (for instance, see (Ritchie 2007)). Very few systems in CC currently generate their own narrative, or framing information. Artefacts are judged either in isolation or in conjunction with a human-produced narrative, such as the name of the system and any scientific papers which describe how it works. Researchers in CC will be familiar with what Bedworth and Norwood call "carbon fascism" (Bedworth and Norwood 1999) (the bias that only biological creativity can produce valuable artefacts), and, for the most part, computer-generated creative artefacts are not taken seriously 2012 77 by experts in the domain in which the artefacts belong. We believe that enabling creative software to produce its own framing information will help to gain acceptance from these experts. While An Oak Tree may be a rather extreme example of the importance of framing information, we hold that such information almost always plays some role in creative acts, and is a fundamental aspect of human creativity. We consider here which types of framing information we could feasibly expect a piece of software to produce, and begin to propose ways in which we could formalise this. Specifically, we consider three areas in computational terms: motivation (why did you do X?), intention (what did you mean when you did X?), and processes (how did you do X?). We make the following contributions: 1. We highlight the importance of framing information in human creativity. 2. We propose an approach to automatically generating framing information, in which a separate creative act of automated story generation is performed alongside traditional artefact generation. Framing in human creativity Sir John Tusa is a British arts administrator, radio and television journalist, known for his BBC Radio 3 series The John Tusa Interview, in which he interviews contemporary artists. These interviews have been reproduced as two books in which he explores the processes of creativity (Tusa 2003; 2006). We have analysed all thirteen interviews in his most recent collection, in order to provide the starting point for a taxonomy of framing information. In the following discussion, unless otherwise specified, all page numbers refer to this collection of interviews (Tusa 2006). We identified two categories which artists spoke about: INTERNAL, or inward looking, in which the artist talks about their own Work, Career and Life, and EXTERNAL, or outward looking, in which the artist talks about their view of their Audience and Field. The artist's work Discussion about artists' work is very common in the Tusa interviews. This might concern a specific piece; such as how an artist feels about a piece, what they think it expresses, or how it relates to everyday concepts; or this might concern details of the generative process, such as how the work is created, how processes involved in its creation fit together, or whether a new technique or material changed the way that something was done. As an example below, Cunningham (MC) relates his work to scientific and religious concepts: MC: ...it was the statement of Einstein's which I read at that time, where he said, ‘There are no fixed points in space.' And it was like a flash of lightning; I felt, Well, that's marvellous for the stage. Instead of thinking it's front and centre, to allow any point, very Buddhist, any point in the space to be as important as any other. (p. 66) The artist's career A picture of the structure of an artist's career, in terms of his or her past, present and possible future directions, can aid understanding of current work. Questions about previous work include asking how two pieces differ; what category work from certain periods falls into; classification of a career into different stages, such as early and late, or pre-work X and post-work X. Examples from (Tusa 2006) include the questions: "So you think you are recognisably the same person, creatively the same person as you would have been if you'd stayed in New York?" (to Forsythe, p. 93); "What's the next stage of your evolution as a maker of ballets?" (also to Forsythe, p. 105), and "When you look back over the last twenty years, would you ever have guessed that the work that you do would have travelled so far ..... I mean this is an extraordinary journey. How aware have you been of the evolution as you've been through it?" (to McBurney, pp. 181-2). The artist's life Audiences are interested in the personalities and influences behind society's "creative heroes". Topics of interest include political, intellectual, personal, cultural and religious influences; value systems; reasons for working in a particular area; important events in the life of the artist, and so on. John Tusa asks many questions in this vein. For instance, he asks: "Are you an optimist or a pessimist as a person?" (to Rovner); "When did you discover that you had this condition called Dysgraphia, where I think the brain wants to write words as pictures?" (to Viola); "What do you feel, as you're coming in to work?" (to McBurney); and "What music do you like?" (to Piano), and has some rather poignant exchanges, such as one with Viola in which he asks about a near-death experience (pp. 221-3); and this exchange with Rovner (MR): JT: Are you lonely as an artist? MR: You mean as an artist or as a person? JT: Well as a person who is an artist. MR: I'm alone. I don't know if I'm lonely. I am single, you know I'm a single person, I'm a single person. (p. 213) The artist's view of their audience The perception that an artist has of his or her audience may influence their work. Queries in this topic included questions about effect of a particular field on audiences, and what the effect of certain pieces of work are on the collective subconscious. Egoyan, for instance, discusses responsibility to one's audience with Tusa (p75). The artist's view of the field Embedding a particular artist's work into the context of a body of work is one of the purposes of framing information. Queries include definitional questions about particular fields, and their relationship to other fields; how a piece fits into a field; in which field an artist sees themselves; the influence of external characteristics such as politics, or how modern advancements such as new techniques have affected a field; the history of a field and directions in which it could go, and so on. For instance, Egoyan discusses how he thinks video compares to film (p76), and Forsythe talks about how his work fits into great classical ballets (p. 106). 2012 78 Framing for Computational Creativity Analysis of the interview responses suggests a new direction for CC: enabling creative software to generate some of its own framing information. As with human artworks, the appeal of computer creativity will be enhanced by the presence of framing. However, there are obvious restrictions on the scope to which the various forms of framing apply in the computer generated context. Here we consider three areas in computational terms: motivation (why did you do X?), intention (what did you mean when you did X?), and processes (how did you do X?). Motivation Many creative systems currently rely upon human intervention to begin, or guide, a creative session and the extent to which the systems themselves act autonomously varies widely. In some sense, the level to which these systems could be considered self-motivating is inversely proportional to the amount of guidance they receive. However, it is possible to foresee situations where this reliance has been removed to such an extent - and the human input rendered so remote - that it is considered inconsequential to the creative process. For instance, the field of Genetic Programming (Koza 1992) has resulted in software which can, itself, develop software. In the CC domain, software may eventually produce its own creative software which, in turn, produces further creative software, and so forth. In such a scenario, there could be several generations in an overall geneology of creative software. As the distance between the original human creator and the software that directly creates the artefact increases, the notion of self-motivation becomes blurred. Beyond this, the scope for a system's motivation towards a particular generative act is broad. For example, a suitably configured system may be able to perform creative acts in numerous fields and be able to muster its effort in directions of its own choosing. With this in mind, we can make a distinction between motivation to perform creative acts in general, motivation to create in a particular field and motivation to create specific instances. In the human context, the motivation towards a specific field may be variously influenced by the life of the artist, their career and their attitudes, in particular towards their field and audience. Several of these are distinctly human in nature and it currently makes limited sense to speak of the life or attitudes of software in any real sense. By contrast, we can speak of the career of a software artist, as in the corpus of its previous output. This may be used as part of a process by which a computer system decides which area to operate within. For example, we can imagine software that chooses its field of operation based upon how successful it has previously been in that area. For instance, it could refer to external assessments of its historic output to rate how well-received it has been, focusing its future effort accordingly. The fact that a computer has no life from which to draw motivation does not preclude its use as part of framing information. All those aspects missing from a computer could, alternatively, be simulated. For example, we have seen music software that aims to exhibit characteristics of well-known composers in attempts to capture their compositional style (Cope 2006). The extent to which the simulation of human motivation enhances the appeal of computer generated artefacts is, however, still unquantified. The motivation of a software creator may come from a bespoke process which has no basis in how humans are motivated. The details of such a process, and how it is executed for a given instance, would form valid framing information, specific to that software approach. Intention The aims for a particular piece are closely related to motivation, described above. A human creator will often undertake an endeavour because of a desire to achieve a particular outcome. Factors such as attitudes to the field contribute to this desire. Certainly, by the fact that some output is produced, every computer generative act displays intent. The aims of the process exist and they can, therefore, be described as part of the framing. In the context of a computer generative act, we might distinguish between a priori intent and intentions that arise as part of the generative process. That is, the software may be pre-configured to achieve a particular goal although with some discretion regarding details of the final outcome, which will be decided during the generative process. The details of the underlying intent will depend upon the creative process applied. For example, as above, software creators might simulate aspects of human intent. Intent has been investigated in collage-generation systems (Krzeczkowska et al. 2010). Here, the software based its collage upon events from the news of that day with the aim of inviting the audience to consider the artwork in the context of the wider world around them. This method was later generalised to consider wider combinations of creative systems and more-closely analyse the point in the creative process at which intentionality arose (Cook and Colton 2011). Processes In an act of human creativity, information about the creative process may be lost due to human fallibility, memory, awareness, and so on. However, in a computational context there is an inherent ability to perfectly store and retrieve information. The majority of creative systems would have the ability to produce an audit trail, indicating the results of key decisions in the generative process. For example, an evolutionary art system might be able to provide details of the ancestry of a finished piece, showing each of the generations in between. The extent to which the generative process can be fully recounted in CC is, nevertheless, limited by the ability to fully recreate the sources of information that played into the generative process. Software may, for instance, use information from a dynamic data source in producing an artefact, and it may not be possible to recreate the whole of this source in retrospect. One system that produces its own framing is an automated poetry generator currently being developed (Colton, Goodwin, and Veale 2012). In addition to creating a poem, this system produces text which describes particular aspects of 2012 79 its poetry that it found appealing and aspects of how it generated its output. In order to fully engage with a human audience, creative systems will need to adopt some or all of the creative responsibility in generating framing information. Details of the creative process are valid aspects of framing information, which are relevant to both computational and human creative contexts. There is a notion of an appropriate level of detail: extensive detail may be dull and the appreciation of artefacts is sometimes enhanced by the absence of information about the generative process. Examples of framing for Computational Creativity There are many ways in which creative systems might generate their own framing information. For example, an automated art system, such as AARON (McCorduck 1991), could store details of all its previous artworks and provide an assessment of how a new piece differs, in various respects, from its past output. A poetry system, such as (Colton, Goodwin, and Veale 2012), might reveal the general mood of the inspiring source it used as a basis for an affective poem. Mathematical software, such as HR (Colton 2002), could be given the ability to compare the conjectures it finds against on-line mathematical databases and report on how its output relates to known theorems. By corollary, an art system could appeal to image databases to suggest similarities to other artists. A simple enhancement to the collage generation program of (Krzeczkowska et al. 2010) could see it provide the text of the news story that formed the inspiration for the collage. In this mode, the framing information would become as important an aspect of the overall presentation as the collage itself. The artwork would be a combination of both the collage and underlying story, rather than the collage alone. This list is by no means exhaustive. The varied nature of framing information that we have been describing shows that the opportunities for enhancing works with framing are extensive. A dually-creative approach to framing Framing information has the potential to greatly impact an audience's assessment of an artefact. In some instances, framing is arguably as much a part of the overall creative presentation as the artefact itself: this was seen in CraigMartin's An Oak Tree, described above, as well as, for example, elements of Marcel Duchamp's readymades series such as Fountain. The information can be as simple as a title for the artefact, or might encompass much of the type of framing indicated in our analysis. Framing can add to the mystique and mystery surrounding an artefact, as we have described. Framing information need not be factually accurate. Information surrounding human creativity can be lost, deliberately falsified or made vague for artistic impact. Thus, the generation of framing information can itself be seen as a creative act. The overall impact of the package - namely the artefact and the associated framing information - will depend on both the assessed quality of the artefact, together with the impression given by the framing information. We propose one approach to artefact-with-framing generation, where the two are produced simultaneously, by a duallycreative process. Under this approach, the most appropriate creative paradigm for the framing information would be a form of automated storytelling. One part of a combined system would create the artefact itself and a storytelling aspect would generate a framing story. The framing story could be as simple or as complex as those which accompany human creations. Tools which were able to perform tasks such as metaphor and analogy (see (Gentner, Holyoak, and Kokinov 2001; Gibbs Jr. 2008)) might be integrated into the storytelling aspect. In the previous section, we discussed aspects of framing which might be relevant to the CC setting. This information could form much of the input to the story generation system, becoming part of the basis of the story. For example, purely factual information about how the software arrived at the final product could be retained. Given that there is no requirement for the framing story to be factually correct, some or all of the story might be fictional and there is no prescription for the extent to which the framing story should directly correspond with the artefact. Consider, for example, a framing story which describes all aspects of the creative process in full detail compared with a framing story consisting of a random seemingly-unrelated word. Both have artistic value, but in entirely different ways. An initial approach to the fact that no configuration of a particular automated story generation system would be able to generate the variety of framing stories that we have witnessed in human creativity, might be to develop a small number of story-telling paradigms, each based upon a particular story template. One challenge might then be to achieve an appropriate balance between fact and fiction in the generated stories. In future, we might hand such decisions over the software. For example, a sufficiently-able software suite might decide which story-telling paradigm is most appropriate for a particular effect, the balance between fact and fiction and how extensive the framing should be. In a more complex manifestation, the story might form an interactive dialogue, providing answers to audience queries in a manner akin to an interview. As with human creativity, the answers to those questions may be entirely at the whim of the generating system. Going further, software might employ story generation approaches to simulate aspects of the framing information which might otherwise be absent, such as a religious belief or other motivation. This could, in turn, feed back into the generation of the creative artefact itself. Storytelling for framing information represents an interesting challenge for our existing and future automated story generation systems. Related work In (Colton, Pease, and Charnley 2011; Pease and Colton 2011), two generalisations were introduced with the aim of enabling more precise discussion of the kinds of behaviour exhibited by software when undertaking creative tasks. The first generalisation places the notion of a generative act, wherein an artefact such as a theorem, melody, artwork or poem is produced, into the broader notion of a creative act. During a creative act, multiple types of generative acts are 2012 80 undertaken which might produce framing information, F, aesthetic considerations, A, concepts, C, and exemplars, E; in addition to generative acts which lead to the invention of novel generative processes for the invention of information of types F, A, C and/or E. The second generalisation places the notion of assessment of the aesthetic and/or utilitarian value of a generated artefact into the broader notion of the impact of a creative act, X. In particular, an assumption was introduced that in assessing the artefacts resulting from a creative act, we actually celebrate the entire creative act, which naturally includes information about the methods underlying the generation of the new material, and the framing information, which may put X into various contexts or explain motivations, etc., generally adding value to the generated artefacts over and above their intrinsic value. The introduction of these two generalisations enabled the FACE and IDEA descriptive models to be introduced as the first in the fledgling formalisation known as Computational Creativity Theory. In this paper we have extended this model by exploring the notion of framing. Future work and conclusions Creativity is not performed in a vacuum and the human context gives an artefact meaning and value. Implicit in the Computational Creativity Theory models so far developed is the notion that the FACE information/artefacts resulting from creative acts can be seen as invitations to a dialogue. For instance, when a person appreciates a painting, they are encouraged to ask questions of it, and look for answers, either explicitly from the artist or some perceived notion of how artists work, via visual interrogation of the piece itself, or through certain cultural contexts; for example, by understanding the culture in the time and place when the painting was produced. Despite the importance of framing information as part of the overall artistic endeavour, we are only aware of a very small number of systems that generate framing information to accompany their creative output. We have proposed one approach to this, whereby automated story generation is used to generate framing information. There are no real bounds to what information such framing can contain, its basis in fact versus fiction, or the format in which it is presented. Consequently, we suggest that initial attempts be restricted to a small number of simplified paradigms, taking their basis from a more complete investigation into how humanly-produced framing information relates to CC. Expanding upon this starting point, we imagine software taking over some of the creative responsibility for the framing information, such as determining the story-telling paradigm and the story's emphasis or level of detail. Craig-Martin, via his narrative of An Oak Tree, opens up a dialogue with the viewer on the nature of essence, proof, faith, matter, reality, art, and so on. The viewer engages with this narrative, which includes the manner of presentation of the piece, Craig-Martin's background as an artist and a person, critics' and artists' responses to the piece, stories surrounding the work and effects that it has on everyday life (for instance, there is a myth that Australian customs officials barred it from entering the country since it was classified as "vegetation", and in February 2012 the first three hits from google images on the search term "an oak tree" are images of Craig-Martin's work). We anticipate that the direction outlined in this paper will form an important axis of development for CC systems. Our long-term goal is to help to develop CC to such an extent that one day a piece of creative software will appear in the table of contents of a collection of Tusa-style interviews, to discuss its work and itself, alongside other contemporary artists. Acknowledgements This work has been funded by EPSRC grant EP/J004049. We are grateful to the three reviewers, who raised interesting points. 2012_13 !2012 Small-Scale Systems and Computational Creativity Nick Montfort and Natalia Fedorova Program in Writing & Humanistic Studies Massachusetts Institute of Technology 77 Massachusetts Ave, 14N-233 Cambridge, MA 02139 nickm@nickm.com phd.natali@gmail.com Abstract Creative computational systems have often been largescale endeavors, based on elaborate models of creativity and sometimes featuring an accumulation of heuristics and numerous subsystems. An argument is presented for facilitating the exploration of creativity through small-scale systems, which can be more transparent, reusable, focused, and easily generalized across domains and languages. These systems retain the ability, however, to model important aspects of aesthetic and creative processes. Examples of extremely simple story generators are presented along with their implications for larger-scale systems. A case study focuses on a system that implements the simplest possible model of ellipsis. Introduction For a variety of institutional, intellectual, and other reasons, the typical computational system developed to model or produce creativity is a sizable one. Some of these systems, such as Harold Cohen's AARON, even become lifelong projects of their creators, continuing to accumulate rules and heuristics for decades. There are certainly virtues to large-scale systems, which have revealed a great deal about formal models of creativity and creative computing. We present the argument that small-scale systems can also make contributions, serving to complement more extensive projects and to lead into them. Specifically, the argument is advanced that it makes sense to welcome such systems in new ways in conferences, in thesis work, and in the developing large-scale systems. Rather than directly claiming that these small-scale systems are creative based on some formal definition, we argue that they engage creativity and are relevant to largerscale systems that have been argued to be creative. Small-Scale Systems that Engage Creativity Many of the systems that will be discussed here are small - often limited to around 1 KB - and most were developed in a matter of hours or days. These are not systems built around a model of creativity; many of them, in fact, were not created with any particular research purpose in mind. However, each of these systems does explore one or more aspects of creativity relevant to its domain. These systems, without modeling creativity directly, nevertheless inquire about creativity. They also can focus larger-scale investigations of creativity that implement complete models. The systems discussed here all use randomness within some framework of regularity. It can be creative to introduce randomness in a context where, individually or as a culture, regularity is the norm - and vice versa. But the connection between regular elements (a recurring vocabulary, a poetic form, etc.) and randomness (deployed in many different ways) is much more complex, as is the question of when randomness is a quick and easy substitute for a more sophisticated process and when it is the best method. While we believe that small-scale systems can be used to address issues of randomness and creativity, full discussion of this topic must be left for later. Creative Text Generators of the 1950s and 1960s By 1952, Christopher Strachey's innovative and certainly small-scale love letter generator was running on the Manchester Mark I and producing texts such as "YOU ARE MY EROTIC APPETITE: MY SWEET ENTHUSIASM. MY LOVE FONDLY WOOS YOUR CURIOUS TENDERNESS. YOU ARE MY WISTFUL SYMPATHY." The system runs today in emulation (Link 2007) and has been discussed recently as "the first experiment in digital literature" (Wardrip-Fruin 2011). Its purpose, it seems, was not to shine with brilliance but to parody the formulaic process of love-letter writing. By being a parody of a banal writing process, this small-scale system did serve as a model - a model of a lack of creativity - and demonstrated that computational processes could relate to human writing processes. In 1959 Theo Lutz published on his small-scale system to generate stochastic texts based on Kafka's The Castle, pairs of "elementary sentences" with a logical connective. These include (in English translation from the German) "A CASTLE IS FREE AND EVERY FARMER IS FAR." and 2012 82 "NO COUNT IS QUIET THEREFORE NOT EVERY CHURCH IS ANGRY." By drawing on a well-known author and transforming the text in a way that intensified his disquieting juxtapositions, Lutz created a system with a literary purpose. His system's operation, and its results, were consonant with Kafka's description of a formally valid social system in which the particular combinations were often meaningless. In the United States, and in connection with the Fluxus movement, Alison Knowles and James Tenney published the 1968 chapbook A House of Dust. It consisted of 20 connected sheets of computer paper on which a poem generated by a Fortran program was printed, with each stanza of the same form. An example: A HOUSE OF GLASS IN A DESERTED FACTORY USING ALL AVAILABLE LIGHTING INHABITED BY COLLECTORS OF ALL TYPES This project showed that variations of a regular stanza could be interesting when lined up and read one after the other, and that a creative language generator could produce a reasonably lengthy work that is compelling and worth reading. "about so many things" and "Arrested" from Electronic Flipbooks Nannette Wylde, 1998 These very simple systems for text generation were written in Macromedia Director. They merely place some strings that are selected uniformly at random into a simple template, the nature of which is self-evident. "Arrested" presents a sort of stanza that describes different situations in which people are arrested, and is, as Wylde describes it, "a play on preconceptions regarding social, ethnic, religious, and political affiliations." In the case of "about so many things," the template is simply "He" followed by a sentence completion and the "She" followed by a sentence completion, to produce text such as: "He likes chocolate / She thinks things should be different." The sentence completions range from being rather gender-neutral to being quite different when applied to people of different genders. For instance: "feels stressful," "is a good parent," "has a crush on the teacher," "is a firefighter." The READ_ME file for "about so many things" explains: "the activities are drawn from the same pool of possibilities. Any line of text could be applied to either subject. In essence, the work explores the release of societal constraints regarding gender roles." Many sentences have different connotations when associated with people of different genders. By simply assigning sentences at random to be about either "He" or "She," "about so many things" produces interesting texts that provoke the reader to think about cultural preconceptions related to gender. The lesson for large-scale creative text generator is that determining the gender of a character, or transforming the gender of a character in an existing story, can be an important decision that is part of the creative process. "The Two" and "Through the Park" Nick Montfort, 2008 "The Two" builds on the core conceit of "About so Many Things." It uses even less text and only a slightly more complex template, such that the original Python program fits into 1 KB and the JavaScript version is not much larger. (The 1 KB limit is inspired in part by the demoscene, but also by poetic compression. While limitations of this sort are useful in many ways, and do enforce certain types of simplicity, they do not guarantee algorithmic simplicity or clear and readable code.) Two people described by their roles are introduced in the first line of each generated stanza; pronouns introduced in the second line require that the reader assign specific genders to the people in those two roles; and a conclusion involving both of them is provided: The indigent turns to the librarian. She smacks him. They pray together. In this case, two characters are introduced, the first of which is stereotypically male in U.S. culture. The second is usually culturally assumed to be female. Then, "she" and "he" appear on the second line, suggesting an obvious but disturbing resolution of reference: The librarian smacking the indigent. Since the typical reader's assumptions about the behavior of librarians and indigents will not line up with this interpretation, the reader may be compelled to consider the other interpretation, that the indigent is female and the librarian male. In either case, this generated text (and many of the texts that are generated) will challenge the reader's assumptions and stereotypes. This and other small-scale systems by this developer have been described and compared to systems of other sizes (Montfort, 2012). Another 1 KB Python program that has also been made available in a JavaScript version is "Through the Park." This system is an attempt to build a very simple model of ellipsis or elision, the omission of part of a story. A carefully constructed list of sentences is reduced by a fixed number (removing sentences at random but keeping their order) and the resulting shorter story is output. The method of ellipsis has no intelligence or creativity to it, but with carefully constructed sentences it can nevertheless be effective. "Through the Park" is the subject of a short case study in the next section. "The Semi-Automatic Doodle Machine" from Microcodes Páll Thayer, 2010 This tiny program (at 756 characters, "a bit longer than most" in the Microcodes series, as Thayer notes) produces some simple instructions for that non-artistic but potentially creative drawing practice known as doodling. The program first prints "Use a pencil and a 210mm x 210mm sheet of paper. Start with your hand at the upper-left corner." and then prints some instruction such as "With pencil up, move 8mm to the right," printing a new one endlessly each time ENTER is pressed. As a creative text generator, the program is curious be 2012 83 cause it generates instructions rather than a story or poem. Of course, the program is not framed as generating creative writing, but rather that non-artistic form of drawing known as a doodle. Seen as a generator of visual art, the program is rather hilarious. It uses a person as a sort of plotter, inverting the typical relationship between "user" and computer. With its tedious, precise instructions about how to do a task that has no external value, it might be a parody of creativity assistance software. It also highlights how computation can be applied at different stages of the creative process, questioning whether the entire pipeline of creative generation needs to be built for a system to be effective. "Through the Park": A Case Study The small-scale system "Through the Park" is about as simple as it can be while incorporating any computational elements at all. It provides a highly simplified model of an important narrative technique, however, a technique useful in full-scale story generation systems. The Importance of Ellipsis One way of understanding ellipsis is as one possible tempo at which a narrative may be related. In this view, it is the leaping over of one or more events in no time at all, which corresponds to telling the story at the fastest possible speed - an infinite speed (Prince 1982). The importance of this narrative technique has been articulated by narrative theorists, including Seymour Chatman: "Ellipsis is as old as The Illiad. But ... ellipsis of a particularly broad and abrupt sort is characteristic of modern narratives" (1978, p. 71). These omissions can allow the reader's imagination to fill the story in, as Fielding explains at the beginning of book III of Tom Jones: The reader will be pleased to remember, that … we gave him a hint of our intention to pass over several large periods of time ... In so doing, we do not only consult our own dignity and ease, but the good and advantage of the reader: for besides that by these means we prevent him from throwing away his time, in reading without either pleasure or emolument, we give him, at all such seasons, an opportunity of employing that wonderful sagacity, of which he is master, by filling up these vacant spaces of time with his own conjectures; for which purpose we have taken care to qualify him in the preceding pages. Understanding ellipses has been the subject of some research, but generating ellipses has not been as well-studied. As recently as 2006, it appeared that computational narrative systems did not incorporate an ability to use ellipsis (Gervás et al.). Those in the field have noted the relevance of this technique to cinematic and textual story generation, however. The interactive fiction system Curveship (Montfort 2007 p. 107) can generate ellipses but does not determine how to do so. Ellipsis was also supported in the Mimesis system, because "narrative effects in [3D] environments are often achieved by selecting elements of the story world to elide from the narrative discourse (e.g., temporal and causal ellipsis) ..." (Young 2007 p. 14) A Minimal Ellipsis System "Through the Park" was prompted by a conversation with Michael Mateas about how to develop the simplest story generator grounded in a meaningful narrative technique. The first version of it, a 1 KB Python program, was posted on Grand Text Auto on November 20, 2008. It has 25 sentences. Nine are removed during execution and the remaining 16 are printed in their original order. The sentences are: The girl grins and grabs a granola bar. The girl puts on a slutty dress. The girl sets off through the park. A wolf whistle sounds. The girl turns to smile and wink. The muscular man paces the girl. Chatter and compliments cajole. The man makes a fist behind his back. A wildflower nods, tightly gripped. A snatch of song reminds the girl of her grandmother. The man and girl exchange a knowing glance. The two circle. Laughter booms. A giggle weaves through the air. The man's breathing quickens. A lamp above fails to come on. The man dashes, leaving pretense behind. Pigeons scatter. The girl runs. The man's there first. Things are forgotten in carelessness. The girl's bag lies open. Pairs of people relax after journeys and work. The park's green is gray. A patrol car's siren chirps. The system is meant to tell a version of, or at least alludes to, the folktale Little Red Riding Hood. On Grand Text Auto, readers were asked if they considered a system this simple to be a story generator. While not all commenters agreed that it was one, game developer Gregory Weir was the first to reply, echoing Fielding in some ways: It's definitely a story generator. I like how my interpretation of the story can vary drastically on which cues are included. This is partly due to a few sharply-charged cues: the girl's smile, the knowing glance, the blank stare, and the police siren. Depending on which of these are included, cues like the girl's bag or the movement can be erotic or horrific. It does depend heavily on the mind's ability to fill in gaps … (Montfort 2008a) The sentences were consciously written to suggest (although not directly assert) that the two characters might be in a friendlier or more antagonistic relationship, and that the situation is more playful or sinister. Developing this generator led to an improved understanding of ellipsis and of the characteristics (both ontological and linguistic) of story elements and their representa 2012 84 tions. In this simple system, there is no representation of the underlying fabula or story levels that is separate from a potential text, which may or may not be included in the final, realized discourse. Linguistically, it is problematic to include pronouns or other words that refer to other sentences; if such words are used, "she" or "he" might appear before "the girl" or "the man" are introduced. The more cohesive a text is, the harder it is to elide a sentence from it without adjusting the other sentences. The underlying events in a story also should be able to stand apart, but for narrative interest, it is appropriate that they are, in Weir's terms, "charged" with varying emotional implications. While it seems valuable for the events to be of different valences, it is also helpful that they contribute to a consistent scenario and agree on, for instance, who the two main characters are and what the setting is. A more general model would allow different events/sentences to have different probabilities of being omitted; an even more general one would allow for conditional probabilities. Since experience with "Through the Park" suggests some qualities of the relationship between intersentential cohesion, the relationship between underlying events, and the opportunity for ellipsis, there are insights that could be applied in the development of more elaborate systems. Generality across Languages Gregory Rabassa has stated that "translation is essentially the closest reading one can give a text" (1989), suggesting that the translation of a computational system to produce linguistic or narrative creativity would at least have to involve a very deep analysis and understanding of the system. Large-scale systems are seldom translated because of the great effort that would be needed; small-scale systems are more manageable and can be translated in fairly short amounts of time, sometimes even by volunteers. Fedorova translated "Through the Park" to Russian, demonstrating that the system does not only work in English. The small size of the system and simplicity of its operation facilitated this. The need to maintain an ambiguity of tone or emotion did complicate the translation process to some extent, further highlighting the particular way in which the original sentences were constructed. However, each of the sentences could be translated, resulting in a Russian system that produced ellipses with the same sorts of effects as the original English system. Because "Through the Park" works at the sentence level, modifying the discourse without making adjustments to syntax, it is less language-specific than some other creative text generators are. "The Two" uses the ambiguity of gender of noun phrases in its first lines to achieve its effect; this ambiguity is not easy to achieve in all languages. Generality across Story Domains A system that is so specific that it can only tell one story, or one class of stories, is probably not worth much time or attention. While large systems are often difficult to convert to other story domains, adaptation is a good sign that the system is general. In the case of large-scale systems, such adaptations would often be difficult and time consuming; they are easier in smaller-scale systems. The simple underlying system in "Through the Park" was re-used by writer and artist J. R. Carpenter to create two story generators, "Excerpts from the Chronicles of Pookie & JR" and "I've Died and Gone to Devon." The former program was used to produce much of the text of Carpenter's book Generation[s]. "Excerpts" was ported to JavaScript in 2009 by Ravi Rajakumar (independently of the port of "Through the Park"), translated into Spanish and Catalan in 2011 by Laura Borràs Castanyer, and translated into Russian by Natalia Fedorova in 2012. Another system that uses "Through the Park" as a basis is Fedorova's "Halfway Through." This system has one Russian and one English array of sentences; it mingles an inner soliloquy with overheard phrases. "Through the Park" is not the most-reused small-scale creative text generator (for instance, Montfort's Taroko Gorge, which was also originally a 1 KB Python program, has been appropriated and reworked online more than ten times) nor the most-translated (for instance, Montfort's "The Two," another originally 1 KB Python program ported to JavaScript, has been translated to French, Spanish, and Russian). Still, that it has been ported, translated, and re-used attests to its accessibility and flexibility. Benchmarks, Baselines, and Subsystems for Larger-Scale Systems A small-scale system can be used as a benchmark or baseline for evaluating larger-scale systems use similar techniques, driven by more elaborate methods. For instance, it could be worthwhile to compare a sophisticated system that elides parts of a story for a particular purpose (to generate suspense, to increase reader interest) against a system that elides at random, as "Through the Park" does. Even without a purpose-built story, such a system would reveal something about how effective the technique of elision or omission is when applied without any special logic, intelligence, or creativity. As a first step, developers of a creative system for ellipsis should show that it can exceed, by whatever metric, the effectiveness of a random one. If an elaborate creative system to address one particular aspect of story-generation does not exceed the small-scale baseline, all is not lost. A larger-scale system that incorporates several subsystems can simply use the simple, random system to deal with that particular technique (ellipsis, assignment of gender to characters, or something else) while using more elaborate methods elsewhere. Allowing for Small-Scale Work Small-scale systems can be of direct as well as indirect significance. They can be easily understood and modified, even without the involvement of their original creators. The new systems that are developed in this way can contribute to new types of cultural production, having value 2012 85 inside and outside the computational creativity community. They can be provocative, challenging the ideas that have been developed using large-scale systems and helping to develop some that have been overlooked. They can be used in teaching as the starting point for literary work or more elaborate exercises in computational expression. Finally, they can be used to sketch, as an artist would, in preparation for undertaking a large-scale work. Despite the worth of small-scale projects and the slight effort that is needed to execute them, the context of computer science, and many interdisciplinary contexts, discourages work on such sketches and encourages researchers to proceed more directly to the development of large-scale systems. There are a few cases where small-scale systems are seen to have a place - for use as examples, for instance, or as subsystems in a larger system - but not many. A dissertation project usually corresponds to a largescale system, and the master's thesis and undergraduate capstone projects are typically reduced versions. Most conference papers are based on work with large-scale systems; even short papers, such as this one, are often invited not for the discussion of small-scale systems but for the dissemination of intermediate results about work in progress. Ph.D. students in every field are already expected to understand their research area thoroughly by reviewing and understanding the relevant literature. It seems appropriate for them to spend as much time as they would reading a handful of articles in the development of one, or a few, small-scale systems. Such systems allow for different perspectives and approaches to be attempted; they also encourage a focus on the essential and on extreme abstraction of method and of the domain of creativity. There are institutions that support, or could support, the development of small-scale systems. In particular, the hackathon, codefest, demo party, or other sort of competition, as often arranged outside of an academic context as inside it, could be employed to encourage the development of small-scale creativity systems. Although adding such an event to an existing conference would not change the paradigm for system development radically - those who were able to attend and compete would be there because their paper about a large-scale system was accepted - an event for quick development of systems could call attention to the value of such systems. Small-scale systems have definite benefits, despite the institutional preference for using and discussing large-scale ones. These systems are easily portable across platforms, easily translated, easily generalized to different domains, and capable of capturing the essential aspects of important narrative techniques. Since they are also quick to put together, it would be sensible to do more to allow and encourage their development. Acknowledgements Our thanks to the anonymous reviewers, particularly for the suggestions that led to the discussion of randomness and the section on benchmarks, baselines, and subsystems. 2012_14 !2012 Automatic Generation of Melodic Accompaniments for Lyrics Kristine Monteith, Tony Martinez, and Dan Ventura Computer Science Department Brigham Young University Provo, UT 84602 USA kristinemonteith@gmail.com, martinez@cs.byu.edu, ventura@cs.byu.edu Abstract Music and speech are two realms predominately species-specific to humans, and many human creative endeavors involve these two modalities. The pairing of music and spoken text can heighten the emotional and cognitive impact of both the complete song being much more compelling than either the lyrics or the accompaniment alone. This work describes a system that is able to automatically generate and evaluate musical accompaniments for a given set of lyrics. It derives the rhythm for the melodic accompaniment from the cadence of the text. Pitches are generated through the use of n-gram models constructed from melodies of songs with a similar style. This system is able to generate pleasing melodies that fit well with the text of the lyrics, often doing so at a level similar to that of human ability. Introduction Programmers and researchers have often attempted to endow machines with some form of intelligence. In some cases, the end goal of this is purely practical; a machine with the capacity to learn could provide a multitude of useful and resourcesaving tasks. But in other cases, the goal is simply to make machines behave in a more creative or more "human" manner. As one author explains, "Looked at in one way, ours is a history of self-imitation...We are ten times more fascinated by clockwork imitations than by real human beings performing the same task." (McCorduck 2004). One major area of human creativity involves the production of music. Wiggins (2006) states that, "...musical behavior is a uniquely human trait...further, it is also ubiquitously human: there is no known human society which does not exhibit musical behaviour in some form." Naturally, many computer science researchers have turned their attention to musical computation tasks. Researchers have attempted to classify music, measure musical similarity, and predict the musical preferences of users (Chai and Vercoe 2001; McKay and Fujinaga 2004). Others have investigated the ability to search through, annotate, and identify audio files (Dannenberg et al. 2003; Dickerson and Ventura 2009). More directly in the realm of computational creativity, researchers have developed systems that can automatically arrange and compose music (Oliveira and Cardoso 2007; Delgado, Fajardo, and Molina-Solana 2009). Like music, speech is an ability that is almost exclusively human. While species such as whales or birds may communicate through audio expressions, and apes may even be taught simple human-like vocabularies and grammars using sign language, the complexities of human language set us apart in the animal kingdom. Major research efforts have been directed toward machine recognition and synthesis of human speech (Rabiner 1989; Koskenniemi 1983) Computers programs have been designed to carry on conversations, some of them doing so in a surprisingly human-like manner (Weizenbaum 1966; Saygin, Cicekli, and Akman 2000). More creative programming endeavors have involved the generation of poetry (Gervas 2001; ´ Rahman and Manurung 2011) or text for stories (Riedl 2004; Perez y P ´ erez and Sharples 2004; Gerv ´ as et al. 2005; ´ Ang, Yu, and Ong 2011). Gfeller (1990) points out the similarities between speech and music: "Both speech and music are species specific and can be found in all known cultures. Both forms of communication evolve over time and have structural similarities such as pitch, duration, timbre, and intensity organized through particular rules (i.e. syntax or grammar) that result in listener expectations." Studies show that music and the spoken word can be particularly powerful when paired together. For example, in one study, researchers found that a sung version of a story was often more effective at reducing an undesirable target behavior than a read version of the story (Brownell 2002). Music can help individuals with autism and auditory processing disorders more easily engage in dialog (Wigram 2002). The pairing of music with language can even help individuals regain lost speech abilities through a process known as Melodic Intonation Therapy (Gfeller 1990; Schlaug, Marchina, and Norton 2008). On the other hand, lyrics have the advantage of being able to impart discursive information where the more abstract nature of music makes it less fit to do so (Kreitler and Kreitler 1972). Lyrics can also contribute to the emotional impact of a song. One study found that lyrics enhanced the emotional impact of a selection with sad or angry music (Ali and Peynircioglu 2006). Another found that lyrics tended to be a better estimator of the overall mood of a song than the melody when the lyrics and the melody disagree (Wu et al. 2009). This work describes a system that can automatically com 2012 87 pose melodic accompaniments for any given text. For each given lyric, it generates hundreds of different possibilities for rhythms and pitches and evaluates these possibilities with a number of different metrics in order to select a final output. The system also incorporates an awareness of musical style. It learns stylistic elements from a training corpus of melodies in a given genre and uses these to output a new piece with similar elements. In addition to self-evaluation, the generated selections are further evaluated by a human audience. Survey feedback indicates that the system is able to generate melodies that fit well with the cadence of the text and that are often as pleasing as the original accompanying tunes. Colton, Charnley, and Pease (2011) suggest a number of different measures that can be used to evaluate systems during the creative process. We direct particular attention to two of these- precision and reliability- and demonstrate that, for simpler styles, our system is able to perform well with regard to these metrics. Related Work Conklin (2003) summarizes a number of statistical models which can be used for music generation, including random walk, Hidden Markov Models, stochastic sampling, and pattern-based sampling. These approaches can be seen in a number of different studies. For example, Chuan and Chew (2007) use Markov chains to harmonize given melody lines, focusing on harmonization in a given style. Cope (2006) also uses statistical models to generate music in a particular style, producing pieces indistinguishable from humangenerated compositions. Pearce and Wiggins (2007) provide an analysis of a number of strategies for melodic generation, including one similar to the generative model used in this paper. Delgado, Fajardo, and Molina-Solana (2009) use a rulebased system to generate compositions according to a specified mood. Oliveira and Cardoso (2007) describe a wide array of features that contribute to emotional content in music and present a system that uses this information to select and transform chunks of music in accordance with a target emotion. Researchers have also directed efforts towards developing systems intended for accompaniment purposes. Dannenberg (1985) presents a system of automatic accompaniment designed to adapt to a live soloist. Lewis (2000) also details a "virtual improvising orchestra" that responds to a performer's musical choices. While not directly related to generating melodic accompaniment for lyrics, a number of studies have looked at aligning musical signals to textual lyrics (the end result being similar to manually-aligned karaoke tracks). For example, Wang and associates (2004) use both low-level audio features and high-level musical knowledge to find the rhythm of the audio track and use this information to align the music with the corresponding lyrics. Methodology In order to generate original melodies, a set of melodies is compiled for each different style of composition. These melodies were isolated from MIDIs obtained from the Free MIDI File Database1 and the "I Love MIDIs" website2. These selections help determine both the rhythmic values and pitches that will be assigned to each syllable of the text. The system catalogs the rhythmic patterns that occur for each of the various numbers of notes in a given measure. The system also creates an n-gram model representing what notes are most likely to follow a given series of notes in a given set of melodies. Models were developed for three stylistic categories: nursery rhymes, folk songs (bluegrass), and rock songs (Beatles). For each lyric, the system first analyzes the text and assigns rhythms. It determines where the downbeats will fall for each given line of the text. One hundred different downbeat assignments are generated randomly, and evaluated according to a number of aesthetic measures. The system selects the random assignment with the highest score for use in the generated melody. The system then determines the rhythmic values that will be assigned to each syllable in the text by counting the number of syllables in a given measure and finding a rhythm that matches that number of syllables in one of the songs of the training corpus. Once rhythmic values are assigned, the system assigns pitches to each value using the n-gram model constructed from the training corpus. Once again, one hundred different assignments are generated and evaluated according to a number of metrics. Further details on the rhythm and pitch generation are provided in the following subsections. Rhythm Generation Rhythms are generated based on patterns of syllabic stress in the lyrics. Each word of the text is located in the CMU Pronunciation Dictionary3 to determine the stress patterns of the constituent phonemes. (Each phoneme in the dictionary is labeled 0, 1, or 2 for "No Stress," "Primary Stress," or "Secondary Stress.") The system also looks up each word to determine if it occurs on a list of common articles, prepositions, and conjunctions. The system then attempts to find the best positions for downbeats. For each given line of text, the system generates 100 possible downbeat assignments. The text of each line is distributed over four measures, so four syllables are randomly selected to carry a downbeat. Each assignment is then scored, and the system selects the assignment receiving the highest score for use in the melodic accompaniment. Downbeat assignments that fall on stressed syllables are rated highly, as are downbeats that fall on the beginning of a word and ones that do not fall on articles, prepositions, or conjunctions. Downbeat assignments that space syllables more evenly across the allotted four measures are also rated more highly (i.e. assignments that have a lower standard deviation for number of syllables per measure receive higher scores). See Figure 4 for further details on the precise downbeat scoring metrics. Figure 1 illustrates a possible downbeat assignment for a sample lyric. 1 http://www.mididb.com/ 2 http://www.ilovemidis.com/ForKids/NurseryRhymes/ 3 http://www.speech.cs.cmu.edu/cgi-bin/cmudict 2012 88 Lyrics: Pat a cake pat a cake Phonemes: PAET AH KEYK PAET AH KEYK Stress: 1 0 1 1 0 1 Downbeats: true false false true false false Lyrics: baker's man Phonemes: BEY KERZ MAEN Stress: 1 0 1 Downbeats: true false true Figure 1: Sample downbeat assignments for Pat-A-Cake lyrics Figure 2: Default rhythm assignments for Pat-A-Cake lyrics Once the downbeats are assigned, a rhythmic value is assigned to each syllable. The system randomly selects a piece in the training corpus to provide rhythmic inspiration. This selection determines the time signature of the generated piece (e.g. three beats or four beats to a measure). For each measure of music generated, the system looks to the selected piece and randomly chooses a measure that has the necessary number of notes. For example, if the system needs to generate a rhythm for a measure with three syllables, it randomly chooses a measure in the training corpus piece that has three notes in it and uses its rhythm in the generated piece. If no measures are available that match the number of syllables in the lyric, the system arbitrarily assigns rhythmic values, with longer values being assigned to earlier syllables. For example, in a measure with three syllables using a three-beat pattern, each syllable would be assigned to a quarter note. In a measure with four syllables, the first two syllables would be assigned to quarter notes and the last two syllables to eighth notes. Figure 2 illustrates the default rhythms assignment for a sample lyric. Pitch Generation Once the rhythm is determined, pitches are selected for the various rhythmic durations. Selections from a given style corpus are first transposed into the same key. Then an ngram model with an n value of four is constructed from these original melodic lines. The model was created simply from the original training melodies, with no smoothing. For the new, computer-generated selections, melodies are initialized with a series of random notes, selected from a distribution that models which notes are most likely to begin musical selections in the given corpus. In order to foster song cohesion, each line of the song is initialized with the same randomly generated three notes. Additional notes for each line are randomly selected based on a probability distribution of what note is most likely to follow the given three notes as indicated by the n-gram model of the style corpus. The system generates several hundred possible series of pitches for each line. Each possible pitch assignment is then scored. To encourage melodic interest, higher scores are given to melodic lines with a higher number of distinct Figure 3: Sample pitch assignments for Pat-A-Cake lyrics pitches and melodies featuring excessive repeated notes are penalized. Melodies with a range greater than an octave and a half or with interval jumps greater than an octave are penalized since these are less "sing-able." Melodic lines that do not end on a note in a typical major or minor scale and final melodic lines that do not end on a tonic note are given a score of zero. More precise details about the scoring of pitch assignments are given in Figure 4. Possible pitch assignments for a sample lyric are shown in Figure 3. Results Accompaniments were generated for lyrics in three stylistic categories: nursery rhymes, folk songs (bluegrass), and rock songs (Beatles). In each case, an attempt was made to find less commonly known melodies, so that the generated music could be more fairly compared to the original melodic lines. Melodic lines were generated for the following: Nursery rhymes: • Goosey Goosey Gander • Little Bo Peep • Pat-a-Cake • Rub-a-Dub-Dub • The Three Little Kittens Folk songs: • Arkansas Traveler • Battle of New Orleans • Old Joe Clark • Sally Goodin • Wabash Cannonball Rock songs: • Act Naturally • Ask Me Why • A Taste of Honey • Don't Pass Me By • I'll Cry Instead Three melodies were generated for each of the fifteen lyrics considered. One was generated using a corpus of songs that matched the style of the lyrics (e.g. to generate a melody for Goosey Goosey Gander the four other nursery 2012 89 1: MelodicAccompaniment(Lyric,StyleCorpus) 2: for all LINEi in Lyric do 3: STRi patterns of syllabic stress in LINEi 4: POSi parts of speech for each syllable in LINEi 5: BEGi boolean values indicating that a syllable in LINEi begins a word 6: for i = 1 ! 100 do 7: DBj randomly assign downbeats to four syllables 8: scorej ScoreDownbeats(DBj ,STRi,POSi,BEGi) 9: end for 10: DBi DBj that coincides with the largest scorej 11: RHY THMi SelectRhythms(DBi) 12: for i = 1 ! 100 do 13: PITCHESj assign pitches using n-gram model from StyleCorpus 14: scorej ScorePitches(PITCHESj ) 15: end for 16: PITCHESi PITCHESj that coincides with the largest scorej 17: MELODYi combine RHY THMi and PITCHESi 18: MELODY + = MELODYi 19: end for 20: return MELODY 1: ScoreDownbeats(DBj ,STRi,POSi,BEGi) 2: for k = 1 ! j do 3: If DBjk and STRik = 1 then score+=1 4: If DBjk and POSik! = Art|P rep|Conj then score+ = 0.5 5: If DBjk and BEGik then score+=0.5 6: x maxSyllablesP erMeasure 7: score+=(x # stdDevSyllablesP erMeasure) ⇤ 0.5 8: score+=(x # numP ickupSyllables) ⇤ 0.25 9: score+=(x # numSyllablesLastMeasure) ⇤ 0.25 10: end for 11: return score 1: SelectRhythms(Di, Si) 2: M divide Si into measures based on Di 3: C randomly select a song in StyleCorpus 4: R 0 5: for all Mj in M do 6: Rj randomly selected measure from C with the same # of notes as syllables in Mj 7: R += Rj 8: end for 9: return R 1: ScorePitches(PITCHESj ) 2: score uniqueP itches(PITCHESj )/size(PITCHESj ) 3: If MaxRepeatP itches(PITCHESj ) < maxRepeatP itches then score+=1 4: If Range(PITCHESj ) < maxRange then score+=1 5: If MaxInterval(PITCHESj ) < maxInterval then score+=1 6: If !EndsOnScaleNote(PITCHESj ) then score = 0 7: If LastLine(j) and !EndsOnT onic(PITCHESj ) then score = 0 8: return score Figure 4: Algorithm for automatically generating melodic accompaniment for text bluegrass nursery rock average bluegrass 1.34 3.09 1.19 1.87 nursery 1.14 3.32 1.19 1.88 rock 1.25 3.28 1.11 1.88 original 1.50 3.50 1.47 2.16 Table 1: Average responses to the question "How familiar are you with these lyrics?" Each row represents a compositional style and each column a category of lyrics. rhyme songs were used to build the n-gram model) and two more were generated in the remaining two creative styles4. Study participants were divided into four groups. Each group was asked to listen to versions of songs for each of the fifteen lyrics, with selections for each group being a mixture of lyrics with the original human-composed melodies and lyrics with the three types of computer-generated melodies. Subjects were not informed that any of the melodies were computer-generated until after data collection. Fifty-two subjects participated in the study, and each melodic version was played for thirteen people. After each selection, subjects were asked to respond to the following questions (1=not at all, 5=very much): • How familiar are you with these lyrics? 1 2 3 4 5 • How familiar are you with this melody? 1 2 3 4 5 • How pleasing is the melodic line? 1 2 3 4 5 • How well does the music fit with the lyrics? 1 2 3 4 5 • Is this the style of melody you would have expected to accompany these lyrics? 1 2 3 4 5 • Are you familiar with any other melodies for these lyrics? YES NO Table 1 shows the average responses to the question about familiarity of lyrics for each of the three categories. In each case, lyrics were rated as more familiar when they were paired with their original melodies as opposed to the computer-generated melodies. However, none of these differences were significant at the p < 0.05 level. The majority of subjects were relatively unfamiliar with the bluegrass and rock lyrics. The nursery rhyme lyrics were slightly more familiar, but in many cases, subjects were familiar with the lyrics but not any specific tune. Table 2 shows the average responses to the question about familiarity of melody for each of the three categories. On average, subjects were slightly more familiar with the original melodies in the bluegrass and rock categories than they were with the lyrics. The original nursery rhymes melodies were rated as slightly less familiar on average than the lyrics. System-generated melodies received an average score of less than two for familiarity in each of the three categories (significantly lower than original melodies with a statistical significance of p < 0.01). Subjects were likely to be less receptive to new melodies if they were very familiar with the old ones. (One respondent 4 Selections generated for these experiments are available at http://axon.cs.byu.edu/emotiveMusicGeneration 2012 90 bluegrass nursery rock average bluegrass 1.62 1.49 1.40 1.50 nursery 1.53 2.17 1.34 1.68 rock 1.41 1.39 1.24 1.35 original 2.31 2.94 1.81 2.35 Table 2: Average responses to the question "How familiar are you with this melody?" Each row represents a compositional style and each column a category of lyrics. bluegrass nursery rock average bluegrass 3.50 3.50 3.56 3.52 nursery 3.37 3.24 3.09 3.23 rock 2.70 2.17 2.16 2.34 original 3.79 3.79 2.95 3.51 Table 3: Average responses to the question "How pleasing is the melodic line?" Each row represents a compositional style and each column a category of lyrics. mentioned that hearing a new melody to a familiar childhood song was a little "unnerving".) Tables 3 through 7 report only the responses where subjects indicated that they were not familiar with an alternate melody for a given set of lyrics. As shown in Table 3, the system was able to generate melodies that received the same average ratings for pleasing melodic lines as the original melodies. The average rating for songs in the bluegrass style was almost identical to that of the original melodies. The average ratings for pleasantness of generated nursery rhythm melodies was not significantly different than the original tunes. For over a third of the lyrics, a computer-generated melody in at least one style was rated as more pleasing than the original melody. These tunes are listed in Table 4 along with their average ratings. For example, the original melody for Battle of New Orleans received a rating of 3.33 for average melodic pleasantness. The computer-generated melody for this lyric in a nursery rhyme style received a rating of 3.92. The original melody for Little Bo Peep received an average melodic pleasantness rating of 3.22. The bluegrassstyled computer-generated melody received a rating of 3.80, and the nursery-rhyme-styled generated melody received a rating of 3.43. Table 5 shows that the original melodies were rated on average as fitting a little better with the lyrics (although the difference between the original melodies and the songs composed in the bluegrass style is not statistically significant). However, as shown in Table 6 a number of the individual computer-generated melodies were still rated as fitting better with the lyrics than the original melodies. For example, the rock version of Old Joe Clark received a rating of 3.00 from this metric while the original version received a rating of 2.75. Both the bluegrass and nursery-rhyme versions of Ask Me Why received higher ratings than the original version. Table 7 reports responses to the question "Is this the style of melody you would have expected to accompany these lyrics?" Not surprisingly, the original melodies were Battle of New Orleans Little Bo Peep Rub A Dub Dub Act Naturally Ask Me Why I'll Cry Instead bluegrass 3.23 3.60 3.80 3.50 4.23 3.79 nursery 3.92 3.43 3.17 2.91 3.14 2.92 rock 2.83 2.60 2.13 2.54 2.00 2.36 original 3.33 3.22 3.50 2.70 2.83 2.12 Table 4: Average responses to the question "How pleasing is the melodic line?" for six songs where system-generated melody in one or more styles scored higher than the original melody. bluegrass nursery rock average bluegrass 3.59 3.20 3.18 3.32 nursery 3.35 3.36 2.71 3.14 rock 3.23 2.18 2.26 2.56 original 3.88 4.27 2.90 3.68 Table 5: Average responses to the question "How well does the music fit with the lyrics?" Each row represents a compositional style and each column a category of lyrics. more "expected" on average than melodies composed in new styles. The computer-generated melodies composed in the style of the original melodies were also generally more expected with one exception: bluegrass melodies for rock lyrics tended to receive higher expectation ratings. In a number of cases, the system was able to compose an unexpected melody that still received high ratings for pleasing melodies and a lyric/note match. Two such examples are shown in Table 8. In both cases, the songs received above average ratings for melodic pleasantness and average ratings for music/lyric match, but below average ratings for style expectedness. Discussion The original nursery rhymes were composed predominantly with notes of the major scale, and the rhythms in these songs were similarly simple. (Songs generated with corpusinspired rhythms were quite similar to songs generated with the system's default rhythms.) With the exception of a flat seventh introduced by the mixolydian scale of Old Joe Clark, the bluegrass melodies also feature pitches exclusively from the major scale. Bluegrass rhythms also tended to be similarly straightforward. With simpler rhythms and fewer accidentals, more of the melodies generated in these two styles are likely to "work." The original bluegrass melodies tended to have more interesting melodic motion, and this appears to have translated into more interesting system-generated melodies. In contrast, the rock songs featured a much wider variety of scales and accidentals. These 2012 91 Arkansas Traveler Old Joe Clark Three Little Kittens Ask Me Why A Taste of Honey I'll Cry Instead bluegrass 4.08 2.71 4.25 3.54 2.57 3.43 nursery 3.08 2.75 3.80 3.07 2.85 2.38 rock 3.08 3.00 2.18 1.77 2.08 2.27 original 3.91 2.75 4.17 2.75 2.79 2.15 Table 6: Average responses to the question "How well does the music fit with the lyrics?" for six songs where systemgenerated melody in one or more styles scored higher than the original melody. bluegrass nursery rock average bluegrass 3.47 2.85 2.91 3.08 nursery 3.22 3.46 2.44 3.04 rock 3.12 1.82 2.14 2.36 original 3.69 4.27 2.79 3.58 Table 7: Average responses to the question "Is this the style of melody you would have expected to accompany these lyrics?" extra tones do add color to the generated selections, but further refinements may be necessary to select which more complicated melodies are "fresh" or "original" instead of just "weird." Wiggins (2006) proposes a definition for computational creativity as "The performance of tasks which, if performed by a human, would be deemed creative." The task of simply composing any decent new melody for an established tune could be considered creative. Composing one that improved on the original constitutes an even greater degree of creative talent. By this metric, our system fits the definition of "creative." Pat-A-Cake (bluegrass) Act Naturally (bluegrass) How pleasing is the melodic line? 3.80 3.50 How well does the music fit with the lyrics? 3.20 3.17 Is this the style of melody you would have expected? 2.60 2.50 Table 8: Average responses to questions for two songs where the melodic accompaniment was surprising but still worked. Colton (2008) suggests that, for a computational system to be considered creative, it must be perceived as possessing skill, appreciation, and imagination. A basic knowledge of traditional music behavior allows a system to meet the "skillful" criteria. Our system takes advantage of statistical information about rhythms and melodic movement found in the training songs to compose new melodies that behave according to traditional musical conventions. A computational system may be considered "appreciative" if it can produce something of value and adjust its work according the preferences of itself or others. Our system addresses this criterion by producing hundreds of different possible rhythm and pitch assignments and evaluating them against some basic rules for pleasantness and singability. The "imaginative" criterion can be met if the system can create new material independent of both its creators and other composers. Since all of the generated melodies can be distinguished from songs in the training corpora, this criterion is met at least on a basic level. Our system further demonstrates its imaginative abilities by composing melodies in alternate styles that still manage to demonstrate an acceptable level of melodic pleasantness and synchronization with the cadence of the text. Boden (1995) argues that unpredictability is also a critical element of creativity, and a number of researchers have investigated the role of unpredictability in creative systems (Macedo 2001; Macedo and Cardoso 2002) Our system meets the requirement of unpredictability with its ability to compose in various and sometimes unexpected styles. It is able to generate melodies that surprise listeners but still achieve high ratings for pleasantness. Colton, Charnley, and Pease (2011) propose a number of different metrics in conjunction with their FACE and IDEA models that can be used to assess software during a session of creative acts. Equations for calculating these metrics are listed in Figure 5, were S is the creative system, (c g i , eg i ) is a concept/expression pair generated by the system, ag is an aesthetic measure of evaluation, and t is a minimum acceptable aesthetic threshold. Two of the measures suggested are precision (obtained by dividing the number of generated works by the number that met a minimum acceptable aesthetic level) and reliability (obtained from taking the system's best creation as calculated by some aesthetic measure and subtracting the system's worst). Table 9 reports the results of these calculations for the system's compositions in each of the three styles and compares them to the same metrics calculated for the original songs using responses to the question "How pleasing is the melodic line?" as the scoring metric. In order to calculate precision, we consider the worst score obtained by an original, human-composed melody to be the minimum acceptable threshold value. While the prize for most pleasing melody still goes to a human-composed song, all of the songs composed in a bluegrass and nursery style and two-thirds of the rock songs meet the basic criteria of being better than the worst original melody. The system is generating original melodies that are better than some established, human-generated songs a remarkable percentage of the time. The reliability of the system in generating bluegrass and nursery-style melodies is also worth mentioning. The reliability measures for these two categories are 2012 92 average(S) = 1 n !n i=1ag(c g i , eg i ) best ever(S) = maxn i=1(ag(c g i , eg i )) worst ever(S) = minn i=1(ag(c g i , eg i )) precision(S) = 1 n |{(c g i , eg i ):1 t}| reliability(S) = best ever(S) " worst ever(S) Figure 5: Assessment metrics proposed by Colton, Charnley, and Pease (2011) bluegrass nursery rock original average 3.52 3.23 2.34 3.51 best ever 4.23 3.92 3.83 4.50 worst ever 2.93 2.58 1.73 2.12 precision 1.00 1.00 0.67 1.00 reliability 1.30 1.33 2.11 2.38 Table 9: Assessment metrics calculate on average responses to the question "How well does the music fit with the lyrics?" 1.30 and 1.33 as compared to the 2.38 reliability measure for original songs. (Note that, for reliability, smaller scores are more desirable.) While the system probably shouldn't quit its day job to become a classic rock songwriter quite yet, it is considerably reliable at producing reasonable and pleasing melodies in the other two genres. Similar results can be seen in Table 10 where responses to the question "How well does the music fit with the lyrics?" are used as the aesthetic measure. As with the previous calculations, the "worst ever" score for an original melody was used as a minimum aesthetic threshold for the generated melodies. Again, all of the nursery rhyme and bluegrassstyled compositions meet this threshold, as do two-thirds of the rock-styled songs. A song generated in the nursery rhyme or bluegrass style also more reliably matches the lyrics than an arbitrarily selected human-generated song. Previous versions of our system analyzed each melody in a given training corpus according to a number of different metrics and used the results in the construction of neural networks designed to evaluate generated melodies (Monteith, Martinez, and Ventura 2010). For the sake of simplicity and computational speed, the most pertinent of these findings were distilled into rules for use by the system in these experiments. In other words, the information gathered by the system to date about melody generation has been simplified and codified so that more focus could be directed towards matching rhythms to text. However, the system could likely benefit from the use of additional metrics and further "observation" of human-generated and approved tunes in its attempts to create pleasing melodies. A similar process of evaluation could be applied to the process of rhythm generation, particularly in the assignment of downbeats. Currently, the system relies on a small set of arbitrary, pre-coded rules to determine downbeat placement. It would likely require a much larger training corpus than we currently have available, but perhaps more natural-sounding placements could be obtained if the system could learn from a corpus of "good" lyric/melody pairings the types of words bluegrass nursery rock original average 3.32 3.14 2.56 3.68 best ever 4.25 3.86 4.23 4.75 worst ever 2.57 2.36 1.63 2.15 precision 1.00 1.00 0.67 1.00 reliability 1.68 1.49 2.61 2.60 Table 10: Assessment metrics calculate on average responses to the question "How well does the music fit with the lyrics?" and syllables best suited for supporting downbeats. Audience feedback could help determine an optimal weighting of the various evaluation criteria. 2012_15 !2012 Full-FACE Poetry Generation Simon Colton1, Jacob Goodwin1 and Tony Veale2 1Computational Creativity Group, Department of Computing, Imperial College London, UK. ccg.doc.ic.ac.uk 2 School of Computer Science and Informatics, University College Dublin, Ireland. afflatus.ucd.ie Abstract We describe a corpus-based poetry generation system which uses templates to construct poems according to given constraints on rhyme, meter, stress, sentiment, word frequency and word similarity. Moreover, the software constructs a mood for the day by analysing newspaper articles; uses this to determine both an article to base a poem on and a template for the poem; creates an aesthetic based on relevance to the article, lyricism, sentiment and flamboyancy; searches for an instantiation of the template which maximises the aesthetic; and provides a commentary for the whole process to add value to the creative act. We describe the processes behind this approach, present some experimental results which helped in fine tuning, and provide some illustrative poems and commentaries. We argue that this is the first poetry system which generates examples, forms concepts, invents aesthetics and frames its work, and so can be assessed favourably with respect to the FACE model for comparing creative systems. Introduction Mainstream poetry is a particularly human endeavour: written by people, to be read by people, and often about people. Therefore - while there are some exceptions - audiences expect the opportunity to connect on an intellectual and/or emotional level with a person, which is often the author. Even when the connection is made with characters portrayed in the poem, the expectation is that the characters have been written from a human author's perspective. In the absence of information about an author, there is a default, often romantic, impression of a poet which can be relied upon to provide sufficient context to appreciate the humanity behind a poem. Using such an explicit, default or romantic context to enhance one's understanding of a poem is very much part of the poetry reading experience, and should not be discounted. Automated poetry generation has been a mainstay of computational creativity research, with dozens of computational systems written to produce poetry of varying sophistication over the past fifty years. In the literature review given below, it is clear that the emphasis has been almost entirely on artefact generation, i.e., producing text to be read as poetry, rather than addressing the issues of context mentioned above. Therefore, without exception, each of these systems has to be seen as an assistant (with various levels of autonomy) for the system's user and/or programmer, because that person provides the majority of the context. This is usually achieved by supplying the background material and templates; or curating the output; or writing technical papers to describe the sophistication of the system; or writing motivational text to enhance audience understanding, etc. While such poetry assistants are very worthwhile, we aim instead to build a fully autonomous computer poet, and for its poems to be taken seriously in full disclosure of the computational setting. The first step towards this aim is to acknowledge that the poems generated will not provide the usual opportunities to connect with a human author, as mentioned above. A second step therefore involves providing a suitable substitute for the missing aspects of humanity. To partly address this, we have built a system to construct poems via a corpus-based approach within which existing snippets of human-written text are collated, modified and employed within the stanzas of poem templates. In particular, in the Corpus-Based Poetry section below, we describe how a database of similes mined from the internet, along with newspaper articles can be used to generate poems. A third step, which also addresses the missing human element to some extent, involves providing a context within which a poem can be read. Software may not be able to provide an author-centric human context, but it can provide a context which adds value to a poem via an appeal to aspects of humanity, in particular emotions. In the section below entitled Handing over High-Level Control, we describe how the software uses a corpus of newspaper articles to (a) determine a mood for the day in which it is writing a poem, which it uses to (b) generate an aesthetic and templates within which to generate poems, then (c) selects and modifies corpus material to instantiate the templates with, ultimately producing poems that express the aesthetic as best as possible. To communicate aspects of the context, a final step has been to enable it to provide a commentary on its work, which can be referred to by readers if required. In the Illustrative Results section, we present some poems along with the commentaries generated alongside them. Given our aim for the poems to be considered in full disclosure of their computational context, along with various other arguments given in (Pease and Colton 2011b), we believe it is not appropriate to use Turing-style tests in the evaluation of this poetry generation project. Instead, we turn initially to the FACE descriptive model described in (Colton, Charnley, and Pease 2011) and (Pease and Colton 2011a), which suggests mechanisms for evaluating software in terms of the types of generative acts it performs. In the Conclusions and Future Work section below, we argue that we can reasonably claim that our software is the first poetry generator to achieve ground artefact generation of each of the four types prescribed in the FACE model, namely: examples, concepts, aesthetics and framing information. We believe that such full-FACE generation is the bare minimum required before we can start to properly assess computer poets in the wider context of English literature, which is a longer term aim for this project. We describe how we plan to increase the autonomy and sophistication of the software to this end. 2012 95 Background Perhaps the first computational poetry generator, the Stochastische Texte system (Lutz 1959), sought recognisably Modernist literary affect using a very small lexicon made from sixteen subjects and sixteen predicates from Kafka's Das Schloß. The software randomly fitted Kafka's words into a pre-defined grammatical template. Poems by software in this genre - where a user-selected input of texts are processed according to some stochastic algorithm and assigned to a pre-defined grammatical and/or formal template - have been published, as in (Chamberlain and Etter 1984) and (Hartman 1996), and they remain popular on the internet, as discussed in (Addad 2010). Such constrained poetry generation follows on from the OULIPO movement, who inaugurated the poetics of the mathematical sublime with Cent mille milliards de poemes ` (Queneau 1961), an aesthetic expressed today in digital poems like Sea and Spar Between (Montfort and Strickland 2010). Most of the more sophisticated poetry generation software available on the internet is designed to facilitate digital poetry, that is, poetry which employs the new rhetorics offered by computation. For examples, see (Montfort and Strickland 2010), (Montfort 2009) and (Roque 2011). We distinguish this from a stronger definition of computationally creative poetry generation, where an autonomous intelligent system creates unpredictable yet meaningful poetic artefacts. Recent work has made significant progress towards this goal; in particular, the seminal evolutionary generator McGONAGALL (Manurung 2004) has made a programmatic comeback, as described in (Rahman and Manurung 2011) and (Manurung, Ritchie, and Thompson 2012). This work is based on the maxim that "deviations from the rules and norms [of syntax and semantics] must have some purpose, and not be random" and the authors specify that falsifiable poetry generation software must meet the triple constraints of grammaticality, meaningfulness and poeticness. McGONAGALL, the most recent incarnation of the WASP system described in (Gervas 2010), and the system described in ´ (Greene, Bodrumlu, and Knight 2010), all produce technically proficient poems satisfying these criteria. There are a number of systems which use corpora of human-generated text as source material for poems. In particular, (Greene, Bodrumlu, and Knight 2010) and (Gervas´ 2010) rely on small corpora of already-poetic texts. The Hiveku system (www.prism.gatech.edu/˜aledoux6/hiveKu/) uses real-time data from Twitter; (Wong and Chun 2008) use data from the blogosphere and search engines; and (Elhadad et al. 2009) have used Project Gutenberg and the Google n-grams corpus. These approaches all rely on user-provided keywords to start a search for source material and seed the poetry generation process. The haikus produced by the system described in (Wong and Chun 2008) using Vector Space manipulation demonstrate basic meaningfulness, grammaticality and poeticness, but are tightly constrained by a concept lexicon of just 50 keywords distilled from the most commonly used words in the haiku genre. The Electronic Text Composition (ETC) poetry engine (Carpenter 2004) is one of a few generators to use a very large corpus of everyday language in the service of meaningful poetry generation. Its knowledge base is constituted from the 85 million parsed words of the British National Corpus, which has been turned into a lexicon of 560,000 words and 49 million tables of word associations. ETC generates its own poem templates, and its corporal magnitude encourages surprising, grammatically well-formed output. A dozen of its poems were published under a pseudonym (Carpenter 2004). The creative use of figurative language is essential to poetry, and is a notion alluded to, but declared beyond the scope of (Manurung, Ritchie, and Thompson 2012). One example of prior research in this direction is the system of (Elhadad et al. 2009), which generates haiku, based on a database of 5,000 empirically gathered word association norms. It was reported that this cognitive-associative source principle produced better poems than a WordNet based search. Other aspects of small-scale linguistic creativity relevant to poetry generation include the production and validation of neologisms (Veale 2006), and the elaborations of the Jigsaw Bard system (Veale and Hao 2011), which works with a database of simple and ironic similes to produce novel compound similes. Concerning aspects of computational poetry at a higher level than example generation, the WASP system can be considered as performing concept formation, as it employs a cultural meta-level generation process, whereby a text is considered and evolved by a group of distinct expert subsystems "like a cooperative society of readers/critics/editors/writers" (Gervas 2010). However, the re´ sults of the iterative evaluation are not presented with the final output, and the system does not generate the aesthetics it evaluates, which are "strongly determined by the accumulated sources used to train the content generator", in a similar way to (Greene, Bodrumlu, and Knight 2010) and (D´ıaz-Agudo, Gervas, and Gonz ´ alez-Calero 2002). ´ To the best of our knowledge, there are no poetry generation systems which produce an aesthetic framework within which to assess the poems they produce. Moreover, none of the existing systems provide any level of context for their poetry. In general, the context within which the poems can be appreciated is either deliberately obfuscated to attempt to facilitate an objective evaluation, as per Turing-style tests, or is provided by the programmer via a technical paper, foreword to an anthology, or web page. There are a myriad of websites available which generate poems in unsophisticated ways and then invite the reader to interpret them. For instance, the RoboPoem website (www.cutnmix.com/robopoem), states that: "A great deal of poetry mystifies it's readers: It may sound pretty, but it leaves you wondering ‘what the hell was that supposed to mean?"' then extols the virtue of randomly generating mysterious-sounding poetry. This misses the point that poets use their intellect to write poetry which might need unpicking, in order to better convey a message, mood or style. A random sequence of words is just that, regardless of how poem-shaped it may be. The RoboPoet (the smartphone version of which enables you to "generate nonsensical random poems while waiting at the bus-stop") and similar programs only serve to highlight that people have an amazing capacity to find meaning in texts generated with no communicative purpose. 2012 96 Corpus-Based Poetry Generation As we see above, using human-produced corpora is common in computational poetry. It has the advantages of (a) helping to avoid text which is utterly un-interpretable (as most human-written text is not like this), which would likely lead to a moment where readers remember that they are not reading the output of a fully intelligent agent, and (b) having an obvious connection to humanity which can increase the semantic value of the poem text, and can be used in framing information to add value to the creative act. However - especially if corpora of existing poems are used - there is the possibility of accusations of plagiarism, and/or the damning verdict of producing pastiches inherent with this approach. Hence, we have chosen initially to work with very short phrases (similes) mined from the internet, alongside the phrases of professional writers, namely journalists writing for the British Guardian newspaper. The former fits into the long-standing tradition of using the words of the common man in poetry, and the latter reflects the desire to increase quality while not appropriating text intended for poems. The simile corpus comes from the Jigsaw Bard system1 which exploits similes as readymades to drive a "modest form of linguistic creativity", as described in (Veale and Hao 2011). Each simile is provided with an evidence score that indicates how many times phrases expressing the simile were seen in the Google n-gram corpus2 from which they were mined. There are 21,984 similes in total, with 16,579 having evidence 1 and the simile "As happy as a child's life" having the most evidence (1,424,184). Each simile can be described as a tuple of hobject, aspect, descriptioni, for instance hchild, life, happyi. Our database of Guardian newspaper articles was produced by using (i) their extensive API3 to find URLs of articles under headings such as World and UK on certain days (ii) the Jericho package4 to extract text from the web pages pointed to by the URLs, and (iii) the Stanford CoreNLP package5 to extract sentences from the raw text. As of writing, the database has all 12,820 articles made available online since 1st January 2012, with the World section containing the most articles at 1,232. In addition to the corpora from which we directly use text, we also employ the following linguistic resources: [1] The CMU Pronunciation Dictionary6 of 133,000 words. [2] The DISCO API7 for calculating word similarities, using a database of distributionally similar words (Kolb 2008). [3] The Kilgariff database of 208,000 word frequencies (Kilgarriff 1997), mined from the British National Corpus8. This database also supplies detailed part-of-speech (POS) tagging for each word, with major and minor tags given. [4] An implementation9 of the Porter Stemmer algorithm 1 afflatus.ucd.ie/jigsaw 10wordnet.princeton.edu 2 books.google.com/ngrams/datasets 3 www.guardian.co.uk/open-platform 4 jericho.htmlparser.net 11lit.csci.unt.edu 5 nlp.stanford.edu/software 12fnielsen.posterous.com/tag/afinn 6 www.speech.cs.cmu.edu/cgi-bin/cmudict 7 www.linguatools.de/disco/disco en.html 8 www.natcorp.ox.ac.uk 9 www.tartarus.org/˜martin/PorterStemmer (Porter 1980) for extracting the linguistic stems of words. [5] The well known WordNet10 lexical database. [6] An implementation11 of the Text Rank keyphrase extraction algorithm (Mihalcea and Tarau 2004). [7] The Afinn12 sentiment dictionary, containing 2,477 words tagged with an integer from -5 (negative affect) to 5 (positive affect). We expanded this to a dictionary of around 10,000 words by repeatedly adding in synonyms for each word identified by WordNet. Poetry generation is driven by a four stage process of: retrieval, multiplication, combination and instantiation. In the first stage, similes are retrieved, according to both sentiment and evidence. That is, a range of relative evidence values can be given between 1% (very little evidence) and 100% (the most evidence) along with a sentiment range of between -5 and 5 (as per [7]). Note that the sentiment value of the hobject, aspect, descriptioni triple is calculated as the average of the three words, with a value of zero being assigned to any word not found in [7]. Constraints on word frequencies, as per [3], can also be put on the retrieval, as can constraints on the pronunciation of words in the simile, as per [1]. In addition, an article from the Guardian can be retrieved from the database (with details of how the article is chosen given later), keyphrases can be extracted using [6], and these can be further filtered to only contain relatively unusual words (as per [3]), which often contain the most pertinent information in the article. Simile Multiplication In the second stage, variations for each simile are produced by substituting either an object, aspect or description word, or any combination thereof. The system is given a value n for the number of variations of given simile G required, plus a substitution scheme specifying which parts should be substituted, and a choice of three substitution methods to use. Denoting Go, Ga and Gd for the object, aspect and description parts of G, the three methods are: (d) Using DISCO [2] to retrieve the n most similar words to each word, as determined by that system. (s) Using the corpus of similes to retrieve the n most similar words to each word. This is calculated with reference to G and the whole corpus. For instance, suppose Gd is to be substituted. Then all the matching similes, {M1,...,Mk}, for which Mi o = Mi o or Mi a = Mi a are retrieved from the database. The set Mi d of words for i = 1,...,k are collated, and a repetition score r(Mi d) for each one is calculated as: r(Mi d) = |{j 2 1,...,k : Mj d = Mi d}|. Informally, for a potential substitute, this method calculates how many similes it appears in with another word from G. The n words with the highest score are used as substitutes. (w) Using Wordnet [5] to retrieve the n most frequent synonyms of each word, with frequency assessed by [3]. Each variation, V , of G is checked and pruned if (i) the simile exists already in the database, (ii) the major POS of either Vo, Va or Vd differs from the corresponding part of G, or (iii) the overall sentiment of V is positive when that of G is negative (or vice-versa). To determine the yield of variations each method can produce, we ran the system to 2012 97 Scheme d sw Average 001 61.68 23.16 0.04 28.29 2.02 1.68 3.22 2.31 010 59.04 25.58 4.50 29.71 2.27 1.99 2.09 2.12 100 37.06 28.38 2.26 22.57 2.08 1.75 1.93 1.92 011 44.50 47.78 0.26 30.85 2.27 2.25 3.35 2.62 101 39.68 41.94 0.10 27.24 2.25 1.89 2.83 2.32 110 37.06 40.54 5.84 27.81 2.21 2.02 2.21 2.15 111 27.84 39.44 0.01 22.43 2.40 2.10 2.67 2.39 Average 43.69 35.26 1.86 26.94 2.21 1.95 2.61 2.26 Table 1: Top lines: the average yield (to 2 d.p) of variations returned by each method and substitution scheme when asked to produce 100 variations for 100 similes. Bottom lines: the average interpretation level required for similes generated by the method and scheme. Note that 101 means that the object and description were substituted but not the aspect in the ho, a, di simile triple, etc. generate 100 variations - before pruning - of 100 randomly chosen similes, for each method, with every possible substitution scheme. The results are given in table 1. We see that the d and s methods yield high numbers of variations, but the w method delivers very low yields, especially when asked to find substitutes for Gd. This is because the number of synonyms for a word is less than the number of similar words, and the number of synonyms for adjectives is particularly low. Unexpectedly, replacing more parts of a simile does not necessarily lead to more similes. On inspection, this is because the increase in degrees of freedom is balanced by an increase in likelihood of pruning due to (i), (ii) or (iii) above. In addition to observing the quantity of variations produced, we also checked the variations qualitatively. We noticed subjectively that, even out of context, certain variations were very easy to interpret, others were more challenging, and for some no suitable interpretation could be derived. For each of the methods d, s and w, we extracted 1,000 variations from those produced for table 1, and the first author subjectively hand-annotated each variation with a value 1 to 4, with 1 representing obvious similes, 2 representing similes for which an interpretation was more difficult but possible, 3 representing similes which took some time to form an interpretation for, and 4 representing similes for which no interpretation was possible. Some example similes with annotations are given in table 2. On inspection of the level 4 variations, we noted that often the problem lay in the POStagging of an adjective as a noun. For instance, in table 2, kind is identified as a noun, hence similes with nouns like form instead of kind are allowed, producing syntactically ill-formed sentences. We plan to rule this out using contextaware POS tagging, available in a number of NLP packages. The average interpretation level for each of the substitution methods and schemes is given in table 1. We turned this analysis into a method enabling the software to control (to some extent) the level of interpretation required. Interp. Method Variation Level Scheme Original 1 d as sad as the groan of a widow 011 as lonely as the moan of a widow 2 s as deadly as the face of a dagger 110 as deadly as the sting of a scorpion 3 d as shallow as the space-time of a fork 110 as shallow as the curve of a spoon 4 w as form as the pump of a dove 011 as kind as the heart of a dove Table 2: Example simile variations, given with the interpretation level required and the original versions. Meth. Bound. Na¨ıve % Best % Best Method d 1/2 72.00 75.20 RandomForest s 1/2 60.40 65.80 LogitBoost w 1/2 68.10 72.00 Bagging d 2/3 59.20 68.30 OneR s 2/3 71.20 75.70 RotationForest w 2/3 63.20 71.10 RandomCommittee average 65.68 71.35 Table 3: Ten-fold cross-validation results for the best classifier on the boundary problems for each method. To do this, given a required interpretation level n for simile variations, pairings of substitution (method, scheme) which produce an average interpretation level between n and n + 1 in table 1 are employed. So, for instance, if similes of interpretation level 1 are required, the software uses a (s, 001),(s, 010),(s, 100),(s, 101) or (w, 100) pairing to generate them. To increase the performance of the approach, we used the WEKA machine learning system (Hall et al. 2009) to train a predictor for the interpretation levels which could be used to prune any variation predicted to have an interpretation level different to n. To produce the data to do so, we recorded 22 attributes of each of the 3,000 annotated similes, namely: the word frequencies [3] of each part and the minimum, average and maximum of these; the pairwise similarity [2] of each pair of parts, and the min/av/max of these; the pairwise number of collocations of each pair in the corpus of similes and the min/av/max of these; the method used for finding substitutions (d, s or w); whether the object, aspect and/or description parts have been substituted from the original; and the interpretation level. Unfortunately, using 30 different machine learning methods in WEKA (with default settings for each), the best predictive accuracy we could achieve was 47.3%, using the RotationForest learning method. We deemed this insufficient for our purposes. However, for each variation method, we were able to derive adequate predictors for two associated binary problems, in particular (i) to predict which side of the 1/2 boundary the level of interpretation an unseen simile will be on, and (ii) the same for the 2/3 boundary. The best methods, assessed under 10-fold cross validation, and their predictive accuracy for the boundary problems for the d, s and w variation methods are given in table 3. We found that in each case, a classifier which is significantly better (as tested by WEKA using a paired T-test) than the na¨ıve classifier had been learned, and we can expect a predictive accuracy of around 71% on average. The best learning method was dif 2012 98 ferent for each boundary problem, but some methods performed well in general. While not the best for any, the RandomSubspace method was the only one which achieved a significant classifier for all the problems. The Bagging, RotationForest, and RandomForest methods all produced significant classifiers for five of the six problems. WEKA enables the learned predictors to be used externally, so we implemented a process whereby the generative procedure above produces potential simile variants of a given level, then the result is tested against both boundary predictors appropriate to the method. If it is predicted to fall on the wrong side of either boundary, it is rejected. As a final validation of the process, we generated 300 new simile variations, with 100 of level 1, 2 and 3 each. We mixed them randomly and tagged them by hand as before. Our hand tagging corresponded with what the software expected 82% of the time, which we believe represents sufficient control. Combination and Instantiation The third and fourth phases of poetry generation are more straightforward. In the combination phase, similes, variations of them and keyphrases extracted from newspaper articles are combined as per user-given templates. The templates dictate what words in each of a pair of text fragments must match exactly, what the POS tags of these words and others in the fragments must be, and how they are to be combined. Templates often simply pair two phrases together according to certain constraints, to be used in the instantiation phase later. Alternatively, they can provide more interesting ways of producing a compound phrase. The process can be iterated, so that triples, quadruples, etc., can be generated. As an example, suppose we have the keyphrase "excess baggage" from a newspaper article about travel. This can be matched with the simile "the emotional baggage of a divorce", and presented in various ways, from simple expressions such as "the emotional excess baggage of a divorce", to the more elaborate "Oh divorce! So much emotional excess baggage", as determined by the combination template. It is possible to drop certain words, for instance the keyphrase "gorgeous history" (about a 1980s pop group) and the simile "As gorgeous as the nature of a supermodel" could produce "a supermodel-gorgeous history", and variations thereof. As a final example, keyphrases such as "emotional jigsaw puzzle" (describing a surreal play in a review) can be elaborated by combination with the simile "As emotional as the journey of a pregnancy" to produce: "An emotional jigsaw puzzle, like the journey of a pregnancy". The retrieval, multiplication and combination stages of the process perform the most important functions, which leaves the instantiation process able to simply choose from the sets of elaborated phrases at random, and populate the fields of a user-given template. Templates allow the extraction of parts of phrases to be interleaved with user-given text, and there are also some final constraints that can be applied to the choice of phrases for the template, in particular to reduce repetition by only choosing sets of phrases where the word stems (constructed by [4]) are different. In terms of linguistic and semantic constraints, the four stage process is quite powerful, as highlighted with the exStealthy swiftness of a leopard, Happy singing of a bird. In the morning, I am loyal Like the comfort of a friend. But the morning grows more lifeless Than the fabric of a rag. And the mid-day makes me nervous Like the spirit of a bride. Active frenzy of a beehive, Dreary blackness of a cave. In the daytime, I am slimy Like the motion of a snake. But the sunlight grows more comfy Than the confines of a couch. And the day, it makes me tasty Like the flavor of a coke. Shiny luster of a diamond, Homey feeling of a bed. In the evening, I am solid Like the haven of a house. But the evening grows more fragile Than the mindset of a child. And the twilight makes me frozen Like the bosom of a corpse. Famous fervor of a poet, Wily movement of a cat. In the night-time, I am hollow Like the body of a drum. But the moonlight grows more supple Than the coating of an eel. And the darkness makes me subtle Like the color of a gem. Stealthy swiftness of a leopard, Happy singing of a bird. Circadian No. 39 Figure 1: An example instantiation of a user-given template. ample poem given in figure 1, produced using a highly constrained search for pairs of similes. We used no simile multiplication here, in order to highlight the linguistic rather than inventive abilities. The circadian aspects of the poem are part of the template, with only the similes provided by the software. We see that the poem contains only straightforward words, because during the retrieval stage, only similes with words having frequencies in the top 5% were retrieved (as determined by [3]). Moreover, the only direct repetition is there by design in the template, and no repetition even of word stems is allowed anywhere else. This was achieved during the instantiation process, which recorded the similes used, and avoided using any word where [4] suggested the same word stem with an existing word in the poem. The poem also has strictly controlled meter and stress. For instance, each two-line stanza firstly uses a simile with hsw, sw, swi pronunciation (where s and w are syllables, with s being the stressed one), and then uses a simile with hsw, sw, si pronunciation. This is achieved during the retrieval stage, which uses the pronunciation dictionary [1] to select only similes of the right form, and the combination process, which puts together appropriate pairs of lines. There is similar regularity in the six-line stanzas. Possibly less obvious is the subtle rhyming at play, with the final phonemes of selected pairs of lines being the same (such as beehive and cave, snake and coke, drum and gem). Moreover, inadvertent rhyming - which can be jarring - is ruled out elsewhere, for instance, snake and couch were constrained to have no rhyming, as were house and child, drum and eel, etc. The rhyming constraints come into play during the combination phase, when sets of lines are collated for final use in the stanzas. Finally, we notice that the stanzas alternate in sentiment during the course of the poem, for instance the line "Happy singing of a bird" in the first two-line stanza, contrasts starkly with the line "Dreary blackness of a cave" in the second. This is also achieved during the combination phase, which can be constrained to only put together similes of certain sentiments, as approximated by [7]. 2012 99 Handing over High-Level Control We see automated poetry generation as the simultaneous production of an artefact and a context within which that artefact can be appreciated. Normally, the context is provided by the programmer/user/curator, but, as described below, to give more autonomy to the software, we enabled it to provide its own context, situated in the events of the day in which it is writing poems. In order to deliver the context alongside each poem, we also implemented a rudimentary ability to provide a commentary on the poem, and how it was produced, as described in the second subsection below. Context Generation In overview, the software determines a mood for the day, then uses this to choose both a Guardian article from which to extract keyphrases which will be combined with simile variations and form lines of the poem, and an aesthetic within which to assess the generated poems. These are then used to produce a set of templates for the four-stage poem generation process described above. Finally, the software instantiates the templates to produce a set of poems, and chooses to output the one which maximises the aesthetic. As in the automated collage generation of (Krzeczkowska et al. 2010), the software appeals to daily newspaper articles for raw material. We extend that approach by also using the articles to derive a mood, from which an aesthetic is generated. In particular, each of the 12,820 articles in the corpus has been assigned a sentiment value between -5 and 5, as the average of the sentiment of the words in the article, assessed by [7]. Thus, when a poetry generation session begins, the software is able to check the sentiment of the set N of newspaper articles posted during the previous 24 hours, and if it is less than the average, the software determines the mood as bad, or good otherwise. If the mood is good, then an article, A, from the happiest five articles from N is chosen, with melancholy articles similarly chosen during a bad mood. The keyphrases, key(A), are then extracted from the article, and we denote as words(A) the set of words appearing in key(A). Note that very common words such as "a", "the", "of", etc., are removed from words(A). As an example, on 17/01/2012, the mood was assessed as bad, and a downbeat article about the Costa Concordia disaster was retrieved. In contrast, on 24/01/2012, the mood was assessed as good, and an article describing the buoyant nature of tourism in Cuba was retrieved, from which keyphrases such as "beach resorts", "jam-packed bar", "seaside boulevard" and "recent sunny day" were extracted using [6]. Note that [6] also returns a relevancy score for each keyphrase, e.g., "recent sunny day" was given a score of 0.48 for relevance, while "jam-packed bar" only scored 0.31. The mood is sufficient to derive an aesthetic within which to create poems, but this will be projected partly through members of words(A) appearing in the poem, and mood is only one aspect of the nature of a poem. Letting words(P) denote the words in poem P, for more variety, the software can choose from the following four measures: • Appropriateness: the distance between the average sentiment of the words in words(P) from 5 if it is a good mood day, or from -5 if it is a bad mood day. • Flamboyance: the average of f(w) over words(P), where f(w)=0 if w 2 words(A) and f(w) = 1/frequency(w) if w 62 words(A), where frequency is calculated by [4]. • Lyricism: the proportion of linguistic constraints adhered to by P, with the constraints determined by the set of templates generated for the poem, as described below. • Relevancy to the Guardian article: the average of rel(w) over words(P), where rel(w)=0 if w 62 words(A) and rel(w) is the relevancy [6] of w, if w 2 words(A). The choice of which set of measures, M, to use in the aesthetic for a poem is determined somewhat by A and key(A). In particular, if A is assessed as being in the most emotive 10% of articles ever seen (either happy or sad), then M is chosen as either {Appropriateness} or {Appropriateness, Relevance} in order to give due consideration to the gravity or brevity of the article. If not, and the size of key(A) is less than 20% of the average over the corpus, then it might be difficult to gain relevancy to A in the poem, hence M is chosen as {Relevance}. In all other cases, M is chosen randomly to consist of either 1 or 2 of the four measures - we found that mixing more than 2 diluted their effect, leading to poems with little discernible style. The software also generates templates to dictate the structure of the poem. The number of stanzas, z, is taken to be between 2 and 10, with the number dictated by the size of key(A), i.e., larger poems are produced when key(A) is relatively large, compared to the rest of the corpus. The structure of the poem can be equal, i.e., of the form A1A2 ...Az with each stanza Ai being of the same length (chosen randomly between 2 and 6 lines). The structure can also be chosen to be alternating of the form A1B2A3 ...Az or A1B2A3 ...Bz; or bookended of the form A1B2 ...Bz!1Az. The choice of structure is currently made randomly, and there is no relationship between pairs of stanzas, except that the templates constrain against the usage of a new phrase (combined from a keyphrase and simile as described above) in the template if one of the words has the same stem as a word in an already-used phrase. As part of the template generation, the software chooses the number of times (between 0 and 5) this constraint is allowed to be broken per phrase, as a level of repetition can add flavour to a poem. Note that the counts per phrase are reset to zero if the software runs out of phrases to add to the template. If M contains the Lyricism measure, then the templates are also constrained to express some linguistic qualities, which are added at the stanza level. In particular, the line structure of all stanzas of type A is chosen to be either equal, alternating or bookended in the same fashion as the stanza structure, with stanzas of type B also given a structure. This structure allows linguistic constraints to be added. For instance, if a stanza has alternating structure abab, the software chooses a single linguistic constraint from: syllablecount, end-rhyme, start-rhyme, and constrains all lines of type a accordingly. It does the same (with a possibly different linguistic constraint) for lines of type b. Note that syllable-count means that the two lines should have the same 2012 100 number of syllables as each other (within a threshold of two syllables), end-rhyme means that the two lines should at least end in the same phoneme, with start-rhyme similar. The random nature of the choices to fill in the final poem template ensures variety. In each session, the software generates 1,000 poems, and their scores for each of the measures in M are calculated. The average rank over the measures is taken as an overall rank for each poem, and the highest ranked is presented as the poem for the day. If the templates over-constrain the problem and no poems are produced, then a single constraint is chosen to be dropped, and the session re-started iteratively until poems are produced. Commentary Generation In addition to the four stage process of retrieval, multiplication, combination and instantiation, the software chooses a Guardian article, performs sentiment analysis, aesthetic invention, template construction and searches for appropriate poems. While some of these methods are at present rather rudimentary and perhaps a little arbitrary, it is our hypothesis that a well-formed commentary about how the software has produced a poem will provide a context for the poem and add value to the appreciation process, as argued above. In order for the software to generate the commentary, we re-use the four stage process, but with the retrieval stage sampling not from corpora of human produced text, but rather from a set of recorded statements about how each of the processes worked, and what they produced. In particular, the software records details such as (a) the mood of the day (b) the Guardian article it retrieved and how emotive it was (c) the keyphrases extracted, which sentences they came from, and which were used in the final poem (d) the combinations of keyphrase and similes it produced (e) the nature of the poem structure dictated by the template, (f) the aesthetic weightings used, and (g) what successes and failures it had in instantiating the templates. We have produced by hand a number of (sets of) commentary templates that can present the statements in a supportive way. Currently, the software randomly chooses which set of templates to use to generate the commentary. The software chooses the title for each poem as the keyphrase occurring the most often in the poem, choosing randomly if there is a tie for the most used. Illustrative Examples We artificially situated the software in the days from 1/01/2012 to 10/02/2012, and asked it to produce a single poem for each day, along with a commentary. We added the constraint that the poem should be exactly four stanzas in length for presentation purposes in this paper. We curated three for presentation here, in figure 2 below. The commentaries are meant to provide enough context for proper appreciation of each poem, so we will not add detail here to the commentaries of the individual poems. Viewing the entire set of generated poem/commentaries subjectively, we were disappointed by the number of compound sentences available for the templates. Even with large sets of keyphrases extracted from an article, and extensive simile multiplication employed, we found that there were few opportunities for a simile to be used for embellishment, which meant that the software had limited choices for the final poem template, which led to an over-reliance on repeating lines, or using similar lines. More importantly, the differences in the aesthetic evaluations over the 1,000 poems generated for a day were not great, hence the aesthetic generation was driving the production of poems less than we would have liked. Conclusions and Future Work We agree with (Pease and Colton 2011b) that Turing-style tests encourage na¨ıvety and pastiche in creative systems. However, eschewing their use leaves a hole regarding proper evaluation of our poetry generation system. Instead, we can turn to the FACE descriptive model put forward in (Colton, Charnley, and Pease 2011) and (Pease and Colton 2011a), which advocates describing a creative system in terms of the creative acts it performs, which are in turn tuples of generative acts. The generative acts produce outputs of four types: examples of concepts, concepts themselves, aesthetic measures which can evaluate concept/example pairs, and framing information. Looking at the literature review above, the WASP and Electronic Text Composition systems can be considered as generating concepts, as can any system which generates and employs a statistical model of written or verbal language (such as in Markovian approaches). It does not appear that any system invents aesthetic measures or produces framing information such as a commentary which can be used as a context for the poem. Hence, according to the FACE model, our approach can be considered favourably, as it has processes producing examples (instantiation), concepts (template generation), aesthetics (choosing measures) and framing information (producing commentaries), within the creative act of poem generation. This represents an advance in the state of the art of automatic poetry generation. It is clear that many aspects of the process presented here are fairly rudimentary, often with random choice substituting a reasoned approach. Our main contribution has been to implement a rounded system which can function on the majority of levels required to be taken seriously as a poet, albeit in a simplistic manner. We plan further enhancements to all of the processes described above, including (i) implementing improved ways to generate phrases for templates, as the yield is currently too low to enable the software to use its more advanced linguistic constraining features (ii) working with other corpora (iii) enabling the software to automatically add higher level structures to poems via the kinds of narratives seen in the circadian poem given above, and (iv) turn the commentary generation processes into full-story telling, which may include the introduction of fictions. After the enhancements, we will work with a poet and explore gaining critical feedback via the publication of anthologies. While the imitation-game aspect of Turing-style tests are not conducive for creativity, we do applaud the usage of dialogue they prescribe. Indeed, in the future, we imagine all creative systems being enhanced with a story generator able to produce both static framing information, and to reply with a story to any question asked of it in a dialogue situation. We believe that only with such abilities will software systems be taken seriously as creative entities in the cultural world. 2012 101 It was generally a bad news day. I read an article in the Guardian entitled: "Police investigate alleged race hate crime in Rochdale". Apparently, "Stringer-Prince, 17, has undergone surgery following the attack on Saturday in which his skull, eye sockets and cheekbone were fractured" and "This was a completely unprovoked and relentless attack that has left both victims shocked by their ordeal". I decided to focus on mood and lyricism, with an emphasis on syllables and matching line lengths, with very occasional rhyming. I like how words like attack and snake sound together. I wrote this poem. Relentless attack a glacier-relentless attack the wild unprovoked attack of a snake the wild relentless attack of a snake a relentless attack, like a glacier the high-level function of eye sockets a relentless attack, like a machine the low-level role of eye sockets a relentless attack, like the tick of a machine the high-level role of eye sockets a relentless attack, like a bloodhound It was generally a good news day. I read a story in the Guardian culture section entitled: "South Africa's ANC celebrates centenary with moment in the sun". It talked of south africans, interfaith prayers and monochrome photos. Apparently, "The heroic struggle against a racist regime was remembered: those thousands who sacrificed their lives in a quest for human rights and democracy that took more than eight decades" and "At midnight he watched with amusement as Zuma lit the centenary flame, at the second attempt, with some help from a man in blue overalls marked ʻExplosivesʼ". I wanted to write something highly relevant to the original article. I wrote this poem. Blue overalls the repetitive attention of some traditional african chants a heroic struggle, like the personality of a soldier an unbearable symbolic timing, like a scream blue overalls, each like a blueberry some presidential many selfless leaders oh! such influential presidents such great presidents blueberry-blue overalls lark-blue overalls a knight-heroic struggle It was generally a bad news day. I read a story in the Guardian entitled: "Thai police hunt second bomb plot suspect in Bangkok". It talked of suspected bomb plotters, lebanese men and travel alerts. Apparently, "Sketches released late on Friday night by Thai police showed the suspect as a white Middle-Eastern man with short hair and stubble, around 1.8m (5ft 9in) tall". Itʼs a serious story, but I have concentrated on flourishes today. I wrote this poem. Foreign embassies the wiry militant arm of a doorman a white middle-eastern man, like a snowball spaceship-foreign embassies foreign embassies, each like a stranger an impersonal suvarnabhumi international airport a white middle-eastern man, like the surface of a porcelain the sturdy design of a bangkok post foreign embassies, each like a spaceship an impersonal suvarnabhumi international airport stranger-foreign embassies the stout engineering of a bangkok post a white middle-eastern man, like the skin of an earthenware foreign embassies, each like a stranger spaceship-foreign embassies Figure 2: Illustrative poems and commentaries. For the Guardian articles on which these poems are based, please see: www.guardian.co.uk, followed by: /uk/2012/feb/09/police-race-hate-crime-rochdale /world/2012/jan/08/south-africa-anc-centenary /world/2012/jan/15/thai-second-bomb-suspect-bangkok Acknowledgements This work has been funded by EPSRC grant EP/J004049. Many thanks to the reviewers for their insightful comments. 2012_16 !2012 Illustrating a Computer Generated Narrative Rafael Pérez y Pérez, Nora Morales, Luis Rodríguez División de Ciencias de la Comunicación y Diseño Universidad Autónoma Metropolitana, Cuajimalpa Av. Constituyentes 1054 C. P. 11950, México D. F. {rperez/nmorales/lrodriguez}@correo.cua.uam.mx Abstract This work describes a computer model that generates visual narratives. It is part of a research project on narrative generation. A visual narrative is defined as a sequence of pictorial-scenes; each scene contains characters, locations and symbols representing dramatic tensions. A computer generated plot is transformed into a visual narrative by converting each textual action into a pictorial scene. We present details of the composition process and explain how the graphic elements employed to produce a coherent narration are generated. We describe the questionnaire that we employed to evaluate the system, discuss the results and outline future developments. Introduction Narrative is a fundamental manifestation of human culture. "Most scholars now see narrative… and a host of rhetorical figures not as ‘devices' for structuring or decorating extraordinary texts but instead as fundamental social and cognitive tools" (Eubanks 2004). Traditionally, the word "narrative" has been understood as a kind of synonymous of written text. However, many forms of storytelling (and knowledge) are visual; that is, "what we see is as important, if not more so than what we hear or read" (Rose, 2001:1). The use of images in the construction of narratives has given origin to the concept of Visual Narrative. Thus, following McCloud, in this work visual narrative is defined "as juxtaposed pictorial and other static images in deliberate sequence, intended to convey information and/or to produce an aesthetic response in the viewer" (McCloud, 1993). The processes involved in the codification and understanding of visual stories have been the subject of study in the field of psychology for some time (see e.g. Arnheim, 1969) and have firmly established the links between thought and perception. The human ability to organize our experiences in the form of stories or narrative structures has been called Narrative Intelligence (Blair and Meyer, 1997). Narrative Intelligence has also been defined as "the human ability and perhaps even compulsion to make sense of the world through narrative and storytelling" (Mateas and Sengers, 1999). In this way, "Human narrative intelligence might have evolved because the structure and format of narrative is particularly suited to communicate about the social world." (Dautenhahn, 2001). Thus, we envision Visual Narrative as a form of Narrative Intelligence. Because of its importance in shaping human experience and knowledge, it is not surprising that AI researchers have developed a substantial amount of work related to understanding stories and on how to generate them. One of the common aspects of these efforts is its inherent interdisciplinary approach. Research on AI and narrative has drawn on ideas and theories from different fields such as art, cultural studies, drama, psychology and more recently design. As result from this interdisciplinary work, we can distinguish three main results: Narrative is now recognized as a source for informing System Design; research paradigms and methodologies that address complex questions have been developed and validated; the relationship between AI, Computational Creativity and the Humanities has proven enriching and useful. The work presented in this paper is part of a research project in narrative generation. We have developed a computer model of creative writing called E-R; a program called MEXICA (Pérez y Pérez and Sharples 2001) is an implementation of such a model. The purpose of this work is to expand our plot-generation model with mechanisms that allow illustrating its textual outputs to produce visual narratives. We refer to this new module as the Visual Narrator. This paper describes our first prototype. Although it is possible to find computer systems that generate or evaluate images (e.g. Norton et al. 2011; Colton 2011), or systems where a visual portrayal of characters plays an important role (e.g. Riedl et al. 2008; Cassell 2001; Rickel and Johnson 1999) as far as we know this is the first plot generator capable of illustrating its own output. It is worth noticing that, as antecedents of this work, we published a paper that employs animations to represent computergenerated daydreams (Pérez y Pérez at al. 2007) and a grammar that generates pre-Hispanic images (Álvarez et al. 2007). Our computerised storyteller has the following characteristics. It generates fictional narratives about pre-Columbian cultures. This seems just adequate because "The history of sequential art could be track to pre-Columbian picture 2012 103 manuscripts since they were pictorial representations painted over strips that convey a story" (McCloud, 1993:9). The system includes 16 predefined characters, amongst them: Tlatoani (the ruler), Jaguar Knight, Princess, Enemy, Fisherman, and so on. It also includes 9 possible locations, e.g. Chapultepec Forest, Popocateptl volcano, Tenochtitlan City. The system generates a sequence of actions representing plots; the following lines are an example: the Enemy kidnapped the Princess; Jaguar Knight found the Princess; Jaguar Knight and Enemy fought; Enemy ran away; Jaguar Knight rescued the Princess; and so on. Once the narrative has been finished the system substitutes the sequence of actions with predefined texts. So, the action where the Knight and the Enemy fought in the previous example is substituted by "Suddenly, Jaguar Knight and Enemy were involved in a violent fight". The same happens for all actions. One of the core characteristics of our computer model of creative writing is the idea that plots can be represented as groups of emotional links and tensions between characters that progress over time (Pérez y Pérez 2007). The current version of the storyteller system includes two emotional links, brotherly love and amorous love, and seven dramatic tensions: when a character is killed (Actor dead); when the life of a characters is at risk (Life at risk); when the health of a character is at risk (Health at risk); when a character is made a prisoner (Prisoner); when two characters are in love with a third character (Love competition); when a character hates and loves another character (Clashing emotions); when one character hates another character and both are positioned in the same location (Potential danger). Each time an action is performed within the story the current set of emotional links and tensions is updated. Thus, our storyteller not only produces plots but also generates detailed information about the emotional links and tensions between characters for each action in the tale. The Visual Narrator recollects all these information to generate a visual narrative. As mentioned earlier, in this work a visual narrative is defined as a sequence of pictorial-scenes. A scene is a composition made up of images representing one or two characters, a location and a tension. So, the Visual Narrator transforms a plot into a Visual Narrative by converting each textual action into a pictorial scene. The process of composing a scene involves: 1) Building characters. We provide the system with a group of primitive graphic elements that represent different parts of the character's picture. Primitives include body parts, clothes, accessories and emotional and tensional facial expressions. We developed a grammar that drives the construction of the image. 2) Choosing a glyph that represents the core active tension in the story. We have defined a set of glyphs that represents each of the possible tensions within a story. 3) Choosing an image that represents the current location of the characters involved in the action. We provide the system with images representing all possible locations within a story. 4) Putting together all these elements in a scene. These four points summarise the core characteristics of the Visual Narrator. Following the theory of narrative from the structuralism point of view (Seymour, 1997), every narrative has two elements: a discourse (the means by which the content is communicated, the set of actual narrative statements) and a story or what is portrayed (the content, events, characters and context or setting). The text is produced by one person and is meant to be read by another; the proper understanding of the narration, requires that both share the same code (Acaso, 2009; Eco, 2005). However, "any text can be interpreted infinitely" (Eco, 1991). In Peirce's words, this is considered as the Dynamical Interpretant: "The Dynamical Interpretant is whatever interpretation any mind actually makes of a sign" (cited in Atkin, 2008:66). In this way, our main concern in this work is to evaluate if the code produced by our computer model satisfies (at least partially) this requirement of proper understanding of the narration. That is, if the visual composition produced by the computer model resembles similar ideas to those produced by its textual counterpart. This paper is organised as follows: the second section describes the features of the pre-Columbian iconography relevant for this project; the third sections provides details of the composition process; the fourth section describes the evaluation of the system; the fifth section includes the discussion. Characteristics of Pre-Columbian Iconography and graphic conventionalism In the following lines we describe some of the graphic elements and conventions used in pre-Columbian codices that inspired the creation of our visual narrative. For this work we mainly considered the Boturini Codex (Galarza J. and Libura, 2004) and the following codices from the Borgia group: Vaticanus and Laud (Galarza, J. 1997), Borgia and Nuttall, (Mohar, L. 1997) and Moctezuma (Lopez A. 1999). 1) Human figures and objects. In the codices human figures are line drawn in black and white, in an abstractgraphical iconic level. Their body positions, either sitting or standing, are always represented in profile, facing right or left and never in a front view. Objects and animals are drawn employing the same graphic conventions. 2) Social Hierarchy. Pre-Hispanics had a very complex and hierarchical social structure, which was represented in their codices. A strict dressing code and the use of specific ornaments show their status within the group. Attires, ornaments, a stick held by human hands, or triangle tiara crowning a head are symbols representing a high rank within the hierarchy. Hairstyle is another sign of social status, gender and role distinction. Priests are always represented with the whole body or parts of their face painted in black and their hair tied back in a ponytail. A regular representation of women's attire is a huipil or skirt and a quechquemetl or bust over her waist. In certain codices it is 2012 104 common to illustrate women showing their breasts. In this way, an image depicting a subject wearing a tilma or cape and sandals indicate that the person is a member of the nobility; by contrast, wearing a loincloth without sandals indicates that the person belongs to the group of common people. 3) Locations. Locations are represented by a chain of iconic images. For example, the representation of Chapultepec forest is composed by a stylised image of a hill with a grasshopper on the top and a wavy line coming out from its base. The pictogram could be read as "The big hill of the grasshoppers, where the water springs". 4) Representing emotions and death. Pre-Hispanic artists depicted emotions in their works. For example, in the Boturini codex there is a passage that shows a group of people crying, wearing dirty clothes. The action of crying is represented by stylised tears in the eyes of the characters. Such tears have an amorphous wavy shape that ends with a circle or oval. In this case, the weeping, reinforced by the dirty clothes, conveys a message of suffering. Other codices, like the Borgia and the Vaticanus, employ mouth and eyes gesticulations to produce richer facial expressions. Finally, a human image with her eyes closed represents death. We would like to point out that our visual narrative is inspired by pre-Columbian iconography. However, we do not attempt to reproduce it or contribute to its understanding. We leave that to the experts in the area. We only employ such iconography to provide a framework to our research in computational creativity. Thus, some of the images presented in this work are free interpretations made by the authors. Composition Process The following lines describe the four process involved in the composition process. Character Generation Characters' portraits are the result of bringing together 9 layers of basic images or primitives (as we also refer to them). Each layer groups similar elements. Layer zero includes left arms; layer 1 includes bodies with legs; layer 2 includes heads; layer 3 includes eyes and emotional expressions; layer 4 includes hairstyles; layer 5 includes clothes; layer 6 includes right arms; layer 7 includes ornaments; layer 8 includes weapons and tools (see figure 1). Layers can include any number of primitives, although in some cases it is important to have at least one. The amount of different portrayals that the system can create depends on the number of available images. We have 4 types of characters: males standing, males sitting, females standing and females sitting. All figures can be painted facing east or west. In this work we only present males and females standing and employed 272 primitives. We classify images in two types: universals and specifics. Universals are used in the construction of any character while specifics are only employed in the construction of those personae they were designed for. In other words, some features might appear in all portrayals while others are specific to a single character. For example, every person can use a shell but only the fisherman can have a fishing-net. Figure 1. Construction of the character fisherman. Thus, the Visual Narrator portrays a character by selecting one image from each layer and then painting all them in a canvas. For this work we implemented such a selection process as a random function to provide variety and surprise. In future versions we might include some constraints that help to select the images based on the necessity of the narrative. The user of the Visual Narrator can define in a text file a set of rules to associate one image in a concrete layer with others in different layers. In this way, it is possible to associate particular types of clothes with particular types of ornaments, or extended arms with specific weapons, and so on. For the experiments in this work we defined 25 rules. Figure 1 illustrates the process of characterising a fisherman; it shows two possible options for representing accessories. Figure 3 shows the portraits of a Jaguar Knight, a Princess and an Enemy. The Visual Narrator depicts characters with emotional expressions. Our automatic storyteller generates information regarding emotional links and tensions between 2012 105 characters. In this work we only represent tensions. Thus, a person can be responsible of triggering a tension or can be a victim. For example, if the enemy kidnaps the princess, the Enemy triggers the tension Prisoner, and the Princess is the captive. We refer to the former as the giver and to the latter as the receiver. As part of our work we have developed giver and receiver facial emotional expression for each of the possible tension that can be triggered during the development of a narrative. Examples of these expressions are wide open eyes, a wide open mouth and tears (e.g. figure 5a shows a princess crying). The Visual Narrator analyses the narrative in order to determine which tensions should be represented in the scene. Deciding what tension to characterize is a complex task because the Visual Narrator must figure out which of the active tensions is the most appropriate to be represented in the current scene, for how long it should be visually represented, when it is necessary to reintroduce a tension, and so on. In this way, our implementation resembles those flip-aface books or board books, where a person can create several different characters combining different predefined elements. Various videogames employ similar tools. Thus, the Visual Narrator models some of the decisions that humans takes to represent emotional characters when using flip-a-face like tools. Building a Scene Scenes are comprised of three elements: a location, characters and a glyph representing a tension. We have 9 possible locations: Texcoco Lake, Popocateptl Volcano, Tlatelolco Market, Palace, Tenochtitlan City, Temple, Chapultepec Forest, Jail and Uncivilized Land. The representations of these locations are inspired by preHispanic codices. For example, figure 5 shows the representation of Chapultepec Forest. Chapultepec means grasshopper's hill in Nahuatl; notice on the right of figure 5 the stylised image of a hill with a grasshopper on the top. As a result of performing a story-action, one o mores tensions can be triggered or deactivated within a narrative. For example, if the Enemy wounds the Jaguar Knight the tension life at risk is triggered. We have designed glyphs to represent them. Figure 2 shows some examples. Currently, a scene only includes a single tension. When several tensions are active at the same time, the Visual Narrator needs to choose one to be painted in the composition. So, we have assigned them ranks. The following list shows the tensions ordered from the highest to the lowest rank: Health at Risk, Life at Risk, Actor Dead, Prisoner, Potential Danger, Clashing Emotions, Love Competition, Prisoner Free. In this way, the system includes in the scene the tension with the highest rank. Prisoner Prisoner free Actor dead Health at risk Health at Risk Life at Risk Clashing Emotions Potential Danger Potential Danger Figure 2. Glyphs representing tensions Thus, a scene is comprised by a location on the back, two characters facing each other, and in the middle of them a glyph representing the core tension of the scene (see figure 5). The Visual Narrator employs the text to define which characters are participating in the scene as well as its location; it also employs internal representations to establish which tensions should be used. Evaluation We were interested in evaluating if the code produced by our system satisfied (at least partially) the requirement that both, author and reader, shared the same code. Thus, the goals of the survey were: a) To evaluate the degree of proper understanding of characters and scenes; b) To establish if the sequence of scenes communicate a clear and congruent narrative. To perform the test the Visual Narrator illustrated a brief narrative generated by our plot generator; then, we asked a group of people to evaluate it. We developed a questionnaire that was answered by 44 persons: 91% Mexicans, 7% Spanish and 2% Guatemalans. 66% were females and 34% males. 5% had a PhD degree; 25% had a master degree; 52% had a bachelor degree; 18% had other types of degree. The questionnaire was elaborated and answered in Spanish. The questionnaire was divided in three sections. The first section showed three images of different characters developed by the Visual Narrator (see figurer 3). For each picture subjects were requested to perform the following tasks: 1) to answer if they recognized the character portrayed as pre-Hispanic ; 2) if they did, to select which of the following options described the best such a character: Tlatoani, Enemy, Jaguar Knight, Princess, Female Peasant; 3) if they did not recognize the character as pre-Hispanic, to briefly explain why. 2012 106 a) b) c) Figure 3. Portraits of a) a Jaguar Knight, b) a Princess and c) an Enemy. The second section showed two glyphs (see figure 4). Subjects were instructed that pre-Hispanics employed such glyphs to represent concepts. Then, they were asked to describe what they thought each glyph symbolized. a) b) Figure 4. Two glyphs representing a) Actor dead and b) Life at Risk In the third section subjects were presented with an individual scene (see figure 5a) and then with the whole sequence of three scenes (see figure 5), all they developed by the Visual Narrator. Subjects were requested to describe what they thought the individual scene denoted; after that, they were asked to describe what the sequence of scenes denoted. Evaluation of Characters' Representation Figure 3a characterises a Jaguar Knight. To the question if they recognized the character portrayed as pre-Hispanic 95% of the subjects answered yes and 5% answered no. To the request of choosing the best description of the character depicted in the figure 3a, 93% selected the option Jaguar Knight 2% selected the option of Tlatoani and 5% did not answer the question. In figure 3b the Visual Narrator characterised a Princess. To the question if they recognized the character portrayed as pre-Hispanic 89% of the subjects answered yes, 9% answered no, and 2% did not answer the question. To the request of choosing the best description of the character depicted in the figure 3b 63% selected the option Princess, 25% selected the option peasent, 2% selected Tlatoani and 10% did not answer. In figure 3c the Visual Narrator attempted to characterise an Enemy. To the question if they recognized the character portrayed as pre-Hispanic75% of the subjects answered yes and 22% answered no. They did not identify the character as pre-Columbian due to the type of ornaments that the character was wearing and some of his characteristics like his baldness. In fact, some subjects identified this character as Egyptian. 2% did not answer the question. To the request of choosing the best description of the character depicted in the figure 3c 41% selected the option Tlatoani, 36% selected the option Enemy, and 23% did not answer. a) b) c) Figure 5. Three scenes generated by the Visual Narrator Evaluation of Glyphs In this section two glyphs were presented to the subjects and they were asked to describe what they thought each image symbolised. Following the pre-Hispanic tradition, the Visual Narrator employed figure 4a to represent death. None of the subjects associated the picture with passing away. Partakers associated the following meanings to the glyph: 36% of the participants described a quiet man (meditating, praying or dreaming); 18% described a man that is thinking, which is related to the former description; 18% related this glyph to the idea of life (birth, harvest, vintage or a foetus in the 2012 107 womb); 13% detected the presence of graphic elements associated to the numeric system of the pre-Hispanic civilization and therefore linked this glyph to a date; 15% gave different interpretations, such as someone seating, learning, a young person. The Visual Narrator employed figure 4b to represent that the life of a character was at risk. As in the previous case, subjects associated different meanings to this graphic element. 48% related this glyph to the notion of death; interesting enough, 30% of this group elaborated their descriptions of death with notions like worship and rituals, the sun (which was a god between the pre-Columbian civilizations) and fights between life and death. 43% of the 44 participants associated the glyph with fear to the divinity, danger, fight between good and evil. These descriptions are closer to what the glyph intended to represent. 9% did not answer. Evaluation of Scenes In the last section subjects were presented with a complete scene (see figure 5a). The Visual Narrator built it from the action Enemy kidnapped the Princess, which triggers the tension Princess prisoner. 70% of the participants described the scene as representing lack of freedom, submission, slavery, capture, kidnapping and conquest; 13% related it to confrontation between two groups or tribes, and the conquest of territories; 13% linked it to concepts such as anger, macho, bulling; 4% related it to other themes, e.g. conversation between characters, or did not answer. In most cases the male (referred to mainly as the Enemy or the Tlatoani) was identified as the antagonist and the female (referred to as the Princess and sometimes as the Peasant) as the Victim. 36% of all descriptions involved an explicit interaction, mainly as a dialog, between the male and the female characters. Finally, subjects were presented with a sequence of three scenes (see figure 5). The Visual Narrator developed this sequence from the following actions: The Enemy kidnapped the Princes; Jaguar Knight fought against the Enemy; Jaguar Knight rescued the Princess. 75% of the participants described the sequence with the same o similar group of events; 25% described the sequence with different accounts. 80% of the subjects made in their report an explicit reference to the main roles of the characters: the Enemy was the antagonist, the knight was the hero and the princess was the victim (9% made an implicit reference to the Enemy but an explicit reference to the other two personae). 20% did not make references to their roles. Discussion The first section of the questionnaire provides a feedback regarding the automatic construction of characters. Most of them are clearly identified as pre-Hispanics although a few characters' features seem to produce a degree of confusion. For example, a number of subjects commented that characters are portrayed with occidental facial features. In the case of the enemy, his baldness seems to puzzle some people. Figure 3a is easily identified as a Jaguar Knight. In the case of figure 3b, most subjects identify the image with a princess, although one fourth of the participants identified it with a peasant. This situation suggests that participants are not aware of the meaning of some pre-Hispanic symbols, e.g. the use of two hands holding a stick to represent social status. Figure 3c was the most confusing. Most people identify it as a Tlatoani and a slightly minor amount of people identify it as an Enemy. A possible cause of this situation is hinted by one of the participants. This person reports having difficulties in identifying the character in figure 3c. He explains that the image lacks majesty to be considered a Tlatoani and, at the same time, lacks aggressiveness to be considered an Enemy. Therefore, this person concludes that the image is closer to represent a priest. Thus, these results suggest that the process employed by the Visual Narrator for building characters works adequately. The grammar allows constructing characters that clearly can be differentiated by people and which, in general, are associated with what they attempt to represent. That is, the system is capable of producing original images that satisfy the requirements associated to particular characters. Nevertheless, it is necessary to improve the graphic elements to solve problems like the ones just described. An important issue that arises from this analysis, and which will be repeated in the following lines, is the fact that people are not familiar with some important pre-Hispanic symbols. That is the case of the image of the princess and the stick held by her hands. This point will be discussed later. The second section of the questionnaire provides a feedback regarding the interpretation of the glyphs. From the beginning, glyphs were designed to be part of a scene. However, we are interested in knowing the type of concepts that they evoke when they are not presented within a visual context. None of the subjects associate figure 4a with death. This situation is understandable because the use of the eyes closed as a symbol of death is particular of pre-Hispanic civilizations and a not well known fact between the population that answered the questionnaire. However, the glyph clearly triggers lost of associations that make sense to us. On the other hand, figure 4b represents that the health of a character is at risk (i.e. the character is wounded or ill). Although an important number of subjects associated it with death, a similar number associated it with concepts close to health at risk. In fact, we can think that death is also a concept close to the intended meaning. The reason why the second glyph was better interpreted than the first one probably has to do with the fact that this second glyph was designed by the authors of this work. That is, it does not belong to the pre-Columbian tradition. Thus, glyphs seem to fulfil its function of evoking concepts and ideas 2012 108 although, again, we have the problem of a lack of knowledge about the significance of pre-Columbian symbols. The third section of the questionnaire provides a feedback regarding scenes. In the first case, the majority of subjects interpreted figure 5a as expected. Thus, the composition, i.e. the interaction between the characters, the glyph and the location, seems to be working appropriately. The majority of the descriptions of the sequence of the three scenes (see figure 5) provided by the subjects are alike the text generated by our storyteller. That is, participants interpreted the sequence as expected. In the same way, the role of the three characters in most of the participants' narrations equals the intended role. Three glyphs are employed in such a sequence: the first represents that a character is a prisoner (a tied pair of hands); the second represents a Potential Danger, i.e. one characters hates other (a hand holding a flint knife); and the last represents that a character that was a prisoner has been released (a broken prisoner's glyph). The three glyphs were designed by the authors of this work. From our point of view, the first and third glyphs can easily be interpreted. However, the intended meaning of the second one is at least not obvious if not obscure. Nevertheless, the context provided by the three scenes seems to give to the user enough information to correctly interpret it. This is an interesting result that illustrates the importance of the context. This brings us back to issue about the lack of knowledge of pre-Hispanic symbolisms. We are interested in achieving a good communication with those interested in the Visual Narrator. At the same, we would like to be as faithful as possible to the pre-Hispanic traditions, which were the inspiration of this work. So, we need to find an adequate balance. The first step is to keep on researching about the role of the context in the interpretation of visual narratives. The results in this work seem to suggest that we might be able to employ unknown symbols that, with the help of an adequate context, can be interpreted as intended by people. One of the most interesting characteristic of the whole project is the use of emotional links and tensions between characters. The Visual Narrator employs this information to depict emotions in its characters. It is interesting to notice that 23% of the subjects make comments about the emotional states of characters with descriptions that include words like anger, surprise and sadness (or crying). This result seems to suggest that characters' portraits express emotional states. However, we need to perform a deeper evaluation of this aspect. In this way, the Visual Narrator is capable of constructing a short visual narrative (three scenes) that, in general terms, is understood by a group of human evaluators. The primitives employed to build characters' portraits, and the composition process seems to be satisfactory. This result suggests that we are walking in the right direction. There are several challenges in front of us. The most important is to provide the Visual Narrator with mechanisms that allow more freedom during the composition process. Most people that answered the questionnaire are unfamiliar with the meaning of pre-Hispanic iconography. Therefore, for them it might be difficult to comprehend this kind of visual narratives. However, it is this lack of prior experience that provides a great opportunity to better understand the mechanisms required to generate satisfactory visual narratives. In summary, we have a plot generator system called MEXICA which is based on the E-R Model of creative writing. MEXICA is capable of illustrating its own narratives. The illustration process consists in analysing the dramatic tensions of the narrative and, employing a grammar, composing a sequence of images that represents such a narrative. This paper reports on how the grammar is employed to create the images. It is worth noticing that any system based on the E-R Model can employ the Visual Narrator. It might be necessary to modify the image-base; but the grammar and the process for analysing active tensions can be used. We expect that this work will contribute to a better comprehension of this fascinating area. 2012_17 !2012 Generating a Complete Multipart Musical Composition from a Single Monophonic Melody with Functional Sca↵olding Amy K. Hoover, Paul A. Szerlip, Marie E. Norton, Trevor A. Brindle, Zachary Merritt, and Kenneth O. Stanley Department of Electrical Engineering and Computer Science University of Central Florida Orlando, FL 32816-2362 USA {ahoover@eecs.ucf.edu, paul.szerlip@gmail.com, marie.norton@knights.ucf.edu, tabrindle@gmail.com, zbmerritt@gmail.com, kstanley@eecs.ucf.edu } Abstract This paper advances the state of the art for a computer-assisted approach to music generation called functional sca↵olding for musical composition (FSMC), whose representation facilitates creative combination, exploration, and transformation of musical concepts. Music in FSMC is represented as a functional relationship between an existing human composition, or scaffold, and a generated accompaniment. This relationship is encoded by a type of artificial neural network called a compositional pattern producing network (CPPN). A human user without any musical expertise can then explore how accompaniment should relate to the sca↵old through an interactive evolutionary process akin to animal breeding. While the power of such a functional representation has previously been shown to constrain the search to plausible accompaniments, this study goes further by showing that the user can tailor complete multipart arrangements from only a single original monophonic track provided by the user, thus enabling creativity without the need for musical expertise. Introduction Among the most important functions of any approach to enhancing human creativity is what Boden (2004) terms transformational creativity. That is, key creative obstacles faced by human artists and musicians are the implicit constraints acquired over a lifetime that shape search space structure. By o↵ering an instance of the search space (e.g. of musical accompaniments) with a radically di↵erent structure, a creativity-enhancing program can potentially liberate the human to discover unrealized possibilities. In e↵ect, the familiar space of the human artist is transformed into a new structure intrinsic to the program. Once the user is exposed to this new world, as a practical matter the program must provide to the user the ability to explore and combine concepts within the newly-conceived search space, which corresponds to Boden's combinatorial and exploratory classes of creativity (Boden, 2004). That way, the user experiences a rich and complete creative process within a space that was heretofore inconceivable. The danger with transformational creativity in computational settings is that breaking hard-learned rules may feel unnatural and thereby unsatisfying (Boden, 2007). Any attempt to facilitate transformational creativity should respect the relationships between key artistic elements even as they are presented in a new light. Thus for a given domain, such as musical accompaniment, a delicate balance must be struck between unfettered novelty and respect for essential structure. Many approaches to generating music focus on producing a natural sound at the cost of restricting creative exploration. Because structure is emphasized, the musical space is defined by rules that constrain the results to di↵erent styles and genres (Todd and Werner, 1999; Chuan, 2009; Cope, 1987). The necessity for a priori rules potentially facilitates the combination of musical structures or exploration of the defined space, but precludes transformational outcomes. In contrast, musical structures in the approach examined in this paper, functional sca↵olding for musical composition (FSMC), are defined as the very functions that relate one part of a piece to another, thereby enabling satisfying transformational creativity (Hoover, Szerlip, and Stanley, 2011a,b). Based on the idea that music can be represented as a function of time, FSMC inputs a simple, isolated musical idea into a function that outputs accompaniment that respects the structure of the original piece. The function is represented as a special type of artificial neural network called a compositional pattern producing network (CPPN). In practice, the CPPN inputs existing music and outputs accompaniment. The user-guided creative exploration itself is facilitated by an interactive evolutionary technique that in e↵ect allows the user to breed the key functional relationships that yield accompaniment, which supports both combinatorial and exploratory creativity (Boden, 2004) through the crossover and mutation operators present in evolutionary algorithms. By representing music as relationships between parts of a multipart composition, FSMC creates a new formalism for a musical space that transforms its structure for the user while still respecting its fundamental constraints. Hoover, Szerlip, and Stanley (2011a,b) showed that FSMC can produce accompaniments that are indis 2012 111 tinguishable by listeners from fully human-composed pieces. However, the accompaniment in these studies was only a single monophonic instrument, leaving open the key question of whether a user with little or no musical expertise can perhaps generate an entire multipart arrangement with this technology from just a singleinstrument monophonic starting melody. If that were possible, then anyone with only the ability to conceive a single, monophonic melody could in principle expand it into a complete multilayered musical product, thereby enhancing the creative potential of millions of amateur musicians who possess inspiration but not the expertise to realize it. This paper demonstrates that FSMC indeed makes such achievement possible. Background This section relates FSMC to traditional approaches to automated composition and previous creativityenhancing techniques. Automatic Music Generation Many musical representations have been proposed before FSMC, although their focus is not necessarily on representing the functional relationship between parts. For example, from long before FSMC, Holtzman (1980) creates a musical grammar that generates harp solos based on the physical limitations imposed on harp performers. Similarly, Cope (1987) derives grammars from the linguistic principles of haiku to generate music in a particular style. These examples and other grammarbased systems are predicated on the idea that music follows grammatical rules and thus by modeling musical relationships as grammars, they are representing the important structures of music (Roads, 1979; McCormack, 1996). While grammars can produce a natural sound, deciding which aspects of musical structure should be represented by them is often di!cult and ad hoc (Kippen and Bel, 1992; Marsden, 2000). Impro-Visor helps users create monophonic jazz solos by automatically composing any number of measures in the style of famous jazz artists (Keller et al., 2006). Styles are represented as grammars that the user can invoke to complete compositions. Creativityenhancement in Impro-Visor occurs through the interaction of the user's own writing and the program's suggestions. When users have di!culty elaborating musical concepts, they can access predictions of how famous musicians would approach the problem within the context of the current composition. By first learning di↵erent professional compositional techniques, students can then begin developing their own personal styles. While Impro-Visor is an innovative tool for teaching jazz styles to experienced musicians, it focuses on emulating prior musicians over exploration. Enhancing Creativity in Music Composition A problem with traditional approaches to music composition is that standard representations can potentially limit creative exploration. For instance, MySong generates chord-based accompaniment for a vocal piece from hidden Markov models (Simon, Morris, and Basu, 2008). Users select any vocal piece and MySong outputs accompaniment based on a transition table, a weighting factor that permits greater deviation from the table, and musical style (e.g. rock, big band). MySong thus allows users to create accompaniment for their own melodies in a variety of di↵erent predefined styles from which users cannot deviate. Zicarelli (1987) describes an early interactive composition program, Jam Factory, that improvises on human-provided MIDI inputs from rules represented in transition tables. Users manipulate the output in several ways including the probability distributions of eight di↵erent transition tables; there are four each for both rhythm and pitch. Users are provided more creative control in designing and consulting the transition tables, but the increased flexibility results in unnatural outputs that thereby limit the utility of the main algorithms (Zicarelli, 2002). The approach described by Chuan (2009) balances user control by training transition tables based on only a few userprovided examples. The tables then reflect the "style" inherent in the examples and can generate chord-based accompaniment for a user's own piece. While each of these systems o↵ers users varied levels of control, rule manipulation alone may not be su!cient to access all three forms of creativity described by Boden (2004). For example, the representations cannot easily combine musical ideas or transform the musical space (due to inherent rule restrictions). Alternatively, most interactive evolutionary computation (IEC) (Takagi, 2001) approaches facilitate creativity through the evolutionary operators of crossover and mutation, and require human involvement in the creative process. In GenJam a human player and computer "trade fours," a process whereby the human plays four measures and the computer "answers" them with four measures of its own (Biles, 1998). Musical propositions are mutated and combined into candidates that the user rates as good or bad. Similarly, Jacob (1995) introduces a system in which human users rate, combine, and explore musical candidates at three di↵erent levels of the composition process and Ralley (1995) generates melodies by creating a population from mutations of a provided user input. Finally, CACIE creates atonal pieces by concatenating musical phrases as they are generated over time (Ando and Iba, 2007). Each phrase is represented as a tree structure that users can interactively evolve or directly manipulate. However, most such systems impose explicit musical rules conceived by the developer to constrain the search spaces of possible accompaniment, thus narrowing the potential for discovery. Previous Work in FSMC The FSMC approach in this paper is based on previous work by Hoover, Szerlip, and Stanley (2011a,b), who focused on evolving a single monophonic accompa 2012 112 Instrument: OnOff Instrument: NewNote Piano: Rhythm Bass: Rhythm Bias Instrument: OnOff Instrument: [ ]...[ NewNote ]... (a) Rhythm Instrument: Pitch Bias Piano: Pitch Guitar: Pitch Bass: Pitch Instrument: Pitch ... ... (b) Pitch Figure 1: How CPPNs Compute a Function of the Input Sca↵old. The rhythm CPPN in (a) and pitch CPPN in (b) together form the accompaniments of FSMC. The inputs to the CPPNs are the sca↵old rhythms and pitches for the respective networks and the outputs indicate the accompaniment rhythms and pitches. Each rhythm network has two outputs: OnO↵ and NewNote. The OnO↵ node controls volume and whether or not a note is played. The NewNote node indicates whether a note is re-voiced or sustained at the current tick. If OnO↵ indicates a rest, the NewNote node is ignored. The pitch CPPN output decides what pitch the accompaniment should play at that particular tick. The internal topologies of these networks, which encode the functions they perform, change over evolution. The functions within each node depict that a CPPN can include more than one activation function, such as Gaussian and sigmoid functions. Two monophonic accompaniment outputs are depicted, but the number of instruments a CPPN can output is unlimited. The number of input instruments also can vary. niment for a multipart MIDI. These accompaniments are generated through two functions, one each for pitch and rhythm, that are represented as compositional pattern producing network (CPPNs), a special type of artificial neural network (ANN). CPPNs can evolve to assume an arbitrary topology wherein each neuron is assigned one of several activation functions. Through IEC, users explore the range of accompaniments with NeuroEvolution of Augmenting Topologies (NEAT), a method for growing and mutating CPPNs (Stanley and Miikkulainen, 2002). Unlike traditional ANN learning, NEAT is a policy search method, i.e. it explores accompaniment possibilities rather than optimizing toward a target. While existing songs with generated accompaniments were indistinguishable in a listener study from fully-composed human pieces, the real achievement for this approach would be to help the user generate entire polyphonic and multi-instrument accompaniment from just a single voice of melody (Hoover, Szerlip, and Stanley, 2011a). This paper realizes this vision. Approach: Extending Functional Sca↵olding for Music Composition This section extends the original FSMC approach, which only evolved a single monophonic accompaniment (Hoover, Szerlip, and Stanley, 2011a,b). It explains the core principles of the approach and how they are applied to producing multipart accompaniments. Defining the Musical Space A crucial aspect of any creativity-enhancing approach for music composition is first to define the musical space. Users can help define this space in FSMC by first selecting a musical starting point, i.e. the monophonic melody or sca↵old. Initial sca↵olds can be composed in any style and if they are only single monophonic parts as in this paper, they can be composed by users within a wide range of musical skill and expertise. The main insight behind the representation in FSMC is that a robust space of accompaniments can be created with only this initial sca↵old. Because of the relationship of di↵erent accompaniment parts to the sca↵old and therefore to each other, the space is easily created and explored. Each instrument part in the accompaniment is the result of two separate functions that independently relate rhythmic and pitch information in the sca↵old (i.e. the inputs) to the generated accompaniment. Depicted in figure 1, these functions are represented as CPPNs, the special type of ANN described in the background (Stanley, 2007). As figure 1 shows, multiple inputs can be sent to the output and many di↵erent instruments can be represented by the same CPPN. CPPNs incrementally grow through the NEAT method, which means they can in principle evolve to represent any function (Stanley and Miikkulainen, 2002; Cybenko, 1989). Together, the rhythmic and pitch CPPNs that will be evolved through NEAT define the musical space that the user can manipulate. In e↵ect, pitch information from the sca↵old is fed into the pitch CPPN at the same time as rhythmic information is fed into the rhythm CPPN. Both CPPNs then output how the accompaniment should behave in response. That way, they compute a function of the sca↵old. Accompaniments are divided into a series of discrete time intervals called ticks that are concatenated together to form an entire piece. Each tick typically represents the length of an eighth note, but this division can 2012 113 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 0 1 2 Time in Ticks Input Level (a) Rhythm 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 0 5 10 Time in Ticks Input Level (b) Pitch Figure 2: Input Representation. The spike-decay representation for rhythmic inputs is shown in (a) and the pitch representation is in (b). Rhythm is encoded as a set of decaying spikes that convey the duration of each note. Because the CPPN sees where within each spike it is at any given tick, in e↵ect it can synchronize its outputs with the timing of the input notes. Pitch on the other hand is input simply as the current note at the present tick. be altered through an interface. Outputs are gathered from both the rhythmic and pitch CPPNs at each tick that are combined to determine the accompaniment at that tick. As shown in figure 1a, the two outputs of the rhythm network for each line of accompaniment are OnO↵, which indicates whether a note or rest is played and its volume, and NewNote, which indicates whether or not to sustain the previous note. The single pitch output for each line of accompaniment in figure 1b determines instrument pitch at the current tick relative to a user-specified key. To produce the outputs, rhythmic and pitch information from the sca↵old is sent to the CPPN at each tick. The continuous-time graph in figure 2a illustrates how rhythmic information in the sca↵old is sent to the CPPN. When a note strikes, it is represented as a maximum input level that decays linearly over time (i.e. over ticks) until the note ends. At the same tick, pitch information on the current note is input as a pitch class into the pitch CPPN (figure 2b). That is, two C notes in di↵erent octaves (e.g. C4 and C5) are not distinguished. The sound of instruments in FSMC can be altered through instrument choice or key. A user can pick any of 128 pitched MIDI instruments and can request any key. Once a user decides from what preexisting piece the sca↵old is provided and the output instruments most appropriate for the piece, candidate CPPNs can be generated, thus establishing the musical space of accompaniments. The theory behind this approach is that by exploring the potential relationships between scaffolds and their accompaniments (as opposed to exploring direct representations of the accompaniment itself), the user is constrained to a space in which candidate accompaniments are almost all likely coherent with respect to the sca↵old. The next section describes how users can combine, explore, and transform this space to harness their own musical creativity. Navigating the Musical Space Figure 3: Program Interface. This screenshot of the program (called MaestroGenesis) that implements FSMC shows accompaniments for a melody input by the user. The instrument output is currently set to Grand Piano on the left-hand side, but can be changed through a menu. Accompaniments are represented as horizontal bars and are named by their ID. The user selects his or her favorite and then requests a new generation of candidates Exploration of musical space in FSMC begins with the presentation to the user of the output of ten randomly-generated CPPN pairs, each defining the key musical relationships between the sca↵old and output accompaniment. These accompaniments can be viewed in a graphical depiction (as shown in the screenshot in figure 3) or in standard musical notation. They can be played and heard through either MIDI or MP3 formats. The user-guided process of exploration that combines and mutates these candidates is called interactive evolutionary computation (IEC) (Takagi, 2001). Because each accompaniment is encoded by two CPPNs, evolution can alter both the pitch and rhythmic CPPNs or adjust them individually. The user combines and explores accompaniments in this space by selecting and rating one or more accompaniments from one generation to parent the individuals of the next generation. The idea is that the good musical ideas from both the rhythmic and pitch functions are preserved with slight alterations or combined to create a variety of new but related functions, some of which may be more appealing than their parents. The space can also be explored without combination by selecting only a single accompaniment. The next generation then contains slight mutations of the original functions. While IEC inherently facilitates these types of creativity, the approach in this paper extends the reach of transformational creativity o↵ered by FSMC. Previously, FSMC generated single-voice accompaniments to be played with a fully-composed, preexisting human 2012 114 piece (Hoover, Szerlip, and Stanley, 2011a,b). This paper introduces a new layering technique whereby generated accompaniment from previous generations can serve as inputs to new CPPNs that then generate more layers of harmony. The result is the ability to spawn an entire multi-layered piece from a single monophonic starting melody. One such layering approach is performed by generating one new monophonic accompaniment at a time. The first layer is the monophonic melody composed by the human user. The second layer is generated through FSMC from the first. The third layer is then generated through FSMC by now inputting into the CPPNs the first and second layers, and so on. All of the layers are finally combined to create an entire accompaniment, resulting in accompaniments that are functionally related to both the initial melody and previous accompaniment lines. In this way, each accompaniment line is slightly more removed from the original melody and subsequent accompaniment lines are based functionally on both the sca↵old and previously-generated lines. To create accompaniments more closely related to the original melody, another layering technique is for users to generate all accompaniment layers from only the single monophonic starting point. For this purpose, the CPPNs are given enough outputs to represent all the instruments in the accompaniment at the same time. Because the melody and the accompaniments are functionally related, any accompaniment will follow the contours of the melodic starting point. However, in this case, the only influence on each accompaniment is this starting point itself, yielding a subtly di↵erent feel. With either of these approaches or a combination of them users can further influence their accompaniments by holding constant the rhythm CPPN or pitch CPPN while letting the other evolve. Interestingly, when two accompaniments share the same rhythm network but di↵er in the pitch network slightly, the two monophonic instruments e↵ectively combine to create the sound of a polyphonic instrument. Similarly, the pitch networks can be shared while the rhythm networks are evolved separately, creating a di↵erent sound. Notice that this approach requires no musical expertise to generate multiple lines of accompaniment. Experiments The experiments in this paper are designed to show how users can generate multipart pieces from a single monophonic melody with FSMC. They are divided into accompaniment generation and a listener study that establishes the quality of the compositions. Accompaniment Generation For this experiment, three members of our team composed in total three monophonic melodies. From each of these user-composed melodies, a multipart accompaniment was generated through FSMC by the author of the originating melody. Two other multipart accompaniments were generated for the folk song, Early One Morning. We chose to include each of these FSMC composers, who were undergraduate independent study students at the University of Central Florida, as authors of this paper to recognize their pioneering e↵orts with a new medium. The most important point is that no musical expertise need be applied to the final creations beyond that necessary to compose the initial monophonic melody in MIDI format. Thus, although results may sound consciously arranged it is important to bear in mind that all the polyphony you hear is entirely the output of FSMC. The original melodies, accompaniments, and CPPNs are available at http://eplex.cs.ucf.edu/fsmc/iccc2012. The program, called MaestroGenesis, is available at http://maestrogenesis.org. As noted in the approach, FSMC provides significant freedom to the user in how to accumulate the layers of a multipart piece. In general, the user has the ability to decide from which parts to generate other parts. For example, from the original melody, five additional parts could be generated at once. Or, instead, the user might accumulate layers incrementally, feeding each new part into a new CPPN to evolve yet another layer. Some layers might depend on one previous layer, while other might depend on multiple previous layers. In e↵ect, such decisions shape the subtle structural relationships and hence aesthetic of the final composition. For example, evolving all of the new parts from just the melody gives the melody a commanding influence over all the accompaniment, while incrementally training each layer from the last induces a more delicate and complex set of harmonious partnerships. As the remainder of this section describes, the student composers took advantage of this latitude in a variety of ways Early One Morning, (Song 1) versions 1 and 2 with fourand five-part accompaniments began from an initial monophonic melody transcribed from the traditional, human-composed folk song. The second layer is identical in both versions and was evolved from Early One Morning itself. The third, fourth, and fifth parts of version 1 were all evolved from the second layer. The third, fourth, fifth, and sixth parts of version 2 were evolved from the pitch network of the second layer of version 1, and the rhythm network from the original Early One Morning monophonic melody. This experiment illustrates that the results with FSMC given the same starting melody are not deterministic and in fact do provide creative latitude to the user even without the need for traditional composition techniques. Song 2 started from an original monophonic melody composed by undergraduate Marie E. Norton. The second layer was added by inputting this melody into the rhythm and pitch networks of the subsequent accompaniment populations. This second layer then served as input to the pitch and rhythmic CPPNs for layers 3 and 4. The pitch CPPN for layer 5 consisted of layer 2, but the rhythm network only had a bias input. Finally, the inputs for the pitch network for layer 6 were layers 3, 4, and 5, while the inputs to the rhythm CPPN were 2012 115 layer 4 and a measure timing signal first introduced for FSMC by Hoover and Stanley (2009) that gives the network a sense of where the song is within the measure. All of the layers finally combined to create a single, multipart piece in which each line is functionally related to the others. Each layer took as few as three to as many as five generations to evolve. For Song 3, Zachary Merritt first created a layer that influences most of the other layers, but is not heard in the final track. The fourth layer was generated from the third, which is influenced by the monophonic melody and the unheard layer. The fifth layer was generated from the population of the fourth layer with the rhythm network held constant to create a chordal feel. The sixth layer was generated from only the initial starting melody and a special timing signal that imparts a sense of the position in the overall piece (Hoover and Stanley, 2009). Similarly, the seventh layer is generated from only the initial starting melody, but adds a separate function input, sin(⇡x), where x is the time in measure. Although there are seven layers described in this experiment, only six were selected to be heard, meaning that there is a five-part accompaniment. Trevor A. Brindle created an initial piece and evolved all five accompaniment lines for Song 4 directly from it. Instead of inputting results from previous generations, he started new runs for each voice from the same scaffold, giving a strong influence to the melody. Notice that the key decisions made by the users are in general from which tracks to generate more tracks. Of course the users also performed the IEC selection operations to breed each new layer. Importantly, none such decisions require musical expertise. Listener Study The contribution of users to the quality of the generated works and accordingly the e↵ectiveness of the creativity enhancement is evaluated through a listener study. The study consists of five surveys, one for each generated arrangement. The surveys present two MP3s to the listener, who is asked to rate the quality of both. The first MP3, called the collaborative accompaniment, is an arrangement resulting from the collaboration of the author with the program (i.e. the two versions from Early One Morning or Songs 2, 3, or 4). The second, called the FSMC-alone accompaniment, is generated by the program alone. That is, a random pitch CPPN and a random rhythm CPPN are provided the same monophonic starting melody as the collaborative accompaniment and their output is taken as the FSMC-alone accompaniment. Thus the factor that is isolated is the involvement of the human user, who is not involved in the FSMC-alone accompaniment. However, it is important to note that the FSMC-alone accompaniments do not actually sound random because even if the CPPNs are generated randomly, they are still functions of the same sca↵old, which tends even in the random case to yield outputs that sound at least coherent (which is the motivation for FSMC in the first place). Thus this study investigates whether the human user is really able to make a creative contribution by leveraging FSMC. A total of 129 students participated in the study. The full survey is available at http://eplex.cs.ucf.edu/fsmc/iccc2012/survey, but note that in the administered surveys, the order of the MP3s was random to avoid any bias. The users were asked to rate each piece with the following question: Rate MIDI i on a scale of one to ten. (1 is the worst and 10 is the best), where i refers to one of the ten generated works. The idea is that if the user-created arrangements are rated higher than those generated by FSMC-alone, the user's own input likely positively influenced the outcome. While this study focuses on the quality of output, the degree to which FSMC enhances creativity will be addressed in future work. Results The generated accompaniments and original scaffold discussed in this section can be heard at http://eplex.cs.ucf.edu/fsmc/iccc2012. Accompaniments Samples of the scores for the two arrangements created to accompany Early One Morning are shown in figure 4. The layers are shown in order from top to bottom in both versions (layer 1 is the original melody). Layer 2, which is the same in both versions, is heard as violin II in version 1 and viola in version 2. An important observation is that the violoncello part in version 1 follows the rhythm of the initial starting melody very closely while the pitch contour di↵ers only slightly. While the viola and double-bass parts di↵er in both pitch and rhythm over the course of the song, both end phrases and subphrases on the tonic note, F, in many places over the course of the piece, including measure 4 in figure 4a. Version 2, on the other hand, contains many rhythmic similarities (i.e. the eighth note patterns contained in the keyboard I, viola, keyboard II, and the violin II parts), but illustrates distinct pitch contours. Together, the two versions illustrate how a single user can generate di↵erent accompaniment from the same initial monophonic starting melody and how the initial melody exerts its influence both rhythmically and harmonically. Songs 2, 3, and 4 exhibit a similar e↵ect: rhythmic and harmonic influence from the original melody, yet distinctive and original accompaniment nevertheless. The result is that the overall arrangements sound composed even though they are evolved through a breeding process. The next section provides evidence that impartial listeners also appreciate the contribution of the human user. 2012 116 Violin I (Layer 1) Double Bass (Layer 4) Viola (Layer 5) Violin II (Layer 2) Violoncello (Layer 3) (a) Early One Morning Version 1 Keyboard I (Layer 1) Electric Bass (Layer 4) Violin II (Layer 6) Violin I (Layer 5) Viola (Layer 2) Keyboard II (Layer 3) (b) Early One Morning Version 2 Figure 4: Early One Morning. The first four measures of versions 1 and 2 of Early One Morning illustrate how a single user with the same monophonic starting melody can direct the accompaniment in two di↵erent ways that nevertheless both relate to the initial melody. Because the accompaniments share two of their layers, they sound related. However, through timbre selection and the evolution of two and three distinct layers in versions 1 and 2 respectively, the user imparts a di↵erent feel. Listener Study Results The results of the listener study in figure 5 indicate that all of the collaborative accompaniments are rated higher than those generated with FSMC alone, with three out of five (Song 1 version 2, Song 4, and Song 5) displaying significant di↵erence (p < 0.05; Student's paired t-test). Taken all together, the collaborative accompaniments sound highly significantly more appealing than those generated with FSMC alone (p < 0.001; Student's paired t-test). These results indicate that not only does FSMC provide a structurally plausible search space, but that it is possible to explore such a space without applying musical expertise. That is, the results suggest that the user input significantly improves the perceived quality of the generated compositions. Discussion A key feature of figure 4 is that the collaborative accompaniments generated by users with the assistance of FSMC follow the melodic and rhythmic contours of the original sca↵old. Furthermore, the listener study suggests that FSMC helps the user establish and explore musical search spaces that may otherwise have been inaccessible. While the users search this space through IEC, which facilitates the combination of musical ideas and the exploration of the space itself, an interesting property of this search space is its robustness; even FSMC-alone accompaniments, which are created without the benefit of human, subjective evaluation, can sound plausible. However, when coupled with the human user, this approach in e↵ect transforms the user's own internal search space of possible accompaniments to one constrained by functional sca↵olding. While the quantitative data suggests the merit of collaborative accompaniments, music is inherently subjective. Therefore, it is important for the reader to judge the results for his or herself at http://eplex.cs.ucf.edu/fsmc/iccc2012 to fully appreciate the potential of the FSMC method. One interesting direction for future work is to explore new interpretations for the output of the pitch functions. Currently, accompaniment pitches are interpreted as discrete note values, a process that limits the instrument to playing the same note each time a given combination of notes occurs in the sca↵old. However, by interpreting the output as a change in pitch (i.e. horizontal interval) rather than an absolute pitch, instruments can select any note to correspond to a particular combination depending on where in the piece it is occurring. In this way, an even larger space of musical possibilities could be created. Perhaps most importantly, with only a single, monophonic melody, users could compose entire multipart pieces without the need for musical expertise. Even if not at the master level, such a capability opens to the novice an entirely new realm of exploration. Conclusion This paper presented an extension to functional scaffolding for musical composition (FSMC) that facilitates a human user's creativity by generating polyphonic compositions from a single, human-composed monophonic starting track. The technique enables creative exploration by helping the user construct and then navigate a search space of candidate accompaniments through a breeding process akin to animal breeding called interactive evolutionary computation (IEC). These collaborative accompaniments bred by users were judged by listeners against those composed only through FSMC. Overall, listeners liked collaborative accompaniments more than the FSMC-alone accompaniments. Most importantly, a promising poten 2012 117 0 1 2 3 4 5 6 7 8 9 10 1 (v. 1) 1 (v. 2) 2 3 4 Overall Song Number Average Score Collaborative FSMC-Alone Figure 5: Listener Study Results. The average rating (by 129 participants) from one to ten of both the collaborative and FSMC-alone accompaniments are shown side-by-side with the lines indicating a 5% error bound. The overall results for the listener study indicate that on average the collaborative accompaniments are of significantly higher perceived quality than FSMC-alone. tial for creativity enhancement in AI is to open up the world of the amateur to the domain once only accessible to the expert. The approach in this paper is a step in this direction. Acknowledgements This work was supported in part by the National Science Foundation under grant no. IIS-1002507 and also by a NSF Graduate Research Fellowship. 2012_18 !2012 Soup Over Bean of Pure Joy: Culinary Ruminations of an Artificial Chef Richard G. Morris, Scott H. Burton, Paul M. Bodily, and Dan Ventura Computer Science Department Brigham Young University rmorris@axon.cs.byu.edu, sburton@byu.edu, norkish@gmail.com, ventura@cs.byu.edu Abstract We introduce a system for generating novel recipes and use that context to examine some current theoretical ideas for computational creativity. Specifically, we have found that the notion of a single inspiring set can be generalized into two separate sets used for generation and evaluation, respectively, with the result being greater novelty as well as system flexibility (and the potential for natural meta-level creativity), that explicitly measuring artefact typicality is not always necessary, and that explicitly defining an artefact hierarchically results in greater novelty. 1 Introduction As a relatively new sub-field of artificial intelligence (AI), computational creativity is currently wrestling with many issues similar to those with which AI struggled several decades ago. Many questions similar to those originally asked of AI are now being asked in the context of computational creativity, including foundational questions such as "What is creativity?" Within computational creativity, there is an ongoing movement to define a theoretical foundation that can provide a level of maturity to the field. For example, Wiggins gives the following definition of computational creativity that closely mirrors definitions of intelligence accepted by many AI researchers (Wiggins 2006): The study and support, through computational means and methods, of behaviour exhibited by natural and artificial systems, which would be deemed creative if exhibited by humans. As another example, Ritchie provides a level of formalism by supplying a framework for evaluating a creative system (Ritchie 2007). Assuming that a creative system's purpose is to produce creative artefacts, Ritchie's framework evaluates the creativity of the system in terms of the typicality and quality of generated artefacts in relation to some inspiring set of known artefacts. Taking some of Ritchie's ideas one step further, Gervas´ proposes that creative systems must be able to consistently generate creative artefacts—producing artefacts that are also novel with respect to its own previous work (Gervas 2011). ´ Gervas shows this can be accomplished by splitting the in´ spiring set (as discussed by Ritchie) into a reference set (used to determine the novelty of generated artefacts) and a learning set (used in the generation of artefacts). We modify this idea by splitting the inspiring set into a set used in the generation of artefacts and one for evaluating generated artefact quality. Note that this does not address the idea of a reference set at all, but it also does not preclude the use of one either (let us say the two ideas are orthogonal and likely complementary). Evaluation of a creative system is both clearly important and inherently difficult. In a recent comprehensive survey of published creative systems, Jordanous found that only half of the papers give details on an evaluation of their system (Jordanous 2011). Despite the difficulty in measuring creativity, quality, and typicality, greater attempts must be made to evaluate them if the field is to gain maturity. In an attempt to do so, we provide an explicit measure of quality used during the artefact generation process. We also show that an explicit measure of typicality is not necessary if it is built in to the generation process. In addition, we present an explicit measure of novelty (rare n-grams). We also show that explicitly defining a hierarchy for elements of our artefacts is beneficial to the creative system. We compare a hierarchical version of our system with one that is lacking any hierarchy and demonstrate greater novelty in the artefacts produced. The hierarchical version also gives a natural method to implicitly model typicality in the system without inhibiting novelty. Novel perspectives on the developing theory of computational creativity are provided by concrete applications of the theory in diverse areas. Creative systems have been produced for a wide variety of artefacts, including poetry (Gervas 2000; Gerv ´ as 2001), literature (P ´ erez y P ´ erez and ´ Sharples 2001; Perez y P ´ erez 2007), music (Jordanous 2010; ´ Lewis 2000; Monteith et al. 2011), theorem proving (Ritchie and Hanna 1984; Colton 2002), humor (Stock and Strapparava 2005; Binsted and Ritchie 1997), metaphor (Veale and Hao 2007), and art (Cohen 1999; Colton 2008; Norton, Heath, and Ventura 2011). The distinctive context of each of these concrete applications provides a novel perspective on the developing field of computational creativity. Further exploration of new domains provides additional viewpoints to help the theory mature. To this end, we present a creative system for recipe generation. While work on recipes has been done in the field of artificial intelligence, to our knowledge, a recipe generation sys 2012 119 Inspiring Set Evaluator Generator WWW Divine water with sirloin Ingredients: 2.35 cups water 2.07 cups yellow onion 1.76 cups black bean 1.43 cups stewed tomato 10.71 ounces steak 10.68 ounces ground beef 0.72 cup salsa 0.66 cup chicken broth ... Directions: Combine ingredients and bring to boil. Reduce heat and simmer until done, stirring occasionally. Serve piping hot and enjoy. Presentation Chili con Carne Ingredients: 2.35 cups water 2.07 cups yellow onion 1.76 cups black bean 1.43 cups stewed tomato 10.71 ounces steak 10.68 ounces ground beef 0.72 cup salsa 0.66 cup chicken broth ... Directions: Combine ingredients and bring to boil. Reduce heat and simmer until done, stirring occasionally. Serve piping hot and enjoy. Recipes Figure 1: High-level view of the system architecture. Inspiring set recipes are taken from online sources and inform the evaluator and generator. Recipes are created through an iterative process involving both generation and evaluation. Eventually, generated recipes with the highest evaluation are fed to the presentation module for rendering and may be published online. tem whose focus is creativity has not yet been developed (or even attempted). These other AI recipe generators use casebased reasoning to plan out a recipe, in the case of CHEF (Hammond 1986), or a meal, in the case of Julia (Hinrichs 1992). These approaches maximize the quality of a presented recipe without considering novelty, often preferring prior success to exploring new possibilities. The goal of our system is not only to produce a good recipe, but also to produce a creative one. This requires high quality as well as the development of novel artefacts. 2 PIERRE Recipe generation is a complicated task that requires not only precise amounts of ingredients, but also explicit directions for preparing, combining, and cooking the ingredients. To focus on the foundational task of the type and amount of ingredients, we restrict our focus to recipes (specifically soups, stews, and chilis) that can be cooked in a crockpot. Crockpot recipes simplify the cooking process to essentially determining a set of ingredients to be cooked together. We introduce a novel recipe generation system, PIERRE (Pseudo-Intelligent Evolutionary Real-time Recipe Engine), which, given access to existing recipes, learns to produce new crockpot recipes. PIERRE is composed primarily of two modules, for handling evaluation and generation, respectively. Each of these components takes input from an inspiring set and each is involved in producing recipes to send to the presentation module, as shown in Figure 1. In addition, the system interacts with the web, both acquiring knowledge from online databases and (potentially) publishing created recipes. 2.1 Inspiring Set The inspiring set contains 4,748 soup, stew, and chili recipes gathered from popular online recipe websites1. From these recipes we manually created both a list of measurements and ingredients in order to parse recipes into a consistent format. This parsing enabled 1) grouping identical ingredients under a common name, 2) grouping similar ingredients at several levels, and 3) gathering statistics (including min, max, mean, variance, and frequency) about ingredients and ingredient groups across the inspiring set. Recipes in the inspiring set are normalized to 100 ounces. The database of ingredients was explicitly partitioned into a hierarchy in which similar ingredients were grouped at a sub-level and these ingredient groups were further grouped at a super-level. For example, as shown in Figure 2, the super-group Fruits and Vegetables is composed of the subgroups Beans, Fruits, Leafy Vegetables, and others. The subgroup of Beans includes many different types of beans including Butter Beans, Red Kidney Beans, Garbanzo Beans, and others. Statistics are kept for each ingredient, including minimum, maximum, average, and standard deviation for the amount of the ingredient, as well as the probability of the ingredient occurring in an inspiring set recipe. These statistics are also aggregated at the suband super-group level, enabling comparison and evaluation of recipes at different levels of abstraction. In addition, gathering statistics at the group level provides for smoothing amounts for rare ingredients. Each statistic ! (min, max, mean, standard deviation, or frequency) for ingredients occurring less than a threshold in the set is linearly interpolated with the corresponding statistic of the sub-group, according to the following: ! = (⇣ ↵ ↵+" ⌘ x + ⇣ " ↵+" ⌘ ⇠ if ↵ < ✓ x if ↵ ! ✓ where x is the statistic of the ingredient, ⇠ is the statistic of the sub-group, ↵ is the number of times the ingredient occurs in the inspiring set, % is the number of times any of the sub-group ingredients occur in the inspiring set, and the threshold ✓ is set to 100. The inspiring set is used differently for generation than it is for evaluation. During artefact generation (Section 2.2) the inspiring set determines the initial population used for the genetic algorithm. During artefact evaluation (Section 2.3) the inspiring set determines which recipes and ratings are used as training examples for the multi-layer perceptron (MLP). Since the inspiring set is used in multiple ways, employing a different inspiring set for generating artefacts than the one used to evaluate artefacts can have useful effects. 2.2 Generation PIERRE generates new recipes using a genetic algorithm acting on a population of recipes, each composed of a list of ingredients. The population is initialized by choosing recipes uniformly at random from the inspiring set, and the 1 www.foodnetwork.com and www.allrecipes.com 2012 120 Meats Chilis Butter Beans Beans Berries & Grapes Fruits Tomatoes Leafy Vegetables Onions Squash Vegetables Corns & Peas Root Vegetables Fruits & Vegetables Dairy Liquids Sauces & Seasonings Grains Seeds & Nuts Mushrooms Red Kidney Beans Garbanzo Beans Fava Beans Refried Beans Green Beans White Kidney Beans Lima Beans Pinto Beans Red Beans White Beans Black Beans Black Soy Beans Black-eyed Peas Cannellini Beans Chickpeas Hummus Chili Beans Lentils Recipe 22.55 oz Meats 44.5 oz Fruits & Veggies 8.99 oz Sauces & Seasonings 23.96 oz Liquids Abstraction 2 Abstraction 1 22.36 oz Beef 0.19 oz Pork 13.35 oz Beans 12.04 oz Tomatoes 19.11 oz Onions 8.99 oz Spices 23.96 oz Broths 17.63 oz ground beef 4.73 oz steak 0.19 oz pork sparerib 6.24 oz red kidney bean 0.25 oz garbanzo bean 0.28 oz lima bean 6.58 oz chickpea 0.33 oz crushed tomato 0.59 oz chopped tomato 1.98 oz tomato puree 1.31 oz diced tomato 0.39 oz roma tomato 7.44 oz spaghetti sauce 12.13 oz yellow onion 6.75 oz white onion 0.23 oz chive baton 8.86 oz garlic 0.13 oz fresh parsley 23.96 oz chicken broth Figure 2: Above, a view of the ingredient hierarchy, showing the super-group (left), sub-group (middle), and ingredient (right) levels of abstraction. The Fruits & Vegetables super-group is expanded to show its sub-groups, including Beans, which is expanded to show its ingredients. Below, an example recipe is shown as it would appear at each level of abstraction. fitness of each recipe is evaluated using the MLP evaluator described in Section 2.3. To produce each generation, a number of new recipes are generated equal to the number of recipes in the population. For each new recipe, two recipes are selected, with probability proportional to their fitness, for genetic crossover. The crossover is performed by randomly selecting a pivot index in the ingredient list of each recipe, thus dividing each recipe into two sub-lists of ingredients. A new recipe is then created by combining the first sub-list of the first recipe with the second sub-list of the second recipe. After crossover, each recipe is subject to some probability of mutation. If a mutation occurs, the type of mutation is selected uniformly from the following choices: • Change of ingredient amount. An ingredient is selected uniformly at random from the recipe, and its quantity is set to a new value drawn from a normal distribution that is parameterized by the mean and standard deviation of that ingredient's amount as determined from the inspiring set. • Change of one ingredient to another. An ingredient is selected uniformly at random from the recipe, and is changed to another ingredient from the same super-group, chosen uniformly at random. The amount of the ingredient does not change. • Addition of ingredient. An ingredient is selected uniformly at random from the database and inserted into a random location (chosen uniformly) in the recipe's ingredient list. The amount of the new ingredient is determined by a draw from a normal distribution parameterized by the mean and standard deviation of the ingredient amount as determined from the inspiring set. • Deletion of ingredient. An ingredient is selected uniformly at random and removed from the recipe. At the completion of each iteration, evolved recipes are re-normalized to 100 ounces for equal comparison to other recipes. The next generation is then selected by taking the top 50% (highest fitness) of the previous generation and the top 50% of the newly generated recipes. The rest of the recipes are discarded, keeping the population size constant. Recipes 1 and 2 were generated using this process and were among those prepared, cooked, and fed to others by the authors. To produce these recipes, a population size of 150 recipes was allowed to evolve for 50 generations with a mutation rate of 40%. 2.3 Evaluation To assess the quality of recipes, PIERRE uses an interpolation of two MLPs. Taking advantage of the (online) public user ratings of the recipes in the inspiring set, these MLPs perform a regression of the user rating based on the amount of different ingredients. The two MLPs are trained at different levels of abstraction within our ingredient hierarchy, with one operating at the super-group level and the other at the sub-group level. Thus, the model at the higher level of abstraction attempts to learn the proper relationship of major groups (meats, liquid, spices, etc), and the other model works to model the correct amounts of divisions within those groups. Because we assume any recipe from the online websites is of relatively good quality, regardless of its user rating, we supplemented the training set with randomly constructed recipes given a rating of 0. These negative examples enabled the learner to discriminate between invalid random recipes and the valid ones, created by actual people. Each MLP has an input layer consisting of real-valued nodes that encode the amount (in ounces) of each supergroup (sub-group), a hidden layer consisting of 16 hidden nodes and a single real-valued output node that encodes the rating (between 0 and 1). The MLP weights are trained (with a learning rate of 0.01) until there is no measurable improvement in accuracy on a held out validation data set (consisting 2012 121 Recipe 1 Divine water with sirloin Ingredients: 2.35 cups water 2.07 cups yellow onion 1.76 cups black bean 1.43 cups stewed tomato 10.71 ounces steak 10.68 ounces ground beef 0.72 cup salsa 0.66 cup chicken broth 3.01 tablespoons emeril's southwest essence 0.87 ounce veal 1.22 tablespoons white onion 1.22 tablespoons diced tomato 1.17 tablespoons red kidney bean 2.79 teaspoons sambal oelek 0.22 clove garlic 2.28 teaspoons white bean 1.83 teaspoons corn oil 0.29 ounce pancetta 1.67 teaspoons mirin 1.51 dashes tom yam hot and sour paste 1.46 dashes worcestershire 0.12 ounce bologna Directions: Combine ingredients and bring to boil. Reduce heat and simmer until done, stirring occasionally. Serve piping hot and enjoy. Recipe 2 Exotic beefy bean Ingredients: 2.2 cups pinto bean 1.09 pounds ground beef 1.6 cups white onion 1.16 cups diced tomato 1.13 cups water 1.11 cups chicken broth 0.77 cup vegetable broth 0.63 cup chile sauce 2.74 ounces pork sausage 4.51 tablespoons salsa 3.39 tablespoons stewed tomato 1.43 ounces chicken thigh 2.5 tablespoons olive oil 1.09 ounces hen 0.34 whole red bell pepper 1.25 tablespoons lentil 1.16 tablespoons chopped tomato 2.87 teaspoons red onion 2.03 teaspoons garbanzo bean 1.65 teaspoons cannellini bean 0.26 slice bacon Directions: Combine ingredients and bring to boil. Reduce heat and simmer until done, stirring occasionally. Serve piping hot and enjoy. of 20% of the recipes) for 50 epochs. The set of weights used for evaluating generated recipes are those that performed the best on the validation data set. 2.4 Presentation Colton (2008) has suggested that perception plays a critical role in the attribution of creativity. In other words, a computationally creative system could (and possibly must) take some responsibility to engender a perception of creativity. In an attempt to help facilitate such a perception of its artefacts, PIERRE contains a module for recipe presentation. First, the module formats the recipe for human readability. Ingredient quantities are stored internally in ounces, but when rendering recipes for presentation, the ingredients are sorted by amount and then formatted using more traditional measurements, such as cups, teaspoons, dashes, and drops. Recipes are presented in a familiar way, just as they might appear in a common cookbook. Second, the presentation module generates a recipe name. Standard recipes always have a name of some sort. While this task could be a complete work by itself, we implemented a simple name generation routine that produces names in the following the format: [prefix] [ingredients] [suffix]. This simple generation scheme produces names such as "Homestyle broccoli over beef blend" or "Spicy chicken with carrots surprise." The components of the name are based on prominent recipe ingredients and the presence of spicy or sweet ingredients. This simple approach creates names that range from reasonable to humorous. 3 EmPIERREical Results To our knowledge, no other creative system has been designed to work in the recipe domain. As such, traditional concepts are highlighted in a new context. This new perspective admits additional analysis of the merits and nuances of theoretical ideas that have become generally accepted by the community. Here we evaluate the system with different combinations of inspiring sets, with and without a direct measure for typicality, and with and without the hierarchical definition of an artefact. We measure novelty in a recipe by counting new combinations of (known) ingredients, n-grams. An n-gram is a combination of n ingredients. For example, a 2-gram would be water-garlic. A rare n-gram is an n-gram that does not occur in the inspiring set and does not contain a rare (n!1)gram as a sub-combination (e.g., 4-grams containing rare 3grams or, recursively, rare 2-grams are not included in the count of rare 4-grams). We define the rare n-gram ratio ⇢n r for a specific recipe r as ⇢n r = "n r ⌧ n r where ⌧ n r is the total number of n-grams in r and "n r is the number of those n-grams that are rare. As another view of novelty, we consider a graph of ingredient amounts, which creates a visual profile of the type of recipes generated by the system. This comparison of visual 2012 122 profiles was inspired by Faria and de Oliveira's use of a similar method in measuring aesthetic distances between document templates and generated document artefacts (Faria and de Oliveira 2006), and we found that it was easy to compare the outputs of the system based on the profiles that it generated. 3.1 Different Inspiring Sets for Evaluation As mentioned, PIERRE can have different inspiring sets for both artefact generation and artefact evaluation. Thus the artefact initially generated would be inspired by one set of artefacts, but fitness would be determined by a fitness function inspired by a different set of artefacts. Using a combination of inspiring sets in the generative process hints at an idea which Buchanan identifies as "transfer" or knowledge sharing (Buchanan 2001), which refers to the notion that where two problems have simple, heterogenous representations, greater creativity can be achieved by transferring knowledge from one problem area to another. Although developing recipes from different inspiring sets may not constitute different problems in the same way as intended by Buchanan, the concepts and methods used by humans to develop recipes in one inspiring set may differ greatly from the concepts and methods used to develop recipes in a different culinary genre. Thus the knowledge used in the composition of artefacts in one inspiring set is introduced in the generation of new artefacts in a different domain, resulting in potentially greater creativity. We experimented with various combinations of two inspiring sets. The first inspiring set included 4,748 soup, stew, and chili recipes crawled from the web (referred to as the "full" inspiring set). The second set is a subset of the first, including only the 594 chili recipes. The chili recipes were longer on average than the full recipes (13.97 ingredients as compared to 11.88 ingredients). We found no significant results from varying the generator's inspiring set therefore all reported experiments were conducted with a generator trained with the full inspiring set. We found that the recipes produced using the chili inspiring set to train the evaluator (hereafter referred to as the "chili evaluator") had a higher ratio of rare 2-grams and 3-grams (see blue lines in Figure 3) than those produced using the full inspiring set to train the evaluator (hereafter referred to as the "full evaluator", see red lines in Figure 3), and a relatively lower ratio of rare 4-grams and 5-grams. Because the system is using different inspiring sets to generate and evaluate recipes, it alters the original recipes to look more like the recipes found in the evaluator's inspiring set. In this context, generic soups or stews are being modified to look more like chilis. The resulting chilis retain some of the characteristics of the generic soups and stews, resulting in more novel combinations of ingredients and flavors (for chilis). Systems which trained the evaluator with chili recipes produced recipes with a "chili" profile, as evidenced by more meat and vegetables, and less dairy and liquids (see blue lines in Figure 4). Systems which trained the evaluator with full recipes produced recipes with a marked "full" profile (red lines). This discovery suggests that a system's creativity can be guided through the use of different inspir!" !#!$" !#%" !#%$" !#&" !#&$" &" '" (" $" !"#$%&#'!" " )*+,+"-./,0/123"452"6+71/89:";:13+9" )*+,+"-./,0/123"45"6+71/89:";:13+9" <0,,"-./,0/123"452"6+71/89:";:13+9" <0,,"-./,0/123"45"6+71/89:";:13+9" <0,,""-./,0/123"452"=+:3/39*>" Figure 3: Average (over r) rare n-gram ratio for various values of n. Higher ratio values indicate increased novelty, with the chili evaluator producing the most novelty. Omitting the hierarchy noticeably reduces novelty, whereas including the distance metric has little effect. ing sets. Combining the use of different inspiring sets could introduce different flavor profiles, and allow the system to explore new parts of the recipe space. 3.2 Elimination of Explicit Typicality Metrics in the Fitness Function We tested PIERRE with and without an explicit distance metric to essentially model a Wundt curve (Saunders and Gero 2001), promoting the generation of recipes that were neither too novel nor too typical. Although the theory can be interpreted to require an explicit evaluation of typicality (Ritchie 2007), in our experiments we found that removing the distance metric from our evaluation has no significant effect on the typicality or the novelty of our recipes (see the dotted lines in Figures 3 and 4). Explicitly measuring typicality is not necessary if typicality is implicitly modeled in the artefact generation process. In our system, ingredient quantities and ingredient counts were generated based on statistics found in the inspiring set. In addition, typicality is !" #" $!" $#" %!" %#" &!" &#" '!" '#" #!" ()*+," -./0+,"*12" 3)4)+*56)," 7*0.8" 90:/02," ;*/<),"*12" ;)*,=1014," >.*01" ;))2,"*12"?/+," !"#$%&' @A060"BC*6/*+=."DE="70,+*1<)"()+.0<" @A060"BC*6/*+=."DE"70,+*1<)"()+.0<" -/66"BC*6/*+=."DE="70,+*1<)"()+.0<" -/66"BC*6/*+=."DE"70,+*1<)"()+.0<" -/66""BC*6/*+=."DE="F0).*. like a ? They are both ", where the following word relations must hold: • Y3-0 SoundsLike Y3 • X3 ConceptuallyRelatedTo Y3 • Y3 ConceptuallyRelatedTo X3 • Y3 PartOf X3 • X6 ConceptallyRelatedTo Y3-0 • X6 IsA Y3-0 • Y3-0 ConceptuallyRelatedTo X6 From the example above we can see that the notion of a template in T-PEG is equivalent to the conflation of a schema, description rule, and template in STANDUP. The constructed templates are then filtered based on a graph-connectedness heuristic, i.e. if the variables are nodes and the word relationships are edges, a template must form a connected graph to be deemed a valid template. Once the template has been extracted, generation proceeds in a similar fashion to STANDUP. In this paper we are less interested in the generation aspect of T-PEG, as it is largely similar to JAPE and STANDUP, and more interested in its ability to automatically learn or extract templates. Specifically, we are interested in assessing the ability of TPEG to correctly identify the necessary and sufficient conditions for generating punning riddles. (Hong and Ong 2009) report a manual evaluation where a subset of the extracted templates was manually assessed by a linguist, whose job was to determine if the extracted templates were able to capture the ‘essential word relationships in a pun'. The evaluation criteria are based on the number of incorrect relationships as identified by the linguist, and includes missing relationship, extra relationship, or incorrect word pairing. A scoring system from 1 to 5 is used, where 5 means there is no incorrect relationship, 4 means there is one incorrect relationship, and so on. They report an average score of 4.0 out of a maximum 5. Automatic evaluation of template extraction There are two issues concerning the manual evaluation of template extraction presented in (Hong and Ong 2009). 2012 135 Firstly, we believe this evaluation is rather subjective. Although punning riddles are relatively simple and straightforward to analyse, the linguists were not the original authors of the jokes, and thus there is room for misinterpretation or incorrect emphasis. Furthermore, it is unquestionable that relying on the manual judgment of a linguist is both time-consuming and costly. The manual evaluation reported in (Hong and Ong 2009) was carried out on 27 templates generated from 27 jokes, which is a rather small sample from which to draw any conclusion. Our observation is that if one had access to a large corpus of punning riddles that had somehow been annotated with the ‘correct' word relationships that underlie the joke, one could assess T-PEG's template extraction functionality by comparing the resulting template against the reference word relationships. Unfortunately, we know of no such annotated resource that currently exists. However, we can use an existing punning riddle generator such as STANDUP to produce an approximation of such a resource, since we can access the underlying data structure of a generated punning riddle. In the joke generation module of STANDUP, the JokeGraph object of a generated punning riddle provides full access to the underlying lexical relationships. Another issue we attempt to address is the fact that T-PEG makes no attempt at generalization of the extracted templates. Given fifty exemplar punning riddles, it will attempt to construct fifty templates. Hong and Ong state that this is beneficial ‘to increase coverage'. However, we contend that if we are interested in building systems that computationally model the mechanisms of linguistic humor, coverage is not enough. A creative generative system should be able to generate artefacts from a limited set of symbolic rules. Thus, T-PEG should be able carry out some abstraction over the extracted templates, to yield a set of highly-productive patterns. These two goals form the rationale of our proposed setup, which we discuss in the next section. Proposed setup As discussed above, the purpose of our experiment is to automatically evaluate the ability of T-PEG to correctly extract templates that underlie a collection of punning riddles. The proposed setup is as follows. Firstly, STANDUP is used to generate a large number of punning riddles. For each riddle, we note the actual rules used by STANDUP to generate them, which are used during the evaluation phase. The riddles are then given to T-PEG, which yields a template for each riddle. These templates are then organized into clusters using agglomerative clustering by calculating the similarity between templates using a structural similarity metric based on the semantic similarity evaluation function presented in (Manurung, Ritchie, and Thompson 2012). We then apply a simple majority rule to label the clusters, and then evaluate the template extraction process using the widely-used notions of precision and recall. To achieve this, T-PEG first had to be modified by replacing its lexical and conceptual resources with those that were used in STANDUP, thus ensuring that the template extraction module would be able to identify the original lexical relationships in STANDUP. Template clustering The agglomerative clustering process starts with all templates belonging to singleton clusters. The distance of each cluster to all other clusters is then computed. The distance between two clusters is defined as the average distance of each pair of elements contained within the two clusters, also known as average linkage clustering. The two clusters with the shortest distance are then merged together. This process is repeated until k clusters remain, where k is provided as an input parameter. In defining the notion of distance between two templates, we turn to the structure-mapping work of (Love 2000) and (Falkenhainer, Forbus, and Gentner 1989) that has defined a computational model of semantic similarity in terms of conceptual and structural similarity. Structural similarity measures the degree of isomorphism between two complex expressions. Conceptual similarity is a measure of relatedness between two concepts. More specifically, we use the semantic similarity evaluation function used in (Manurung, Ritchie, and Thompson 2012), which implements a greedy algorithm based on Gentner's structure mapping theory (Falkenhainer, Forbus, and Gentner 1989). It takes two complex expressions, in our case two T-PEG extracted templates, and attempts to ‘align' them in an optimal manner. It then applies a scoring function based on Love's computational model of similarity (Love 2000) to compute a score based on various aspects of the alignment. This function yields a distance of zero between two conceptually and structurally identical templates, and further distances for increasingly different template pairs. Cluster labeling We then automatically label the clusters using a simple majority rule. First, we define the underlying schema of a template to be the label of the STANDUP schema that was used to generate the punning riddle from which the template was extracted. A cluster is then automatically labelled by identifying the underlying schemas of all its member templates, and simply choosing the schema that created the majority of templates within that cluster. If there are several underlying schemas that produced the same number of templates in a cluster, one is randomly selected. As an example, if a cluster contains 10 templates whose underlying schema is lotus, and 6 templates whose underlying schema is newelan1, then that cluster is labelled as representing the lotus schema. Precision and recall Using these cluster labels, we can compute measures which correspond to the widely-used notions of precision and recall in pattern recognition. In classification tasks, these measures are defined as follows: P recision = tp tp + f p Recall = tp tp + fn 2012 136 where in our case, given a cluster c with label l, tp (true positive) is the number of extracted templates that appear as members of c whose underlying schemas are l, f p (false positive) is the number of templates in c whose underlying schemas are not l, and fn (false negative) is the number of templates not in c but whose underlying schemas are l. Precision and recall can be computed for each cluster, or as an aggregate measure over all resulting clusters. Experimental setup The experimental setup is as follows. Firstly, STANDUP is used to generate 20 jokes for each of 10 schemas, namely bazaar, lotus, doublepun, gingernutpun, rhyminglotus, newelan1, newelan2, phonsub, poscomp, and negcomp, resulting in a collection of 200 exemplar jokes. These jokes are then analysed by T-PEG, which yields 200 joke templates. We then apply agglomerative clustering until 10 clusters are formed (since 10 STANDUP schemas are used). Our hypothesis is that for T-PEG to be deemed successful in extracting templates, it should be able to correctly organize the 200 templates into 10 clusters that correspond to the 10 STANDUP schemas. The precision and recall metrics should provide appropriate quantitative measures as to this goal. Results and discussion Table 1 shows the results of applying agglomerative clustering on the 200 templates into 10 clusters. The first column indicates the cluster number. The second and third columns specify the cluster membership, by indicating the number of templates with a given underlying STANDUP schema found within that cluster. The fourth column indicates the cluster size. The last column indicates the label assigned to that cluster using the majority rule described above. For example, cluster 1 contains 21 templates, 19 of which have rhyminglotus as their underlying schema, and 2 of which have bazaar as their underlying schema. Accordingly, it is assigned the label rhyminglotus. Table 2 shows the precision and recall values computed for the clustering results. The first two columns indicate the cluster numbers and assigned labels, which correspond to the information in Table 1, and the last two columns indicate the precision and recall values computed for each cluster. Finally, the last row indicates the overall precision and recall. This aggregate result take into account the weighted averages given the different cluster sizes. Note that we collapsed clusters 1 and 6 because they were both labeled as rhyminglotus, and similarly, clusters 2 and 3 are collapsed due to the fact that they are both labeled as bazaar. Finally, Table 3 shows the confusion matrix of how the templates are classified according to their underlying schema. The rows indicate the original underlying schemas, and the columns indicate the cluster labels. For example, of the 20 templates extracted from punning riddles generated using the bazaar schema, 14 are correctly found within a cluster labeled bazaar, 2 are found in a cluster labeled rhyminglotus, and 4 are found in a cluster labeled lotus. No. Schema Amount Total Label 1 rhyminglotus 19 21 rhyminglotus bazaar 2 2 bazaar 11 11 bazaar 3 bazaar 3 3 bazaar 4 lotus 20 89 lotus newelan1 19 gingernutpun 17 newelan2 16 doublepun 13 bazaar 4 5 doublepun 7 newelan2 4 14 doublepun gingernutpun 3 6 rhyminglotus 1 1 rhyminglotus 7 newelan1 1 1 newelan1 8 phonsub 20 20 phonsub 9 poscomp 20 20 poscomp 10 negcomp 20 20 negcomp Table 1: Results of agglomerative clustering No. Label Precision Recall 1 & 6 rhyminglotus 20/22=0.91 20/20=1 2 & 3 bazaar 14/14=1 14/20=0.7 4 lotus 20/89=0.225 20/20=1 5 doublepun 7/14=0.5 7/20=0.35 7 newelan1 1/1=1 1/20=0.05 8 phonsub 20/20=1 20/20=1 9 poscomp 20/20=1 20/20=1 10 negcomp 20/20=1 20/20=1 Overall 0.61 0.763 Table 2: Precision and recall measures From these results, we can see that only templates with phonsub, poscomp, and negcomp as their underlying schemas are perfectly identified. Templates with the underlying schemas rhyminglotus and lotus are correctly clustered together, but suffer some impurities with other templates also being deemed to belong to their clusters. Most notably, the cluster labeled lotus contains a very large number of templates from other schemas such as bazaar, doublepun, gingernutpun, newelan1, and newelan2. The cluster labeled newelan1 contains only one template. No clusters were labeled as gingernutpun and newelan2. A purely random baseline, in which the 200 extracted templates are randomly assigned to 10 different clusters, would yield an expected precision and recall of 0.1. Whilst this is an artificially low baseline, the results of an overall precision of 0.61 and recall of 0.763 suggests that T-PEG is able to extract some salient information regarding the underlying lexical relationships of a pun. However, certain underlying schemas are very difficult to distinguish from each other. Upon further analysis, we can see that the problems arise from the fact that the templates extracted by T-PEG conflate 2012 137 bazaar rhyminglotus lotus doublepun gingernutpun newelan1 newelan2 phonsub poscomp negcomp bazaar 14 2 4 rhyminglotus 20 lotus 20 doublepun 13 7 gingernutpun 17 3 newelan1 19 1 newelan2 16 4 phonsub 20 poscomp 20 negcomp 20 Table 3: Confusion matrix of clustering results the various rules in STANDUP, i.e. schemas, description rules, and canned text templates. To illustrate the issues, observe the following two jokes, both generated using the lotus schema, and their resulting extracted templates: Joke 1: What do you call a cross between a firearm and a first step? A piece initiative The resulting template is: What do you call a cross between a and a ? A with the following word relationships: • IsCompoundNoun(X11, X12) • IsCompoundNoun(Y1-0, Y2) • Synonym(X8, Y1) • Synonym(Y2, X11:X12) • Hypernym(X11:X12, Y1-0:Y2) • Homophone(Y1-0, Y1) From the joke we can see that the instantiations for the regular variables are X8=firearm, X11=first, X12=step, Y1=piece, and Y2=initiative. Furthermore, the similarsound variable Y1-0 is bound to peace, because in WordNet, "peace initiative" is an instance of an initiative. Thus, the word relationships state that "first step" and "peace initiative" are compound nouns, firearm is a synonym of piece, initiative is a synonym of "first step", which in turn is a hypernym of "peace initiative", and that lastly, peace and piece are homophones. Joke 2: What do you call a cross between an l and a correspondent? A litre writer. The resulting template is: What do you call a cross between an and a ? A with the following word relationships: • IsCompoundNoun(Y1-0, Y2) • Synonym(X8, Y1) • Synonym(X11, Y1-0:Y2) • Homophone(Y1-0, Y1) From the joke, we can see that the instantiations for the regular variables are X8=l, X11=correspondent, Y1=litre, and Y2=writer. Furthermore, the similar-sound variable Y10 is bound to letter, because in WordNet, "letter writer" is a synonym for "correspondent". Thus, the word relationships state that "letter writer" is a compound noun, l is a synonym of litre, correspondent is a synonym of letter writer, and litre and letter are deemed to be homophones. From these two jokes and their resulting extracted templates, we can make several observations. Firstly, T-PEG correctly extracts the core lexical preconditions stated for the lotus schema, in that the 'punchline' must contain a compound noun Y1-0:Y2 ("peace initiative" and "letter writer", respectively), where the first word is replaced with a homophone, Y1 (in the first joke, piece, and in the second, litre). However, the two jokes use different description rules for the question part. Whereas the former joke used a synonym (firearm) and a hypernym (first step) to describe the punchline, the latter used two synonyms, namely l and correspondent. Since T-PEG makes no distinction between word relationships arising from schemas or description rules, the choice of description rule, which is a somewhat trivial linguistic variation, leads the agglomerative clustering to falsely conclude that two jokes from different underlying schemas in fact use the same pattern. Additionally, T-PEG extracted ‘noisy' word relationships that play no part in the joke construction. Whereas the word relationships of the extracted template for the second joke capture exactly the necessary and sufficient conditions, in the former joke, nothing hinges on the fact that a "first step" is a compound noun, nor that initiative is a synonym of "first step". Such extraneous word relationships further pull the templates into wrong clusters. From this experiment, we can conclude that although TPEG may be successful at learning some joke templates given sample punning riddles, it is still making faulty assertions as to the constraints that specify what makes the riddle ‘work' as a joke. However, we speculate that much work can be done to repair such errors. The data redundancy contained within the specific templates that are clustered together, for instance, can be further analysed to form core relationships that are at the heart of the punning riddle structure. This is an avenue of future work that we intend to explore. Summary The manually constructed rules of STANDUP are specifically designed to maximize generative powers whilst retaining a fairly limited set of symbolic rules. T-PEG, on the other hand, tries to address the issue of having to handcraft rules by automatically extracting templates from example jokes. However, the evaluation of this functionality was 2012 138 fairly limited given the difficulty and cost of manual evaluation. This paper has attempted to carry out a fairly novel methodology of automatically evaluating the performance of one creative system (T-PEG) using another creative system (STANDUP) to produce sample data with complete underlying annotations for comparing against. Although the results are far from conclusive, it corroborates the results of the manual evaluation in (Hong and Ong 2009) that claims T-PEG was successful in extracting templates from sample jokes. Furthermore, the experiment could shed more light on where T-PEG was still lacking in its ability to extract the underlying generative rules of the punning riddles. Furthermore, the benefit is twofold: the template clustering process proposed in this work has attempted to address the generalizability of T-PEG. It is not sufficient to say that a huge number of templates will ensure maximum coverage. For a generative system to be deemed creative, it should be able to generate a high ratio of good quality artefacts from a limited set of symbolic rules. By breaking down the patterns into schemas, description rules, and templates, and by stating the conditions when they can be composed together, STANDUP is able to produce a very wide range of different jokes, and in some sense can "explain" the craftsmanship behind its joke production facilities, as each individual component represents a specific function of the joke. T-PEG's templates, on the other hand, are monolithic structures that cannot be broken down into its functional components. This distinction is to be expected, given that the rules found within STANDUP were manually constructed, whereas T-PEG rules are automatically extracted. Nevertheless, this points towards a promising direction of future work in the automatic extraction of rules from creative artefacts. As stated in the previous section, we believe that the template clustering process opens up the possibility of future work, i.e. by further leveraging the data redundancy contained with the resulting clusters. Further still, given that the clustering process makes use of the notion of distances between templates and clusters, one could imagine a technique that selects the template closest to the centroid of the cluster as being the most representative template for further generation. Finally, we would like to explore more sophisticated generalization techniques that would enable us to tease out the distinctions between component rules such as schemas, description rules, and canned text templates. Acknowledgments We would like to thank the creators of the STANDUP and TPEG systems for making their software available for further experimentation and development. We would also like to thank the anonymous reviewers for their feedback and suggestions. 2012_21 !2012 Evaluating Musical Metacreation in a Live Performance Context Arne Eigenfeldt Contemporary Arts Simon Fraser University Vancouver, BC CANADA arne_e@sfu.ca Adam Burnett Cognitive Science Simon Fraser University Burnaby, BC CANADA ajb14@sfu.ca Philippe Pasquier Interactive Arts and Technology Simon Fraser University Surrey, BC CANADA pasquier@sfu.ca Abstract We present an evaluation study of several musical metacreations. An audience that attended a public concert of music performed by string quartet, percussion, and Disklavier was asked to participate in a study to determine its success: 46 complete surveys were returned. Ten compositions, by two composer/programmers, were created by five different software systems. For purposes of validation, two of these works were human-composed, while a third was computerassisted: the audience was not informed which compositions were human-composed. We briefly discuss the different systems, and present the artistic intent of each work, the methodology used in gathering audience responses, and the interpreted results of our analyses. Introduction The Musical Metacreation project1 is an ongoing research collaboration between scientists and composer/musicians at Simon Fraser University that explores the theory and practice of metacreation - the notion of developing software that demonstrates creative behaviour (Whitelaw 2004). The objectives include not only developing software, but producing and presenting artistic works that use the software, and validating their musical success. The research team includes a composer of acoustic and electroacoustic music who has created music composition and performance systems for over twenty years, an artificial intelligence researcher whose specialty includes multiagent systems and cognitive modeling (and who is himself a creative artist in the field of computer music, sound design, audio and media arts), and several research assistants who are composers and/or scientists. The fields of musical metacreation revolves around two central tasks: • The composition task: the aim of this task is to produce music in the form of a symbolic representation, often a musical score. If the system takes existing compositions as input, it will be said to be corpus-based. • The interpretation task: given some symbolic musical notation, this task consists of generating an acoustic signal. Sometimes, these two tasks collide. For example, in electroacoustic music (in which we include electronica), an acoustic signal is directly generated as the output of the composition task. In the case of improvised music, composition and interpretation can be seen to happen simultaneously. The systems described in this paper, along with their evaluation, are all addressing the composition task. The creative systems produced by our research team have already been described in conference proceedings and journals, while the music produced has been presented in public concerts and festivals. On the surface, therefore, we could state that our work has already been validated; however, there are deeper issues involved that we discuss in this paper. In considering how a metacreative system might be validated, there are at least five potential viewpoints that can be considered: 1.The designer: the designer of the system accepts the output as artistically valid; 2.The audience: the work is presented publicly, and the audience accepts the work; 3.The academic experts: the system is described in a technical peer-reviewed paper and accepted for conference or journal publication; 4.The domain experts: the system receives critical attention through the media or non-academic artists via demonstration; 5. Controlled experiments: the system is validated through scientifically accepted empirical methods, using statistical analysis of the results in order to accept or reject the hypothesis made about the system. In the first instance, any artwork created by a human, and publicly presented, conceivably requires the artist to consider it complete and successful and representative of the artist's aesthetic vision. Similarly, metacreative works have, so far and to our knowledge, reflected the artistic sentiment of their designers. According to this viewpoint, the system evaluation is made directly by the designer. In our case, our metacreative systems have produced works that we find artistically interesting. The second step reflects an artist's desire to share their work with the public. Whether the audience accepts, appreciates, or enjoys the work is, unfortunately, often difficult to ascertain, as many audiences will politely applaud any work. One could include more quantitative measures, such as audience counts, album sales, or online downloads. The third case involves peer-review, albeit for a description of the system in technical terms. A different criteria is in place, one dependent less upon the artistic output, and more upon the technical contribution of the system in its novelty and usefulness. Often, the evaluation is also an evaluation of the originality and soundness of the process encoded in the system in regard to the computational creativity literature (Colton, 2008). 1 http://www.metacreation.net/ 2012 140 Both metacreation software and their output can be discussed in the media. Journalists and critics are different from the regular audience, in that their opinion will be further diffused to the audience: this may influence the audience judgment and the work can gain or lose notoriety as a consequence. Lastly, empirical quantitative or qualitative validation studies can be undertaken that involve methods long supported by the research community for generating knowledge within the hard and soft sciences. While the computational creativity literature has started investigating these (Pearce and Wiggins, 2001; Pease et al., 2001, Ritchie, 2007, Jordanous, 2011), a great deal remains to be done. While most previous work regarding the evaluation of musical metacreation (and computationally creative software in general for that matter) have been focusing on dimensions 1, 3 and 5, this paper presents an experimental study realized in the context of the public presentation of artworks in a concert setting (mixing dimensions 2 and 5). Also, there are very few instances of evaluation studies that consider more than one metacreative system at a time; our study is a comparative study of five different systems for computer-generated or computer-assisted composition. The remainder of this paper discusses our evaluation study and the results we received, but also the questions that were raised. We first describe the different software systems involved, as well as the artistic intent of the compositions produced. We then present the methodology used in gathering audience responses to the compositions, as well as the results garnered from these responses. Finally, we posit our conclusions, as well as potential future work in this area. Description The public presentation of the metacreative software systems described in this paper took place as a public concert in December, 2011. Audience included members of the general public, as well some students of the first and third authors. Ten compositions, by two composers, were performed by a professional string quartet, percussionist, and Disklavier (a mechanized piano equipped to interpret MIDI input). The music was produced by five different software systems designed, and coded individually by the two composers. For comparison purposes, two of the pieces were composed without software; in other words, composed completely by human; a third was computer-assisted. The audience was informed beforehand that at least two of the works were human-composed, but were not informed as to which pieces these were; however, the program notes made it rather obvious that fundatio and experiri were, at most, computer-assisted. See Table 1 for a list of compositions. The Systems and Compositions In Equilibrio was generated by a real-time multi-agent system, described in (Eigenfeldt, 2009b). The system is concerned with agent interaction and negotiation towards a integrated melodic, harmonic, and rhythmic framework; its final output are MIDI events. The generated MIDI data was sent to a Yamaha Disklavier; no effort was made to disguise the fact that the performance was by a mechanical musical instrument. Along with the Disklavier and some high-level performance control by the composer, this system was responsible for both the "live" composition and its interpretation. One of the Above consists of three movements for solo percussion. The music is notated by a system described in (Eigenfeldt and Pasquier, 2012). This system uses multiple evolutionary algorithms, including genetic algorithms, to control how a population of musical motives is presented in time, and how it is combined with other populations of motives. Intended for solo percussionist, the composition is a concentrated investigation in development of rhythmic motives. Each movement of the composition was presented separately, and treated as a unique composition within the evaluation. One additional movement, composed with the same intentions as the other three in this series, is humancomposed (for reasons discussed in the Evaluation section). Dead Slow / Look Left is a notated composition for string quartet and percussion, by a system that employs the harmonic generation algorithm described in (Eigenfeldt and Pasquier, 2010). The composition consists of a continuous overlapping harmonic progression generated using a harmonic analysis of 87 compositions by Pat Metheny, and a third-order Markov model based upon this analysis. In this corpus-based system, durations, dynamics, playing style, range, and harmonic spread were determined using patterns generated by a genetic algorithm. These continuous harmonies were interrupted by contrapuntal sections that interpret tendency masks (Truax 1991), which define such parameters as sequence length, number of instruments, subdivisions, playing style, number of playing styles, dynamics, and the number of gestures in a section. Other, Previously was generated by a system described generally in (Eigenfeldt, 2009a), while the composition is described more fully in (Eigenfeldt, 2012b). A corpus of MIDI files - in this case 16 measures of the traditional Javanese ensemble composition Ladrang Wilugeng - was analysed, and generative rules regarding rhythmic construction was derived from the corpus. These rules were used by a genetic operator to create a population of everevolving melodies and rhythms that the system reassembled in a multi-agent environment over a rotating harmonic field. The real-time output was transcribed in a music notation program, and performed by string quartet. The end result is a piece of notated music that reflects many of the tendencies of the original corpus material, without direct quotation. The composer's role was limited to dynamic markings, orchestration, and assembling sections. Gradual was generated by an extension of the system used to generate One of the Above, with an additional module to control pitch aspects integrated into the system. The final output was a notated work for marimba, violin, and Disklavier. While the system achieved the composition on its own, the interpretation was mixed: humans were playing the marimba and violin while the system was in charge of operating the Disklavier. 2012 141 Composition Instrumentation Experience Level Expert Novice Combined 1 In Equilibrio [c] Disklavier 3.17 (0.99) 2.71 (1.23) 2.90 (1.14) 2 One of the Above #1 [h] Solo percussion 4.00 (1.00) 3.36 (1.19) 3.67 (1.13) 3 Dead Slow /Look Left [c] String quartet and percussion 4.16 (0.90) 3.08 (1.15) 3.51 (1.16) 4 One of the Above #2 [c] Solo percussion 3.68 (0.67) 3.16 (1.07) 3.42 (0.93) 5 fundatio [h] String quartet 4.29 (0.80) 4.24 (0.83) 4.24 (0.81) 6 experiri [c-a] String quartet 4.47 (0.61) 4.36 (0.86) 4.40 (0.76) 7 One of the Above #3 [c] Solo percussion 3.39 (0.76) 3.12 (1.20) 3.22 (1.04) 8 Other, Previously [c] String quartet 4.31 (0.75) 4.50 (0.59) 4.40 (0.66) 9 One of the Above #4 [c] Solo percussion 3.63 (1.16) 2.71 (1.00) 3.10 (1.16) 10 Gradual [c] Violin, marimba, Disklavier 4.05 (0.85) 3.88 (0.95) 3.93 (0.89) Table 1. Individual composition engagement score means (out of 5). Standard deviations appear in parentheses. [c] = computercomposed. [h] = human-composed. [c-a] = computer-assisted. fundatio and experiri were created by composer and software designer James Maxwell, with the help of his generative composition software that rests on a cognitive model of music learning and production. This software, ManuScore, is partially described in (Maxwell et al. 2009, 2011). ManuScore is a notation-based, interactive music composition environment. It is not a purely generative system, but rather a system which allows the composer to load a corpus, and proceed with that compositional process while enjoying recommendations from the system of possible continuations as suggested by the model. fundatio was written using the commercial music notation software, Sibelius, following the compositional process used by the composer for many years, while experiri was written using ManuScore. Although this latter work remains clearly human-composed, the formal development of the music, and much of the melodic material used, were both directly influenced by the software. Performances of the compositions can be viewed here: In Equilibrio: http://youtu.be/x5fIdHbqEhY Other; Previously: http://youtu.be/gaQfyhOiRio One of the Above #2: http://youtu.be/gAIjQOiMG54; One of the Above #3: http://youtu.be/bUYr7T7DKGs; One of the Above #4: http://youtu.be/cQNQKinbJ-s. Gradual: http://youtu.be/HZ2_Pr35KyU. experiri: http://youtu.be/Gr5E7UVUoE8 fundatio: http://youtu.be/rNXt8b-kLMQ Evaluation Study The public concert was meant to serve two purposes: firstly, to present the artworks of the metacreative systems to the public, and secondly, to explore the idea of conducting evaluation in concert settings. The opportunity for serious validation prompted the first composer to write an additional work separate from the metacreative systems, with the same musical goal. The purpose was not to fool the audience in making them guess which piece was not composed by machine, but rather to add human-composed material to the comparative study. While we hope that audiences will, one day, accept machine generated music without bias, Moffat and Kelly (2006) suggest this is not yet occurring. In our case, given three works for solo percussionist, composed in a particularly modernist style, it would be difficult to ascertain whether an audience's appreciation - or lack thereof - was due to the musical style, the restricted timbral palette, the lack of melodic and harmonic material, or any failings of the metacreative system. The human-composed piece allowed the composer to demonstrate the above-mentioned aspects, yet composed by the system designer. If the audience's rating of the human-composed piece was statistically similar to the metacreative works, it would demonstrate that the audience's preferences were based upon style, rather than musical creativity and/or quality. 2012 142 Methods Participants were 46 audience members from the general public (rather than only students) who attended a paid concert put on by Simon Fraser University. A program distributed to each audience member explicitly indicated that "machine-composed and machine-assisted musical compositions" would be performed. Each audience member also received an evaluation card on which they were encouraged to provide feedback. Audience members were asked to indicate, on a Likert-scale from 1 to 5, their level of familiarity with contemporary music, followed by ten similar 5 point Likert-scales regarding how "engaging" they found each piece to be. Audience members were also asked to indicate which three pieces they felt were the most directly human-composed. Audience members where also given space to write in their own comments. See Table 1. Hypotheses We hypothesized that the machine-generated and computer-assisted works were sufficiently similar in quality and style to the human-composed pieces that audience members would show no preference for the timbrally similar human-composed pieces (null hypothesis). This preference would be indicated by audience members' indication of how "engaging" they found each piece. Analysis In order to avoid alpha inflation that arises from multiple comparisons, statistical tests were made using post-hoc Bonferroni corrected alpha levels of .005 (0.5/10). For part of the analysis, the 46 audience members were divided into novice and expert groups depending on the score they indicated for the "familiarity with contemporary music" question. The "novice" group consisted of audience members that gave a score 1, 2, or 3 out of 5 on the familiarity scale (N = 25). The "expert" group consisted of the remaining audience members who gave a 4 or 5 (N = 19). Two audience members failed to provide a familiarity score, so their data was excluded from group comparisons. Audience did not seem to discriminate between all the percussion pieces. Comparing the average engagement scores for the human-composed solo percussion piece One of the Above #1 (M =3.59, SD = 1.15) with the average scores for the machine-composed One of the Above #2 through #4 (M = 3.28, SD = 1.02) was not significant, t(44) = 1.43; p = .16 ns, leaving us unable to suggest that participants were able to discriminate between the human and machine-composed percussion pieces. Audience did not "recognize" which piece was not computer-made. Assuming participants would find human-composed pieces more engaging, participants' engagement rating of the individual pieces were interpreted as an indication of whether participants could implicitly distinguish human-composed from machine-composed pieces. Tests comparing expert listeners' engagement scores for the human-composed One of the Above #1 (M = 4.00, SD = 1.00) against the machine-composed alternatives (M = 3.57, SD = 0.88) were not significant (t(18)=1.68; p = 0.11 ns). Similarly, novice listeners' scores for One of the Above #1 (M = 3.33, SD = 1.20) compared to the alternatives (M = 3.01, SD = 1.08) demonstrated no significant preference for the human-composed piece, t(23)=0.96; p = 0.34 ns. Comparisons between the expert listener engagement ratings for the two string quartet pieces, the humancomposed fundatio (M = 4.29, SD = 0.81) and the machine-assisted experiri (M = 4.47, SD = .61) were nonsignificant, t(18) = 1.00; p = .33 ns. Novice ratings for fundatio (M = 4.24, SD = 0.83) and experiri (M = 4.36, SD = 0.86) were similarly non-significant, t(24) = .72; p = .48 ns. This also failed to show that audience was discerning between the computer-assisted composition made using ManuScore and the human-made composition by the same composer. Together, these results do not support the hypothesis that audience members were able to implicitly pick out which pieces were human-composed. There was no difference between experts and novice choices. To determine whether audience members' ability to explicitly pick out the human-composed piece could depend on one's familiarity with contemporary music, a chi-square test compared novice and expert listeners' three "most directly human-composed" choices. The results of this test were non-significant, X2 (9, N = 113) = 14.17; p = .51 ns. This result fails to support the hypothesis that expert and novice listeners differ in their ability to explicitly discriminate human-composed pieces from machinecomposed pieces. Discussion In addition to the above results, several further remarks can be made. Overall, the evaluation results were pretty successful, showing both a rather high level of engagement from the audience, as well as good range with ranking means varying from 2.7/5 to 4.5/5. The audience did not discern computer composed from human-composed material, which seems to give credit to the five systems presented above. More precisely, this might just mean that the system were successful in portraying the goal, aesthetic and style of the two composers who developed them. One further general observation that can be made is that while an evaluation in a concert setting allows us to capture the audience reaction to musical output in its "natural" presentation environment, it also introduces many variables that get us out of the usual controlled environment setting. The experimental protocol is also more difficult to follow. On the other hand, controlled experiments are not the traditional setting in which a musical artwork is presented and this does introduce a number of biases in this type of evaluation. While these are well known, and solutions exist to circumvent them, our goal was to conduct an evaluation study in a live concert setting. We were concerned if conducting an evaluation in a concert setting would risk upsetting the audience's appreciation of the artwork. To our surprise, it did not seem to be the case, and the feedback forms were really welcomed. The whole process triggered a 2012 143 longer than expected question and answer session at the end of the show. It is to be noted that very few audience members left before the end of the Q&A session. Conclusions and Future Work Finally, the whole process shed some light on the difficulty of evaluating computational creativity (and creativity in general). Artificial intelligence addresses the problem of emulating intelligence by having the computer achieve tasks that would require intelligence if achieved by humans. These tasks are usually formalized as well-formed problems. Rational problem solving is then evaluated by comparison to some optimal solution. If the optimal solution is theoretical and not attainable, optimization and approximation techniques can be used to get closer to the optimal, or at least improve the quality of the solution according to some metrics. Computational creativity is faced with the dilemma that, while creative behavior is intelligent behavior, such notions of optimality are not defined. It is often unclear which metrics need to be used to track progresses in the area. As demonstrated by this paper, it is at least an issue for the evaluation of composition systems. Musical success is subjective in nature. This is why we resort to a comparative study capturing the relative level of success, rather than absolute ones. In the absence of formal metrics, we used human subjects to evaluate musical metacreation. However, creativity is a process (Boden, 2033). When evaluating a musical composition system, one particularly challenging aspect is that the system is capable of generating numerous pieces, with possibly varying levels of success: designing methodologies to measure that variability is an inherent challenge of the area. This is especially true when one has to use human subjects, since getting average relative evaluations of the average system production makes the experimental design particularly challenging. To our knowledge, this paper is the first one to report on an evaluation experiment of machine-generated material conducted in real-world public situation. Beside the findings exposed above, the research instrument discussed here is a contribution in itself. As the systems presented are musical metacreations, validation and evaluation of such a system's output is itself a relatively novel and challenging research area. Our future work will continue to investigate and try to evaluate the methodologies to do so. Meanwhile, besides the finding exposed above, the paper raises a number of concerns and questions that will likely need further consideration in future work. Acknowledgements This research was funded by a grant from the Canada Council for the Arts, and the Natural Sciences and Engineering Research Council of Canada. 2012_22 !2012 Critical issues in evaluating freely improvising interactive music systems Adam Linson, Chris Dobbyn and Robin Laney Faculty of Mathematics, Computing and Technology Department of Computing The Open University Milton Keynes UK {a.linson, c.h.dobbyn, r.c.laney}@open.ac.uk Abstract As freely improvised music continues to be performed, it also continues to be implemented in interactive computer systems. For the scientific study of such systems to be possible, it is important to ensure the fitness for purpose of available evaluation methods. This paper will review several approaches to evaluating interactive computer music systems. It will also examine the uncritically-accepted assumption that quantitative evaluation invariably yields significant data, irrespective of context. Ultimately, it will be argued that, for some interactive computer systems, such as those designed for freely improvised music, qualitative evaluation by experts is the most appropriate evaluation method. Introduction Freely improvising computer systems, modelled on an established musical practice that has been called "nonidiomatic" (Bailey 1980/1993), have been around since at least the 1990s (see Rowe 1993; Lewis 1999). There has been a significant amount of academic writing on the topic, including a chapter in Machine Musicianship (Rowe 2001), and an entire book, Hyperimprovisation (Dean 2003), dedicated to the topic of its subtitle, "computer-interactive sound improvisation". As freely improvised music continues to be performed, it also continues to be implemented in interactive computer systems (see, for example, Blackwell and Young 2004; Hsu 2005; Collins 2006). For the scientific study of such systems to be possible, it is important to ensure the fitness for purpose of available evaluation methods. A significant amount of research is conducted on dominant forms of instrumental and computer music, which has led to a number of evaluation methods and technologies. For example, music with well-defined stylebased rules that constrain melodic, harmonic, and/or rhythmic constructs lends itself to generation and analysis techniques based on traditional musical notation. However, less widely studied forms of music such as freely improvised music have different evaluation criteria, and thus pose unique problems to widely adopted approaches to musicological and computational analysis. In particular, for music such as free improvisation, formalisable musical rules and symbolic notation fail to account for the fundamental aspects of the musical practice. Defining the practice of freely improvised music is not trivial. As MacDonald, et al. (2011) point out, "while there is no generally accepted single definition of improvisation, most accounts highlight the spontaneously generated nature of the musical material and the real-time negotiation of unfolding musical interactions". This characterisation is explicitly extended to cover contemporary improvisation practices including free improvisation. In Clarke's examination of creativity in performance (2005a), he refers to an empirical study on freely improvised music showing that the "interweaving of social and structural factors" serves a central role in such music. (For those unfamiliar with freely improvised music, the artists and recordings mentioned, for example, in Bailey 1980/1993 and Smith and Dean 1997 may provide a useful starting point.) In considering research into computer music systems that have been developed to perform freely improvised music, it is important to find an appropriate method of evaluation that is well-suited to the context. When computer music systems for free improvisation are assessed according to inappropriate criteria, it can have a potentially stifling effect on the development of new approaches to such systems, as well as potentially devaluing existing effective systems. This paper will review several approaches to evaluating interactive computer music systems. It will also examine the uncritically-accepted assumption that quantitative evaluation invariably yields significant data, irrespective of context. Ultimately, it will be argued that, for some interactive computer systems, such as those designed for freely improvised music, qualitative evaluation by experts is the most appropriate evaluation method. 2012 145 Evaluation methods in computer music research Computer music researchers generally acknowledge the need to determine an evaluation method appropriate to their specific research. It is not always apparent, however, to what extent these methods apply to other research in the wider field of computer music. Stowell, et al. (2009) consider a number of quantitative and qualitative approaches to evaluating "live human-computer musicmaking" (Stowell, et al. 2009), although they do not consider generative systems. Collins (2008), on the other hand, considers approaches to evaluating generative systems, finding promise in approaches that take into account the relationship of software to musical output. These authors find musical improvisation significant enough to merit acknowledgement, although they do not engage with the evaluation issues unique to "playerparadigm" interactive improvising systems, that is, systems with "a musical voice which may be related to, but is still in an audible way distinct from, the performance of a human partner" (Rowe 1996). Notably, Stowell, et al. (2009) favour studies of expert performers for the evaluation of interactive digital musical instruments that are under performer control, although they do not mention a logical extension of this view, namely, that the same approach can be extended to interactive systems that are not under performer control. Similarly, Pearse and Wiggins (2001) find experts to be capable evaluators of music with enumerable rules (such as period harmonisations), but they do not address the evaluation of music without enumerable rules, such as freely improvised music. Among these researchers, there is a clear recognition that expert human analysis has something to offer, although pragmatic concerns lead to the consideration of alternatives to using human experts, especially computational approaches. But while computational approaches to evaluation can be expected to yield appropriate results in some research contexts, in others, computer-based evaluation techniques may be in principle incapable of discovering evidence that is relevant to the investigation. In support of this claim, Collins (2008) acknowledges that computational analysis, in failing to address emergent features of complex musical output, may have a destructive effect on the (musical) object of study. A quantitative approach to evaluating improvised music Pressing (1987), in a comprehensive study of quantitative analysis and improvisation, concludes that while "idiomatic" (Bailey 1980/1993) improvisation such as jazz lends itself well to both macroand microstructural quantitative analysis, in freely improvised music "the musical meaning is not well described" by the same quantitative analytical approach. To clarify Pressing's terminology, "macroanalysis uses the full panoply of devices from traditional music theory" (primarily those generally found in musicological analysis of composed works), and microanalysis addresses parameters more likely to be found in perception studies of expressivity, such as "interonset and duration distributions", dynamic contours, and "legatoness". Pressing devised a specialised model to account for some of the general structural features of improvised music, which he validated in quantitative empirical studies. In further studies, he found that while his model functioned effectively for analysing improvised jazz music, it could not be effectively extended to freely improvised music without an arbitrary (and thereby subjective) partitioning of "polyphonically overlapping phrase structures". When comparing a jazz improvisation and a free improvisation—both subjectively regarded by Pressing as aesthetically successful—he found that the jazz improvisation contained extensive quantitative evidence of "micro-micro" and "micro-macro" correlations (which thus appears to validate his subjective assessment); in the free improvisation, both types of quantitative correlation were "nearly completely absent" (Pressing 1987). His findings suggest that even when quantitative analysis succeeds in apparently similar musical contexts, it is not trivial to extend such analysis to evaluating freely improvised music, whether humanor computer-generated. Music and qualitative analysis In some performances, musical features that are apparently insignificant become significant in the course of an analysis. This poses a difficulty for approaches to analysing data that screen for features whose relevance has been determined in advance. A computer-based quantitative analysis cannot overcome this problem, despite other strengths in detecting specific correlations, statistical significance, or other quantitative constructs such as self-similarity. Thus, computational quantitative analysis is limited in what it can discover; it is often confined to providing answers about whether or not a given data set complies to a given rule set. While computers are a powerful tool to rapidly sift through vast amounts of data, computational analyses are notoriously bad at picking out long-term dependencies or large-scale structures from a body of time-series data, in contrast to human experts. Consider, for example, an analysis by composer, musician, and historian Gunther Schuller (1958) of three Sonny Rollins saxophone solos, all from within a single performance of Rollins' piece, "Blue 7" on the album "Saxophone Colossus" (Prestige LP 7079). Part of Schuller's argument for the merits of Rollins' solos includes the assessment that they are not merely following the fixed harmonic chord progression, nor are they merely variations on a melodic theme, nor do 2012 146 they merely fulfil both of these (quantitatively measurable) criteria, both of which are typically used to determine that a jazz solo is (formally) allowable. Rather, Schuller points out the musical significance of a number of creative decisions made during the solos. His argument, based on his background expertise in the field, is fundamentally qualitative, and is nonetheless extensively backed up with (in some cases, quantitatively measurable) material evidence (i.e., musicological specificity about particular pitches, phrases, rhythms, etc.). The structural features he identifies in his analysis stand in sharp contrast to those that could be discovered by rule-based quantitative approaches: he identifies semantic information in the particular configuration of musical elements, thus extending his analysis beyond the quantitative measurement of compliance to musical rules. (Another example of this distinction can be found in Clarke's analysis of Jimi Hendrix's "Star Spangled Banner"; Clarke 2005b, Chapter 2.) Other quantitative approaches, such as conducting a survey of listeners' opinions, require large enough sample sizes to find statistical significance, and thus may be applicable when studying the capacities of a given listener population. By determining what is relevant to an analysis, a given analytic approach not only investigates but also characterises the object of study. Schuller's (1958) expert analysis contains a wide range of assessments that illustrate the strengths of qualitative analysis over quantitative analysis, computational or otherwise. Take, for example, his assertion that a musical phrase introduced by Rollins "at first, seems gratuitous", whereas later in the piece, it "becomes apparent that [the phrase] was not at all gratuitous or a mere chance result, but part of an overall plan". Or, for example, the notion that the final restatement of an initial theme "is drained of all excess notes" and that the "rests [in the original statement of the theme] are filled out by long held notes," serving both to end the piece and "sum up all that came before". His analysis even briefly isolates the Max Roach drum solo, pointing out that two musical ideas, a triplet figure and a snare roll, are built up through permutations and alternations into a complex solo; then, eleven bars after the drum solo has ended, the drummer interestingly and meaningfully re-uses these two elements "in an accompanimental capacity". These examples can be viewed as arguments for the significance of specific musical decisions—what was chosen, or, in some cases, not chosen—among an allowable range of options. For instance, several possible notes may fit a given chord, but there may be a significance to the particular note that is played, such as a long-term dependency that is outside the scope of a computational analysis. Furthermore, a particular note may be chosen over another because of a social connotation, in principle irreducible to a quantitative framework. In general, assessments of musical significance are relative to listener knowledge and expectation, as well as being strongly affected by listening context (for an extended discussion of this point, see Clarke 2005b). Furthermore, differences in listeners' accounts may extend beyond traditional musicology, and new concepts may be introduced that were not built into the initial evaluation framework. This is not possible when a computer has been limited in advance to a particular analytic framework. Also, in contrast to a quantitative approach, differing assessments of the same material need not contradict each other. In Clarke's Hendrix example, three listeners assess the significance of a particular arpeggiation: Clarke hears a destructive melodic rupture verging on dissolution, another hears the bugle of a military funeral, and yet another hears a pattern of fingerboard traversal. For Clarke's example, an imagined computational analysis would run the risk of shifting the framework of significance to only what can be discovered computationally, potentially excluding a priori the three listener assessments. It is difficult to imagine what a computational or other quantitative approach could contribute in this case, beyond support (confirm that it is an arpeggiation; identify the statistical likelihood for the presence and location of the arpeggiation within the melody; confirm its similarity to a given bugle call; investigate melodic possibilities constrained by fingerboard layout). And even if in principle computational analysis could discover any item of significance, the necessity of making prior decisions as to what counts as significant is a profound limitation. Clarke's engagement with musical meaning finds support in the empirical listener perception studies conducted by Deliege, et al. (1997). These studies identify two primary types of perceived musical cues: those that can be confirmed by consulting the musical notation— "‘objective' cues (themes, registral usages, etc.)"—and, in contrast, "‘subjective' cues, which have psycho-dynamic functions (impressions, for example, of development, or of commencement) which may be experienced differently from one listener to another and are not necessarily identifiable in the score" (Deliege et al. 1997). This account of cues highlights specific, narrowly-defined observations (such as development and commencement), as opposed to the broader semantic framework of Clarke. But both accounts point to the fact that different listeners experience the same musical material in different ways, underscoring the fact that human listeners may be sensitive to information that could otherwise be obscured by more constrained assessments of the same material. It is not currently possible to computationally model the entirety of human listening possibilities. Thus, when a particular research question is framed to empirically validate a computational model of human listening, the boundaries of listening are constrained, for example, to investigate melodic or harmonic expectations. But for research questions that seek, for example, to uncover the inherent polysemy of a given guitar solo, the diversity of embodied cultural expertise captured by multiple 2012 147 qualitative accounts is no less scientific, and likely more relevant to the question at hand, than a quantitative study. The role of experts Expertise is not necessarily confined to an unworkably small set of specialists. With respect to Clarke's example, the ability to recognise a particular bugle call or guitar fingering can be considered forms of expertise that are shared by many. In practice, these recognitions eluded and thus enhanced his own musicologically astute account of melodic dissolution. Returning to the topic of improvisation, Smith and Dean, in their extensive investigation of improvisation in the arts, suggest that with an improvised work, "the possibility of finite interpretation is not to be expected, or even desirable," and "the ideas of improvisors themselves are very interesting sources for the analysis and understanding of improvisation" (Smith and Dean 1997). The substance of their study is found in the differing perspectives of practising improvisors who are regarded as experts. As Clarke (2005a) states, "the boundaries between the mundane, the creative, and the unacceptably idiosyncratic are constantly shifting, and [...] their position and evaluative significance is a function of judgements made within a shifting cultural and historical context". If we define experts as those with significant experience operating within the given cultural and historical context of a musical practice, it follows that such individuals are better equipped to make effective evaluations about the practice being studied. Especially in light of the aforementioned centrality of the "interweaving of social and structural factors" in freely improvised music, an experienced improvisor is well-suited to serve as an expert qualitative evaluator, capable of attunement to both subtle and complex emergent criteria. Although some aspects of freely improvised music are amenable to various quantitative criteria (such as those that borrow from compositional analysis, especially melodic and harmonic information), the unique aspects of the music being studied do not necessarily reside in such criteria (see Lehmann and Kopiez 2010). To identify shared features across classical compositions by a single composer, a quantitative analysis would likely suffice, because the melodic and harmonic information comprise a significant degree of what constitutes the compositions. On the other hand, with freely improvised music, Smith and Dean (1997) find that "a multiplicity of semiotic frames can be continually merging and disrupting during a ‘free' [...] improvisation," which they find to be an essential characteristic of such music. This represents at least one finding that is more effectively discovered by qualitative human expertise. Furthermore, in their elaborate taxonomy of improvisation, Smith and Dean refer to what they term "stipulated" improvisation, which describes a type of improvisation that derives structure and characteristic style from stipulated aesthetic parameters that are internalised by a community of performers. According to their account, the "stipulated" approach does not fully exploit improvisation because it does not permit the "breaking, remoulding and rebreaking of such ‘parameters'", as does freely improvised music, which fundamentally allows for the possibility of "reformulating the parameters on each occasion" (Smith and Dean 1997). Thus, for some complex objects of study, expert qualitative analysis should be recognised as fulfilling an essential role that, at times, can be empirically supported by quantitative means, but never entirely replaced by these means. Research context and conclusion Quantitative approaches certainly have independently useful scientific functions (such as examining physical mechanics or features of perception). Yet expert qualitative analysis has the potential to offer a set of results that may, in fact, be more relevant to the particular research being conducted. Unfortunately, qualitative study is often assumed to diminish scientific rigour, despite the wellknown criticisms of quantitative studies concerning test bias, determination of statistical significance, and assumptions implicit in classifications and standardised procedures (Hammersley 2009). Generally speaking, for empirical study, the research question ought to be the determinant of experiment design and evaluation. Among the varieties of computer music research, there are some computer music systems that are not interactive, such as systems designed to output rulebased compositions. In many of these cases, quantitative computational analysis may be the most practical approach to evaluating whether or not a given computer system is successful in achieving its aims, such as rule compliance. When listener surveys are used to evaluate system success, it may be appropriate to use discrimination tests of fixed musical material (for more on discrimination tests, see Ariza 2009). For computer systems that generate widely divergent musical material, studies that focus on the underlying software may offer results more relevant to some research questions (Collins 2008). Alternatively, for studies of interactive computer music systems, the humancomputer interaction, rather than the music, may be at the centre of the research. For these studies, the relation between performer intention and system responsiveness is one area of investigation that benefits from both quantitative and qualitative study, such as looking into actual and perceived timing issues (Stowell, et al. 2009). However, when considering interactive computer systems that are not under direct performer control, there is no well-established evaluation method that is widely recognised in the literature. For some studies of such player-paradigm systems, the focus may be on idiomatic music, for which the evaluation 2012 148 approaches mentioned for generative composition systems are found to be applicable (Pachet 2002). But for studies of interaction experience with player-paradigm systems, it is essential to use expert qualitative analysis to avoid the danger of "measurement that fails to ensure that the assumptions built into measurement procedures correspond to the structure of the phenomena being investigated" (Hammersley 2009). It is a common aim of many studies of computer systems to iteratively improve a system based on assessments of its strengths and weaknesses. In the case of player-paradigm systems, expert qualitative evaluation can be used to identify even broadly defined—or potentially undefinable—weaknesses such as whether or not (and why) a human musical interaction with a system is, for example, "boring". Qualitative expert analysis in this context, though not widely acknowledged, is not entirely disregarded. For example, in Collins' brief account of a "free improvisation simulation" (2006), expert interview data is the primary source of evaluation. It has been argued here that using qualitative data from experts is one way to approach the problem of evaluating a freely improvising computer music system. This approach is especially relevant for determining whether or not a player-paradigm system itself performs at the level of a human expert. Accounts of interaction experiences such as interview data can be collected, correlated, and analysed, with the aim of applying the data to improve the system. In practice, as part of a longer research program, qualitative data can function in the same manner as quantitative data: after identifying a system's strengths and weaknesses, a second iteration of the system can be built, and a follow-up study can determine what aims have been achieved. In this way, despite the predominance of quantitative evaluation in computer music, qualitative expert analysis can be a viable means of investigating phenomena, and qualitative studies can ultimately serve in making novel contributions to the research field. 2012_23 !2012 Towards a Mixed Evaluation Approach for Computational Narrative Systems Jichen Zhu Drexel University Philadelphia, PA 19104 USA jichen.zhu@drexel.edu Abstract Evaluation is one of the major open problems in computational creativity research. Existing evaluation methods, either focusing on system performance or on user interaction, do not fully capture the important aspects of these systems as cultural artifacts. In this position paper, we examine existing evaluation methods in the area of computational narrative, and identify several important properties of stories and reading that have so far been overlooked in empirical studies. Our preliminary work recognizes empirical literary studies as a valuable resource to develop a more balanced evaluation approach for computational narrative systems. Introduction Evaluation is one of the major open problems in computational creativity research. A set of well-designed evaluation methods not only is instrumental in informing the development of better creative computational systems, but also helps to articulate overarching research directions for the field overall. However, research in creative systems has encountered tremendous difficulties in defining suitable evaluation methods and metrics, both at the level of individual systems and across systems. A recent survey of 75 creative systems shows that, only slightly above half of the related publications give details on evaluation; among those, there is lack of consensus on both the aim of evaluation and the suitable evaluation criteria (Jordanous 2011). Traditionally, methods for evaluating intelligent computational systems have been mainly developed in two areas: artificial intelligence (AI) and human-computer interaction (HCI). Following the scientific/engineering tradition, evaluation in AI typically relies on quantitative methods to measure the system's performance against a certain benchmark (e.g., system performance, algorithmic complexity, and the expressivity of knowledge representation). A salient example is the measure of "classification accuracy" in machine learning, where new algorithms are evaluated by being compared to standard ones over the same sets of data. Whereas the AI community is primarily concerned with the operation of the system itself, HCI concentrates on the interaction between the user and the system. Borrowing from psychology, human factors, and other related fields, HCI developed a set of quantitative and qualitative user study methods to understand the usability of a system along such principles as learnability, flexibility, and robustness (Dix et al. 2003). Although these existing approaches offer useful insights into creative systems as functional and useful products, they do not fully capture a crucial property of creative systems, that is, they are and they produce cultural artifacts such as stories, music, and paintings. In these areas, there has not be an established tradition of formal evaluation. When we combine artistic expression and system building, evaluation becomes an issue. As Gervas observes in the context of com´ putational narrative, "[b]ecause the issue of what should be valued in a story is unclear, research implementations tend to sidestep it, generally omitting systematic evaluation in favor of the presentation of hand-picked star examples of system output as means of system validation" (2009). We argue that the difficulty of establishing an evaluation methodology in computational creativity research reflects the cultural clash between the scientific/engineering and the humanities/arts practices. Aligned with Snow's notion of the two cultures (1964), researchers working in the intersection of the two communities have observed the conflict of different and sometimes opposing value systems and axiomatic assumptions (Mateas 2001; Sengers 1998; Manovich 2001; Harrell 2006; Zhu and Harrell 2011). One of the differences is what Simon Penny (2007) calls the "ontological status of the artifact" between the electronic media arts practice and computer science research. For an artwork, the effectiveness of the immediate sensorial effect of the artifact is the primary criterion for success. As a result, most if not all effort is focused on the persuasiveness of the experience, which is built on specificity and complexity. In computer science, the situation is reversed. The artifact functions as a "proof of concept" and hence its presentation can be overlooked; the real work is inherently abstract and theoretical. These differences, Penny argues, illustrate that the insistence upon "alphanumeric abstraction," logical rationality, and desire for generalizability in science is fundamentally at odds with the affective power of artwork. In the context of evaluation, this conflict takes the form of the clash between the productivityand value-based methodologies adopted by both AI and HCI communities, and the general resistance to empirical studies in the arts. In this position paper, we present our initial work of developing a more balanced evaluation approach that takes into 2012 150 account both system and cultural aspects of creative systems, focusing on computational narrative systems and their output. Our work is not intended to replace the function of literary criticism and close reading with empirical studies and statistical analysis. Simplistic attempts to reproduce art as a scientific experiment without an in-depth understanding of the former's tradition and value systems are shortsighted (as discussed in Ian Horswill's panel presentation at the Fourth Workshop on Intelligent Narrative Technologies, Palo Alto, 2011) and counter-productive to the long-term goal for computational creativity research. In the meantime, we also believe that evaluation is a critical process to inform the development of creative systems and to deepen the understanding of computational creativity. Therefore, more research and discussions about evaluation are needed. In the rest of paper, we examine existing evaluation methods in the area of computational narrative, and identify several important properties of stories and reading that have so far been overlooked in existing evaluation methods. Our preliminary work suggests that empirical literary studies can be a valuable resource to develop a more balanced evaluation approach for computational narrative systems. Existing Work on Narrative Evaluation Broadly speaking, discussions of evaluating creative systems have taken place at two levels. At the level of computational creativity in general, researchers have attempted to come up with domain-independent evaluation criteria to measure a system's level of creativity, both in terms of its process and output. For example, Colton (2008) and Jordanous (2011) proposed standardized frameworks to empirically evaluate system creativity. The importance of these approaches is that, in addition to evaluating specific systems, they also allow potential cross-domain comparison between systems. At the level of specific creative domains, evaluations are conducted to validate a specific creative system and its output in that domain. For instance, the recent work by Vermeulen et al. in the IRIS project (2011) proposed a list of standardized, systematic assessment criteria for interactive storytelling systems using concepts that "play a key role in users' responses to interactive storytelling systems." This section provides an overview of existing evaluation methods in the area of computational narrative. Our main focus is on the evaluation of story generation systems and their output, but some of our observations can also be applied to (non-generative) interactive digital storytelling systems. Recent examples of evaluating the latter type can be found in (Thue et al. 2011; Schoenau-Fog 2011). Although we do not specifically deal with high-level constructs such as ‘novelty' and ‘value,' we believe that more comprehensive evaluation criteria at the domain-specific level can indirectly contribute to the recognition and formulation of these highlevel creativity constructs at the first level. Based on our survey of major text-based story generation systems, existing evaluation methods can be categorized into three broad approaches. System Output Samples As Gervas pointed out above, providing sample generated ´ stories is one of the most common approaches for validating the system as well as the stories it generates. This approach started from the first story generation system TaleSpin (Meehan 1981), where sample stories (translated from the logical propositions generated by the system into natural language by the system author) are provided to demonstrate the system's capabilities. In addition to successful examples, Meehan also picked different types of "failure" stories to illustrate the algorithmic limitation of the system for future improvement. Similarly, many later computational narrative systems such as BRUTUS (Bringsjord and Ferrucci 2000), and ASPERA (Gervas 2000) use selected system output for ´ validation. Besides the lack of established specific evaluation metrics, the reason for the wide appeal of this approach is that it aligns with the tradition in literary and art practice where the final artifact should stand on its own without formal evaluation. However, simply showing the "successful" and/or "interesting" output without explicitly stating the system author's selection criteria can be potentially problematic. Some recent work in this approach has attempted to make this selection process more transparent. For example, in the evaluation of the GRIOT system, Harrell (2006) evaluates the generated poems based on the quality and novelty of the metaphors they invoke. When the system generates "my world was so small and heavy," the author evaluates it by the metaphor it evokes — "Life is a Burden." Similarly, the Riu system (Ontan˜on and Zhu 2011) automatically assesses ´ the generated stories by measuring the semantic distances of the analogies in the stories based on the WordNet knowledge base. Evaluating the System's Process The second approach is to evaluate the system primarily based on its underlying algorithmic process. Among the three evaluation approaches, this one is most aligned with traditional AI evaluation methods. Cognitive systems often use this approach to show that the system's underlying processes are cognitively sound. For instance, the evaluation of the Universe system (Lebowitz 1985) included fragments of the system's reasoning trace, along with the corresponding story output. It is intended to illustrate the system's capability to expand its plot-fragments library by generalizing from given example stories. Although the sample output and the process are relatively simple compared to those of the previous approach, Lebowitz intends to show, especially through the system processes, that the learning process is a necessary condition to creativity. In a more complex example, the Minstrel system (Turner 1993), presented as a model for the creative process and storytelling, is evaluated in two ways. First, Turner evaluates the system by comparing it to related work in psychology, creativity, and storytelling. Minstrel's process is contrasted to existing AI models of creativity both in the similar domain of narrative (e.g., Tale-Spin and Universe) and in different ones (e.g., AM (Lenat 1976)). Second, Minstrel is empirically studied in terms of its plausibility and quality as a test 2012 151 bed for evaluating different hypotheses of creativity. Specifically, plausibility is evaluated based on 1) the quantity of possible output stories, by testing the system in different domains, and 2) the quality of output stories through a series of user studies (details in the next section). In the evaluation of the "test bed" criteria, Turner studies why some TRAMS (i.e., problem-solving strategies) were added, removed, etc. to prove that one can experiment with different models of creativity. For instance, to test its model of "boredom" as how many repeated elements are there in the stories, Minstrel was asked to generate stories about the same topic four times. The differences and similarities between these stories are analyzed to evaluate how boring these stories are. User Studies Evaluating the system's process alone, however, does not provide insights into the quality of the output. For systems that are more geared towards seeing narrative as a goal in its own right, user studies provide a way to assess the output story without counting solely on the author's own intuition. As a result, user studies has been increasingly adopted both as a standalone evaluation method and as a complement to other approaches. For example, the MEXICA system (Perez y P ´ erez and ´ Sharples 2001) is evaluated through an Internet survey. The users rated seven stories by answering a set of 5-point Likert scale questions over five factors (i.e., coherence, narrative structure, content, suspense, and overall experience). Among these seven stories, four were generated by MEXICA using different system configurations (with or without certain modules). Two stories were generated by other computational narrative systems (i.e., GESTER and MINSTREL). The last story was written by a human author using "computer-story language." The scores each stories received is used to determine MEXICA's level of "computerised creativity" (c-creativity) in reference to human writers and other similar systems. In a more complex example, in addition to the methods mentioned above, the stories generated by Minstrel are evaluated through a series of independent user studies. In the first user study, users were given the generated stories, without being told that they were generated by a computer. Then they were asked to answer questions regarding their impressions of the author and the stories. In the second study, a different group of users repeated the above test, except the generated stories were rewritten by a human writer for better presentation with improved grammar and more polished prose. In the third study, the users were presented an unrelated story written by a 12-year-old and asked to answer the same set of questions. User studies of narrative systems do no always adopt some form of the Turing Test. In the Fabulist system (Riedl 2004), the system author conducted two quantitative evaluations without using human writers as a benchmark. The first study evaluates plot coherence, measured based on the assumption that unimportant sentences decrease plot coherence. A group of users independently rates the importance of each sentence in the generated story and hence the coherence of the plot. Second, character believability is evaluated by asking users to rate the difference in characters' motivation in stories generated by two configurations of the system. What is Missing Computational narrative is still in its early stage, both in terms of the depth and breath of the narrative content. It is especially true when we compare these generated stories with what we typically conceive as literary text produced by human authors. In this regard, the different methods described in the previous section are arguably adequate for the current state of these systems. As argued above, however, evaluation methods play an important role not only in assessing existing systems, but also in informing what kind of future systems should be built. In this regard, waiting for the narrative systems to mature before starting to develop suitable evaluation criteria is detrimental to the research community. As computational narrative research moves forward, a set of more comprehensive evaluation methods can help to reduce the gap between computer generated stories and traditional literature. Our position is that many important lessons from literary criticism and communication theory are by and large overlooked in computational narrative. We argue that they can be instrumental to developing evaluation methods that not only focus on the algorithmic and usability aspects of narrative systems, but also the expressiveness of the generated stories as cultural artifacts. Below is our preliminary work in identifying some crucial elements that are missing in many existing evaluation methods. It is not intended to be seen as a comprehensive list, but rather as an initial step towards incorporating fundamental knowledge and concerns from related fields in the arts and the humanities. Different Modes of Reading Reading is a complex activity. Depending on the setting, purpose of the reading, and background of the reader, different aspects of the text are highlighted. Vipond and Hunter (1984) distinguished among point-driven, story-driven, and information-driven orientations for reading. Shown by recent studies in Reader Response theory (Miall and Kuiken 1994), ordinary readers typically adopt the story-driven approach, that is, to read for plot. They contemplate what characters are doing, experience the stylistic qualities of the writing, and reflect on the feelings that the story has evoked. This mode is adopted while we read for pleasure. By contrast, the point-driven orientation is the foundation for literary criticism. Experts perform informed close reading — a complex act of interpretation at the linguistic, semantic, structural, and cultural levels — in order to understand the "point" of plot, setting, dialogue, etc. Point-driven reading assumes that the text is a purposeful act of communication between the author and the reader, and the "points" in the story have to be constructed through the reader's careful examination of the text. Finally, in the information-driven orientation, a reader is more concerned about extracting specific knowledge from the text. We adopt this orientation while, for example, 2012 152 following a recipe or checking facts in an encyclopedia. Information-driven reading places a strong emphasis on the coherence and informativeness of the text. This orientation is less common in computational narrative. Different reading orientations place different emphasis on evaluation methods. As story-driven reading is primarily concerned with creating the "lived-through experience" for the reader, compatible evaluation needs to focus on the immersiveness of the story world. In computational narrative, most existing evaluation criteria presume the story-driven reading orientation and center on interestingness, presence, and engagement of the stories (e.g. plot coherence and character believability). Additionally, this orientation requires the participants of the evaluation to be close to an "average reader." A point-driven-based evaluation requires participants, usually experts, to perform more in-depth reading of the text beyond the surface plot. The effectiveness of different literary techniques, such as thematic structures, linguistic patterns, and points of view in the story can be evaluated in ways similar to traditional literary criticism. To the best of our knowledge, there have not been attempts of point-driven-based evaluation in the context of computational narratives. There are many complex reasons for this. Some may argue that computational narrative, at its current stage, is too simple for this level of close reading. However, electronic literature (e-lit) work demonstrated that less algorithmically complex systems can still produce rich meanings. Establishing these evaluation criteria helps to develop a wider range of computational narrative. Authorial Intention Contradictory to the tradition of literary criticism, the evaluation of computational narrative systems has by and large ignored the intention of the authors. If we subscribe to the assumption that storytelling is a form of communication between the author and the reader, authorial intention should play a role in evaluating how effective these stories are. For instance, a user's report of unpleasantness may be positive or even desirable, if the system author intends to use her stories to challenge the reader's belief system, in ways similar to Duchamp's Urinal. A more balanced evaluation needs to differentiate this scenario from unpleasantness caused either by poorly written story or by unintuitive user interface. Similarly, intentional ambiguity in the story can be a powerful device, leaving something undetermined in order to open up multiple possible meanings. In the history of literature, intentionally ambiguous works such as Henry James's 1898 novel The Turn of the Screw have triggered many distinctive interpretations and vigorous debates about them. Mixed Methods A large percentage of the evaluations we surveyed gravitate towards quantitative methods with qualitative methods as a supplement, if at all. Through surveys and experiments, numerical data is collected, then analyzed statistically to provide an average user response. Although these methods have the clear advantage of being relatively easy to collect and analyze, they filter out the specificity and contextualization that is crucial to cultural artifacts. Several research projects have attempted to address this issue. Mehta et al. (2007) devised an empirical study for the Fac¸ade system, which was intended by its authors to evoke rich exchange of meanings. Mehta et al. acknowledge that the standard quantitative criteria in the conversational system research community (e.g., task success rate, turn correction ratio, concept accuracy and elapsed time) are not adequate because they assume a task-based philosophy, where conversational interaction is framed as a simple exchange of clear, well-defined meanings. As a result, they made a deliberate choice to use more in-depth but less statistically significant ethnographic methods to study a small group of users' perceptions and interpretations of their conversations with non-player characters. Using video recording and retrospective interviews, their study found that participants created elaborate back-stories to make sense of character reactions in order to fill in the gaps of AI failures, an insight difficult to capture with pure quantitative methods. The limitation of quantitative methods is echoed in Ho¨ok, ¨ Sengers and Andersson's user study of their digital art project (Ho¨ok, Sengers, and Andersson 2003). They ob¨ served, "[g]rossly speaking, the major conflict between artistic and HCI perspectives on user interaction is that art is inherently subjective, while HCI evaluation, with a science and engineering inheritance, has traditionally strived to be objective. While HCI evaluation is often approached as an impersonal and rigorous test of the effects of a device, artists tend to think of their system as a medium through which they can express their ideas to the user and provoke them to think and behave in new ways." As a response, their interpretive methods (open-ended interviews) focuses on giving the artists a grounded feeling for how the interactive system was interpreted and their message was communicated. Despite the sentiment against user studies in the interactive arts community, some artists involved in the project acknowledged that laboratory evaluations can help artists to uncover problems in interaction design. Because of these limitations, we believe that a mixed methods approaches may be more suitable for evaluating computational narrative outputs. In addition to the closedended questions and surveys, qualitative methods such as phenomenology, grounded theory, ethnography, case studies can better capture the plurality of meanings interpreted by different readers and the complexity of such readings. In literary studies, a group of researchers have started developing methods to empirically study readers' responses to literature. Due to the field's predisposition to point-driven interpretation, these methods offer a good example of balancing expert interpretation and ordinary readers' responses to and experience of the stories under evaluation. For example, Miall (2006) identified four kinds of empirical literary studies. First, studies that manipulate a literary text to isolate a particular effect. Second, studies that use an intact text in which the researchers hypothesize that intrinsic features of the text influence the reader. Instead of manipulating a text, each text itself provided a naturally varying level of foregrounding from high to low. A third kind of study involves comparison of two or more texts. Four, readers are asked to think aloud about a text during or after reading it. All of 2012 153 these can be further explored and potentially incorporated into the evaluation of computational narrative systems. Conclusion In this position paper, we discussed the challenge of designing evaluation methods for creative systems due to their dual status. Focusing in the area of computational narrative, we surveyed existing evaluation approaches in story generation systems and identified crucial aspects of computational narrative, as a potential form of cultural artifacts, that have been so far downplayed. Penny warned us of the danger of the "unquestioned axiomatic acceptance of the concept of generality as being a virtue in computational practice especially when that axiomatic assumption is unquestioningly applied in realms where it may not be relevant" (Penny 2007). We suggest that work in empirical literary study research can offer valuable insights of developing more interdisciplinary and more balanced evaluation methods. 2012_24 !2012 A Creative Improvisational Companion Based on Idiomatic Harmonic Bricks1 Robert M. Keller August Toman-Yih Alexandra Schofield Zachary Merritt Harvey Mudd College Harvey Mudd College Harvey Mudd College University of Central Florida Claremont, CA, USA Claremont, CA, USA Claremont, CA, USA Orlando, FL, USA keller@cs.hmc.edu August_Toman-Yih@hmc.edu aschofield@hmc.edu zbmerritt@gmail.com Abstract We describe an improvisational companion based on the concept of harmonic bricks, as articulated by Cork and others. Our companion is software that can play background for, and trade melodies with, a human soloist. While exhibiting creativity itself, its greater purpose is to improve creativity of its user. Bricks, originally intended for memorization of chord progressions, are used here as a structuring device for improvised melodies within a tune, as a basis for interaction, and as a means of learning new grammars for the purpose of generating melodies by the companion. A user interface for a partially implemented system is presented. Introduction Jazz musicians often make use of play-along audio tracks to practice their improvisations. Such tracks feature a recorded rhythm section (e.g. drums, bass, and piano, organ, or guitar), with the solo part omitted. A well-known example is the Aebersold (1967) series, comprised of over 130 volumes and still growing. One performance aspect helpful in practice is that of trading, wherein different soloists alternate playing over consecutive four or eight-measure segments of a tune. Our interest here is a computational companion in which the roles of the rhythm section and all but one improviser are played by the computer program. Such a companion does not require the assembly of other musicians and is essentially tireless. It can also be used as a basis for improving its user's understanding of the underlying theory and tune structure. Because playing vs. listening alternate in small manageable chunks, trading can be a valuable learning device for the jazz musician. This paper proposes that trading can be structured using the concept of a brick, a shorthand term for an idiomatic harmonic phrase. While there are thousands of tunes that comprise the jazz literature, fewer than one hundred bricks suffice to describe most of the tunes. Thus significant intellectual economy is achieved by working on melodic lines for bricks vs. entire tunes. We describe a preliminary implementation, with an outline for how bricks can also be used for machine learning, thereby improving the quality of the improvisational companion with time. _______________________________________________ 1 The authors thank the NSF (CNS REU #0753306) and Harvey Mudd College for their generous support. Background and Related Work Jazz solo improvisation most commonly consists of a soloist spontaneously creating and playing a melody over a chordal and rhythmic background provided by the rhythm section. The harmony typically consists of a chord progression from a standard tune. In the vernacular of the jazz musician, the progression is referred to as "the changes", i.e. the transitions from one chord to another. Negotiating the changes while playing is one of the aspects of jazz that provides both pleasure and challenge for the soloist, and ideally pleasure for the listener as well. Certain idiomatic chord sub-sequences, such as cadences, have long been used by improvisers, but Cork (1988, 2008) was one of the main proponents of providing informal labels for a significant variety of these sequences, which he called "LEGO bricks", after the well-known toy building block system. We will simply call them bricks here. Later Clark (2007) and Elliott (2009) analyzed a much larger set of standard songs and extended the set of bricks suggested by Cork. Elliott's work and analysis representation, which he called "road maps", further extended and refined the set of analyses. Our focus is on the computational aspects of using bricks in an improvisational companion. The work of the aforementioned authors emphasized chordal aspects of bricks. The present paper redirects the focus toward melody, with the intent that bricks are also a useful organizational concept in learning to improvise. Aebersold (1967) took a similar approach, mentioning a few common progressions. Berg (1992) provided theoretical underpinnings, but did not use the brick terminology. He focused primarily on cadences, to which he applied the term "turnarounds", targeting chords diatonic to major or minor keys. A cognitive discussion of creativity in jazz improvisation was provided by Johnson-Laird (2002), who observed "The cognitive problem for jazz musicians is to create a novel melody that fits the harmonic sequence and the metrical and rhythmic structure of the theme." He also cited Cork (1988) as recognizing the possibility of modulation between arbitrary keys in standard tunes, which Cork called "joins". Most notable for computer performance, Biles (1994) used a genetic algorithm to generate jazz licks. Walker (1997) and Thom (2000) each researched and prototyped 2012 155 their concept of an improvisational companion. Although neither system is currently accessible, both are compatible with our specific suggestion, which focuses on a particular basis for learning. Bricks This section provides a few examples of bricks that occur in standard tunes, such as found in the jazz music literature. The illustrations are taken from the graphical user interface of our improvisational companion. Each brick consists of three rows. The top row is the inferred key of the brick, the second row is the name of the inferred brick, and the third row is the input chord sequence in the brick. The inference algorithm is described in a separate paper (under review). Definitions of bricks are specified as text, in a grammar file accessible by the user. The first example is the most common type of jazz cadence in a major key, classically described as a ii-V7 -I cadence. The brick name is straight cadence, to distinguish it from other types of cadence. We use m7 to note a minor seventh chord. Figure 1: Straight cadence The second example extends the first by adding two more chords to the front. Termed a long cadence, one of its functions is to prolong the tension leading up to the final resolution. Figure 2: Long cadence The third example extends the long cadence even further by adding two more chords. This is called a starlight cadence, in reference to the tune Stella by Starlight, by Victor Young, which contains a typical instance of this cadence. Figure 3: Starlight cadence There are other bricks in addition to cadences. For example, a turnaround in our terminology is a brick that take the progression from a chord of a given function, (the tonic chord by default) toward a chord of another function (also the tonic by default). Figure 4 illustrates the POT (Plain Old Turnaround) brick (Aebersold, 1979), while figure 5 illustrates the Ladybird Turnaround, after the Tadd Dameron tune Ladybird, which entails making tritonesubstitutions (cf. Berg, 1992) for the non-tonic chords in the POT. Figure 4: POT (Plain Old Turnaround) Figure 5: Ladybird Turnaround Turnarounds do not always target the tonic. For example, the dropback (Cork, 2008) targets the ii chord. There are many variations of dropback, all ending with the secondary dominant for the ii chord (V7 of ii = VI7 ). Two of them are shown in Figures 6 and 7. TTFA stands for "turnaround to further away", with further away (from the tonic) being the phrase Cork used for the pre-dominant ii chord. TINGLe (Elliott, 2009) stands for There is No Greater Love, a tune by Isham Jones, that begins with this brick. Figure 6: TTFA Dropback Figure 7: TINGLe Dropback Figure 8 illustrates a roadmap produced by the companion, representing the analysis of a complete tune, in this case Confirmation by Charlie Parker, into bricks. The input that produced the roadmap consists of a text file with the chord symbols, bar markers, and section markers. In this case there are four sections, one per line. Most of the brick types in this tune have been defined above. Ones that haven't are major on, sad approach, and straight launcher. The word "on" simply means the tonic chord in the key of the moment. In this tune, FM7, which stands for F major seventh, is the on chord. An approach is the part of a cadence not including its resolution, and a launcher is an approach that resolves in the start of a new section. One can also notice in Figure 8 the presence of joins Cork (2008), which the software shows as small tag boxes below some bricks. Joins represent transitions between bricks. For example there are six sidewinder joins in this tune. The sidewinder join is often used to signal a transition from a major key to its relative minor, for example F major to D minor at the start of the tune. Although important to understanding the tune as a whole, and practice should take into account all twelve join possibilities (one for each chromatic interval) over time, joins do not play a major role in the current exposition. Not every transition has an identifiable join, revealing a gap in Cork's method. 2012 156 Automatic Brick Analysis The feasibility of our proposal is enhanced by the development and implementation of an algorithm for analyzing the chord progression of a tune into bricks. The implementation underlies our user interface, starting with a lead sheet as input and resulting in a roadmap as output. Because the algorithm is described in another paper submitted for publication, we will take it for granted here. One of the challenges of such an algorithm is that a given chord sequence can be interpreted as more than one brick sequence. The algorithm uses section sub-divisions of the tune to help reduce the ambiguity, and produces a final unique parse based on a cost assignment, representing user-specifiable precedence levels for various brick types. It can also be noted in Figure 8 that some of the starlight cadences end in Bb7_. The underscore tells the program that this Bb7 chord functions as a tension tonic (Cork, 2008), rather than as a dominant, its function by default. Interaction Modes Once the brick roadmap for a tune is made available, an improviser can make use of the bricks for practice. Our user interface allows the repeated play ("looping") of bricks. There are various modes of looping: • Simple looping mode plays only the automatically generated background. The user can play over it as long as desired. • Trading mode has the companion play melody every other iteration. The intention is that the user will play when the companion is not playing the melody. The melodies can be generated by two possibilities: o Generated on the fly by a grammar. o Selection from a pre-composed database. • Recording mode can be used with either of the above modes. Whatever the user plays is recorded. • Learning mode augments recording by the program learning a grammar from the melodies played by the user. Figure 10: Edited screen capture of a response played as by a user over the starlight brick Figure 9: Screen capture of a melody played by the improvisation companion over a starlight brick Figure 8: Screen capture of the algorithmically produced roadmap of a complete 32-measure tune, Confirmation capture of a response played as by a user over the starlight brick 2012 157 The tradeoff between grammar and database creation of melodies is that grammars can be much more creative, generating a wider variety of material and do not require the laborious process of predetermining the database. However, use of database allows one to avoid sub-standard melodies. At this writing, recording and learning have not been completely implemented, although a sufficient technology base exists to establish their feasibility. Recording from a MIDI instrument, such as a keyboard or EWI (Electronic Wind Instrument), is relatively straightforward and available. Recording from audio will require the addition of a module for audio to MIDI transcription. Software tools such as Smart Music (2012) and Intelliscore (2012) establish the feasibility of recording and learning from audio. Figure 9 shows a screen capture of a companiongenerated melody over the starlight cadence bricks in Figure 8, while Figure 10 shows a possible user response, played on a MIDI instrument input at 100 beats per minute. Figure 10 was edited for readability to account for swing feel in the user's playing. Automation of the reversal of swing feel for visualization purposes is a solvable problem still to be implemented in our system. Our user interface provides additional features to enhance its applicability: • The various play modes are not limited to just single bricks. Any contiguous combination of bricks can be played, allowing gradual range expansion. • A variety of different styles, such as swing, bossa, rock, etc., can be generated as background for the bricks. Tempos can be varied to suit the player. • Large bricks can often be broken down hierarchically into sub-bricks, as bricks in general are defined using a grammar-like notation in a user-modifiable dictionary. This feature can facilitate incremental learning by the user. • The user can drag bricks together to create the harmony of a totally new composition, then save the result as a lead sheet. Figure 11 shows a screen capture of the full roadmap interface as it currently stands, including the brick dictionary, from which the user can select bricks. Using Bricks to Improve Grammar Learning Gillick, et al. (2010) demonstrate a method for automating learning of grammars for generating melodies over a sequence of chords, by using a set of transcriptions of solos as input. The set of solos can be as small as a single solo, or a part of a solo. This learning method, as used in ImproVisor (2012), involves scanning the solo using a fixedlength moving window one or two measures long. Each segment scanned in the moving window is converted, based on the underlying chords, into an abstract melody wherein the notes are not actual pitches, but rather categories (chord tones, color tones, approach tones, etc.). The melodic contours are represented by a series of slopes, groups of notes that uniformly rise or fall, with parametric bounds on the number of semitones between notes. Once the segments have been extracted by the moving window method described above, they are clustered by similarity. Then a small set of representatives is chosen for each cluster, and the representatives are chained probabilistically using a Markov chain, which is conveniently representable as productions in the overall probabilistic context-free grammar. The transition probabilities are derived empirically from occurrence frequencies in the transcribed solos. The grammar is used to generate melodies by instantiating the representative abstract melodies into actual melodies. Due to the manner of abstracting and chaining, the results exhibit stylistic characteristics of the original solos without being rote copies of them. Probabilistic selection of grammar productions is the root source of apparent creativity emerging from the algorithm. Our new contribution is to use bricks as the windows, rather than having a fixed-size window as in the case of Gillick, et al. An advantage of using bricks is that they are already harmonically coherent units. As one can expect that a solo performed by a professional player will have melodic segments that conform to the harmonic units (Johnson-Laird, 2002), such a correspondence represents the transcribed soloist's understanding of the harmonic flow of the tune. Another advantage of using bricks as windows is that the Markov chaining used for sequencing small fixedlength segments can be eliminated. Chaining still might find uses in connecting brick-based melodies themselves; however, we expect that this macro-level chaining will not be extraordinarily useful in jazz, where large-scale coherence tends to be the exception. Learning from the User A set of recorded melodies produced by a user can be a viable basis for grammar learning. Such learning can either take place off-line or, if the processing demands are not too great, while the companion's melody is being played. Learning from the user's own melodies as a basis for a grammar can provide positive or negative reinforcement, as the resulting grammar is fundamentally an embodiment of the kinds of melodies being played by the user. Assessment As our companion is proposed as a learning vehicle, it is worthwhile to consider what kinds of mechanisms might be possible to provide feedback for the user. Coloration of user-played notes to indicate chord tones, color tones, approach tones, and other, as employed by Impro-Visor, provides one means of judging how well the tones being played by the user conform to the underlying harmony. This contrasts with note coloration used by, e.g. Smart 2012 158 Music (2012), which informs the user whether a note is the one specified in the score. With improvisation, there is no single correct note, so Impro-Visor's note categorization is more appropriate in this context. Another capability that could be added concerns timing. Smart Music can inform the user whether a note is played early or late. But what is desired for jazz is to inform the user whether his or her timing swings or not, an aspect difficult to capture, and one that remains as a topic of future research. Conclusion We have proposed an approach toward a jazz improvisation companion based on the idea of harmonic bricks. The latter were suggested by Cork (1988, 2008) as a means of remembering jazz chord progressions. Our suggestion is to also use bricks as a basis for improvising melodies. Toward that end, we have prototyped a software improvisation companion based on this idea. Although the learning aspects of the tool is work in progress, the implementation has progressed far enough that we are confident that the approach will be useful in an educational setting to help enhance the creativity of its users. 2012_25 !2012 Automatic Composition from Non-musical Inspiration Sources Robert Smith, Aaron Dennis and Dan Ventura Computer Science Department Brigham Young University 2robsmith@gmail.com, adennis@byu.edu, ventura@cs.byu.edu Abstract In this paper, we describe a system which creates novel musical compositions inspired by non-musical audio signals. The system processes input audio signals using onset detection and pitch estimation algorithms. Additional musical voices are added to the resulting melody by models of note relationships that are built using machine learning trained with different pieces of music. The system creates interesting compositions, suggesting merit for the idea of computational "inspiration". Introduction Musical composition is often inspired by other musical pieces. Sometimes, the new music closely resembles the inspiring piece, perhaps being an intentional interpretation or continuation of its themes or ideas. Other times the connection between the pieces is not identifiable (or even conscious). And, such sources of inspiration are, of course, not limited to only the musical realm. A composer can be inspired by the sight of a bird, the smell of industrial pollution, the taste of honey, the touch of rain or the sound of a running stream. Since this is the case, an interesting question for the field of computational creativity is whether a similar mechanism can be effected in computational systems. If so, new, interesting mechanisms for the development of (musical) structure become viable. Many attempts have been made at computational composition. These attempts use mathematical models, knowledge based systems, grammars, evolutionary methods and hybrid systems to learn music theory, specifically whatever music theory is encoded in the training pieces applied to the algorithms (Papadopoulos and Wiggins 1999). Some of these techniques have been shown to be capable of producing music that is arguably inspired by different music genres or artists (Cope 1992). Some computational composers focus on producing melodies (Conklin and Witten 1995), but most focus on producing harmonies to accompany a given melody (Chuan and Chew 2007)(Allan and Williams 2005). Ames (Ames 1989) and others have described training Markov models on existing artists or styles and generating similarly sounding melody lines. No system that we have found models the idea of artistic inspiration from non-musical sources. We present a computational system which implements a simple approach to musical inspiration and limit our focus to (non-musical) audio inspirational sources. Our system can autonomously produce a melody and harmonies from nonmusical audio inputs with the resulting compositions being novel, often interesting and exhibiting some level of acceptable aesthetic. Methodology Our approach to automatic composition from non-musical inspirational sources is composed of four steps: (1) audio input and melody generation, (2) learning voice models, (3) harmony generation and (4) post-processing. Audio Input and Melody Generation Inspirational audio input was selected from various sources. Our samples included baby noises, bird chirpings, road noises, frog croakings, an excerpt from Franklin Delano Roosevelt's "A Date Which Will Live in Infamy" speech, and an excerpt from Barack Obama's 2004 DNC speech. The melody generator takes an audio file (.wav format) as input and produces a melody. The input signal typically contains many frequencies playing simultaneously and continuously, and the generator's job is to produce a sequence of non-concurrent notes and rests that mimics the original audio signal. To do so, it uses an off-the-shelf, free audio utility called Aubio to detect the onset of "notes" in the audio file (as well as to estimate their duration) and to extract the dominant pitch at each of these onset times. Aubio is intended for analyzing recordings of musical pieces in which actual notes are played by instruments; however, in our system it is used to analyze any kind of audio signal, which means Aubio extracts "notes" from speeches or recordings of dogs barking or anything else. A thresholding step discards generated notes that are too soft, too high, or too low. The result is a collection of notes, extracted from the raw audio, composing a melody. Learning Voice Models To produce harmonization for the generated melody, we employ a series of voice models, Mi, learned from a collection of MIDI files representing different musical genres and artists. Each such model is trained with a different set of training examples, constructed as follows. First, because 2012 160 Figure 1: Finding neighbor notes. The top center note (circled in red) is the current melody note. In this case, k = 3, and, assuming wp = wt, the k closest neighbors are the two notes surrounding the melody note on the top staff and the first note on the bottom staff (circled in dark red). dt refers to the distance in time between the melody note and neighbor, and dp refers to the change in pitch. The (k + 1)th note is the rightmost note on the bottom staff (circled in green). there is no restriction on the time signature of the input or output pieces, note durations are converted from number of beats to seconds. Second, to identify the melody line of the training piece (and later to identify the melody line of the output piece), we use a simple heuristic assumption that the highest pitched note at any given time is the melody note. Third, for each melody note, we find the k + 1 nearest neighbor notes using the distance function (see Figure 1): d(n1, n2) = q wtdt(n1, n2)2 + wpdp(n1, n2)2 where n1 and n2 are notes, and weights wt and wp allow flexibility in how chordal or contrapuntal the training data will be. dt and dp compute absolute difference in onset time and pitch, respectively, so dt(n1, n2) = |onset(n1) ! onset(n2)| and dp(n1, n2) = |pitch(n1) ! pitch(n2)| Training instances are constructed from a musical piece's melody notes and its k + 1 closest notes. The training inputs are the melody note and its k nearest neighbors, while the (k + 1)th closest note is used as the training output (see Figure 1). The melody note is encoded as a 2-tuple consisting of the note's pitch and duration. The neighbor notes and the output note are encoded using a 3-tuple consisting of the time (dt) and pitch (dp) differences between the neighbor note and the melody note and its duration (see Figure 2). When building the training set for voice model Mi (with i indexed from 0), k = i + 2. So, after training, voice model Mi computes a function, Mi : R3i+8 ! R3. Figure 2: Training the voice models. For each melody note m of each training piece, a training instance is created from the melody note and the k + 1 closest neighboring notes (n1, ...nk+1). The k closest neighbors are used, along with m as input, and, as training output, the (k + 1)th closest neighbor is used. The melody note is represented as a pitch and a duration. Each of the other notes is represented as a 3-tuple consisting of dt, dp, and duration, where dt and dp refer respectively to the differences in start time and pitch between the neighbor note and the melody note. Harmony Generation The harmony generator is applied iteratively to add notes to the composition. Each pass adds an additional voice to the composition as follows. For the iteration 0, k = 2 and voice model M0 is used with the melody as input. Each note, in turn, is used as the melody note, and it and it's two nearest neighbors are used as input to the model, which produces an output note to add to the harmonizing voice. This does not imply that each harmony note is produced to occur at the same time as its associated melody note. For each melody note the model produces as output values for dt, dp, and duration; the harmony note will only start at the same time as the associated melody note if dt = 0. When all melody notes have been used as input, the additional harmonic voice is then combined with the original melody line and the first iteration is complete. For iteration 1, k = 3 and voice model M1 is used with the new two-voice composition as input, and the process is repeated, with the following caveat. We use the "melody" notes of the current piece (that is, the highest pitched notes) instead of the original melody notes (along with their k neighbors) as input to the model. This allows the melody notes to change from iteration to iteration, since the system can output notes that are higher than the (current) melody. The end result is another harmonic voice that is combined with the twovoice composition to produce a three-part musical composition (see Figure 3). This process is repeated for v iterations, so that the final composition contains v + 1 voices in total. Empirically, we found that v = 3 resulted in the most pleasing outputs. With v < 3 there was not enough variation to distinguish the output from the original melody. For higher values of v, the less musical and more cluttered the output became. Post-processing After the output piece has been composed, the composition is post-processed in two ways which we call snap-totime and snap-to-pitch (and to which we refer collectively as snap-to-grid). 2012 161 Figure 3: Adding voices. The harmony generator is applied iteratively over the melody line and generated harmony lines, using successively complex voice models. These iterations add successive voices to a composition. Algorithm 1 Snap-To-Time. This algorithm adjusts note start times in the final composition to compensate for lack of uniform timing across input and training pieces. First, !min, the minimum difference in start time between any two notes in the melody, is calculated. Each note is then shifted so that its start time is an integer multiple of !min from the start time of the composition's initial note. !min 1 for all notes n1 do for all notes n2 do ! |onset(n1) # onset(n2)| if ! < !min then !min ! end if end for end for for all notes n do ! bonset(n)/!min + .5c ⇤ !min # onset(n) onset(n) onset(n)+! end for Due to the beat-independent durations of the generated notes, the note onsets in the composition can occur at any time during the piece, which can result in unpleasant note timings. To correct this, we implement a snap-to-time feature. To do so, we first analyze the melody line to determine the shortest time, !min, between any two (melody) note onset times. Then each composition note onset is shifted so that it is an integer multiple of !min from the onset of the first note in the composition (see Algorithm 1). In other words, each note is snapped to an imaginary time grid whose unit measure is !min, with the result being music with a more regular and rhythmic quality. Because each voice is generated independently, there is no explicitly enforced (chordal) relationship between notes which occur at the same time. The voice models may provide some of this indirectly; however, this implicit relationship is not always strong enough to guarantee pleasing harmonies—there exists the possibility of discordant notes. To remedy this, we implement the snap-to-pitch algorithm. If two notes occur at the same time, the difference in their pitches is computed. The pitches are then adjusted until the pitch interval between the notes is acceptable (here, for simAlgorithm 2 Snap-To-Pitch. The notes n1 and n2 start at the same time. If the interval between them is not one of {major third, perfect fourth, perfect fifth, major sixth}, snap-to-pitch modifies the pitch of one of n2 so that it is. ! pitch(n1) # pitch(n2) if ! > 0 then if ! < 4 then ! = 4 else while ! 2/ {4, 5, 7, 9} do ! ! # 1 end while end if else if ! < 0 then if ! > #3 then ! = #3 else while |!| 2/ {3, 5, 7, 8} do ! ! + 1 end while end if end if pitch(n2) pitch(n1) # ! plicity, acceptable means one of {major third, perfect fourth, perfect fifth, major sixth}). See Algorithm 2. As a summary, Algorithm 3 gives a high-level overview of the entire compositional process. Results Musical results are better heard than read. We invite the reader to browse some of the system's compositions at http://removedforblindcopy. In some cases the melody generator produces melody outAlgorithm 3 Algorithmic Overview Of System. A melody is generated by detecting pitch, onset, and duration of "notes" in an inspirational audio sample. Additional voices are added by creating increasingly complex voice models and iteratively applying them to the composition. The entire composition is then post-processed so that it incorporates a global time signature of sorts and to improve its tonal quality. composition extractMelody(inspirationAudio) for i = 0 to v do k = i + 2 trainset ; for all training pieces t do trainset trainset [ extractInstances(t, k)) end for trainModel(Mi, trainset) composition addVoice(Mi, composition) end for composition snapToTime(composition) composition snapToPitch(composition) 2012 162 Figure 4: Snap-to-grid. The first graph shows the layout of an output composition based on CarSounds without snapto-grid post-processing. The second graph shows another CarSounds output with snap-to-grid. Note the change in the pitch scale that reflects the increase in pitch range which is a result of adjusting concurrent notes to an aesthetically pleasing interval. puts which are readily identifiable with their inspirational source audio files. Examples include compositions inspired by a speech by President Obama and by a bird's song. In both cases, the resulting melody line synchronises nicely with the original audio when both are played simultaneously. In contrast, other compositions sound very different from their inspirational source. Examples include a recording of a frog's repetitive croaking and a monotonous recording of road noise in a moving car. In the case of the road noises one would expect an output melody that is monotonous, mirroring the humanly-perceived characteristics of the input audio file. However, the melody generator composes a low-pitched, interesting, and varied melody line when given the road noise audio file, making it hard to identify how the melody relates to its source. In all outputs there is a general lack of traditional rhythm and pitch patterns. This is, of course, not surprising given that our audio sources for inspiration are not required to be in any particular musical key or to follow traditional key changes, nor do they have any notion of a time signature. Additionally, we do not restrict our training sets in either of these traditional ways. As a consequence, it is likely that in any given training set there will be instances which are in different keys and/or time signatures than the melody. In light of these conditions, it is to be expected that the output would not be traditional music. Training wt wp Percent Chords TwoDance 1 1 83 TwoDance 1 3 44 TwoDance 3 1 80 TwoBlues 1 1 67 TwoBlues 1 3 47 TwoBlues 3 1 71 Table 1: This table shows the effect of the weights wt and wp. The input was the FatFrog audio file and voice models were trained using either two songs from the Dance genre or two songs from the Blues genre. Generally, as wp increases (with respect to wt), the number of chords produced in the output composition decreases. The snap-to-grid feature is helpful. We have posted audio examples on the web comparing outputs with and without snap-to-grid. An example graph of each is given for visual comparison in Figure 4. Snap-to-time doesn't significantly change the landscape of the pieces, but it proves to be essential in synchronizing notes which were composed as chords but are not easily recognized as such because of the high precision of start times. Snap-to-pitch has a dramatic effect on the pitch of certain notes but is limited to those notes which occur at the same time. We explored several values for wt and wp (see Table 1), and, as expected, when wp > wt there are less chordal note events than single notes compared to when wp < wt. Interestingly, the baseline wt = wp = 1 for the case of voice models trained with two Dance songs is slightly more chordal even than wt = 3, wp = 1. We could not detect any significant difference in effect when using different genres or artists for training the voice models. No distinguishable qualities of dance music were discernible in the outputs composed using models trained only on dance music. No distinguishable qualities of Styx songs were discernible in the outputs composed using models trained only on songs by Styx. In short, each variable on training input successfully introduced novel variations in the output compositions in an untraceable way. Choice of training pieces did not produce a predictable pattern for aesthetic quality. The fact that our (admittedly simple) voice models failed to capture the distinct qualities of certain artists or genres suggests that our methods for encoding the musical qualities of training pieces are less effective at capturing such information than they are at capturing interesting note combinations and timings (see Figure 5). As described, the standard system uses the k + 1 closest neighboring notes of each melody note for training the voice models, and this works. However, as a variation on this approach, randomly sampling k + 1 notes from the 4k closest notes adds some extra variation in the composition and can lead to more aesthetically pleasing outputs. Snap-to-grid proved to be very useful for contributing to the aesthetic quality of the compositions. Compositions without snap-to-grid have more atonal and discordant chords which play at undesirable intervals. Using snap-to-grid allows a compromise between the uniqueness of the compo 2012 163 Figure 5: Composition sample. These two measures are taken from one of the compositions produced by our system. The system produces interesting rhythms with varying chordal texture. sitional style and regular timing intervals and chordal structure. Future Work At this point, our system is quite simple and many of the techniques it employs are somewhat na¨ıve musically. Some of this na¨ıvete is for convenience at this early stage of sys´ tem development, and some of it is design decisions that allow for greater variety in system output. The snap-to-grid processing is a post-hoc attempt to impose some level of musical "correctness" on the system's output. Given the unconstrained nature of the inspirational input, it is an interesting question to ask how one might naturally incorporate useful aspects of music theory directly in the melody generation process while still allowing significant effect from the source. Also, it is natural to suggest incorporating more traditional and mature harmonization schemes for the generated melodies. Finally, to this point, only the melody has been (directly) affected by the inspiring piece; it would be interesting to develop methods for using the inspirational source to directly influence other musical characteristics such as harmonization, style, texture, etc. However, all of these necessary improvements are relatively minor compared to the real open issues. The first of these is the development of an evaluation method for judging aesthetic and other qualities of the compositions. To this point, our measure of "interestingness" has been only our own subjective judgment. The development of more principled, objective metrics would be useful as a filtering mechanism, and, at a more fundamental level, as feedback for directing the system to modify its behavior so that it produces better (novel, interesting, and surprising) compositions. In addition, such results may also be vetted in various kinds of human subject studies. The second of these is the development of a mechanism for autonomously choosing which inspirational sources the system will use as input. This requires the development of some type of "metric" for inspiration. Or, perhaps another way to think about this problem is to ask the question, "what makes a sequence of sounds interesting (or pleasing, or arousing, or calming, or ...)?" Is this quantifiable or at least qualifiable in some way? Some potential starting points for this type of investigation might include work on identifying emotional content in music (Li and Ogihara 2003; Han et al. 2009) as well as work on spectral composition methods (Esling and Agon 2010). This, in turn, introduces further considerations, such as in which quality or qualities the system might be interested and how those interests might change over time. An additional consideration is that of a second level of inspiration - rather than the system being inspired by the aural qualities of the input alone (as it is at present), is it possible to construct a system that can be inspired by metaphors those aural qualities suggest? And is it then possible for the system to communicate the metaphor to some degree in its output? 2012_26 !2012 Creativity in Configuring Affective Agents for Interactive Storytelling Stefan Rank¹, Steve Hoffmann², Hans-Georg Struck³, Ulrike Spierling², Paolo Petta¹ ¹ Austrian Research Institute for Artificial Intelligence (OFAI), Austria stefan.rank/paolo.petta @ ofai.at ² Hochschule RheinMain, University of Applied Sciences, DCSM, Germany Steve.Hoffmann/Ulrike.Spierling @ hs-rm.de ³ Independent Screenwriter georgstruck @ foni.net Abstract Affective agent architectures can be used as control components in Interactive Storytelling systems for artificial autonomous characters. Creative authoring for such systems then involves configuration of these agents that translate part of the creative process to the system's runtime, necessarily constrained by the capabilities of the specific implementation. Using a framework for presenting configuration options based on literature review; a questionnaire evaluation of authors' preferences for character creation; and a case study of an author's conceptualisation of the creative process, we categorise available and potential methods for configuring affective agents in existing systems regarding creative exploration. Finally, we present work-in-progress on exemplifying the different options in the ActAffAct system. Introduction Interactive Digital Storytelling (IDS) is concerned with the creation of a new media art form that allows for real-time interaction with a developing narrative. In the terminology of (Boden and Edmonds 2009), the aim is a form of CI-art or VR-art, i.e., computer-generated and responsive to audience interaction, possibly in the form of virtual reality. Creating new methods for adaptivity, generativity, and interactivity is seen as the prime method for advancing beyond traditional linear media, and while there are recent examples of technical approaches that are close to video-based media, e.g. video recombination (Porteous et al. 2010), and of conceptual approaches to story generation, e.g. based on analogy-mapping (Ontan˜on and Zhu 2011 ´ ), a large part of the research has focused on enabling interactive storyworlds inhabited by synthetic characters (Rank 2005; Si, Marsella, and Pynadath 2005; Louchart and Aylett 2007). The assembly of autonomous conversational actors endowed with some degree of autonomy poses significant integration challenges (Gratch et al. 2002), including the development of authoring methodologies that support the creative process inside the boundaries of an IDS system. We take the point of view of authors with IDS experience to find ways beyond the current disparity between needs of authors and capabilities and interfaces of existing systems. In order to examine the support for creative authoring in these systems, we look at methods for configuring one crucial element for translating parts of the creative process to runtime: affective agents. After introducing affective agent architectures and a framework for presenting their configuration options to authors that draws on literature review; the evaluation of authors' preferences for character creation using a questionnaire; and a case study of an author's conceptualization of the creative process, we examine available and potential methods for configuring affective characters in existing systems. Finally, we report on work-in-progress translating these options to the ActAffAct system. Affective Characters Affective agents are a specialization of intelligent agents for domains in which emotional and related phenomena are important. A key dimension for agent control architectures of synthetic characters is affective competence: believable portrayal of emotional reactions and the capability of selecting appropriate expressiveness; variability within the consistent boundaries perceived as personality (Ortony 2003); and the recognition of subjective relevance in the agent's (social) environment (Marsella, Gratch, and Petta 2010). The origins of such architectures often lie with scenarios of use (Rank and Petta 2006) that target other application areas, cognitive modeling or modeling of psychological theories. Their configuration is not necessarily suited directly for the authoring of synthetic characters. In the context of IDS, synthetic characters translate parts of the creative decisions of authoring to the runtime system. Affective competence helps to ensure the emotional aspect of character portrayal as well as of the causal connections in a story, down to the fine-grained level of audience interaction. Here, we focus on techniques that model characters and their behaviour explicitly in order to achieve levels of motivational and behavioral autonomy that facilitate the generativity and interactive flexibility that IDS strives for. On the spectrum from ‘strong autonomy' to ‘strong story' (Mateas and Stern 2000; Swartjes 2010), this places the approach towards the autonomy-end, more compatible with the idea of emergent narrative that nevertheless requires purposeful authoring (Louchart et al. 2008), in addition to affective and situated competencies, to be successful. Even in a strongly story-based interactive system, autonomously competent agents are valuable if they can be configured to act ‘in character' during episodes that are not directly controlled by a story-based framework or if the system explicitly 2012 165 represents emotional links between characters as part of the authoring process (Perez y P ´ erez 2007 ´ ). For the context of this work, the top-level of an IDS system, drama management (Roberts and Isbell 2008), is not considered: we intentionally focus on single characters and their autonomous behaviour: character goals rather than author goals (Riedl 2009). The nature of interactive systems entails that judgment of creativity cannot focus on novelty and value of static artifacts as for systems that clearly separate a generative part, see also (Gervas 2009 ´ ): The use of affective agents carries aspects that are clearly separable for static artifacts, such as chronology, causality, and the distinction between fabula and discourse, i.e. what is told and how it is told, into the runtime of a system. Rather, we try to map the conceptual spaces that are established by affect models as well as the ways for exploring and potentially transforming them. This approach is similar to work in the area of game design (Smith and Mateas 2011). In (Spierling and Hoffmann 2010), the authors state that creative authorship is far from obsolete in the context of IDS. The creative output consists to a substantial degree of the specific configuration of the control system. Abstractions at the creative conceptual level are seen to be distinct from the more formal abstractions (Crawford 2004) required for implementation. Rather, this relationship between authorial conceptual abstractions and implementation-specific abstractions of a more technical kind and the special kind(s) of dedicated support required of an IDS system for this transformation needs to be investigated for each case. Configuration Options for Affective Agents We use a framework for the configuration of affective characters as seen from the author's perspective (Rank et al. 2012), based on feedback received from a questionnaire study performed during an authoring workshop1, as well as a case study of one author's practice, strongly rooted in drama theory. The questionnaire used free-text feedback and a set of Likert-scale questions intended to gauge the relative importance of different strategies for creating characters for a story-world. The free-text feedback reflected a wide range of approaches to character creation, including placing the focus on events, conflicts, or personality-specific feelings that happen to the character; the reliance on known characters or on personal experience as a starting point; and picking the background and underlying goals of characters as central element. At the same time, the reported results point at the complementarity of different approaches to character creation while the evaluation of preferences for different approaches showed no significant differences at that level of investigation (Rank et al. 2012). As a second source for authors' viewpoints on the problem of configuring synthetic characters in an effective way, we draw on a case study of the authoring experience of one of the co-authors (Struck 2005) with both traditional linear narratives (i.e., script writing) and interactive story1 The IRIS authoring summer school http://iris.interactivestorytelling.de/Summerschool telling systems. The central tenet of this author's experience is the conceptualization of narrative as a sequence of emotional effects. Underlying this approach are charactercentered drama models based on work such as by Frank Daniel (see (Howard and Mabley 1995)) and the notion of a conceptual space (Struck 2005). Furthermore, the practice corresponds to the cyclic process of engagement and reflection that has been proposed as a model of creativity in writing (Sharples 1999). As a quick way to concretize a character's role in a narrative, it is proposed to answer the question: What does the character want the most? Motivations and aversions of a character are then considered subordinate to this main desire. The notions of aspirations, vocations, and goals are crucial to derive a character's fears. As an example, consider a character that wants to see the world: Potential candidates for suitable high-impact fears would then be fear of flying or fear of crossing open water. A narrative is then seen to pitch a character's goals against obstacles involving the need to risk high stakes. Anything lacking in connection to a character's hopes and fears is omitted. Conversely, everything that is shown relates to this backstory of the character. Note that such constraints quickly extend beyond individual characters to comprise their social stances and interrelationships (Spierling and Hoffmann 2010). In addition, this selection principle contributes to perceived believability by allowing for inference and prediction of motivations and intentions through observations, e.g. (Riedl and Young 2010), p.220. Figure 1: The levels of presenting configuration of characters in IDS. Based on this investigation of the authors' viewpoint, and considering available systems, four different non-distinct levels of presenting configuration are distinguished, as illustrated in Figure 1: 1. Direct changes to initial inner states of agents as motivating factors for the character's behaviour. 2. Parameter settings that influence the inner working of an agent in correspondence to a theoretically persistent characteristic of the agent as a whole: traits. 3. Complete stock characters with a particular personality that can be used as a basis for customization. 4. Configuration based on the selection of backstory experiences that influence background beliefs and emotional parameters. 2012 166 Levels 1 and 2 define a conceptual space that can be explored exhaustively in theory, though not in practice for most systems. Levels 3 and 4, while relying on the same methods of character control, present configuration as a transformational process starting from exemplars2. As mentioned, these levels are not distinct and can be seen as complementary, progressing towards a higher level of abstraction. Configuration in Existing Systems In the following, we review current systems and their support for exploring configuration in terms of inner states and traits, as well as the potential for supporting stock characters and backstory experiences. An important biasing factor for the selection of example systems was the open or confidential availability of source code. As mentioned above, the inner states of existing systems are often strongly tied to the origin of the agent architecture and not geared towards authoring. Many affective agent architectures can be seen as extensions of beliefdesire-intention (BDI) agent models based on ideas about resource-bounded practical reasoning (Bratman, Israel, and Pollack 1988). These ‘BDI+E' architectures rely on: beliefs that represent what an agent holds to be true, desires (or goals) that an agent tries to fulfill, and a representation of what the agent is capable of doing, often in terms of a plan library. The third name-giving element, intention, refers to capabilities activated in pursuit of a current goal. The main influence on agent behaviour that all these architectures share and that are directly amenable to exploration are the relative importance of different desires, i.e. their utility, and the set of capabilities available to a specific agent. Examples of architectures that add an operationalization of affective appraisal are FAtiMA (Dias and Paiva 2005) or ActAffAct (Rank 2005). The addition of affective appraisal results in further parameters for exploring configurations: the relative importance of standards, i.e. the evaluation of different types of behaviour; the initial relative importance of the actors and objects in the storyworld; and the creation thresholds and decay rates for different types of emotions. Further, such architectures were extended to consider mood as a meta-level effect, i.e. as an aggregate of previous emotions. An important additional parameter in this respect is the number of emotions considered. One extension of FAtiMA (Doce et al. 2010) applies the so-called OCEAN or five-factor model of personality. An individual's personality is expressed as values of five personality traits that can be used to explore the possible resulting behaviours: openness, conscientiousness, extraversion, agreeableness, neuroticism. The values for the five OCEAN factors influence the appraisal process in terms of thresholds and decay rates for emotion instances, but also coping and planning as well as expressivity in an animation system. Both the BDI framework and personality theory frame the configuration of agents in terms that are still close to the authorial conceptualization presented above. For other mech2 See (Rook and Knippenberg 2011) on the influence of quality of exemplars on creativity and imitation depending on the regulatory focus of authors. anisms, the matching of explorable settings is less direct. In implementations based on PSI theory, such as MicroPSI (Bach 2003) or ORIENT (Lim et al. 2012), another extension of FAtiMA, emotions are described as sets of modulators that influence processing directly: arousal, i.e. the propensity for action, resolution level, i.e. the accuracy of internal processing, and selection threshold, i.e. resistance to change the current intentions. Exploring the effect of different situations on the values for these modulators opens a space for an agent's personality. On a more general level, PSI theory introduces the individual settings of so-called motivators (affiliation, integrity, energy, certainty, and competence), homeostatic variables with an influence on behaviour based on the deviation from set-points. Further affective architectures are built on top of cognitive architectures that in themselves provide a wide range of possibilities for configuring individual differences in terms of inner states. An example is Soar and the emotion models based on it, such as EMA (short for emotion and adaptation) (Marsella and Gratch 2009) or PEACTIDM (Marinier and Laird 2006). Soar itself provides a general processing cycle that can be used in different styles, which in turn results in a wide range of configuration options. Corresponding to the BDI approach, the utility of goals and the availability of operators form the core of any configuration. Similarly, in Thespian (Si, Marsella, and Pynadath 2005; Si, Marsella, and Pynadath 2009), goals, policies and beliefs about self and others are the determining factors of single agent behaviour. In addition, in support of authorial control, characters can be configured by specifying multiple story-paths that are used to deduce their goals. Thereby, this approach employs a strong-story element to parameterize a system that is originally autonomy-driven. In EMA, the overall affective assessment is based on a causal interpretation of the current state of the world. On a conceptual level, the granularity of this representation forms an important part of configuring the personality of an agent. One focus of affective architectures in general are coping activities defined as the inverse operation of appraisal. Coping thus involves the identification and influencing of the believed causes for the currently significant state. Different coping strategies can be available to a single agent and the selection of these strategies represents a new level of potential configuration and a suitable candidate for a high-level exploration of the conceptual space of affective agents. Further, availability and relative priority of different coping strategies can be linked to backstory experiences. In the VirtualStoryteller framework (Swartjes, Kruizinga, and Theune 2008), late commitment is used for the autonomous characters to determine the values of internal parameters and the state of the storyworld at runtime rather than beforehand at authoring time. To inform these delayed decisions, an assessment of the benefit of available options for story development is computed, thus reducing options of direct authoring of character states in favour of more global traits of the character in the story context. Planning and scheduling techniques add further implementation-specific parameters for configuring individual differences that are comparatively far removed 2012 167 from the author's perspective. As an example, planning algorithms can use quality measures, e.g., time and cost, and resource constraints to decide between alternative paths of action. Configuration of these mechanisms involves relative weighting of different quality measures. Table 1: Configuration options for inner states and traits. Inner States Beliefs, granularity of internal representation; Availability of capabilities; Utility of goals; Standards, evaluation of types of behaviour; Thresholds and decay rates for emotion types; Availability and priority of coping strategies Traits History considered for meta-level mood; Qualities/constraints for planning/scheduling; Openness, conscientiousness, extraversion, agreeableness, neuroticism; Arousal, resolution level, selection threshold; Importance of affiliation, integrity, energy, certainty, competence Table 1 summarizes relevant options for configuration of affective behaviour, distinguishing options related to inner states from traits of the agent as a whole. Practical and intentional examples for stock characters and configuration based on backstory experiences are rare. However, most architectures were designed for a specific purpose and therefore a set of characters is available, at least in principle. To the best of our knowledge, explicit use of backstory experiences to influence individual character behaviour has not been implemented in any system so far. Extending Configuration of ActAffAct ActAffAct (Rank 2005) is a proof-of-concept system that relies solely on the affective competences of individual characters and their configuration in terms of beliefs and desires to generate interesting but very simple plots within an interactive storyworld setting comprising a hero; a villain; a victim; and a mentor, as well as simple props such as a sword; a rope; or a bouquet of flowers. As a system with BDIbackground and a practical reasoning system at its core, the modification of inner states is the direct way of exploring character designs. The use of a mood-system allows for the modification of traits: influence factors of different emotion types on mood and the decay rate of the mood state, that influence the character as a whole. Our work-in-progress considers the support for stock characters and backstory experiences. Due to the use of "cliche" story characters, ´ archetypes, as agents in the original system, a set of stock characters can be derived directly. Most interesting though is the realisation of backstory experiences: The implementation of the appraisal process includes the selection of coping styles and the relative weighting of different types of emotions. These two elements lend themselves to be configured by a selection from automatically generated episodes: every episode shows reactions of a character in the possible combinations of encounters with other characters and objects in the storyworld. Evaluation of the resulting authoring possibilities is planned, including a questionnaire study. Conclusion In this paper, we used a conceptualization of character creation from the author's viewpoint to review different levels of configuration that current affective agent architectures provide. A mapping of notions of the author's creative process to configuration options is not straightforward. Rather, due to the roots of many affective agent architectures in areas other than IDS, parameters offered for exploration are often far removed from the author's perspective on character creation. On the other hand, parameters that stem from theoretical and practical considerations in agent architectures potentially provide new sources of creative inspiration for authors, both in terms of details of modeling and in terms of additional factors of influence that implementation-specific configurations expose for exploration. Overall, the conceptualization helps to frame the support for creativity in authoring IDS systems, and points to future extensions of approaches for character configuration. Finally, based on the review of configuration options in related systems, we presented ongoing work in extending the ActAffAct system regarding new ways for configuring autonomous characters. Acknowledgements This work is partially supported by the European Commission under grant agreement IRIS (FP7-ICT-231824). The Austrian Research Inst. for AI is supported by the Austrian Federal Ministry for Transport, Innovation, and Technology. 2012_27 !2012 A Meme-Based Architecture for Modeling Creativity Shinji Ogawa Nagoya, Aichi, Japan perfectworld@nyc.odn.ne.jp Bipin Indurkhya and Aleksander Byrski AGH University of Science and Technology, Cracow, Poland {bipin, olekb}@agh.edu.pl Abstract This research is a collaborative work between a visual artist, a computer scientist, and a cognitive scientist, and focuses on the creative process involved in connecting two pictures by painting another picture in the middle. This technique was involved in four Infinite Landscape workshops conducted at Art Museums in Japan and Europe over the last five years. Based on the artist's verbal recollection of the ideas that occurred to him as he drew each of the connecting pictures, we identify the micro-processes underlying these ideas, and propose a meme-based, evolutionary-inspired architecture for modeling them. Introduction Research in recent years has revealed that though creativity may involve an aha moment with a gestalt shift or a sudden perceptual or conceptual reorganization, it is typically preceded and followed by several micro-processes that play an equally important role as the aha moment itself (Dunbar 1997; Sawyer 2006). These micro-processes can occur within a cognitive agent itself, or in different agents within a group or society. Our goal in this research is to study and model these micro-processes. Infinite Landscape Workshops This research is a collaborative effort between a visual artist [henceforth the Artist], a computer scientist and a cognitive scientist. Over the last five years, the Artist conducted four workshops at art museums in Japan and in Europe with the common theme connecting different spaces. In each workshop, there were 15-19 participants, all children (8-14 years) except in one workshop there were six adults. All the workshops followed the following modus operandi. In the first step, the children were shown about 20 photographs of scenery from around the world, and then they were asked to draw imaginary landscapes using the building, people, animals etc. in these pictures as they liked. In the second step, the Artist brought the children's imaginary landscapes to his studio, and then he drew one picture to be inserted between every two children's pictures, so that all three pictures form a seamless scene. One such trio of pictures is shown in Fig. 1: scenes 9 and 10 were drawn by participants, and the Artist drew S9 to connect the two. Figure 1: First strip In the third and final step, all the pictures were connected in a ring without a beginning and an end, and the completed ring was suspended from the ceiling of the museum where the workshop was held. The ring was placed with the paintings on the inner side, so that the viewer is surrounded by the work while viewing it. Overview of the Project and Methodology Specifically, our goal in this project is to model the microprocesses involved in creating the connecting picture. Our methodology is as follows. In the first step, the Artist has recorded various ideas that occurred to him as he drew each of the connecting pictures. In the second step, we analyze these steps to identify and classify underlying processes. In the third step, we outline a model for implementing these processes. Finally, we would like to do experiments with the implemented system and evaluate the results. In the current paper, we report our observations from analyzing the data from the workshop conducted at the Meguro Museum of Art, Tokyo (Japan) on 2 August 2007. The Meguro workshop was different from the other three workshops in that the participants were given only pencil and paper; there was no color, so the focus was on forms, shapes and space. Also, this workshop included six adults among nineteen participants; the remaining 13 were children (8-14) years. Based on our observations, we identify various microprocesses and how they interacted with each other to create the macro-level connecting pictures. Finally, we propose a meme-based, multi-agent architecture for modeling the underlying cognitive process, and discuss future research directions. 2012 170 Observations on the ‘Connecting' Process We analyzed data from ten connecting pictures that the Artist drew for this workshop. Here we present the Artist's self-reflection on the genesis of ideas that led to the creation of connecting pictures. We include here seven of the more interesting cases. (The original comments were in Japanese. Translation and slight editing is by one of the other authors of this paper.) We start with the Artist's observations on connecting 9 and 10 (Fig. 1): "These two had completely different atmosphere from each other. Sketch 9, drawn by an adult participant, is a scene set at dusk; a person looking at the artist is drawn wearing a sad expression. Sketch 10 has a bright atmosphere with flowers, fountains, buildings on a hill, and a horse. Moreover, each picture had an important character in the bottom left. The idea for connecting these sketches came to me while looking at the wonderful horse in 10. I thought of putting a parent horse running nearby. Because the background color of 9 and the body color of the horse in 10 was the same, I transformed the background of 9 into the parent horse in S9, which became a nested image structure. Then I extended the baby horse and the hill with the buildings." On connecting 11 and 12 (Fig. 2): "There was the ground and the sky in the left one-third of 11, but the sea covered the remaining part on the right. In 12, a vast meadow was drawn with rich pictorial details. Here my attention was drawn to the connection between the color of the giant bridge in 11 and the color of the sky in 12. In S11 I drew the enlarged bridge of 11 and connected it with the picture on 12, which resulted in a nested image structure." Figure 2: Second strip On connecting 12 and 13 (Fig. 3): "I felt these two could not be connected with the techniques I had used so far. Then I noticed the wall on the top-right corner of 12 and the curved ledge surrounding the fountain in 13. Using these two curves, I drew a large Mobius strip in S12. As this Mobius strip divided S12 into four sections, in each section I extended the adjacent scenery. It felt like pouring in the scenery. Accordingly, I was able to connect them without blending, and this became the first work with this technique." On connecting 7 and 8 (Fig. 4): "Because 8 was a richly detailed realistic presentation, to contrast it with the presentation in 7, I decided to stress dimensionality in the connection. The realistic rocks and the bridge in 8 were rendered in 3-d and were connected with the bridge in 7 that was extended in 2-d. To make this connection smoother and give an accent to the picture, I drew 3 Russian onion domes from 7 into S7." Figure 3: Third strip Figure 4: Fourth strip On connecting 5 and 6 (Fig. 5): "Sketches 5 and 6 could be naturally connected. However, I had decided to refuse ordinary, conventional way of connecting things. I got the hint from the composition of 6. Oddly, on the right of 6, everything is drawn tilted towards the bottom left along a vector, but in the middle part, another horizontal vector appears. As a result, the horizon is split into two: one horizontal and another pointing to bottom left. I further emphasized this split of horizon, and drew a horizon pointing to the sky where the cow is, and another horizon that is sinking down where the buildings are." Figure 5: Fifth strip On connecting 4 and 5 (Fig. 6): "Connect 5 on the right of 4. I was very interested in the row of flags that was hanging in 5 from left to right. On the right edge of 4, there is an upside-down building. What a challenge! I took that challenge and extended the gate of that fort-like building, and turned the top-right part of it into water surface. I extended that dark water surface to the right, making it narrower, and connected it with the contour of the lake in 5. On top of it, I placed the swans and plants from 5. I left the top-right part of the picture white in order create a contrast effect with the black space that is extended to the left. In the bottom right, I extended the flags." On connecting 3 and 4 (Fig. 7): "I had a strong impression that the participants were expressing their own images instead of sketching by sampling from the photographs of the scenery I had shown. An extreme case of this is 4. 2012 171 Figure 6: Sixth strip At first the sketch was filled-in completely black, and then brightened by the eraser. It had no earth and sky, but an ambiguous space from a dark fantasy I decided to connect this dark picture with 3, which had a child-like pictorial space. However, it would be impossible to connect the two in an ordinary way. Here, I decided to ignore all the meaning in these pictures, but to focus on the pattern of light and dark instead. I said to myself, 'it is just a blotch'. The only connecting point in both pictures was the street in 3 and the bridge on the bottom left of 4. I could connect this street and the bridge. Luckily, bottom left of 4 looked like the sea, and bottom right of 3 also looked like a body of water. In S3, I extended the road in 3 in S-shaped curve and connected it with the bridge in 4. Continuing, I also extended the sea The problem was what to do on top of this. On the left part of S3, the only possibility was to extend the street-side houses on 3, so I did that in the same touch. Then I gradually changed the color of houses from gray to black, while introducing spatial distortion, and changing them from solid to liquid. I floated a swan in the dark pond that the buildings were turned into." Figure 7: Seventh strip Identifying Micro-Processes in ‘Connecting' Carefully going through all these comments, as well as examining the trio of pictures ourselves, we came with the following list of micro-processes that played a role in the genesis of connecting two pictures: Copy elements This was by far the most common operation. Elements were copied from both the left and right pictures and incorporated in the connecting pictures just like that. One can see examples of this in almost every instance of connection. Among the examples presented above, one can see that swan is copied from 4 to S3, flags, plants and swan from 5 to S4, onion domes and swan from 7 to S7, small bridge and bull from 11 to S11, and so on. Copy elements and transform This is similar to the above except that the element gets transformed while copying. For example, the rocky peak and the bridge are rendered in 3-d while they are copied from 8 into S7, and parking sign is turned around as it is copied from 5 into S5. Copy elements and swap attributes Here elements that are being copied interact during copying and swap attributes. One example is provided in 6-S6-7 (Fig. 8), where two people are copied from 6 into S6, but their poses and the object one of them is holding are swapped. Extend elements An element is continued in the adjacent picture; for example, the sea from 3 into S3, the masonry from 6 into S6, and the meadow from 12 into S12. Same form (shape, shade,. . . ) ! search for meaning This is illustrated by 9-S9-10 (Fig. 1), where the same shading for the horse's body in 10 and the background in 9 led to the idea that the background in 9 can be morphed in the mother horse in S9. This process can also be evidenced between S11 and 12 (Fig. 2). Similar form and semantic association ! morph forms This is evidenced in 3-S3-4 (Fig. 7), where a semantic association between the road and the bridge, and similar forms (notice that they are similar but not the same) led to the idea that they can be joined by morphing one into the other. Form-based continuation This is different from the extend element above in that the continuation is based on the shape and shade only, and does not involve meaning. This is seen in S3 and 4 (Fig. 7). Form-contrast ! concept-contrast This is illustrated by 7S7-8 (Fig. 4). The contrast between a richly detailed sketch (8) and a plain sketch (7) suggested a 3-d vs. 2d contrast. Form-similarity ! unifying concept In 12-S12-13 (Fig. 3) form-similarity between the wall on the top right of 12 and the ledge around the fountain on the bottom left of 13 suggested the idea of a Mobius strip. ¨ Emphasize concept In 5-S5-6 (Fig. 5), different planes (horizons) in 6 were incorporated in S5 and emphasized. This is similar to copy element and transform except that the element is a concept rather then a concrete object. Figure 8: Eight strip Meme: A Representation for Ideas In order to represent all these micro-processes, we propose to use the formalism of meme, which was popular 2012 172 ized by Richard Dawkins in his celebrated The Selfish Gene (Dawkins 1989). Memes are cultural counterpart of genes, and represent ideas that can be generated, be passed on, get transformed, be combined with each other, and die out. As we observed many of the similar operations and interactions among the micro-processes in connecting two pictures, we chose meme as a unit of representation for modeling. In our particular domain, a meme can be an element like a swan, a horse, or a building. It is a particular element, so it carries specific attributes. In other words, the horse meme that plays a key role in 9-S9-10 (Fig. 1) is not the general concept of a horse, but carries concrete attributes like the shade and the shape of the horse that was drawn in 10. There can also be conceptual memes, for example, ‘horizon tilted to top-right', ‘3-d rendering', or ‘dark shade'. Such memes represent specific operations or attributes that can be imparted to an element or a scene. It is possible to have generalized memes and to organize them in a hierarchy. So, for example, there can be a ‘horse' meme of which the horse meme of 10 (Fig. 1) would be an instance; or there can be a ‘tilted horizon' meme, which would be a parent of the ‘horizon tilted to the top-right' meme. But for the time being we are not considering such general memes. Following actions can be carried out on individual memes: Copy or replicate In this case the element is copied as it is, or the concept is applied as it is. So a swan is copied with all its attributes intact, or the horizon can be tilted toward the top-right corner of the pictures for a part of the scene that is selected. Copy with transformation In this case, the element is copied but one or more of its attributes are changed along the same dimension. For example, its size can be made bigger or smaller, its color or shade can be changed, its orientation can be changed, and so on. For a concept meme, some of its parameters are changed during application; for example 'horizon tilted to top-right' can change to 'horizon tilted to top-left'. Two memes can also interact with each other and we specify the following four modes of interaction: Swap attributes Two memes can swap attributes of each other. We saw an example of this in Fig. 8 where the pose and the "object-held' attributes of two people were exchanged. Overwrite attribute In this case the attribute of one meme overwrites the attribute of the other meme. So, for example, the size or the color of one meme can be rendered according to the other meme. This is illustrated by an instance in the Osaka workshop, where the silhouettes of cliffs were made to conform to the silhouettes of buildings. Unify This allows two memes to bond together and act as one meme. Any common attributes of the two become the attributes of the unified meme, and in addition some extra attributes may be created based on the spatial or other relationships between the two. This is similar to the grouping operation in many graphic editors. Create a new meme This allows creation of a new meme with attributes inherited from each of the parent memes. There are a number of other features that we are not considering at the moment. For example, it may be possible for a meme to activate another meme. We saw an example of it in our observation above when the Mobius strip idea was suggested by the similarity in form between the wall and the fountain ledge 3. However, in order to model this mechanism, we need to have some kind of global associative knowledge network. A Memetic Architecture We are implementing a meme-based system to model the process of creating the intermediate picture. In particular, our system incorporates the following features: 1) modeling of visual attention to identify prominent elements or areas in the neighboring pictures; 2) specifying memes for spatial relationships among the picture elements; 3) specifying memes for general techniques like extension and continuation; and 4) various heuristics for choosing among competing memes. For a lack of space, and also as our system is currently being implemented, we limit ourselves to only pointing out that we are exploring two approaches to generating the connecting image for a given pair of images: • Evolutionary algorithm: the two images should be digitalized, and potential solution generated stochastically from them with the use of crossover and mutation (Michalewicz 1998). Formulation of the fitness function should take into consideration the similarity of the potential solution to both of images. We also plan to incorporate aesthetic criteria in the fitness function (Norton, Heath, and Ventura 2010). • Agent-based approach: some complex approaches utilizing multi-agent notions (Byrski and Kisiel-Dorohinicki 2005)) bring interesting additions to the process, as autonomous individuals, as agents are, may utilize other means to evaluate the resulting images, and may choose different crossover and mutation operators in an intelligent way to apply them to the current solution. Both approaches may leverage concepts well-known from the memetic computation—local search (Moscato and Cotta 2010)—thus applying a number of mutation operators (instead of only one) before final evaluation. Relation with Previous Research Needless to say, the ideas and the architecture presented here are based on a number of existing and past research efforts to model different aspects of creativity. The origin of the parallel, competition-cooperation architecture can be traced back to Selfridge (Selfridge 1959). Subsequently, Lesser et al. (Lesser, Fennell, and Reddy 1975) formalized it as blackboard architecture and used it for speech recognition; and in our earlier research (Indurkhya 1997) we used a similar approach to model creativity in legal reasoning. Hofstadter 2012 173 and his colleagues (Hofstadter 1995) proposed a parallel terraced scan architecture for modeling creativity in analogical thinking and our approach outlined above is heavily influenced by their work. One key point of difference is that a meme is more like an agent that carries its own data with it, unlike a knowledge source in the blackboard architecture or a codelet in Hofstadter's architecture. The system proposed above also draws from the meme media architecture of Tanaka, Fujima and Kuwahara (Tanaka and Kuwahara 2008). They have developed the C3W wrapper framework that allows the user to open a web application page, clip some input and output portions as pads, and link them with pads clipped from other web applications. A number of approaches have been developed for applying evolutionary algorithms to generate visual art (Sims 1991; Lewis 2007; Machado, Romero, and Manaris 2007), but their goal is to generate aesthetically pleasing visual objects. In the long run, it may be possible to use some of these techniques by incorporating constraints from the neighboring picture objects to generate novel but related picture objects for the connecting picture. As for systems that generate constrained visual objects or scenes, there has been some research on automatic collage generation (Krzeczkowska 2009) and on completing a partially drawn picture in the intended style (Colton 2008), and some of the techniques developed therein can be exploited in our system as well. Conclusions and Future Research We analyzed data from the Artist's verbal recollection of his thoughts as he drew the middle pictures to connect pairs of pictures seamlessly. From this analysis, we identified a number of micro-processes that led to the big picture idea. We described a memetic approach to formalize these microprocesses, and outlined an evolutionary-inspired approach to support the process of generating the connecting picture. We are also interested to study the cognitive processes of the viewers as they look at the trio of pictures. It has been noted in the past that surface-level perceptual similarities influence how viewers connect pairs of images and relate them conceptually (Indurkhya et al. 2008). It would be interesting to see how this process is affected when there is an intervening picture in the middle. We plan to conduct behavioral and eye-tracking experiments to measure the viewers' response and incorporate those observations in our model. 2012_28 !2012 Creatively Subverting Messages in Posters Lorenzo Gatti1 , Marco Guerini1 , Charles Callaway1 , Oliviero Stock2 , Carlo Strapparava2 1 Trento-Rise, 2 FBK-Irst Via Sommarive 18, Povo, Trento Italy {l.gatti, marco.guerini, c.callaway}@trentorise.eu, {stock, strappa}@fbk.eu Abstract Creativity is widely used in advertisements, and is meant to be appreciated by people. However creativity can also be used as a defense. When we walk in the street we are overwhelmed by messages which try to get our attention with any persuasive device at hand. As messages get ever more aggressive, often our basic cognitive defense - trying not to perceive those messages - is not sufficient. One advanced defensive technique is based on transforming the perceived message into something different (for instance irony or hyperbole) from what was originally meant in the message. In this paper we describe an implemented application for smartphones that creatively modifies the linguistic expression in a virtual copy of the poster. The mobile system is inspired by the subvertising practice of countercultural art, and aims at experiencing aesthetic pleasure that relaxes the cognitive tension of the user. Introduction We are surrounded by linguistic expressions on the walls around us. Whenever we walk along a street, posters, signs and other similar advertisements are there trying to attract our attention and in most cases trying to influence our actions, beliefs and behavior. We may try to avoid those ads but it is not easy: even if the characteristics of our perceptive and cognitive system partially help us in being "banner blind" (Pagendarm and Schaumburg 2001; Burke, Gorman, Nilsen and Hornof 2004), pervasive advertising often manages to overcome our barriers (Müller, Alt and Michelis 2011). One strategy to counter messages that forcefully grab our attention is to use our cognitive system to fight back and creatively alter the advertising message itself. This form of "reactive" creativity lies at the root of various phenomena, including some aspects of verbal humor, especially irony. The psychoanalytical approach to humor (Freud 1905), gives an attractive account of the release of energy that results from overcoming our inner censors through the appreciation of humorous expressions. A similarly liberating process can be attributed to other types of variations of linguistic expressions. From an aesthetic point of view a given variation is more highly appreciated if the change is limited, as for instance suggested by the optimal innovation theory (Giora 2003). When we humans entertain this creative, reactive modality to defend ourselves, we tend to intervene within our mind. Sometimes people even intervene on the physical object itself, the classic example being the poster, writing over it to correct an expression (or even add graphic symbols to images, such as moustaches added to a face). In countercultural art this is called subvertising. As far as current technology is concerned, a lot of attention is being devoted to figuring out how to exploit smartphones for advertising. Here instead we propose a mainly defensive goal on behalf of the consumer. The aim is to exploit technology for producing linguistic expressions that slightly change the observed advertisement. The goal is to accommodate a message that is biased in a rather different direction. The system produces a new virtual poster so as to help the user relax the cognitive tension produced by the unduly attention-grabbing original message. In particular we have developed a mobile application that allows users to take a picture of a poster, and then automatically produces a new virtual version with the same layout and visual aspect of the poster, but with a creative variation of the linguistic expression it originally expressed. In our current prototype the user merely needs to point the camera of the smartphone at the poster, and the image, with the same appearance but the altered linguistic expression, is produced in a few short steps. An image analysis and reconstruction component takes care of the graphic aspects, and an underlying program is called to obtain the actual variation of the given expression, which can have several different realizations. In this paper we utilize just one of the functionalities of VALENTINO, an affective valence shifting program (Guerini, Strapparava and Stock 2011); the creativity involved in the process is a necessary element for the successful impact of this defensive tool. Background and relevant Work The word subvertising is a portmanteau of the words "subvert" and "advertising". Subvertising refers to the practice of making spoofs or parodies of corporate and political advertisements in order to make a statement. This can take the form of a new image, an alteration to an existing im 2012 211 age, or a modification/re-contextualization of an existing slogan (sometimes called a "meme hack"). According to AdBusters, a Canadian magazine that is a leading proponent of counter-culture and subvertising, "A well produced 'subvert' mimics the look and feel of the targeted ad, promoting the classic 'double-take' as viewers suddenly realize they have been duped. Subverts create cognitive dissonance." In our work we focus on the creative textual modification task of subvertising. In particular, we want to implement a defense strategy for making the user aware of the subtle presuppositions implicit in advertising messages by using exaggeration (or hyperbole) of the affective content of the message. The main resource used to implement such a defensive strategy is the VALENTINO prototype, a tool for affective modification of existing texts. Affective variations of pre-existing texts have been studied and implemented in various domains, see for example (Mateas, Vanouse and Domike 2000; Guerini, Strapparava and Stock 2008b, 2011), or similarly funny variations (Stock and Strapparava 2003). The effectiveness of affective variations has also been assessed; in particular, Van Der Sluis and Mellish's (2010) evaluation shows that biased variations of a message work better than the neutral condition. With regard to output quality, Whitehead and Cavedon (2010) demonstrated that adding bigram frequencies for the insertion of valenced modifiers (chosen according according to MAX function) significantly improve the perceived quality of the resulting texts. Valentino VALENTINO can modify existing textual expressions towards more positively or negatively valenced versions, given a numeric coefficient that represents the desired valence shifting for the final expression. Since the system works in an open domain and without lexical restrictions, VALENTINO's linguistic resources are general purpose, and automatically built from large scale corpora and English lexical repositories. For the task of modifying single words, we automatically built a resource that gathers these terms in vectors (Ordered Vectors of Valenced Terms OVVTs). We used the WordNet antonymy relation as an indicator of terms that can be "graded", and built four groups of terms that can be used (one group for each POS). Moreover, we populated the vectors using other specific WordNet semantic relations (the similar_to relation for adjectives, hyponym relation for verbs and nouns). Finally the valence of WordNet synsets, taken from SentiWordNet scores (Esuli and Sebastiani 2006), was added to the corresponding lemmata. An example OVVT for the antonymy pair (ugly ↔ beautiful) ordered from most negative to most positive is: (hideous … ugly … unnatural) ↔ (pretty … beautiful … gorgeous) For insertion or deletion of words that play the role of downtoners or intensifiers we created specific OVVTs (which we call Modifier-OVVTs). In this case the words were gathered according to a criterion of contextual, rather than semantic, connection: we used the Google Web 1T 5Grams Corpus (Brants and Franz 2006) to extract information about co-occurrences. In particular we created resources connecting terms with their modifiers (according to POS), thus obtaining adjective modifiers for nouns, and adverb modifiers both for adjectives and verbs. An example Modifier-OVVT for the term "dish", ordered from most negative to most positive, is: (disgusting … mediocre … tasty … delicious … exquisite). Strategies We undertook a preliminary qualitative study with human subjects, to understand how people modify the valence of existing texts. The insights gained showed that: (a) people usually modify single words, (b) sometimes add or subtract words that play the role of downtoners or intensifiers and (c) sometimes use paraphrases (Guerini, Strapparava and Stock 2008b). As a first step VALENTINO performs POS tagging, named entity recognition, morphological analysis and chunking of the existing constituents (NPs, VPs, ADJPs, and so on). This task exploits the TextPro package (Pianta, Girardi and Zanoli 2008). Subsequently the strategies described in points a), b), and c) above are applied to the chunks, following some general guidelines. Minimal variation: texts (chunks) are slanted as much as needed, but the target score should not be exceeded, limiting the variation as much as possible. Modification of dependents: A constituent is modified considering first the dependents (from left to right) and then possibly the head. Consider the very positive and the slightly negative variations of the following sentence: " We ate [a very good dish]NP" "We ate [an incredibly delicious dish]NP" (+) "We ate [a good dish]NP" (-) The rationale is that in a constituent the element that bears the greatest part of the meaning is in the head, and it decreases the further we move into the constituent. Candidates Selection: The selection of substitutes is a two-step process. Given a term to be modified (e.g. "good" in the example) there can be various candidates for the modification. • The first step requires filtering out all the terms that do not meet the target score. For example if the target score is higher than +0.5, all terms from -1 to +0.5 are discarded. Further possible constraints can be taken into account (e.g. if the reasoning is about "good dish" then only the similar_to "good" that co-occur with "dish", and with score > 0.5, should be kept). • Various strategies can then be used for choosing the best candidate: word persuasive impact (Guerini, Strapparava and Stock 2008a) word or n-gram frequency (Whitehead and Cavedon 2010), mutual-information, etc. Currently, the most used measure in VALENTINO is pointwise mutual information score, which yields modifiers specialized for the given term (e.g. "delicious" co-occurs less fre 2012 212 quently than "nice" with "dish", but it is more specialized in this context). As for metrics that help decide the best quality lexical choice, while we have converged so far on the best mutual information measure, we think in different situations different measures should be applied (although this is outside the scope of this paper). Furthermore specific n-gram patterns - see for example (Veale 2011) for extracting semantically exaggerated variations are under development. In the present scenario, the critical choice was deciding the suitable degree of the affective modification amongst those proposed by VALENTINO; i.e. which one is the best for obtaining a defensive effect. In fact, light modifications usually obtain the effect of strengthening the message, while stronger ones can weaken it (Guerini, Strapparava and Stock 2012). Obviously strengthening the message is not the aim of the present tool, which is why we chose maximum target scores for the affective exaggeration strategy in subvertising. Figure 1 Figure 2 Interface We have implemented SUBVERTISER, a mobile subvertising tool that allows a user to photograph an advertisement they see, select the text they wish to change. The system then replaces that text with a valenced version created by VALENTINO. SUBVERTISER tries to match the font face, size and color of the new slogan to the text in the original image in order to heighten the effect of presenting a message that subverts the original. SUBVERTISER requires very little interaction on the part of the user: once a photo of a printed ad (e.g., poster, billboard or banner) is taken with the phone, only a few steps are required to swap the original advertising message with a valenced version: selecting the region of the text in the photo to replace, correcting the text after scanning by inbuilt OCR, and selecting his or her preferred new version from of a list suggested by VALENTINO. In a typical scenario, the user is walking with friends in a city, perhaps shopping or going to see a movie. When he is interrupted by a poster advertisement that bothers him, he uses his phone to take a snapshot of it, modifies it with SUBVERTISER, and can then show the new ad to his friends. Algorithm Behind the scenes, SUBVERTISER performs a number of steps to process both the language and the image of the advertisement. Given the photo taken by the user with the phone's camera (the image can also be chosen from a preexisting library of images), the user selects the text region he wants to change containing the advertising message by moving and resizing a selection rectangle (Figure 1). The image area is then passed to an OCR application on the smartphone itself, which scans for text within that rectangle1 . The OCR both detects the coordinates of the bounding boxes for every individual word as well as returns the recognized text string of the message. From the bounding box information we obtain the rectangle containing the first line of text, which is then scaled down to 100 pixels in height, and uploaded to an online third-party (multi-step) font recognition service2 using dedicated APIs. Meanwhile, the program applies an inpainting algorithm to each bounding box in the original text zone. This step reconstructs the background image that was underneath the original text, providing a blank background where new text can be written (Figure 2). The user is then asked to correct OCR errors, which if left unchanged would lead to linguistic errors in the valenced text, via a text entry box on the smartphone. VALENTINO is queried with the corrected OCR text string, and four valenced sentences are returned, from the most positive to the most negative, and presented to the user (Figure 3) to choose from. Once we know the original text, we also send that information to the font recognition server, which needs to align known letters with the image in order to determine the font and then respond with that information. Once the user selects one of the slanted messages, an algorithm decides how to divide the slanted text into lines, since VALENTINO typically changes the number of words in the sentence. Then, we ask the online font service to generate a new image with the detected font and a transparent background, 1 As much processing as possible is done directly on the mobile to avoid excessive bandwidth usage and associated costs, and to lower the needed time to complete the task. Image processing is done with the OpenCV library, while character recognition is provided by the Tesseract OCR engine, which are both open source. 2 URL: www.myfonts.com/WhatTheFont/ 2012 213 since we often may not have access to the original font due to limitations of the smartphone, and the image containing the new text is downloaded to the phone. We next identify the original text color by looking at the first line of text in the original image, and treat that as the color for the whole text (even if the ad is written in multiple colors), to save processing time. This identification is performed by clustering the colors of the pixels in two groups with k-means, then considering the two means as colors of text and background. Finally the new message is copied inside the bounding boxes of the original text. The image is shown onscreen (Figure 4) and the user can save it to the image library or share it by mail or MMS. SUBVERTISER is currently implemented on the iPhone. An Android version is under development. The only external resources needed are the font recognition service and VALENTINO's server. Further work From the technical, NLP processing point of view, the basic capabilities of VALENTINO are currently being expanded. First, we are now starting to take into account the sentence structure, so that it can be better focused. An analysis of rhetorical aspects of a given sentence will also benefit the quality of the intervention. As for metrics that help decide the best quality for lexical choice, we have converged so far on the best mutual information measure (see above), although we think in different situations different measures should be applied. More generally we can note that an extended VALENTINO could be parameterized to achieve different goals, including: a) generic valence shifting; b) focused biased language to influence the audience's view on one element (e.g., a human or a thing) in the sentence; c) "cleansing" of biases present in the original expression; or d) special effects, such as ironic or hyperbolic reconstructions. As for point b, specifically for evaluative expressions, subjective evaluations can be along different dimensions: the ethical aspect (related to moral values), the epistemological aspect (related to truth), aesthetics (related to beauty or pleasure), and the utilitarian (related to utility, resources, results), and can be especially reflected in the lexical choice. The aim is also to link a set of preferences and information about the context to the system. For instance audience preferences can shift its behavior in line with the user's attitude; also independent preferences (e.g. originating from a social institution) might produce expressions that could influence the audience towards a specific direction. As for the overall mobile application we have described, some technical improvements, like automatic spell checking of the OCR output, would enhance the app's usability. We would like to mention that additional uses can be envisaged, apart from defense against unwanted advertising expressions. In a sophisticated but not unusual twist of fate, the same technology can be used by the advertisers themselves. A new form of promotion could be based on an active role on the part of their target, which, by adding a Figure 3 Figure 4 2012 214 creative variation, contributes to the reinforcement of the basic advertising goal. Another prospect is in an artistic direction. For instance the mobile application could be monitored on a large display by a crowd at an exhibition, where the audience could see posters in the city being continuously and automatically changed by different individuals walking around with their smartphones, so as to counter the messages on the walls and introducing a collective sense of liberation. Another setting is with mobile games, where the user may interact with existing linguistic expressions to produce anagrams, wordplay and so on. Conclusions Creativity is widely used in advertising, which must appeal to people of all walks of life in every imaginable situation. But advertising also tries to change people's actions, beliefs and behavior, which they rightfully resist as an invasion of their time and attention. The resulting conflict leads to increasingly pervasive, aggressive and frequent advertisements on one hand, while on the other to a conscious refusal to pay attention to those ads or a profaning transformation of the message. Inspired by the latter, a system, even if just based on some degree of combinational creativity (Boden 2009) can aid people in defending themselves against elements in their environment. We thus built SUBVERTISER, a creative subvertising system, which assists consumers in proactively "taking back" their daily outdoor routine. SUBVERTISER allows consumers to use the power of satire and virtual profaning to push back at advertisers. By combining the utility and pervasiveness of smartphones with the capability of the VALENTINO affective valencing system, consumers can take a picture of an advertisement in public that offends them, select the wording they want to change, use VALENTINO to supply them with language variations that subverts the intended message, and then modify the advertisement with their chosen variations to look just like the original. By sending the new version to their friends, they can join in a collective release of tension from the perpetual barrage of advertisements. 2012_29 !2012 Crossing the Theshold Paradox: Modelling Creative Cognition in the Global Workspace Geraint A. Wiggins Centre for Digital Music School of Electronic Engineering and Computer Science Queen Mary University of London Mile End Road, London E1 4NS, UK geraint.wiggins@eecs.qmul.ac.uk Abstract I present a hypothetical global model of everyday creative cognition located within Baars' Global Workspace Theory, based on theories of predictive cognition and specific work on statistical modelling of music perception. The key idea is a proposal for regulating access to the Global Workspace, overcoming what Baars calls the Threshold Paradox. This idea is motivated as a general mechanism for managing the world, and an argument is given as to its evolutionary value. I then show how this general mechanism produces effects which are indistinguishable from spontaneous creative inspiration, best illustrated by Wallas' (1926) "Aha!" moment. I argue that W. A. Mozart's introspective account of compositional experience closely matches the proposed process, and refer to a computational system which will form the basis of an implementation of the ideas, for musical composition. Introduction Computational Creativity is mired, practically speaking, in the problem of evaluation. Artefacts created by computer cannot be judged by the computer's aesthetic, for that is obscure, and evaluating them in terms human aesthetics has been shown to be unreliable due to negative preconceptions (Moffat and Kelly, 2006). One solution to this might be to compensate for that bias statistically, given the necessary models. Another is to avoid the issue of artefact evaluation altogether, and focus on process, and on building systems that apply it. Colton (2009) catchily entitles this point "Paradigms Lost", making the point that AI sometimes overtheorises, and paints itself into a corner by the application of problem solving methods to a domain in the abstract, instead of getting on and building something that concretely explores it: the subtext may be that this tendency arises from rigour envy. Colton raises a point that benefits from emphasis: he finishes the section with "the production of beautiful, interesting and valuable artefacts", and this occludes the key point in the final sentence: "the need to embrace entire intelligent tasks" (my italics). The vexed question is "how?" Modelling an entire, novel creative process, evaluation, reflection, and all, in the abstract leads us back to the initial problem: the only way to judge it from outside is in terms of its outputs (Ritchie, 2001). The rest of the paper is structured as follows. First, I explain the theoretical methods used, and make and important distinction between what I shall call inspiration and creative reasoning; the current proposal addresses only the first of these. Next, I describe some of the extended background to the thinking presented here, and in terms of the surrounding and supporting cognitive theory, including an apparent inherent paradox identified by its creator. Then I present the evolutionary argument for the theoretical stance taken here, and derive the (simple) principles on which my proposal is based from it. Next, turning to implementation, I summarise earlier modelling work, explain its connection with the current proposal, and describe what is necessary to extend it into the model proposed here. The technical contributions of the paper are a variant notion of AI Agent, based on prediction from sense data, rather than on sensing, and a mechanism for deployment of that agent in a particular kind of reasoning system. The key philosophical contribution is the fact that, once this mechanism is deployed, the kind of creativity that is addressed here, inspiration, is explained within the basic reasoning, and needs no further explanation. Methodology To overcome the methodological problem introduced above, my approach is to attempt to replicate an existing creative process. The only existing creative process ready available for inspection is that of humans; these have the built-in advantage, mostly, of being able to explain what (they thought) they did, and elegant paradigms exist to empirically deconstruct that majority of aspects of human behaviour of which introspective reports are unreliable. I therefore aim to apply cognitive modelling theory and technology to human creative process, and then to evaluate the success of the enterprise with respect not only to the outputs of the computational systems produced, but to compare the various aspects of their operation with human creators. While this approach solves only part of the general problem of computational creativity, it is an area where refutable hypotheses can be made, and so demonstrable progress in a research programme (Lakatos, 1970) may take place. For this attempt to succeed in a scientific sense, before one even considers the artefacts that the replicant creative system may produce, the theory and its associated computational 2012 180 system must conform to at least the following constraints, to be said to model human creative cognition. 1. Falsifiability The system must not behave in ways which are arguably or demonstrably different from human creators while it is operating. Since we cannot, currently, know how human creators create, this is the strongest falsifiability constraint that can be applied. 2. Evolutionary context There must be an account of the evolutionary advantage conferred by the mechanisms proposed, a corresponding order of development, and an analysis of their appearance in successive species over evolutionary time. This account cannot be verifiable, but the lack of one leaves the biological development of the proposed solution unavailable to scientific scrutiny. 3. Learning capability The system must be capable of learning its creative domain. Learning should be appropriate to the domain: for example, in music, perceptual aspects should be implicit—that is, teaching or supervision should not be required; however, in some domains, such as mathematics, minimal supervision is evidently unavoidable, because of the need to know the meaning of symbols, to give semantics to what is being learned1. 4. Production capability The system must be able to produce artefacts that are demonstrably within its creative domain, whether or not they are of quality comparable with a human creator's output. While the judgement of whether an artefact is or is not a particular kind of thing is subjective, it is not as difficult as the subjectivity of quality. For the purposes of experiment, restricted domains with clear tests must be set up, using appropriate theory from the corresponding human-creative domain. 5. Reflection The system must be capable of reflecting on its behaviour, modifying it, and explaining it—where necessary via indirect indicators such as those used for understanding the behaviour of humans. In this paper, I present a hypothetical, but partly implemented, computational model of a particular kind of human creativity, and suggest that it conforms to criteria 3- 4, and partly to criterion 2, though further research is required to provide more evidence against criterion 1. Criterion 5, Reflection, is conferred by location of the model within Baars' (1988) Global Workspace Theory, whose focus is consciousness; so it falls beyond the scope of the present proposal. Background Creativity: Inspiration and Reasoning Wiggins (2012) introduces a distinction between two kinds of creativity: on one hand, inspiration and, on the other, creative reasoning. Respectively, these terms are intended to distinguish what appears spontaneously in consciousness— the "Aha!" moment that Wallas (1926) suggests follows 1 To ask the system to learn the semantics of the symbols to which it is exposed from context is not, in principle, unreasonable, as there is every evidence that humans do so. However, to require the system to do so when the scientific research focus is creativity seems unnecessarily difficult. "incubation"—from what is produced by the deliberate application of creative method. The spectrum between the two allows us to make distinctions between conscious creation in the deliberate planning of a formalist composer, the semi-spontaneous but cooperative and partly planned creation of the jazz improviser in a trio, and entirely spontaneous singing in the shower. Note that a non-polar position on this spectrum necessarily entails a combination of explicit technique and implicit imagination: there is not a smooth transition in kind between the two, but rather a mixture containing some of each in varying proportion. Having made this point, I reserve creative reasoning for future work, not least because it entails that we address consciousness, which is difficult, but also because Baars' theory already provides a framework in which it may be considered, given a mechanism for inspiration. This is not to dismiss the deliberate end of the scale, nor to suggest that it does not exist, but merely to focus the current work on a separable aspect of the complex. Global Workspace Theory Bernard Baars (1988) introduces a theory of conscious cognition called the Global Workspace Theory. There is not space to describe this wide-ranging and elegant theory here, so I summarise the relevant important points. The theory posits a framework within which consciousness can take place, based around a multi-agent architecture (Minsky, 1985) communicating via something like an AI blackboard system (Corkill, 1991), but with particular constraints, which I outline below. The approach taken is to avoid Chalmers' "hard" question of "what is conscious?" (Chalmers, 1996) and instead ask "what is it conscious of, and how?" This is especially appropriate in cases such as the current paper, where consciousness is not the central issue, but presentation of information to it is. Baars casts the non-conscious mind as a large collection of expert generators (not unlike the multiple experts in Minsky's Society of Mind, 1985), performing tasks by applying algorithms to data in massive parallel, compete for access to a Global Workspace via which (and only via which) information may be exchanged; crucially, information must cross a notional threshold of "importance" before it is allowed access. The Global Workspace is always visible to all generators, and contains the information of which the organism is conscious at any given time. However, it is capable of containing only one "thing" at a time, though the scope of what that "thing" might be is variable. The Global Workspace is highly contextualised, and meaning contained therein is context sensitive and structured; contexts can contain goals, desires, etc., of the kind familiar from broader AI. Aside from further discussion of the "threshold" idea, below, this is all that is needed to understand the purpose of the competition mechanism proposed here. Baars mentions the possibility of creativity within this framework in passing, implicitly equating entry of a generator's output into consciousness with the "Aha!" moment (Wallas, 1926). However, he does not develop this idea further beyond noting that a process of refinement may be implemented as cycling of information into the Workspace and out again; that process 2012 181 Global Workspace Generator Generator Generator Generator Generator Access threshold Generator Coalition of agreement Generator Figure 1: Illustration of Baars' Threshold Paradox. Generators generate, but need a means of recruiting support for their outputs. Individuals cannot break in; they must recruit coalitions, as shown. The only way to do so is via the Global Workspace, but before they can do so, they need the support they are trying to recruit, and therein lies the paradox. may be equivalent to my creative reasoning. To the best of my knowledge, however, creativity in the Global Workspace has not been addressed elsewhere in the related literature. In the later developments of the theory, Baars proposes that information integration may take place in stages, via something that one might (but he does not) call local workspaces, that integrate information step by step in a sequence, rather than all in one go as it arrives in the Global Workspace. This information integration approach has been extended by Tononi and Edelman (1998), who propose information-theoretic measures of information integration as a measure of consciousness of an information-processing mechanism. Baars has embraced the information-theoretic stance, too, and the three authors have jointly proposed to begin implementing a conscious machine (Edelman, Gally, and Baars, 2011) based on their ideas. The current work may contribute to this endeavour, though probably at a level more abstract from neurophysiology than these authors intend. The Threshold Paradox Baars (1988, pp. 98- 99) addresses what he acknowledges is a problem for his theory. He proposes a threshold for input access to the Global Workspace, crossing of which is thought of in terms of recruiting sufficient generators to produce information that is somehow coordinated, or synchronised between them: it must be metaphorically "loud" enough to be "audible" in the Workspace. However, in terms of the Global Workspace alone, there is no means of doing this: generators can only be coordinated (whatever that means) via the Global Workspace, and so the generators are faced with the beginning artist's dilemma: you have to be famous to show your work, but you have to show your work to become famous. This form of the Workspace is illustrated in Figure 1. Baars presents two possible solutions to the paradox, which is the motivation of the current paper, but both are presented somewhat half-heartedly, leaving a gap in the theory. Here, I present a possible solution, in terms of the evolutionary argument required by my criterion 2, above. Perception, Anticipation, and Evolution Reaction vs. Anticipation I now present a mechanism for managing the competition between generators in Baars' system. This mechanism may be implemented either directly or indirectly (that is, by means of some other effect)—the difference is immaterial at the current theoretical level. The key distinctions are a) between the information content and entropy (defined below) of various stimuli; and b) between organisms that react and organisms that anticipate. The design of this mechanism is motivated by evolutionary thinking: that is, by consideration of the evolutionary advantage conferred by the resulting behaviours, in humans and other animals. Thus, the evolutionary argument presented here is part of the design, not merely an example. Russell and Norvig (1995), in their well-known AI text book, define an AI agent (of which an AI creative agent is presumably an instance) as a program or robot with a behaviour cycle that consists of perceiving the world and then acting on the perceptions. It seems not unreasonable to present this as a model of lower organisms, such as insects, which seem to do nothing more than react to environmental conditions, coping poorly when their evolved reactive program is interrupted. However, to model higher cognitive development, one can propose a more predictive system, in which an organism is predicting continually, from a learned model of previous sensory data, what is likely to come next, and comparing this with current sensory input. Doing so gives a simple but effective mechanism for spotting what is unusual, what, therefore, constitutes a potential new opportunity or threat, and what deserves cognitive resource, or attention. In the simplest case, the anticipatory agent can in principle avoid a threat before it becomes apparent, while the reactive one has to experience the threat in order to respond. The consequence of sequence: managing uncertainty with expectation The most important feature of an autonomous agent is not, as sometimes supposed in AI, that it is able to identify or categorise a situation from available data. What gives it the edge is that it can, in some sense, imagine what is to come next, and react, or perhaps preact, in advance. Of course, the word "imagine" is loaded, and suggests the involvement of consciousness and even volition; I use it here deliberately to draw attention to the point that consciousness need not be implicated in this process, which can be described in completely mechanistic terms, of prediction alone. In order to predict usefully in a changing world, it is necessary for an organism to learn. It must be able to learn not just categorisations (to understand what something is), but also associations (to associate co-occurrence of events with reward or threat), and, crucially here, sequence. However, a simple statistical learning mechanism is not subtle enough (Huron, 2006). Since evolutionary success entails that an organism breeds, a mechanism which allows 2012 182 that organism to learn only from potentially fatal consequences does not suffice: if the organism dies as the result of an experience, it does not benefit from the experience (or, at least, not for long). An effective strategy here lies at a metalevel with respect to a learned body of experience: if an organism is aware that it is in circumstances that it cannot predict reliably, it can behave more cautiously, its metabolism can be aroused to prepare for flight, and it can devote more attention than normal to its surroundings; thus, the effective strategy is also affective. Huron convincingly argues that this process is exapted to produce part of the aesthetic effect of music; however, for the purposes of the current section, the mere adaptation suffices: self-evidently, there is a mechanism that allows uncertainty to affect behaviour in humans and other animals, and that mechanism does not rely on explicit reasoning: indeed, the converse is the case: we feel nervous in uncertain situations, and the feeling serves to make us wonder why, as well as to heighten our attention to appropriate sensory inputs and to prepare for flight. This mechanism, and the associated affective response, is not the same as fear, but can lead there in extremis. Finally, any kind of learning of this nature is inadequate unless it includes generalisation. It is necessary to be able to generalise from both co-occurrence and sequence that similar consequences arise from similar events, encounters, etc. Without this, mere tension cannot lead to the fear that is appropriate at the sight of the bared fangs of a previouslyunexperienced large animal. This accords with proposals such as that of Gardenfors (2000), that perceptual learning ¨ systems are motivated by the need to understand similarities and differences between perceived entities in the world, and to place observations at the appropriate point between previously experienced referents. Prediction, Prioritisation and Selection Given a model of the world, suitably subcategorised into types, situations, etc., one can imagine a set of generators using the model with recent and current perceptual inputs matched against precursors of sequential associations, making predictions, on a basis that is stochastic, and conditioned by the model. Making such predictions quickly, one at a time, would be valuable, but, given the nature of brains, slow, multiple predictions, in parallel, are a more likely candidate for evolutionary success, and the more the better— as in Baars' proposal. But this begs a question: arbitrarily many predictions occurring simultaneously will be an impossible, incomprehensible babble, so how will useful candidates for prediction be selected? Baars' solution is the problematic threshold, described above. Another shortcoming of the Global Workspace Theory is unclarity about precisely what the notion of generators "recruiting" one another means. The effect is something like an additive weight: the more generators that are "recruited", the greater the impact of their output. In my proposal, we will avoid answering this question, by approximating the effect of the recruitment, rather more simply. I return to this below; in the argument that follows, I will use the analogy of sound volume to refer to this property: "loud" predictions come from many generators, "quiet" ones do not. My proposal here is based on statistical, frequentist notions of learning, and so my reasoning is couched in terms of statistical models; however, I do not think that the reasoning is in principle exclusive to such models, and it should not be supposed that the proposal is restricted in this way. In this view of the system surrounding the Global Workspace, there are many independent subsystems, which are making multiple predictions by biased sampling from a predictive statistical model of (assumedly) reasonable quality. It is also appropriate to assume imperfect models: each of these generating subsystems will have a fragmentary, partial view of its world and its predictions, as to model everything all the time in massive-parallel would be prohibitively expensive. It follows from the use of frequentist models that the more expected occurrences are the more likely ones to be predicted: the commonest predictions will be the most expected ones. This means there are relatively "loud" groups of contributions, reinforcing each other. Conversely, extremely unlikely predictions will be proposed by only a very small number of generators, and as such will never be "audible". In a model of prediction and action based solely on this frequentist principle, an organism will tend do the commonest thing, even when inappropriate, and therefore will be doomed to failure: it will not "imagine" unlikely and surprising situations, and will not therefore prepare itself against necessary eventualities. To see this, consider a territorial animal, on patrol, and let it be a high enough species to learn its reactions. Today, our animal senses the things it usually senses, and the vast majority of things in the world today are the same as they were the last time it passed this way. One tiny difference is a scent that it does not recognise, that it has not experienced before. Since this difference is small in comparison to the rest of the data in the world, and it has not been experienced before, in purely frequentist terms, it will be ignored: it is unlikely, and it has no known consequences and determines little or no probability mass. In Baar's theory, the pure frequentist approach, where the most likely outcome is chosen, corresponds with multiple generators in coalition generating that outcome. The likelihood of each generator predicting an outcome is proportional to the "volume" of that outcome across the set of generators. Therefore, we can neatly draw a veil over the mechanistic gap left by Baars' idea of coalition formation, and simply use the likelihood of the outcome, p, to model its outcome. In reality, though, we know well that to carry on as normal will not be the reaction of an animal in these circumstances: it will experience Huron's proposed affective response, described above. Therefore, it is necessary to hypothesise a mechanism to cause that response. In our current simple context of abstracted statistical modelling, the obvious choice for such a mechanism is the notion of entropy, as formalised by Shannon (1948). MacKay (2003) makes a distinction between information content, h, which is defined as an estimate of the number of bits required to describe an event, e, given a context, c, or its unexpectedness: h(e | c) = ! log2 p(e | c), and entropy, H, which is defined as an estimate of the uncertainty inherent in the distribution of the set of events E 2012 183 from which that e might be selected, given the context, c: H(c) = X e2E p(e | c) h(e | c) = ! X e2E p(e | c) log2 p(e | c). H is maximised when all outcomes are equally likely, and minimised when a single outcome is certain. Both h and H are useful to our hypothetical animal. First, consider ht, the unexpectedness of a partial model of the actual on-going experience in a particular state, t. If the experience is likely (in particular, if it is readily predictable from what has gone before), it is not unexpected, and therefore ht is low; if it is unlikely, it is unexpected, and so ht is high. An experience such as encountering a new scent is maximally unlikely, in frequentist terms. To model this, I propose that individual generators are sensitive to their own ht value, and decrease their notional "volume" when it is low. Thus, the likelihood of models of the experience in which the new scent is included being heard in the theatre is positively related (possibly in a non-trivial way) to its unexpectedness. I call this the recognition-h case. It may explain why unexpected things are noticed. Now, consider, ht+1, the unexpectedness of a predicted situation. It is maximally unlikely that a prediction will be made including a scent that has not been encountered before, and, as above, we would therefore expect ht+1 to be very high, causing alarm. Excess of such predictions, or repeated occurrence of a single one, would lead to a state of constant anxiety2. I call this the prediction-h case. It may explain why surprising predictions are more likely to draw attention than prosaic ones. Of course, in a simplistic frequentist account, predictions introducing new percepts or concepts cannot arise, because they entail the creation of new symbols. This is why it is necessary to include generalisation and/or interpolation in the theory (see above). Gardenfors (2000) presents a the¨ ory that explicates the symbolic representations more commonly used in statistical AI modelling in terms of an underlying, sometimes continuous, geometrical layer, and, at least at perceptual levels, places cognitive semantics at the centre of mind. In particular, an outline mechanism is supplied whereby previously unencountered stimuli may be assigned first non-symbolic, and then symbolic, representations. It is important to understand that the semantics in these theories are internal to the organism experiencing them, and have no definition in terms of the external word; rather they have external associations, which can serve to allow intersubjective meaning, but they themselves are ineffable. The problem of over-active prediction-h is mitigated by the mechanism supplied above, in which prediction is probabilistic and (broadly) additive across predictors, modelled by p. There are two opposing forces here, one of which changes inversely relative to the other, and because they are co-occurrent, their effects should (broadly) multiply. Therefore, the overall outcome audible in the global workspace 2 Indeed, some humans who suffer from anxiety, in the clinical sense, report intrusive, repetitive thoughts predicting problems or worries of one sort or another, the anxiety being aroused by fear of what might happen. Their situation would be explicable in terms of a breakdown of this mechanism. Likelihood/Information Content 2012_3 !2012 A Creative Analogy Machine: Results and Challenges Diarmuid P. O'Donoghue Department of Computer Science, National University of Ireland Maynooth, Co. Kildare, Ireland. diarmuid.odonoghue@nuim.ie Mark T. Keane Department of Computer Science and Informatics, University College Dublin, Ireland. mark.keane@ucd.ie Abstract Are we any closer to creating an autonomous model of analogical reasoning that can generate new and creative analogical comparisons? A three-phase model of analogical reasoning is presented that encompasses the phases of retrieval, mapping and inference validation. The model of the retrieval phase maximizes its creativity by focusing on domain topology, combating the semantic locality suffered by other models. The mapping model builds on a standard model of the mapping phase, again making use of domain topology. A novel validation model helps ensure the quality of the inferences that are accepted by the model. We evaluated the ability of our tri-phase model to re-discover several hcreative analogies (Boden, 1992) from a background memory containing many potential source domains. The model successfully re-discovered all creative comparisons, even when given problem descriptions that more accurately reflect the original problem - rather than the standard (post hoc) representation of the analogy. Finally, some remaining challenges for a truly autonomous creative analogy machine are assessed. Introduction Analogy has a long and illustrious history within creativity, particularly within scientific and intellectual contexts (Brown, 2003). Many episodes of scientific creativity are driven by analogical comparisons (Dunbar and Blanchette, 2001), often involving image related analogies (Clement, 2008). Much progress has been made in cognitive science on modeling this analogical reasoning process (see below), prompting the following questions. Are we any closer to creating an autonomous model of the analogical reasoning that can generate new creative analogies? What progress has been made towards such a creative analogy model? What are the main challenges that lie ahead? In this paper we envisage a creative process that can take any given target description and using a pre-stored collection of domain descriptions, identify potentially creative source domains with which to re-interpret the given problem. This paper explores and evaluates the potential for a model of analogy to act as a creativity engine. While Boden (1992) argues that analogy is effectively the lowest form of creativity (improbable), we argue that analogical creativity should be seen a part of a cohesive human reasoning system. If the inferences mandated by an analogy contradicts a fundamental belief, especially one that has accrued many consequent implications, then resolving this contradiction might well involve the "shock and amazement" of transformational creativity. As such, it appears that analogies may drive creativity at any of Boden's levels of creativity. Our creativity model is domain independent and does not include a pragmatic component or domain context. So, as our model does not use domain-specific knowledge, arguably it cannot be easily cast as improbable, exploratory or transformational creativity (Boden, 1992). The current work was driven by three main aims. Firstly, we wished to assess the creative potential of a three-phase model of analogy. Secondly, we wished to assess the impact of using differing knowledge bases upon the creative potential of our analogy model. Finally, we wished to assess the wider implications of analogical models for computational creativity. Is a three-phase model either necessary or sufficient to function as an engine of creativity? Can such a model re-discover analogies considered to be creative by people? Since people often overlook analogies (Gick and Holyoak, 1980) even when they are present, will such a model uncover many creative analogies or are creative analogies, in some way, different and rare? We see the current model as being potentially useful in three distinct ways, but for now we do not commit to using it in one particular manner. Firstly, it could be used as a simple model of creativity, yielding creative interpretations for a presented problem. Secondly, it could be used as a tool to assist human creativity; suggesting source domains to people, to enable them to re-interpret a given target problem. Finally, it could be used as one possible model of how people analogize in a creative means. The paper is structured as follows: first we describe the Kilaza1 model for generating creative analogies, briefly illustrating its operation on the famous atom:solar-system 1 Kilaza is not an acronym. 2012 17 analogy. Then we present results that reflect the model's ability to re-generate some well-known h-creative analogies (Boden, 1992). Finally, the implications of these results are assessed and some remaining challenges are discussed. Analogy as an Engine of Creativity An analogy is a conceptual comparison between two collections of concepts, a source and target (Gentner, 1983), such that the source highlights particular aspects of the target, possibly suggesting some new inferences about it. In creative analogies, an productive source domain conjures up a new and revolutionary interpretations of the target domain, triggering novel inferences that help explain some previously incongruous phenomena or that help integrate some seemingly unrelated phenomena (Boden, 1992; Eysenck and Keane, 1995). Creative analogies differ from "ordinary" analogies primarily in the conceptual "distance" between the source and target domains (i.e., these two domains may never have been linked before) and the usefulness of the resulting comparison. Both creative and mundane analogies appear to use the same analogical reasoning process, as described in the following section, but different in their inputs and outputs. Kekulé's is famous for his analogy between the carbonchain and a snake biting its own tail. But this analogy could have been triggered by many alternative and more mundane source domains - from tying his own shoe-lace to buckling his belt. While many source domains could have generated the creative carbon-ring structure, Gick and Holyoak (1980) have shown most people (including Kekulé) frequently fail to notice many potential analogies. This highlights one potential advantage of a computational model, in that a model can tirelessly explore all potential analogies, returning only the most promising comparisons to a user for more detailed consideration. Thus, computational models could potentially act a tools helping people overcome one barrier; namely, their failure to perceive analogies when they are present. Kilaza Analogical Creativity Engine Keane (1994) presented a five-phase model of the analogical reasoning process, which recognises the distinct phases of representation, retrieval, mapping, validation and induction. While other authors describe slightly different subdivisions of this process, there is broad agreement on these phases. Our computational model encompasses the three central phases of analogy (see Figure 1). We highlight that Walls & Hadamard subdivide creativity into the phases of preparation, incubation, illumination and verification (Boden, 1992), which is reminiscent of several multi-phase models of analogy. The heart of our creativity model is the central mapping phase and this borrows heavily from Keane and Brayshaw's (1988) IAM model (see also Keane, Ledgeway & Duff, 1994). Our model of the retrieval phase attempts to overcome the semantic bias suffered by many previous models, improving the diversity of the source domains that are returned. It was intended that this diversity might address the quality of novelty (Ritchie, 2001) associated with creativity, retrieving more "unexpected" and potentially creative sources. Finally, our model of the validation phase attempts to filter out invalid inferences, addressing the quality (Ritchie, 2001) factor associated with computational creativity. Ritchie (2001) identifies the essential properties of creativity as being directed, novel and useful. We argue that our model is directed in that it focuses on re-interpreting some given target domain. Our model addresses the novelty property by its ability to retrieve potentially useful but semantically distant, even disconnected, source domains. Finally, the useful property is addressed through a validation process that imposes a quality measure on the inferences that are accepted by the model. Figure 1: Kilaza is a three-phase model of Analogy Analogical Retrieval Phase-Model Existing models for analogical retrieval suffer from the limitations in the range of possible retrievals because their they either (i) focus exclusively on domain semantics (like MAC/FAC; Forbus, Gentner and Law, 1995) or (ii) focus primarily on domain semantics (like HRR; Plate, 1998). Other models -such as ARCS (Thagard et al, 1990) and Rebuilder (Gomes et al, 2006) supplement domain representations by elaboration from external sources (like WordNet) to widen the net to include more semantically non-identical sources. However, all of these approaches arguably over-constraint retrieval for the the proposes of creativity. We argue that a creative retrieval process must allow semantically distant and even semantically disconnected sources to be retrieved, ideally without overwhelming the subsequent phase-models with irrelevant domains. Figure 2: Topology is a key characteristic in retrieving creative source domains Gentner (1983) mentions two specific qualities are required of analogical comparisons: semantic similarity and structural similarity. The model presented in this paper performs retrieval based exclusively on structural similarity, performing retrieval based exclusively on the graph structure (or topology) of each domain description. This design decision was taken to overcome the semantic narrowness that constrains existing models, with the hope that retrieval mapping validation loves loves tom jo liz loves (b) Unrequited Love loves loves loves tom jo liz (a) Love Triangle 2012 18 this would increase the possibility of retrieving surprising and creative source domains. As the example in Figure 2 illustrates, semantics and domain topology are often intertwined. Each domain description is mapped onto a location in an n-dimensional structure space (Figure 3), where each dimension represents a particular topological quality of that domain. Structure space is somewhat akin to feature vectors (Yanner and Goel, 2006; Davies, Goel and Yanner, 2008). Image related analogies are often involved in creative comparisons (Clement, 2008) and a variety of imagebased analogy models has been developed, focusing on specific topics such as; geometric proportional (IQ type) analogies (Evans, 1967; Bohan and O'Donoghue, 2000), geo-spatial comparisons (O'Donoghue et al, 2006), spatial representations of conceptual analogies (Davies et al, 2008; Yanner et al, 2008) and reasoning about sketch diagrams (Forbus et al, 2011). Our model performs a single retrieval process for each presented target, in contrast to the iterative retrieval and spreading activation phases employed by KDSA to retrieve semantically distant sources (Wolverton and Hayes-Roth, 1994). Specific topological features used by our retrieval model include quantifying the number of objects and predicates (first order and higher order) and number of root predicates etc. Thus, the representation in Figure 4 might be mapped onto the location (4 0 2 2 0 0 1) in structure space - 4 object references, 0 high-order predicates, 2 unique firstorder relations, 2 first-order relations and 2 root predicates etc. The distinction between unique and non-unique relations, for example, distinguishes between domains repeatedly using a small number of relations and domains that typically have one instance of each relation in its description. One advantage of this scheme is that the distance between domains is not impacted by the number of domains contained in memory so the retrieval system should scale reasonably well. For the retrieval results presented later in this paper a maximum retrieval distance of 10 is imposed - and only candidate source inside this threshold are considered. Figure 3: Displacing the Locus of Retrieval within a 3D representation of n-dimensional Structure Space. Only source domains within the displaced boundary are retrieved and passed to the remaining phases of analogy. Topologically similar (i.e., homomorphic as well as isomorphic) domains are mapped onto similar locations within this topology-based structure space (O'Donoghue and Crean, 2002). To account for the inferences that were sought from any inspiring source domain, the locus of retrieval was slightly offset to account for this additional source domain material. Included in this offset is the desire for sources containing additional first-order relations and high-order relations. However, this offset has relatively little impact on the final results. (heavier nucleus electron) (attracts nucleus electron) Figure 4: Simplified Model of Rutherford's Problem Analogical Mapping Phase-Model The model for the mapping phase is based on the Incremental Analogy Machine (IAM) model (Keane & Bradshaw, 1988; Keane et al, 1994). It consists of the three subprocesses of root-selection, root-elaboration and inference generation. Mapping proceeds as a sequence of rootselection and root-elaboration activities, gradually building up a single inter-domain mapping. Typically a domain description will consist of a small number of root predicates, each controlling a large number of (partly overlapping) lower-order predicates. Root selection Root selection identifies "root predicates" within a representation, which are typically the controlling causal relations in that domain. Each root predicate lies at the root of a tree of predicates and each root is be seen as "controlling" the relations lower down the tree. In our implementation of IAM, the root-selection process examines the "order" of each predicate. Objects are defined as order zero and first-order relations that connect two objects are defined as order one. The order of a causal relation is defined as one plus the maximum order of its arguments. Mapping begins with the highest order relations and maps any unmapped low-order root-predicates last. Root elaboration Root elaboration extends each rootmapping, placing the corresponding arguments of these relations in alignment. If these arguments are themselves relations, then their arguments are mapped in turn and so on until object arguments are mapped. Items are only added to the inter-domain mapping when they conform to the 1-to-1 mapping constraint (Gentner, 1983). Inference Generation Each analogical comparison is passed to the inference generation sub-process. Analogical inferences are generated using the standard algorithm for pattern completion CWSG - Copy With Substitution and # root predicates # object references nucleus electron heavier attracts potential source domains 2012 19 Generation (Holyoak et al, 1994). In effect, additional information contained in the source domain is carried over to the target, creating a more cohesive understanding of that target problem. Analogical Validation Phase-Model The third part of our tri-phase model is focused on analogical validation. Validation attempts to ensure that the analogical inferences that are produced are correct and useful. O'Donoghue (2007) discusses the accuracy of this validation process, using human raters to assess the goodness of inferences that were rated as either valid or invalid. However, this paper did not assess the models ability to discover creative analogies. Phineas (Falkenhainer, 1990) is a multi-phase model of analogy that incorporates a post-mapping verification process. To achieve this Phineas incorporates a model of the target domain - qualitative physics simulation illustrating the power of embedding an analogy model within a specific problem domain. However, this qualitativesimulation process effectively limits Phineas to reasoning only about physical and physics-related analogies. The validation model presented in this paper is relatively simple, aimed at rejecting those predicates that are deemed invalid - rather than guaranteeing the validity of those inferences that are accepted. This approach helped maximise the creative potential of this model, by resisting the rejection of potentially plausible inferences. Of course, a more complex validation process could make use of problemspecific domain knowledge (where available). In the absence of such domain-specific knowledge verification and validation of the analogy could be carried using user feedback, employing Kilaza in a tool-like way. The validation phase-model is composed to two main parts. The first performs validation by comparing the newly generated inference to predicates already stored somewhere in memory. The second mode of validation is more general and driven in part by the functionally relevant attributes that play a role in analogical inference (Keane, 1985). Validation by Predicate Comparison The validation process compares newly inferred predicates (produced by CWSG) to the previous contents of memory. Inferences are firstly compared to predicates in memory, with both the agent and patient roles potentially being validated independently. This validation mechanism thus has access to the entire contents of memory, accessing predicates from any of the domains stored in that memory. This model of validation captures the advantages of simplicity and generality, but it does of course mean that dependencies between arguments are not captured. This limitation was deemed acceptable within the context of our desire for a creativity engine. While many simple inferences were validated by this mechanism, many creative inferences were not. This may be partly attributed to the relatively small number of predicates contained in memory and to the novelty associated with creative inferences. To address this shortcoming validation using functional attributes was introduced. Validation with Functional Attributes Functional attributes specify necessary attribute requirements for each role of a predicate - being inspired by the functionally relevant attributes of Keane (1985). Functional attributes are intrapredicate constraints that ensure each predicate appears to be a plausible combination of a relation coupled with each of its arguments. It should be pointed out that functional attributes have only been used with first-order predicates - those whose arguments are objects. Although validating higher-order (causal) relations might make use of the spatio-temporal contiguity associated with causality, but this cannot be relied upon (Pazzani, 1991) and is not enforced by our model. Thus our model treats all causal inferences as implicitly valid. Functional attribute definitions connect each role of a predicate directly into an attribute hierarchy, whereby arguments filling those roles must conform to these attribute constraints. Kilaza stores functional attributes for both the agent and patient arguments of each relation independently. More general relations (part-of, next-to) typically have few functional attributes, whereas more specific relations (hit, eat) possess a greater number of attribute restrictions. For example the agent role of hit might require the hitter to be a physical object, whereas the agent of an eat relation might have to be a living organism or an animal. Relations that are more specific are seen to be more amenable to the validation process, while their more general counterparts are more difficult to validate accurately. In addition, functional attributes have also been used to support a form of inference adaptation. This allows an inferred relation to be adapted to a semantically similar relation that better suits the arguments that pre-existed within in the target domain. Adaptation uses the functional attributes to conduct a local search of the taxonomy, to identify a more semantically suitable relation that better fits the given arguments. Data Sets Three datasets were used to conduct experiments using the described model. These are referred to as the Professions dataset, the Assorted dataset and an Alphanumeric dataset. The dataset contained a total of 158 domains and our creativity engine attempted to find creative source analogues for a given number of target problems. It was hoped that the differing natures of these collections would provide a reasonable grounds on which to evaluate the computational model - and to assess its potential to act as a creativity engine. Professions Dataset consists of descriptions of fourteen professions, including accountant, butcher, priest and scientist. These are rather large domain descriptions created by Veale (1995) and range in size from 10 to 105 predicates (M=55.4, SD=29.3). One important feature of the Professions dataset is its reliance on many different instances of a small number of relational predicates, including control, affect, depend, and part. The domains range from using just 6 distinct relational predicates (ignoring 2012 20 duplicates) to the most diverse domain that uses 15 (M=8.9, SD=2.2). Another important feature is that this dataset does not appear to use a set of clearly identifiable high-order relations (such as a cause, result-in or inhibit) between first-order predicates. Assorted Dataset consists of a large number of smaller and more varied domain descriptions, including many of the frequently referenced domains in the analogy literature; such as the solar-system, atom, heat-flow and water-flow domains. It also includes an assortment of other domains describing golf, soccer and story-telling. The 81 domains of the Assorted dataset use 108 distinct (ie non-repeated) relations. Each of these domains contains between 1 and 15 predicates (M=4.16, SD = 2.9). The average number of distinct relational predicates in each domain is M=3.48, indicating that most relational predicates are used just once in each of the Assorted domains. Alphanumeric Dataset One final dataset contained 62 semantically constrained domains. However, these domains contained a great deal of topological diversity. It was hoped that this mixture of topologies might support some novel comparisons and inferences and provided a counterpoint to the semantic richness of the other domains. Example: p-Creative Re-Discovery of Rutherford's Analogy Before presenting detailed results, we will first see how Kilaza can re-discover Rutherford's famous solarsystem:atom analogy. We highlight that this is a test for the p-creativity (Boden 1992) of our model - though not necessarily a model how Ernest Rutherford actually conducted his own reasoning. The traditional representation of this analogy (Figure 5) is heavily based on a post hoc description of the domains involved. These descriptions and are heavily influenced by the analogy itself. We shall first look at the traditional representation of this domain, before examining how our model can also deal with more realistic version of how Rutherford might have thought of the target problem before arriving at his famous comparison. Figure 5: Traditional representation of Rutherford's Solution First, the semantically impoverished target problem (Figure 4) is mapped onto its location in structure space. We highlight that the "locus of retrieval" is slightly displaced from the targets original location to account for the additional information that one expects to be found in a useful source domain. In this instance the desired source was retrieved at a distance of just over 6 "units" in structure space. The desired source (the solar-system domain) and all other candidate sources near the locus of retrieval were passed in turn to the mapping and validation phases of the model. In total 10 other candidate source domains that were retrieved also generated inferences, most yielding only one inference each. Three domains generated more than one candidate inference - but all three were different versions of the solar-system domain. We point out that our semantic "free" retrieval process can also trigger identification of the same source, even if it was represented in a number of alternate ways (O'Donoghue, 2007). Our mapping model successfully generated the correct inter-domain mapping and CWSG generated the desired inferences without adaptation. Representation Issues in de novo Discovery of p-creative analogies We argue that the traditional presentation of Rutherford's analogy is a simplified pedagogical device (Figure 5). This description of the target problem effectively removes much of the complexity of the real discovery task as encountered by Rutherford. The description of the target problem uses terminology specifically designed to accentuate the semantic (and structural) similarity that is the result of Rutherford's comparison - and should not be treated as an input when re-creating this creative episode. This distinction between the problem domain as it would have existed before the creative analogy and its subsequent representation after discovering that analogy is a serious problem one that is easily overlooked. Any model that attempts to re-discover known creative analogies must address the original problem, not just the representation that accentuates the desired similarity. Differences in domain terminology and topology are central to the distinction between elaborating a given analogy, and the much more difficult task of generating a novel h-creative (or pcreative) analogy (Boden, 1992). We argue that generating Rutherford's analogy using the representation in Figure 6 is a far better test of a models creative ability, than the normal post hoc representation in Figure 5. Terminological differences are particularly prevalent in distant between-domains analogies as the firstorder relationships describing the problem domains originate in different disciplines. When modeling analogical creativity, we must expect to encounter these differences in terminology, and our models of retrieval, mapping and validation must be able to overcome these problems. Ernest Rutherford would most likely have thought of the target relation between the nucleus and electron as electromagnetic-attraction, and not the more generic attracts relation. The corresponding relationship between source's sun and planet is gravitation. It is only after he found the analogy (which involved mapping revolves nucleus electron attracts heavier and cause 2012 21 electromagnetic-attraction with gravitation) that these relationships can be generalized to a common super-class like attracts (Gentner, 1983). We point out that our model can operate successfully on either the simplified or more realistic domain descriptions. This is primarily the result of our retrieval and mapping models using domain topology, rather than using identicality (or similarity) between the predicates in both domains. Figure 6: More realistic representation of Rutherford's Analogy Results of Individual Phase Models We shall first briefly examine the performance of the retrieval and validation models in isolation, before looking at their combined performance in the next section. We shall briefly examine the results of the mapping model, but our focus will remain on the inferences that it produced. Results were produced from a memory containing the three previously described datasets. Retrieval Results Retrieval was performed in structure space. The distance between domains in structure space varied from 2.645 to 230 (M= 80, SD=57.3), with a large number of domains being given a unique structural index in this space. A small number of locations contained multiple domains - these mostly involved small domains of just a few predicates from the Assorted dataset. Retrieval and Mapping A broad tendency was identified between structure-based retrieval and the size of the resulting inter-domain mapping, although the correlation was low. A range effect was identified between structure space and the size of the resulting mapping, indicating that larger distances between domains in structure space tend to produce smaller inter-domain mappings. This indicates a weak connection between structure-based retrieval and the size of any resulting mappings. Validation Results Although the validation model was very simplistic, it proved surprisingly effective. For example with the inferences generated on the Professions dataset, the average (human) rating awarded to predicates that Kilaza categorized as valid was M=2.62 (SD=2.09), while the average rating awarded to the invalid predicates was M=1.57 (SD=1.23). As ratings were given between 1 and 7 with 7 representing clearly valid inferences, this indicates that many of the generated inferences were of rather poor quality. Adaptation Results In addition, 24 inferences were passed to the adaptation process and 20 of these were adapted. While we cannot realistically assess if these adapted inferences matched what was "intended" by our analogy model, we did assess the validity of these inferences using two human raters. When we look at human ratings for the 20 adapted predicates before and after adaptation, we see that the average ratings were increased by the adaptation process from 1.57 (SD=1.23) to 2.57 (SD=1.70). The average ratings of the adapted predicates was broadly in line with the predicates from Kilaza's valid category above (M=2.62, SD=2.09). Before adaptation, 18 of the 20 (90%) predicates were given rated as invalid and after adaptation just 12 (60%) were rated as invalid. Thus, adaptation has a distinct influence on improving the ratings of the rejected inferences. It may well be argued that this adaptation process is itself somewhat creative - identifying new relations that better fit the available target arguments. In contrast to the top-down nature of the creative analogy approach, predicate adaption is a very much a bottom-up process that is motivated by the detection of a potential analogical comparison. Creativity Test Results To assess the creative potential of our model, we assess its performance at the p-creative task of re-creating some well-known h-creative analogies (Boden, 1992). These include some of the famous examples of creative analogical comparisons including the Rutherford's solarsystem:atom analogy, the heat-flow:water-flow and the tumour:fortress analogies. Our descriptions are based on the standard representation of these domains as found in the analogy literature. Creative Retrieval We now examine the performance of our model on the creative retrieval task. We presented our model with the target domain of each of 10 creative analogies, together with a memory of 158 source domains. From this memory of 158 potential sources, the retrieval model selected a number of these domains as candidate sources. Only the selected candidate sources were passed to the mapping and validation phase-models. Evaluating only the selected source domains was necessary in order to avoid an exhaustive search through all possible analogical comparisons. While computationally feasibly in this instance, an exhaustive search would be impractical on a larger collection of domains. Before looking at the results, we point out that many comparisons did not generate a viable inter-domain mapping. Furthermore, most analogies did not generate any valid inferences. The following results ignore these unproductive comparisons and we focus only on the productive analogies. All of the desired creative sources were among the candidate sources that were retrieved by the model. This gives nucleus electron electromagnetic attraction heavierthan sun planets gravitation more-massive than 2012 22 our retrieval model a recall value of 100% on this creative retrieval task. While a large number of other candidate sources were also retrieved, this was still a pleasantly surprising result. The distance within structure space between the target and the creative sources ranged from 3.1 to 7.9, suggesting that structure based retrieval was reasonably accurate in locating candidate sources. The precision of the retrieval processing is summarised in Figure 7. As can be seen, precision was above 0.2 for two problems showing that few other sources were located near the structural index of those targets. However, precision was much lower for most problems, indicating that the desired source was merely one of a larger number of candidate sources that had to be explored. Figure 7 - Precision of retrieval for 10 Creative Analogies Creative Inferences Next we summarise the inferences that were generated by each of these comparisons (Table 1). These results implicitly encompass a productive interdomain mapping between the target and each candidate source in turn. Kilaza generated and validated the correct inferences for 9 (70%) of the creative analogies. The cycling:driving analogy correctly generated no inferences. Target Correct Inferences Validated Inferences Atom: Solar-System y 4 Atom-Falkenhainer: Solar-SystemFalkenhainer y 3 General: Surgeon y 4 Heat-flow: Water-Flow y 4 Leadbelly : Caravaggio y 4 Love-triangle: Triangle-Directed y 0 Requited-love: Love Triangle y 3 Fish : Bird y 4 Vampire : Banker y 3 Cycling: Driving n 0 Table 1 - Number of Inferences generated by different analogies One of these analogies also required one inference to be adapted. The bird:fish analogy generated the inference (flies-through fish water), which was correctly adapted to (swim fish water). Conclusion We presented a three-phase model of analogy, adapting it to function as a tool for discovering creative analogies. This model encompasses the three central phases of analogy, namely retrieval, mapping and validation. We argue that a model encompassing these three core phases of analogy is the minimum required to be considered a model of analogical creativity. Our retrieval model overcomes the semantic bias of previous retrieval models, helping retrieve new and surprising source domains. This helps to improve the novelty of the source domains identified by our creativity engine. Our model of the post-mapping validation phase attempts to filter out any clearly invalid inferences, thereby improving the quality of the analogies identified as being creative. We note that novelty and quality are two attributes strongly associated with creativity (Ritchie, 2001). Our three-phase model of analogy successfully rediscovered 10 examples of creative analogies, including the heat-flow:water-flow and solar-system-atom analogies. In doing so, the model retrieved the correct source from a large memory of potential sources. It then developed the correct mapping and successfully validated (and adapted) the resulting inferences. We point out that these analogical comparisons, if produced by a human analogizer, would be considered creative. Our focus on creative analogies rather that the more normal (or pedagogical) analogies had a far-reaching impact on the model. Terminological differences are particularly prevalent in creative between-domains analogies, as the first-order relations describing each domain originate in different disciplines. When modeling analogical creativity, we must expect to encounter these differences and cannot rely heavily on the presence of identical relations. Our model successfully created Rutherford's famous solarsystem:atom analogy, even when the target was represented in a more realistic and challenging from. Our model shows that very significant progress has been made towards an autonomous creativity machine, re-discovering many creative analogies. We briefly outline three remaining challenges to analogical creativity, beginning with the issue of knowledge representation. Our results illustrate a trade-off between the specifity and the generality of domain descriptions. Overly specific representations make comparisons more difficult to discover, but overly general representations appear too profligate and can overwhelm the validation (and subsequent) processes. Perhaps multiple representations of each domain might offer a useful avenue for progress. Multiple representations might also help explain why exerts are more fluent in their use of analogy within their own domains (Dunbar and Blanchette, 2001). Our model does not currently include an explicit re-representation process 2012 23 highlighting "tiered identicality" (Gentner and Kurtz, 2006). It seems that the greatest challenge to computational analogizing might lie with the post-mapping phases. Challenges include assessing analogical inferences for validity, evaluating the significance of an analogy and considering the implications of creative comparisons. Surprisingly little attention has been given to this phase - partly because of its ultimate dependency on the target problem domain. Phineas (Falkenhainer, 1990) and also Rebuilder (Gomes et al, 2006) showed that integration of the analogy and case-based reasoning within the target domain can have very positive effects. While tight integration of all target domains into an analogy model seems most unlikely, Kilaza has show that a generic validation model can play a part improving the quality of the inferences that are accepted. Overall, the results presented in this paper highlight that a three-phase model of analogical reasoning can operate successfully as a model of analogical creativity. Our results highlight the improbability of finding a suitable source domain to re-interpret a given target in a creative manner. Extending this model will necessitate a tighter integration of the analogy process with other facets of intelligence. 2012_30 !2012 Brainstorming in Solitude and Teams: A Computational Study of Group Influence Ricardo Sosa Singapore University of Technology and Design 20 Dover Drive, Singapore 138682 ricardo_sosa@sutd.edu.sg John S. Gero Krasnow Institute for Advanced Study and Volgenau School of Engineering George Mason University john@johngero.com Abstract Early studies of creative ideation showed that individuals brainstorming in isolation tend to generate more and better ideas than groups. But recent studies depict a more complex picture, reinforcing the need to better understand individual and group ideation. Studying group influence is one way to address the complex interplay between ideas in different brainstorming scenarios. We define group influence as the degree to which individuals are influenced by ideas coming from other team members. This paper presents results from a multi-agent simulation of the role of group influence in brainstorming groups. The results from the simulations indicate that the findings from previous laboratory studies tend to be misinterpreted, and that both isolation and teamwork present opportunities and challenges for creativity. Introduction Is it better to generate new ideas in solitude or in teams? Creativity research has shown that this distinction is not trivial. Early studies showed that individuals working in privacy tend to generate superior results along three criteria: total number of ideas, number of unique ideas, and quality of ideas [1]. But a more complex picture is portrayed by subsequent studies, reinforcing the need to better understand the interplay between individual and group ideation as well as the importance of facilitation dynamics [2]. The term ‘brainstorming' refers to the method of problem solving based on timed sessions where participants are instructed to address a problem by freely generating a large number of ideas irrespective of their apparent value [3]. The aim of brainstorming sessions is to generate as many different alternative solutions to a given problem as possible. Whilst many variants of brainstorming have been proposed, the basic premises are: a) to maximize the number and the originality of ideas, b) to combine or improve ideas suggested, and c) to avoid critical evaluation of ideas [4]. Individual brainstorming consists of engaging subjects in idea generation sessions isolated from others. Group or team brainstorming refers to the more typical scenario where individuals interact to generate and evaluate possible solutions to a common problem. Following the literature, we use the term nominal group to refer to the former and interactive group to refer to the latter condition [2]. Recent studies of idea fluency in brainstorming show that nominal groups outperform non-facilitated interactive groups both in gross and net fluency of ideas; but are considerably outperformed by facilitated interactive groups [2]. As with other factors related to team dynamics, such as diversity and leadership, group influence as a construct and its effects on creative ideation are yet to be fully understood. This is a relevant topic in the still incipient research stream on multi-level approaches to team creativity [5]. The general process by which individuals in isolation consistently surpass group creativity has been explained as ‘ideational productivity loss' and appears to have a series of likely cognitive and group-level causes [6]. Cognitive factors that may interfere with ideational productivity include production blocking, interruptions, forgetting ideas, and distraction by task-irrelevant processes. A higher cognitive load is also often cited as a source of ideational loss, typically caused by attending to other's ideas. Group factors that may account for productivity loss include team structure and diversity, turn-taking, awareness of public evaluation, disposition to converge with others' judgments, lower motivation due to shared responsibility, and a tendency to free-ride [7]. Multi-level approaches are required to understand, for instance, what is the appropriate degree of accessibility to others' ideas when brainstorming in teams in order to ensure that individuals are able to both build upon their own ideas as well as upon the ideas of their teammates. Teamwork in creativity enables the important process of sharing ideas; however this freedom may have two different effects on creative ideation: one possibility is that teammates generate a wide range of diverging ideas thus obstructing the connection and refinement of coherent 2012 188 ‘trains of thought'. The noise generated in this imagined scenario would more likely produce incomplete and incompatible ideas of low quality not to mention dissatisfaction from the participants. A second possibility is that teammates rapidly converge in agreement around one or just a few dominant ideas without exploring other alternatives. Group influence can be one way to address this interplay between ideas in brainstorming. We define group influence in this paper as the degree to which individuals are influenced by ideas coming from other team members. Here, group influence is a group-level rather than an individual construct. Groups with high influence levels are those where all ideas by all participants are always available to every group member. Groups with low influence levels are those where individuals are only exposed to their own ideas. Between these extremes, group influence indicates the ratio of ideas available to brainstormers. In this paper we present results from a multi-agent simulation of the role of group influence in brainstorming groups. Our aim here is to model the interactions between agents engaged in a simple task of divergent reasoning in order to inspect the beneficial and detrimental effects of different team structures in idea generation. In defining this model, we follow the distinctions between ideas, agents and societal factors of the IAS framework for the computational modeling of creativity and innovation as explained below [5]. The rest of this paper is organized as follows: the next section presents precedent work on the computational modeling of group brainstorming, the following section introduces our own modeling approach to group influence in brainstorming, then the simulations results are presented and the paper concludes with a discussion of the results and their implications in computational creativity research. Models of Group Brainstorming This paper presents an approach to the study of creativity using computational social science [7] in order to inspect the mechanisms behind the apparent paradox of ideational productivity loss in brainstorming groups. Computational social science utilizes multi-agent simulations that are useful to explore hypotheses, test assumptions and understand fundamental issues in complex social systems. These systems are also useful to generate predictions for future laboratory experiments or case studies. Semantic and Social Models Iyer et al [8] propose a connectionist framework of idea generation in order to inspect experimental data from laboratory studies on ideation and idea priming. In particular they explore the interaction between ‘irrelevant primes' and context familiarity; irrelevant cues are defined as sets of ideas of which only a fraction are related to the task at hand, while context familiarity is given by the pre-existing classification of ideas defined in the system. With this model, the researchers emulate the laboratory results and provide hypotheses as to why even irrelevant primes can increase idea quality and fluency. By manipulating the degree of familiarity between contexts, they show that when irrelevant primes are used between two completely unfamiliar contexts, there is no benefit, whilst irrelevant priming is useful only when partial information about semantic relationships is shared between search contexts. In this vein, the authors suggest future experimental studies on the creative capacity to create ‘short-cut linkages' between features, concepts or semantic categories that are typically not related. In an extension of this work, Paulus et al propose an approach to modeling group creativity by vertically integrating neural and social networks [9]. They define agents as simplified versions of the connectionist model described above, and account for individual differences in semantic contexts, idea association, domains, cognitive strategies and responses to cues. Through what they define as a parameterized interaction protocol (PIP), their proposed model accounts for turn-taking between agents and, more relevant to our approach, the accessibility of ideas by either the entire group or a selected few. With this model still under development and testing, the authors aim to address a range of research questions, including the efficiency of certain interaction structures and scheduling protocols for group ideation. Group Influence in a Design Task From the perspective of computational social science, creative systems are modeled by multiple generators and evaluators of ideas linked in a social system. In such systems, creativity is explained as an emergent outcome, i.e. a global effect that ‘grows' from simple local interactions [10]. The model presented here is defined using the channels of interactions specified in the IAS framework (ideas, agents, society) [10]. Agents (A) engage in a simple designing task that constitutes the agent-idea channel (Ai) where the resulting designs belong to the set of Ideas (I); social structures (S) determine the availability of ideas (Si); ideas are used by agents (Ia) to build design concepts (Aa) that are further applied in the design of new ideas (Ai'). In this model, Ai is implemented as a shape search process starting from an initial set of polygons and affine transformations, I is the set of final shape representations produced by the agents, S is the arrangement of agents in groups, Si is the experimental variable of group influence, Ia is a transmission mechanism of ideas to agents, and Aa is modeled as an inference process of design concepts. At the moment, this model is limited to only four of the nine channels of interaction in the IAS framework, namely: Ai, Si, Ia and Aa. In the future, we plan to integrate and examine more IAS processes in this model including leadership styles (As), compliance to group majority (Sa), group agreement to adjust idea influence (Is), etc. A description of the simplified design task implemented in this system can be formulated as: "within a fixed time 2012 189 period, generate as many different shape compositions as possible by combining a set of initial shapes". Shape compositions are defined as arrangements of n final shapes created from the combination of less than n initial shapes. New shapes are created by the superposition of existing shapes which lead to the identification of new vertices in the intersections of line segments. This enables the emergence of new shapes as the set of paths {LM} from between the start and end points of figures L and M that lead through each intersection point, traversing each segment no more than once [11]. This shape arithmetic task provides a relative quantitative measure to compare two or more sets of results. A quality criterion is defined for this task as a function of the total number of new shapes created and their number of sides. Figure 1 illustrates one composition created by this generative program. Further details on the complexity of this type of tasks are found elsewhere [13]. This twodimensional shape representation is used to model divergent visual reasoning and is similar to those typically used in brainstorming research [6]. Whilst this design task is fairly straightforward to implement in a computer system, the results are varied enough to capture some of the key properties of design situations such as open-ended problem formulations with many appropriate solutions, and incremental development of solutions. A measure of task difficulty is defined by the number of initial shapes and the number of sides of these initial shapes. In this paper we present results using two initial shapes of three sides each. Figure 1 A random composition of 2 initial shapes where overlapping triangles are detected forming 3 emergent subshapes of 3, 4 and 5 sides, respectively. Shape compositions with more subshapes and subshapes with more sides are ranked higher. The task is used to study group brainstorming by implementing a multi-agent system where agents are automated shape generators that search for new solutions, derive design concepts and interact in this process over a fixed time period. Agent behavior in this simplified model of brainstorming consists of the following behaviors: exploration function (random shape drawing and transformation), evaluation function (concept formation from topology relationships of shapes), and exploitation function (shape drawing and transformation by application of learned concepts). Shape exploration in this program can be considered potentially creative inasmuch as emergent shape semantics "exists only implicitly in the relationships of shapes, and is never explicitly input and is not represented at input time" [12]. A design concept is defined here as a topology relationship between the initial shapes associated to the fitness of the final shape composition. More details are provided below. After a designer agent has generated one or more concepts, it can use them to generate new shapes. Exploitation strategies consist of random variations to existing design concepts. New compositions can then be obtained as a result of applying the modified rule and evaluating whether its outcome yields a new shape composition. The following pseudo-code shows the algorithm to generate initial shapes and new emergent subshapes in this task (exploration function): for(initialShapes) { select n random (x,y) points connect all pairs of points with lines build a polygon with resulting lines } for(every polygon) { for(every line i of every polygon) { find intersection point(linei-linen) store all vertex in a set } } for(all vertex in set) { build all subshapes via graph search (dijkstra) store new subshape in a set } eliminate duplicate subshapes The following pseudo-code shows the algorithm to assign a qualitative measure to shape compositions (first part of evaluation function): for(finalShapes) { fitness += (sides of subshape * finalShapes) } The following pseudo-code shows the algorithm to build design concepts in this task (second part of evaluation function): for(all initialShapes s) { s.insideVertex += (vertex is within boundaries of shape s+1) s.outsideVertex += (!vertex is within boundaries of shape s+1) s.inLine += (vertex intersects line of shape s+1) s.coincidentVertex += (vertex is coincident with vertex of shape s+1 } designConcept = { {insideVertex, outsideVertex, inLine, coincidentVertex} , fitness} store designConcept in a set The exploration and exploitation mechanisms used here are inspired in the classic notions of divergent or ‘horizontal' and convergent or ‘vertical' thinking processes [3]. During brainstorming sessions, one may assume that exploration 2012 190 enables the discovery of new types of solutions, whilst exploitation allows for the generation of alternatives or new instances - a kind of tradeoff in a multi-armed bandit problem. In this system, designer agents start with exploration and transition to exploitation given a variable defined by the experimenter. The following pseudo-code shows the algorithm for selecting between exploration and exploitation: for(designerAgents) { if (timeStep < exploreLength) strategy = "exploration" else { strategy = "exploitation" select a random designConcept from set of concepts switch (designConcept) { case (insideVertex): initialShapes(insideVertex) case (outsideVertex): initialShapes(outsideVertex) case (inLine): initialShapes(insideVertex) case (coincidentVertex): initialShapes(coincidentVertex) } } } Group influence γ is defined as a sharing ratio of concepts: in the extreme case where γ = 0, agents have no access to the concepts generated by other agents; for cases γ > 0, agents have access to a fraction of the concepts generated by other agents up to γ = 1, where all agents have access to all concepts generated in the group. This experimental variable γ enables the modeling of both nominal and interactive groups in the brainstorming research literature, as well as scenarios similar to computer-mediated brainstorming where the researcher can control the level of interaction between participants [2]. Group influence γ is implemented in two sections of the code. First, agents store new design concepts in a shared team pool of concepts with a probability γ. Second, γ is also used in turn-taking on each simulation step. This is to account for the differential conditions in which nominal and interactive teams operate: when individuals work alone there is a type of allocation of turns in parallel, while teammates work in sequential turns. In this paper we inspect four γ scenarios: γ = 0, 0.33, 0.66 and 1.0. Exploration length ε is defined as a ratio of total simulation time during which agents activate exploration behavior. This variable is used to model the timing at which brainstorming participants switch from exploration to exploitation behaviors. Although we acknowledge that such transition may take more complex patterns in real brainstorming sessions, in this paper we adopt a parsimonious approach as a foundation for future models. Exploration lengths ε = 0.2 to 1.0 are inspected in this paper in 0.2 increments. In this paper we present and discuss results of four and sixteen-member groups where both group influence γ and exploration length ε are the experimental variables and both quantity and quality of generated ideas is the dependent variable. Gross fluency refers to the total number of design concepts generated during a simulation, while net fluency refers to the number of original or unique design concepts produced. The impact of varying the level of group influence in idea fluency at different stages of a brainstorming session is likely to provide a possible explanation of the mechanisms behind the well-documented yet poorly understood phenomenon of ‘ideational productivity loss' in group brainstorming. Results All results are mean values of 30 runs for every experimental condition. Control random-generator seeds are used in order to compare the effects of the independent variables. The trend is clear: as the scope of influence of ideas increases, fluency decreases across all exploration lengths. Table 1 shows the results for all 20 experimental conditions in gross and net fluency in four-member teams. When γ = 1, agents are activating the exploration strategy during 100% of the simulation; therefore no advantage from exploitation behavior is possible. Table 1. Results in gross and net fluency from varying group influence γ in teams of 4 agents across a range of exploration lengths ε. Exploration length ε Group γ Gross fluency Net fluency ε = 0.2 γ = 0 40.9 19.63 γ = 0.33 53.86 19.46 γ = 0.66 41.23 16.9 γ = 1 18.4 9.2 ε = 0.4 γ = 0 48.46 22.76 γ = 0.33 63.46 23.4 γ = 0.66 47.66 19.23 γ = 1 23.26 11.63 ε = 0.6 γ = 0 56.96 25.9 γ = 0.33 69.46 25.43 γ = 0.66 47.16 19.6 γ = 1 29.13 14.56 ε = 0.8 γ = 0 57.56 25.03 γ = 0.33 64.86 23.03 γ = 0.66 44.9 18.26 γ = 1 27.6 13.8 ε = 1.0 γ = 0 44.36 14.93 γ = 0.33 44.43 14.13 γ = 0.66 33.9 13 γ = 1 22.06 11.03 2012 191 Figure 2. Group influence γ has a negative effect on net fluency across all exploration lengths ε. The differential effects of γ are higher when ε is low, and net fluency is higher across all γ when ε is medium. With high ε, the effects of γ are less significant. Group influence γ has a clear effect on the generation of unique design concepts or net fluency, Figure 2. In this model, agents brainstorming in isolation do produce more original ideas than the same agents brainstorming in teams. These results are consistent across different team sizes from 4 to 16 members in our model, Figure 3. Figure 3. In large teams (N = 16), group influence γ has a more significant effect on net fluency across all exploration lengths ε. When group influence is zero, agents contribute no solutions to the common pool of team concepts, and they only store and have access to their own concept pool. Gross fluency is the total sum of individual concepts, while net fluency is the count of original concepts in this set. Consistently across different team sizes, when group influence is zero, net fluency is highest indicating that agents in isolation generate more unique solutions than when they share solutions with others. Although this outcome is consistent with the brainstorming literature, it still seems counter-intuitive; how can teams of agents in this model be less efficient than samesize groups of separate individuals? This result may seem paradoxical particularly when we consider the amplifying effects that the exploitation strategy has in this model, as evidenced by scenarios where agents explore the entire simulation time span (ε = 1.0). If exploitation is so productive (particularly when balanced with similar rates of exploration in this model), where does the advantage of isolated agents come from despite the fact that they have access to smaller solution sets during exploitation? In other words, one could expect that teams of agents in this model would be more productive given that each individual agent has access to a larger pool of concepts from which it can retrieve a higher diversity of solutions in order to build more concepts. In contrast, we observe that as group influence increases and agents contribute more and have more access to a larger pool of solutions, both gross and net fluency decrease. The gap between net fluency of nominal and interactive groups varies in this model as a function of exploration length, i.e., how early or late is exploitation activated during the simulated brainstorming session. In larger groups the effects of group influence are more significant. Agents in large teams appear rather inefficient in high group influence conditions: their net fluency is equivalent to that of teams four times smaller. In this respect, it would be tempting to conclude that working in isolation is more efficient for creative ideation. However, there is a fundamental distinction that is made clear in this model, which has largely remained implicit across studies that compare the performance of nominal versus interactive teams: the total output of these two types of groups is incommensurable. The key is turn-taking; the comparison is inadequate when measured in number of turns rather than in minutes or hours. The difference is that in isolation, although in theory the same number of individuals are generating and recording ideas, in fact the number of turns is n times higher than in interactive groups since turn-taking occurs in parallel. In principle, no idle time exists for individuals in nominal groups. In contrast, teams follow some type of sequential order (skewed or not) by which all team members except one are idle at every turn or intervention. Therefore, this natural ‘bottleneck' in team interaction (production blocking) is a sufficient cause for the relative poor performance of teams when compared to the aggregate results of individuals in isolation. In order to account for this inequality, turn adscription is manipulated in our model to ensure that all agents in nominal and interactive groups have access to an equal number of turns over the simulated time. The result is an increase in gross fluency as group influence increases, Figure 4. Figure 4. Teams outperform same-number of individuals in gross fluency as group influence γ increases (ε = 0.2). 2012 192 Discussion Is it better to generate ideas in solitude or in teams? The work presented in this paper suggests that each condition may present certain advantages and judging performance merely by measuring output is a limited approach. Although no definite answer can be expected from this simple model of brainstorming, it does capture interesting observations related to one of the key causes associated to ideational productivity loss, i.e. production blocking in groups. Within its limitations, this model supports a number of insightful hypotheses to consider: The balance between divergent and convergent thinking in a brainstorming session is important, and the time at which ideation is switched between these two modes of thinking is likely to have important an effect in the productivity of brainstorming groups. Individuals brainstorming in isolation are more productive than teams over similar time periods as a result of their increased intensity of participation. Teams present a ‘bottleneck' in the form of sequential turn-taking, which is avoided by individuals in isolation who are - in principle- constantly active in generating new ideas and building up on previous ideas. The increased fluency of isolated brainstormers over teams may be a feature of easy tasks. It is possible that in difficult tasks, group diversity is more advantageous than individual ideational intensity. If this is the case, then transformative creativity may be more appropriate for brainstorming in solitude, whilst combinatorial creativity may be a more suitable objective of teams. Turn allocation can be optimized via facilitation techniques or technological means so that an adequate balance exists between having access to others' ideas and avoiding interruptions. This balance may turn out to be a key factor in the performance of brainstorming groups. The work presented here focuses on the effect of influence over ideas; it is natural to expect a more complex picture that includes individual diversity and other situational conditions. Nonetheless, our results can be cautiously compared to those from laboratory studies. For instance, a widely-cited study of 4-person groups in the same two assessment conditions shows a productivity gain of around 60% from interactive to nominal groups [14]. On the other hand, another study where the total number of ideas is considered in 4-people groups but in a simpler task, reports a mean difference of 38% between nominal and interactive groups [15]. In our system these differences range between 40% and 100% depending on certain task factors. Nevertheless, the aim of this system is not to replicate a particular task or set of results, but rather to demonstrate the nature and effects of production blocking in teams or interactive groups. In addition, these findings provide a possible explanation as to why people may enjoy more working in groups than in isolation [16, 17]. Apart from a number of social reasons, in terms of idea generation, our experiments suggest that individuals find it easier to operate in groups as they have access to a large number of ideas generated by others. Namely, significantly less individual effort is required to generate solutions. If the results of this experiment were able to be generalized, then facilitators of brainstorming sessions should consider the aim of the session in relation to the type of demands imposed over the search of solutions, the degree of transformative or combinatorial creativity required, the social influence of the group (as a sum of paired influences between team members), and the resulting hierarchical interactions between brainstormers. Brainstorming has been treated in general as a ‘black box' method of problem solving. People are allocated into teams and they are expected to come up with solutions in a period of time with the general rule that they generate ideas without constraints. The importance of these simple computational experiments is that they show that the results of brainstorming sessions can be qualitatively different between independent individuals and groups, and also between different types of groups. Further modeling will be necessary in order to formulate and evaluate researchbased instructions for adequate brainstorming sessions [18]. Future work with this model will account for individual agent diversity. Acknowledgements This research is supported in part by the National Science Foundation under Grant Nos. NSF IIS-1002079 and NSF SBE-0915482. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 2012_31 !2012 Representational affordances and creativity in association-based systems Kazjon Grace Faculty of Architecture, Design and Planning Sydney University NSW, Australia kazjon@arch.usyd.edu.au John Gero Krasnow Institute for Advanced Study George Mason University Fairfax, VA, US john@johngero.com Rob Saunders Faculty of Architecture, Design and Planning Sydney University NSW, Australia rob.saunders@sydney.edu.au Abstract This paper describes ongoing research into associationbased computational creative systems. The necessary components for developing association-based creative systems are outlined, and the challenges in measuring the creativity of such a system are discussed. An approach to operationalising creativity metrics for association-based systems based on representational affordances is described. This approach is then demonstrated through an analysis of results produced by a system for constructing associations between visual designs. Association-based creative systems Association, or the construction of a new relationship between two concepts or ideas, is a cognitive process at the heart of many creative endeavours. Its presence is most obviously felt in analogy and metaphor, but associative reasoning is also a component of complex similarity judgement, recognition and simplification tasks (Markman and Gentner 1993; Balazs and Brown 1998; Goel 2008) that are critical to the appreciation of creative works. Given the creative potential of analogical processes (Goel 1997; Hofstadter and the FARG 1995; Kuhn 1962) and the importance of understanding and appreciating creative works (Jennings 2010; Wiggins 2006; Colton 2008) to a computational creative system, it is clear that an operationalised understanding of the process of association that underlies these and other acts is of value to the field of computational creativity. Furthering that understanding is a twofold endeavour: computational models of association that are general, extensible and powerful must be developed, and metrics by which the creativity of those models can be assessed must be devised. Association involves representing objects in a manner that enables a new relationship between them. A mapping is then constructed between the objects which embodies that relationship. These two component processes representation and mapping cannot be modelled serially or discretely, as representation depends on mapping and mapping depends on representation (Kokinov 1998). This complex relationship between the mapping and the representations used in mapping creates a ‘chicken-or-egg' problem that must be addressed by any computational model. Not only must a computational model of association possess representational flexibility, but the search for representations must be informed by feedback from the ongoing search for mappings, just as the search for mappings is influenced by the construction of new representations. Notably the process of association does not incorporate the use or evaluation of the relationships that it constructs. This is the primary addition of processes like analogy that extend association analogy adds the transfer of knowledge between the associated object, the use of that knowledge to achieve some goal, and the evaluation of the analogy in terms of its utility at achieving that goal (French 2002). Association-based similarity judgement also extends association, in this case by evaluating mapped and unmapped attributes to construct a notion of similarity between the objects and then using that similarity in some categorisation or comparison task (Markman and Gentner 1993). Models of association must be capable of supporting this variety of applications. This research has developed the notion of interpretationdriven search as a general framework for computational association. We investigate this approach for its potential to exhibit creative behaviours. Interpretation-driven association Donald Schon (1983) proposed a theory, ‘reflection-in¨ action', to explain the cyclical interactions of evaluation and synthesis processes that had been observed in studies of designers. Schon suggests that designers change the design ¨ representations with which they are working, then observe and reflect on the effects of those changes. As a result of that reflection, the designer again acts to change the emerging design representation. This iterative interaction is enabled by the designer's ability to interpret a representation in a new way after it has been produced. Schon posits that ¨ the designer's ability to see things in an emerging design that were not consciously put there is the core of the creative design process. The framework for computational association developed in this research draws a parallel between Schon's theory of ¨ creative design and Boden's (1990) notion of creativity as exploring (and potentially transforming) a conceptual space. The actions taken by a designer to modify their design may translate that design to a new position within the designer's conceptual space, or they may transform the space itself, re 2012 195 formulating the designer's understanding of the problem and producing a novel and surprising design. The genesis of both exploratory and transformative reformulations is the reconceptualisation of the representation the designer had constructed previously. This produces a new interpretation of the design on which previously impossible actions are rendered possible. Schon sees the process of reflection-in-action as itself ¨ being based on analogical reasoning (Schon and Wiggins ¨ 1992), but this research inverts that relationship, putting forward a framework for association that is based on Schon's ¨ iterative cycle of reflection and action. This framework is referred to as interpretation-driven search. While inspired by the design process, the interpretation-driven search approach can be generalised beyond design tasks to any domain in which potentially creative associations are constructed. Interpretation-driven association uses iterative transformation and exploration of the objects being associated to produce a representation that enables a new mapping to be constructed. An interpretation is a transformation of the representation of the objects being associated. These transformations affect the object representations and enable potential mappings between them to be explored. In this approach, interpretations are explicitly represented elements of system knowledge, allowing them to be constructed, evaluated, stored and retrieved. The interpretation process iteratively interacts with the process of searching for mappings and operates in parallel with it. Interpretation influences mapping search and mapping influences the construction, application and evaluation of interpretations. A model of association that implements these principles can broadly be viewed a consisting of three processes: Representation, Interpretation and Mapping. Representation produces the ‘original' representations of the objects that are then iteratively searched, transformed and mapped by the Interpretation and Mapping cycle. This framework can be seen in Figure 1. The benefits of this parallel, interactive approach are discussed in Grace et. al. (2012), along with a more detailed elaboration of the framework. Representation Interpretation Mapping mappings objects association graph representation interpretations Figure 1: Interpretation-driven search, a high-level framework for computational association. The creativity of associations The definition of, and criteria for, creativity have been the subject of considerable debate. One broad definition that has attained some consensus is that of creativity as the union of novelty and value (Sternberg and Lubart 1999). Novelty is a metric based on the difference between the artefact and other, existing artefacts in the same domain. Value is a metric based on the artefact's performance at whatever tasks to which it is applied, when compared to the performance of existing artefacts. Both of these qualities are highly contextualised, as novelty can only be assessed from the perspective of a viewer and usefulness can only be assessed in the context of an application. Some challenges arise in applying this pair of creativity criteria to the domain of computational association. Firstly, the novelty of an association is on some level guaranteed, as by definition an association must be a new relationship that did not exist previously. Recalling a relationship of which a system was already aware is a memory task, not an association one. This makes an association always at least P-novel (novel to the system itself, as defined by Boden (1990) ). A significant challenge in applying the "novelty and value" framework for evaluating creativity to a model of association is in assessing an association's value. Association does not necessarily incorporate an evaluative component and it is not necessary that an association be constructed to serve some purpose. We refer to this goal-agnostic form of association as ‘free' association, which may incorporate evaluative components but in which the associations are not used to accomplish some purpose. Evaluation and purposefulness are instead features of association-derived processes that incorporate additional components. This does not mean, however, ‘free' association has no effect on the system that constructed it, and therefore an alternative assessment for value can be derived. Different associations produce different transformed and mapped representations of the associated objects, and their value can be assessed based on the degree to which those representations go on to affect the system. This research focusses on this representational affordance model of association value as a way by which the model of association that has been developed could be further developed into an association-based creative system. Representational affordance as a utility metric The "affordances" of an object or environment were first defined by the psychologist James Gibson (1979), referring to the opportunities it offers to a user. As applied to the design of objects (Norman 2002) affordances refer to the possibilities for action that a user perceives when interacting with an object. Affordances do not require instruction, they emerge implicitly from the interaction of an object, its user and the situation (Maier and Fadel 2009). A representation is an internal surrogate that encapsulates knowledge about an entity, enabling the agent or system to reason about that thing (Davis, Shrobe, and Szolovits 1993). In any system that permits the construction of different representations of an object, those representations will facilitate the performance of different actions by that system. Different representations of objects within a system open up different action possibilities for that system. Gero and Kannengeisser (2012) refer to this as representational affordance: the cognitive actions that are enabled by a representing an object in a particular way. During the design process a representation may afford the construction of a new representation with its own, different set of affordances. Gaver 2012 196 (1991) refers to this as "sequential affordance" and it is consistent with the notion of reflection-in-action (Schon 1983). ¨ Representations can provide affordances based on their syntax or based on their semantics. The structure of a graph representation can provide syntactic affordances, such as path following or matching. However, a representation can also provide semantic affordances based on how its content can interact with other system knowledge. This paper focusses on semantic representational affordances. In modelling the creativity of an association, a key question arises: what is the value of a new mapping and the representations that underlie it? We define an association's value in terms of what activities the possession of that association enables the system to do. An association can be said to be of value if the interpretation of the associated objects it contains provides the system with different representational affordances than it previously possessed. Furthermore, associations can be compared and contrasted by the affordances they provide. Value can be defined using representational affordances in the absence of any specific objectives or purpose of the association construction process, making it apt for use in a general model of creative association. In the case of an analogy-making system built on a model of association, the affordances that would be most relevant would be those that enable acts of knowledge transfer between the object domains. By contrast, in a model of design style the most relevant affordances would be those that permitted the detection of new patterns that connect stylistically similar objects. A model of ‘free' association that does not extend the process to incorporate a use for the mappings it constructs can also be assessed using the representational affordance metric for value. If the goal of a free association system is to construct as many different associations as possible, then valuable associations are those that afford the possibility of future, different associations. This kind of sequential affordance of association is made possible by association models that incorporate the effects of a system's past experiences in constructing associations into new association tasks. In this research we use the notion of representational affordances as a value metric for association models to discuss the potential creativity of results from a computational model of association. Experimenting with interpretation-driven association A computational model based on the interpretation-driven framework for association has been developed. An implementation of that model which constructs ‘free' associations (in that the associations it constructs are not used for any explicit goals) between ornamental designs is described here. The structure of the model and its prototypical implementation are presented here, along with selected association results produced by the system. The potential representational affordances of the results presented are discussed as a first step towards extending this model towards an associationbased creative system. Computational model Interpretation-driven search builds on the model of analogy as Structure Mapping (Gentner 1983), in which the relationships within two objects are mapped, rather than their features. The search for these relationship mappings is integrated with an iterative process of re-representation. The model of interpretation-driven association (see Grace et. al. (2012) for a detailed description) is comprised of five processes. The first three processes: concept formation, relation formation and graph construction collectively form the "representation" process of the interpretation-based framework, while the latter two processes, mapping and interpretation, are direct implementations of that framework. The system begins with an image-based representation of the objects, extracts a set of features to describe them and then categorises those features into concepts. Relationships between these features within each object are then constructed based on both topological information (such as relative size, bearing or symmetry) from the feature sets and typological information from the conceptual categorisation (such as conceptual similarity or conceptual sameness). The features and relationships are then compiled into a graph representation that serves as the basis for the iterative mapping and interpretation. The mapping process then searches these graphs for subgraphs that contain common edge labels. These subgraphs represent regions of the two images that possess a consistent relational structure. The transformations that are applied by the interpretation process affect the structure or content of the object graph representations. Implementations of this model could utilise a variety of transformational approaches, such as transforming the graph objects directly, transforming the features or concepts directly and then re-constructing the graphs, or even transforming the process by which one or more representational stages are constructed. At any given time, a single transformation is applied to the graph representations, this is referred to as the ‘current' interpretation. This interpretation changes the structure of the graphs, altering the trajectory of the mapping search operating on those graphs. The mapping search process produces candidate mappings as it searches, and these are used to construct new interpretations. New interpretations are constructed by examining what features-to-feature mappings in those candidates cannot currently be successfully mapped, and extrapolating what transformations would be necessary to cause those to be successful. Implementation The implementation of the model uses vector images as its input, calculating object features from the minimal closed shapes formed by vector lines. The kinds of relationship implemented in the system are ‘same concept', ‘similar concept', ‘relative scale', ‘linear distance', ‘horizontal distance', ‘vertical distance', ‘relative orientation', ‘bearing', ‘contains', ‘reflection of', ‘shared vertex' and ‘shared edge'. The implementation is provided with the knowledge necessary to detect these relationships and categorise them into groups such as "slightly smaller than" or " 120 degrees of 2012 197 difference in orientation". Instantiations of these relationships form the edge labels on the graph representations of each object being associated. Mapping search is implemented as a genetic algorithm that searches for subgraph isomorphisms between the graph representation of each object. Each individual in the population of the genetic algorithm is a set of mappings between a feature in one object and a feature in the other. The fitness for this algorithm is the largest contiguous subgraph that can be constructed out of those feature-to-feature mappings in both objects. This use of a powerful, general search algorithm reflects the fact that we are not attempting to implement association in a biologically or cognitively plausible way, rather we are demonstrating the feasibility of the interpretation-driven approach. The interpretation process is implemented as the substitution of relationships between features. Replacing relationships effectively causes the system to perceive two disparate relationships as being alike. Interpretations in this system can be expressed as "in this situation, relationship X in the first object is the same as relationship Y in the second object'. An interpretation is therefore a set of rules for replacing relationships, where relationships are represented as edge labels in the graphs. Which interpretation is being applied to the objects is able to change every iteration, providing the parallelism between mapping and interpretation that characterises the interpretation-based framework. Methodology A total of 31 ornamental designs were inputted into the system as part of a series of experiments to demonstrate the application of interpretation-based association. Objects were drawn from a broad variety of design domains, including symbols, architectural ornamentation and decorations and object designs. These objects were drawn from a variety of cultures and historical periods. From this library of designs a subset of objects were selected for which interesting associations could be produced and the capabilities of the system could be documented. A set of associations constructed by repeatedly associating a single pair of objects is presented. These associations are presented as a demonstration of the interpretationbased model, but also as a starting point from which the use of representational affordances as a metric for utility in association-based creative systems can be discussed. The two objects associated here are presented in Figure 2. Object 1, on the left, is a Hittite sun symbol, while Object 2, on the right, is a Japanese floral symbol. Both are vector line drawings produced manually from black and white images by the authors. For the purposes of this experiment the system has been restricted so that the only type of relationship which connects the features of these two objects is relative orientation. Results Three associations between the two objects in Figure 2, along with the interpretations used to produce them, are shown in Figures 3, 4 and 5. All three associations constructed between these two objects utilised the ‘relative ori(a): Object 1 (b): Object 2 f1 f2 f4 f5 f6 f8 f7 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18 f19 f20 f21 f22 f23 f24 f25 f26 f27 f28 f36 f29 f30 f31 f32 f33 f34 f35 f3 Figure 2: The two objects used in the example associations. The minimal closed shapes extracted from the images by the system are numbered. The relative orientations of these features are the relationships relevant to these examples. Designs sourced from (Humbert 1970). entation' relationship type, but each involved a different interpretation. These differing interpretations permitted the construction of different mappings. In each of these figures the associated objects are presented side by side, with the features involved in the mapping being highlighted. The mapping between features in one object and features in the other is shown as solid lines joining the two images. The common set of relationships between the features within each object is shown as thick dashed lines. Only the relationships that are used in the mapping are shown, pairs of features can have many relationships connecting them. Each of these relationships is labelled with its uninterpreted description. Interpretations are an imposed equality between different labels and are shown at the bottom of each image. Mappings can be constructed between sets of features that share patterns of relationships after this interpretation is applied. The first association, seen in Figure 3, is constructed without the use of an interpretation. Within the representations of the two objects there exists a pattern of seven nodes in each that share a consistent pattern of relationships. All seven objects in both objects, in the order indicated by the thick dashed lines, are consecutively separated by approximately 150 degrees of orientation. This relationship is present between every second point in the seven-pointed star in Object 1, starting from Feature f4 and proceeding twice around the star to Feature f9. The same relationship of relative orientation is present between every eighth petal in Object 2 that is between each petal and the petal one spot to its left in the floret to its left, starting from Feature f35 and proceeding in a spiralling pattern twice around the design to f29. The ‘null' interpretation i; is shown as this mapping can be constructed from the base representations produced by the system without any transformation. This association is included to demonstrate the capability of the representation construction and mapping search elements of our model of association. Without the use of interpretations the system is capable of producing representations of visual objects comprised of networks of abstract 2012 198 Object 1 Object 2 f4 f5 f6 f8 f7 f9 f10 f16 f19 f24 f27 f29 f32 f35 Interpretation: i 0 Figure 3: An association constructed between the two objects without the use of an interpretation. All the relationships incorporated into this mapping (depicted by thick dashed lines) are of the type "⇠150 degrees of difference of orientation". These relationships join the seven points of the star in Object 1 (by traversing the star twice) and seven of the petals in Object 2 (in a spiral pattern joining each eighth petal). The empty or "null" interpretation means that these relationships are present in the default object representations the system constructs. relationships and features. These representations can then be searched for common patterns of relationships, allowing the features which those relationships join to be mapped. Without the ability to transform representations, this association (and others trivially different from it) are all that the system can construct. With mappings limited to those relationships already present in both objects, the potential for constructing an association with relevant affordances is slim. The only way to give a system with a single representation of each object the ability to construct additional associations is to incorporate more information into those representations. In this case that information would take the form of additional types of relationship between features other than differences of orientation. Utilising the capacity to reinterpret object representations, the system is not limited to "literal" associations as seen in Figure 3. Figure 4 shows the mapping produced by an association between the same two objects that was produced through interpretation-driven search. During the search for mappings between the objects, the system constructed (initially by chance) some fragment of a mapping similar to the one shown in Figure 4. This mapping candidate would not have been successful in the absence of an interpretation. The mapping was selected by the interpretation construction process, which reverse-engineered one or more interpretations from it. Interpretations are generated that would improve the size of the largest common subgraph of the mapping specified by the candidate. That interpretation is then likely to become the "active" one if the mapping search reaches a point where the current interpretation is significantly outperformed by the new interpretation. The search for mappings influences the construction of interpretations and then those interpretations in turn influence mapping. The mapping expressed in Figure 4 is based on an interpretation that effectively treats the orientation difference between adjacent points on the star in Object 1 the same as the orientation difference between adjacent points on the star in Object 2. Thick dashed lines are shown representing the "approximately 50 degrees of orientation difference" relationship in Object 1 and the "approximately 20 degrees of orientation difference" relationship in Object 2. The resultant mapping connects each feature in the star in Object 1, starting with f4 and proceeding sequentially to f10 with each feature in a floret in Object 2, starting with f14 and proceeding sequentially to f20. This mapping was constructed from low-level relationships extracted from a visual representation of these objects and then interpreted to make those relationships situationally alike. The notion of representational affordances as a tool for assessing value in association models can be applied to the association shown in Figure 4. The interpreted representations of the two objects are useful to the extent that the new interpretations more aptly afford actions to the system. In the case of a free association model like this implementation, the only actions available are the construction of different associations, and therefore the value of an association can only defined by the degree to which it enables that. In this system interpretations are remembered and re-used, which causes the system's past experiences to affect future interpretations. This provides a mechanism by which this association can guide the system's future actions. Representational affordances provided by the association 2012 199 Object 1 Object 2 f4 f5 f6 f8 f7 f9 f10 f14 f15 f16 f17 f18 f19 f20 Interpretation: ~50° Δrot = ~20° Δrot ~20° Δrot ~50° Δrot ~50° Δrot ~50° Δrot Figure 4: An association constructed between the two objects through the application of an interpretation which equates the relationship "⇠50 degrees of difference of orientation" in Object 1 to the relationship "⇠20 degrees of difference of orientation" in Object 2. These relationships join the adjacent points of the star in Object 1 with eight adjacent petals in one of the florets in Object 2. The interpretation that enabled this association is shown in the box beneath the objects. in Figure 4 could be used to define the value of that association if the system were extended to perform purposeful association construction. If associations were being constructed for use by an ornamental design classification system, then the transformed representation may afford the possibility of classifying Object 2 as being based on radial adjacency, which a naive classification system would have been able to do. If instead associations were being constructed for use in an analogy-based design system, then the mapping of f4 through f10 to f14 through f20 may afford the transfer of knowledge about f11 (the centre circle which all of the mapped features in Object 1 touch) to f13, which could then be considered the "centre" of Object 2. The actions permitted by the representational affordances of the association are used by the system to achieve its goals, so those affordances can be said to be of value because of that use. Figure 5 shows a different association constructed by the system. This time the interpretation equates the difference in orientation between every third point in the star in Object 1 with the difference in orientation between the edge petals in Object 2. Viewing the "edges" of a compound object such as Object 2 as being part of a rotating sequence may be a valuable affordance for a creative system. The association system implemented in this research was able to find a broad variety of such associations using just two objects and considering just one type of relationship. Different combinations of the relationships that were mapped in Figures 4 and 5 were also found, such as mapping every adjacent point on the star in Object 1 to the outermost petals of the three florets in Object 2. Interpretation construction provided the system with the ability to produce a variety of divergent mapping from a single association problem. Discussion The experiments described here demonstrate that it is possible to use the interpretation-driven search approach to construct associations between real-world design objects. The representation construction processes used to produce the graphs on which the iterative mapping and interpretation processes operate have been shown to be viable. Associations were produced based on interpretations that were constructed by the system that transformed graph representations that had also been constructed by the system from features and concepts that had been extracted from low-level visual input. These results serve as an initial proof of concept of the interpretation-driven model of association. The associations presented in this paper could not have been constructed from the information that was provided to the system without the ability of the system to transform its representations through the interpretation process. These associations could be constructed without the use of interpretation if the system were provided with additional information about the relationships between the objects, but this reduced representational autonomy would have a deleterious effect on novelty. As assessed in the context of a hypothetical society of individuals with access to the same information and possessed of comparable perceptual abilities, associations produced using interpretation will be P-novel to any individual that has not constructed the same interpretation, while associations produced using additional information would be apparent to any other individual with access to that information. The P-novelty of an association is guaranteed as the system by definition did not know of the relationship expressed in an association before its construction. However, the inter 2012 200 Object 1 Object 2 f4 f5 f6 f7 f9 f10 f14 f20 f22 f28 f29 f35 Interpretation: ~150° Δrot = ~120° Δrot ~150° Δrot ~120° Δrot Figure 5: An association constructed between the two objects through the application of an interpretation which equates the relationship "⇠150 degrees of difference of orientation" in Object 1 with the relationship "⇠120 degrees of difference of orientation" in Object 2. These relationships join every third point of the star in Object 1 with the two outermost petals in each floret in Object 2. As there are only six such petals in Object 2, only six of the seven points in Object 1 have been mapped. The interpretation that enabled this association is shown in the box beneath the objects. pretations used in the model of association described in this research may not themselves be novel as they can be learnt and re-used. It is possible to apply a known interpretation to a different object and still produce a novel representation, as a known transformation can produce a novel result. It is also possible that a P-novel association can be constructed from a representation that has been used before. For example, if the representation of Object 1 present in 4 (and the interpretation used to produce it) were associated with some other, different object, then the resulting association would be P-novel. The known representation of Object 1 is novel in the current circumstances, but is not novel to the system as a whole. This is referred to as "situational" or "S" novelty, after the definition of "S-creativity" in Suwa et al. (1999). Figure 3 demonstrates an inherent weaknesses of a model of creative association that does not incorporate the ability to reinterpret its object representations. The mapping it expresses is quite likely H-novel, in that it is not expected that the majority of human observers would have identified that mapping. However, any sufficient pattern matching system provided with the two graph representations used by the system would come to the exact same conclusion. The mapping used in this association may be H-novel in a society composed of individuals with human perceptual biases, but to a hypothetical society composed of individuals with a perceptual system like that of this model, the mapping is obvious. The use of an interpretation in the construction of associations acts to redress the weakness present in models of creative association based on static representations. The association expressed in Figure 4 could not be constructed from the graph representations the system built without the use of a representation transformation process. In a hypothetical society of individuals with the same perceptual biases as this system, the mapping used in this association would be P-novel to any individual that had not constructed the same interpretation. If this same mapping were to be constructed by incorporating a new relationship type in the default representations that made the mapping possible without interpretation "radial adjacency" for example then the resultant mapping would, like the one in Figure 3, be trivially deducible by any other pattern matching system with access to the same information. The utility component of evaluating creativity can be aided by the use of representational affordances. In the interpretation-driven approach to association, mappings are produced between transformed representations that can reveal structures and connections that were not apparent in the original representations. The utility of those mappings can then be assessed by what actions those new structures enable the system to take. For example, the view of the outermost petals in Object 2 as a sequence of rotated features seen in Figure 5 was not expressed in the uninterpreted representation. The notion of representational affordance also allows the value of an association to be defined in the abstract for models of association that do not use the associations they construct to accomplish any objectives. Maher (2010) frames the evaluation of a creative artefact as requiring three criteria; not just novelty and value but also unexpectedness. Unexpectedness (also referred to as ‘surprisingness') is a metric based on how different the artefact is to what was expected to be the next artefact produced. 2012 201 Unexpectedness, writes Maher, differs from novelty in that it relates to the expected trajectory of the domain or field in which the artefact is being produced, which is distinct from the existing set of artefacts within that domain. Assessing the unexpectedness of an association using Maher's criteria requires identifying abstract patterns in the sequence of recently constructed associations that can be used to project a trajectory of expected associations. Associations can certainly be surprising in a variety of ways much humour depends on setting the recipient up to expect that a certain association is being proposed, then subverting that expectation and instead constructing a very different association. However, unexpectedness as defined by Maher specifically refers to identifying emerging trends in the output of a system over multiple iterations. This is a challenging task in the domain of association as it is difficult to define a similarity metric that could then be used to find patterns in association output. The interpretation-based model of association could provide a method by which the unexpectedness of associations could be assessed. Interpretation-based associations could be characterised by the interpretations used to construct them, which are significantly more generalisable than the mappings that comprise those associations. The explicit representation of interpretations permits further investigation of expectation and unexpectedness in computational models of association. This would address a challenging issue in the development of creative models of association and methods by which they can be evaluated. The representational affordance framework for assessing the value of an association allows us to consider the associations constructed by an interpretation-based system as a creative artefact. The utility of that artefact is the degree to which it has an effect on the system that constructed it. In a "free" association system where experience plays a role in future associations then that effect can be defined as the influence an association has on the construction of future associations. 2012_32 !2012 How Did Humans Become So Creative? A Computational Approach Liane Gabora Department of Psychology University of British Columbia, 3333 University Way Kelowna BC, CANADA, V1V 1V7 liane.gabora@ubc.ca Steve DiPaola Department of Cognitive Science / SIAT Simon Fraser University, 250-13450 102 Ave Surrey BC, CANADA, V3T 0A3 sdipaola@sfu.ca Abstract This paper summarizes efforts to computationally model two transitions in the evolution of human creativity: its origins about two million years ago, and the ‘big bang' of creativity about 50,000 years ago. Using a computational model of cultural evolution in which neural network based agents evolve ideas for actions through invention and imitation, we tested the hypothesis that human creativity began with onset of the capacity for recursive recall. We compared runs in which agents were limited to single-step actions to runs in which they used recursive recall to chain simple actions into complex ones. Chaining resulted in higher diversity, open-ended novelty, no ceiling on the mean fitness of actions, and greater ability to make use of learning. Using a computational model of portrait painting, we tested the hypothesis that the explosion of creativity in the Middle/Upper Paleolithic was due to onset of contextual focus: the capacity to shift between associative and analytic thought. This resulted in faster convergence on portraits that resembled the sitter, employed painterly techniques, and were rated as preferable. We conclude that recursive recall and contextual focus provide a computationally plausible explanation of how humans evolved the means to transform this planet. Introduction To gain insight into the mechanisms underlying creativity, one might start by testing peoples' creative abilities, perhaps using technologies such as fMRI, or dissect the brains of people who were known to be particularly creative during their lifetimes. However, to gain insight into the evolution of creativity, these options do not exist. All that is left of our prehistoric ancestors are their bones and artifacts such as stone tools that resist the passage of time. Thus to understand the evolution of creativity, computational modeling is virtually the only scientific tool we have. Humans are not only creative; we put our own spin on the inventions of others, such that new inventions build cumulatively on previous ones. This cumulative cultural change is referred to as the ratchet effect (Tomasello, Kruger, & Ratner, 1993), and it has been suggested that it is uniquely human (Donald, 1991). A mathematical model of two transitions in the evolution of the cognitive mechanisms underlying creativity has been put forward (Gabora & Aerts, 2009). Computational models of these mechanisms have also been developed (DiPaola & Gabora, 2007, 2009; Gabora, 1995, 2008a,b; Gabora & Leijnen, 2009; Leijnen & Gabora, 2009, 2010; Gabora & Saberi, 2011). However, these efforts used different modeling platforms, and because the aims underlying them have been part scientific and part artistic, their relevance to each other, and to an overarching research program has not previously been made clear. The goal of this paper is to explain how, together, they constitute an integrated effort to computationally model the evolution of human creativity. First Transition: The Earliest Signs of Creativity The minds of our earliest ancestors, Homo habilis, have been referred to as episodic because there is no evidence that they deviated from the present moment of concrete sensations (Donald, 1991). They could encode perceptions of events in memory, and recall them in the presence of a cue, but had little voluntary access to memories without environmental cues. They were therefore unable to shape, modify, or practice skills and actions, and unable to invent or refine complex gestures or vocalizations. Homo erectus lived between approximately 1.8 and 0.3 million years ago. The cranial capacity of the Homo erectus brain was approximately 1,000 cc, about 25% larger than that of Homo habilis, at least twice as large as that of living great apes, and 75% that of modern humans (Ruff et al., 1997). This period is widely referred to as the beginnings of cumulative culture. Homo erectus exhibited many indications of enhanced intelligence, creativity, and ability to adapt to their environment, including sophisticated, taskspecific stone hand axes, complex stable seasonal home bases, and long-distance hunting strategies involving large game, and migration out of Africa. This period marks the onset of the archaeological record and it is thought to be the beginnings of human culture. It is widely believed that this cultural transition reflects an underlying transition in cognitive or social abilities. Some have suggested that they owe their achievements to onset of theory of mind (Mithen, 1998) or the capacity to imitate (Dugatkin, 2001). However, there is evidence that other species possess theory of mind and the capacity to imitate (Heyes, 1998), yet do not compare to modern humans in intelligence and cultural complexity. 2012 203 Evolutionary psychologists have suggested that the intelligence and cultural complexity of the Homo line is due to the onset of massive modularity (Buss, 1999, 2004; Barkow, Cosmides, &Tooby, 1992). However, although the mind exhibits an intermediate degree of functional and anatomical modularity, neuroscience has not revealed vast numbers of hardwired, encapsulated, task-specific modules; indeed, the brain has been shown to be more highly subject to environmental influence than was previously believed (Buller, 2005; Byrne, 2000; Wexler, 2006). A Promising and Testable Hypothesis Donald (1991) proposed that with the enlarged cranial capacity of Homo erectus, the human mind underwent the first of three transitions by which it—along with the cultural matrix in which it is embedded—evolved from the ancestral, pre-human condition. This transition is characterized by a shift from an episodic to a mimetic mode of cognitive functioning, made possible by onset of the capacity for voluntary retrieval of stored memories, independent of environmental cues. Donald refers to this as a selftriggered recall and rehearsal loop. Self-triggered recall enabled information to be processed recursively with respect to different contexts or perspectives. It allowed our ancestor to access memories voluntarily and thereby act out1 events that occurred in the past or that might occur in the future. Thus not only could the mimetic mind temporarily escape the here and now, but by miming or gesture, it could communicate similar escapes in other minds. The capacity to mime thus ushered forth what is referred to as a mimetic form of cognition and brought about a transition to the mimetic stage of human culture. The self-triggered recall and rehearsal loop also enabled our ancestors to engage in a stream of thought. One thought or idea evokes another, revised version of it, which evokes yet another, and so forth recursively. In this way, attention is directed away from the external world toward one's internal model of it. Finally, self-triggered recall allowed for voluntary rehearsal and refinement of actions, enabling systematic evaluation and improvement of skills and motor acts. Computational Model Donald's hypothesis is difficult to test directly, for if correct it would leave no detectable trace. It is, however, possible to computationally model how the onset of the capacity for recursive recall would affect the effectiveness, diversity, and open-endedness of ideas generated in an artificial society. This section summarizes how we tested Donald's hypothesis using an agent-based computational model of culture referred to as ‘EVOlution of Culture', abbreviated EVOC. Details of the modeling platform are provided elsewhere (Gabora, 2008b, 2008c; Gabora & Leijnen, 2009; Leijnen & Gabora, 2009). 1 The term mimetic is derived from "mime," which means "to act out." The EVOC World. EVOC uses neural network based agents that (i) invent new ideas, (ii) imitate actions implemented by neighbors, (iii) evaluate ideas, and (iv) implement successful ideas as actions. Invention works by modifying a previously learned action using learned trends (such as that more overall movement tends to be good) to bias the invention process. The process of finding a neighbor to imitate works through a form of lazy (non-greedy) search. An imitating agent randomly scans its neighbors, and adopts the first action that is fitter than the action it is currently implementing. If it does not find a neighbor that is executing a fitter action than its own action, it continues to execute the current action. Over successive rounds of invention and imitation, agents' actions improve. EVOC thus models how descent with modification occurs in a purely cultural context. Agents do not evolve in a biological sense- they neither die nor have offspring- but do in a cultural sense, by generating and sharing ideas for actions. Following Holland (1975) we refer to the success of an action in the artificial world as its fitness, with the caveat that unlike its usage in biology, here the term is unrelated to number of offspring (or ideas derived from a given idea). The fitness function (FF) was originally chosen because it allows investigation of biological phenomena such as underdominance and epistasis in a cultural context (see Gabora, 1995); the one used here is one over several used in EVOC (see Gabora, 2008 for others). The FF rewards head immobility and symmetrical limb movement. Fitness of actions starts out low because initially all agents are entirely immobile. Soon some agent invents an action that has a higher fitness than doing nothing, and this action gets imitated, so fitness increases. Fitness increases further as other ideas get invented, assessed, implemented as actions, and spread through imitation. The diversity of actions initially increases due to the proliferation of new ideas, and then decreases as agents hone in on the fittest actions. We used was a toroidal lattice with 100 nodes, each occupied by a single, stationary agent, and a von Neumann neighborhood structure (agents only interacted with their four adjacent neighbors). During invention, the probability of changing the position of any body part involved in an action was 1/6. On each run, creators and imitators were randomly dispersed. Chaining. This gives agents the opportunity to execute multi-step actions. For the experiments reported here with chaining turned on, if in the first step of an action an agent was moving at least one of its arms, it executes a second step, which again involves up to six body parts. If, in the first step, the agent moved one arm in one direction, and in the second step it moved the same arm in the opposite direction, it has the opportunity to execute a three-step action. And so on. The agent is allowed to execute an arbitrarily long action so long as it continues to move the same arm in the opposite direction to the direction it moved previously. Once it does not do so, the chained action comes to an end. The longer it moves, the higher the fitness of this 2012 204 multi-step chained action. Where n is the number of chained actions, the fitness, Fc, is calculated as follows: Fc = Fnc+ (n - 1) The fitness function with chaining provides a simple means of simulating the capacity for recursive recall. ‘Origins of Creativity' Results As shown in Figure 1, the capacity to chain simple actions into more complex ones increases the mean fitness of actions in the society. This is most evident in the later phase of a run. Without chaining, agents converge on optimal actions, and the mean fitness of action reaches a plateau. With chaining, however, there is no ceiling on the mean fitness of actions. By the 100th iteration it reached almost 15, indicating a high incidence of chaining. Figure 1. Mean fitness of actions in the artificial society with chaining versus without chaining. As shown in Figure 2, chaining also increases the diversity of actions. This is most evident in the early phase of a run before agents begin to converge on optimal actions. Although in both cases there is convergence on optimal actions, without chained actions, this is a static set (thus mean fitness plateaus) whereas with chained actions the set of optimal actions is always changing, as increasingly fit actions are found (thus mean fitness keeps increasing). Figure 2. Mean number of different actions in the artificial society with chaining (continuous line) versus without chaining (dashed line). Recall that agents can learn trends from past experiences (using the knowledge-based operators), and thereby bias the generation of novelty in directions that have a greater than chance probability of being fruitful. Since chaining provides more opportunities to capitalize on the capacity to learn, we hypothesized that chaining would accentuate the impact of learning on the mean fitness of actions, and this too turned out to be the case (Gabora & Saberi, 2011). Second Transition: The ‘Big Bang' of Human Creativity The European archaeological record indicates that a truly unparalleled cultural transition occurred between 60,000 and 30,000 years ago at the onset of the Upper Paleolithic (Bar-Yosef, 1994; Klein, 1989; Mellars, 1973, 1989a, 1989b; Soffer, 1994; Stringer & Gamble, 1993). Considering it "evidence of the modern human mind at work," Richard Leakey (1984:93-94) describes the Upper Palaeolithic as follows: "unlike previous eras, when stasis dominated, ... [with] change being measured in millennia rather than hundreds of millennia." Similarly, Mithen (1996) refers to the Upper Paleaolithic as the ‘big bang' of human culture, exhibiting more innovation than in the previous six million years of human evolution. We see the more or less simultaneous appearance of traits considered diagnostic of behavioral modernity. It marks the beginning of a more organized, strategic, season-specific style of hunting involving specific animals at specific sites, elaborate burial sites indicative of ritual and religion, evidence of dance, magic, and totemism, the colonization of Australia, and replacement of Levallois tool technology by blade cores in the Near East. In Europe, complex hearths and many forms of art appeared, including cave paintings of animals, decorated tools and pottery, bone and antler tools with engraved designs, ivory statues of animals and sea shells, and personal decoration such as beads, pendants, and perforated animal teeth, many of which may have indicated social status (White, 1989a, 1989b). Whether this period was a genuine revolution culminating in behavioral modernity is hotly debated because claims to this effect are based on the European Palaeolithic record, and largely exclude the African record (McBrearty & Brooks, 2000); Henshilwood & Marean, 2003). Indeed, most of the artifacts associated with a rapid transition to behavioral modernity at 40- 50,000 years ago in Europe are found in the African Middle Stone Age tens of thousands of years earlier. However the traditional and currently dominant view is that modern behavior appeared in Africa between 50,000 and 40,000 years ago due to biologically evolved cognitive advantages, and spread replacing existing species, including the Neanderthals in Europe (e.g., Ambrose, 1998; Gamble, 1994; Klein, 2003; Stringer & Gamble, 1993). Thus from this point onward there was only one hominid species: modern Homo sapien. Despite lack of overall increase in cranial capacity, the prefrontal 2012 205 cortex, and particularly the orbitofrontal region, increased significantly in size (Deacon, 1997; Dunbar, 1993; Jerison, 1973; Krasnegor, Lyon, and Goldman-Rakic, 1997; Rumbaugh, 1997) and it was likely a time of major neural reorganization (Klein, 1999; Henshilwood, d'Errico, Vanhaeren, van Niekerk, and Jacobs, 2000; Pinker, 2002). Given that the Middle/Upper Palaeolithic was a period of unprecedented creativity, what kind of cognitive processes may have been involved? A Testable Hypothesis Converging evidence suggests that creativity involves the capacity to shift between two forms of thought (Finke, Ward, & Smith, 1992; Gabora, 2003; Howard-Jones & Murray, 2003; Martindale, 1995; Smith, Ward, & Finke, 1995; Ward, Smith, & Finke, 1999). Divergent or associative processes are hypothesized to occur during idea generation, while convergent or analytic processes predominate during the refinement, implementation, and testing of an idea. It has been proposed that the Paleolithic transition reflects a mutation to the genes involved in the fine-tuning of the biochemical mechanisms underlying the capacity to subconsciously shift between these modes, depending on the situation, by varying the specificity of the activated cognitive receptive field. This is referred to as contextual focus2 because it requires the ability to focus or defocus attention in response to the context or situation one is in. Defocused attention, by diffusely activating a broad region of memory, is conducive to divergent thought; it enables obscure (but potentially relevant) aspects of the situation thus come into play. Focused attention is conducive to convergent thought; memory activation is constrained enough to hone in and perform logical mental operations on the most clearly relevant aspects. Support from Computational Model Again, because it would be difficult to empirically determine whether Paleolithic humans became capable of contextual focus, we decided to begin by determining whether the hypothesis is at least computational feasible. To do so we used an evolutionary art system that generated progressively evolving sequences of artistic portraits, with no human intervention once initiated. We sought to determine whether incorporating contextual focus into the fitness function would play a crucial role in enabling the computer system to generate art that humans find "creative" (i.e. possessing qualities of novelty and aesthetic value typically ascribed to the output of a creative artistic process). We implemented contextual focus in the evolutionary art algorithm by giving the program the capacity to vary its level of fluidity and control over different phases of the creative process in response to the output it generated. The 2 In neural net terms, contextual focus amounts to the capacity to spontaneously vary the shape of the activation function, flat for divergent thought and spiky for analytical. creative domain of portrait painting was chosen because it requires both focused attention and analytical thought to accomplish the primary goal of creating a resemblance to the portrait sitter, as well as defocused attention and associative thought to deviate from resemblance in a way that is uniquely interesting, i.e., to meet the broad and often conflicting criteria of aesthetic art. Since judging creative art is subjective, rather than use quantitative analysis, a representative subset of the automatically produced artwork from this system was selected, output to high quality framed images, and submitted to peer reviewed and commissioned art shows, thereby allowing it to be judged positively or negatively as creative by human art curators, reviewers and the art gallery going public. Our strategy for modeling contextual focus may raise questions about the ability of computers to "truly" be creative, and the role of the human system designer in the creative output. Several researchers in computational creativity, have addressed such questions by outlining different dimensions of creativity and proposing schema for evaluating a "level of creativity" of a given system, for example (Ritchie, 2007; Jennings, 2010; Colton, Pease, & Charnley, 2011). We are interested in applying such analyses to our portrait-system as a possibility for future work; indeed, the mechanics of contextual focus might be clarified by the computational creativity literature. In particular we are interested in further exploring the link between systemmodified fitness constraints and the idea of transformational creativity (Boden, 2003; Wiggins 2006). However, for the purposes of the current paper, it is less important to address the question of designer involvement in system creativity, or to try and quantify the amount of creativity displayed. Rather, we concentrate on the qualitative impact made by the explicit incorporation of contextual focus into the system as a whole, and its ability to elevate the perceived quality and novelty of system output to a level audiences judged reminiscent of successful "artistic, human-style" creativity. Generative Art Systems: Creative evolutionary systems are a class of search algorithms inspired by Darwinian evolution, the most popular of which are genetic algorithms (GA) and genetic programming (GP) (Koza, 1993). These techniques solve complex problems by encoding a population of randomly generated potential solutions as ‘genetic instruction sets', assessing the ability of each to solve the problem using a predefined fitness function, mutating and/or marrying (applying crossover to) the best to yield a new generation, and repeating until one of the offspring yields an acceptable solution. We are not claiming that contextual focus is Darwinian, but simply that for our computational modeling purposes, Genetic Programming proved a convenient foundational aggregator to support our contextual focus fitness function module. Typically these systems allow a human user to pick those individuals that will be mated - making the human the creative judge. In contrast, our system used a function trigger mechanism within the contextual focus fitness func 2012 206 tion which allowed the process to run automatically, without any human intervention once the process was started. It was not until the evolutionary art process came to completion that humans looked at and evaluated the art. Others have begun to use creative evolutionary systems with an automatic fitness function in design and music, as well as in a creative invention machine (Bentley, Corne, 2002). What is unique in our approach is that it incorporates several techniques that enable it to shift to processing artistic content in a more divergent or associative manner, and employs a form of GP called Cartesian Genetic Programming (Miller, 2011), detailed in the next section. Implementation: The GP function set has 13 functions which use unitized x and y positions of the portrait image as variables and additional parameter variables (noted PM) that can be affected by adaptive mutation. Functions are low level in nature which aids in a large ‘creative' search space, and output HSV color space values between 0 and 255. An individual in our population is manifested as one program that runs successively for every pixel in the output image, which is then tested against our creative fitness function. This allows correlated painterly effects as one moves through the image. Functions 1 through 5 use simple logical or arithmetic manipulations of the positions (low level functions create a larger ‘creative' search space), whereas 7 through 14 use trigonometric or logical functions that are more related to geometric shapes and color graduations. The 13 functions of the function set are: 1: x|y; 2: PM & x; 3: (x ? y) % 255; 4: if (x[y) x y; else y x; 5: 255 x; 6: abs (cos (x) * 255); 7: abs (tan (((x % 45) * pi)/180.0) * 255)); 8: abs (tan (x) * 255) % 255); 9: sqrt ((x PM)2 ? (y PM) 2); (thresholded at 255) 10: x % (PM? 1) ? (255 PM); 11: (x ? y)/2; 12: if (x[y)255*((y ? 1)/(x ? 1)); else 255*((x ? 1)/(y ? 1)); 13: abs (sqrt (x - PM2? y - PM2) % 255); The contextual focus based fitness function varies fluidly from tightly focusing on resemblance (similarity to the sitter image, which in this case is an image of Charles Darwin), to swinging (based on functional triggers) toward a more associative process of the intertwining, and at times contradicting, ‘rules' of abstract portrait painting. Different genotypes map to the same phenotype. This allows us to vary the degree of creative fluidity because it offers the capacity to move though the search space via genotype (small ordered movement) or phenotype (large movement but still related). For example, in one set of experiments this is implemented as follows: if the fittest individual of a population is identical to an individual in the previous generation for more than three iterations, meaning the algorithm is stuck in analytic mode and needs to open up, other genotypes that map to this same phenotype are chosen over the current non-progressing genotype, allowing divergent open movement through the landscape of possibilities. The automatic fitness function partly uses a ‘portrait to sitter' resemblance. Since the advent of photography (and earlier), portrait painting has not just been about accurate reproduction, but also about using modern painterly goals to achieve a creative representation of the sitter. The fitness function primarily rewards accurate representation, but in certain situations also rewards visual painterly aesthetics using simple rules of art creation as well as a portrait knowledge space. Specifically, the divergent painterly portion of the fitness function takes into account: (1) face versus background composition, (2) tonal similarity over exact color similarity, matched with a sophisticated artistic color space model that weighs for warm-cool color temperature relationships based on analogous and complementary color harmony rules, and (3) unequal dominate and subdominant tone and color rules, and other artistic rules based on a portrait painter knowledge domain as detailed in (DiPaola, 2009) and illustrated in Figure 3. The system is biased toward resemblance, which gives it structure, but can, under the influence of functional triggers, exhibit artistic flair. Figure 3. The contextual focus fitness function mimics human creativity by moving between restrained focus (resemblance) to more unstructured associative focus (resemblance + ambiguous art rules of composition, tonality and color theory). The fitness function calculates four scores (resemblance and the three painterly rules), and then combines them in different ways to mimic human creativity, shifting between unstructured associative focus (rules of composition, tonality and color theory) and restrained focus (resemblance). In its default state, the fitness function uses a more analytic 1 2 3 Structured Focus Innovative Exploration Fitness: Similarity Fitness: Artistically Expressive Resemblance Fuzzy Rules of Art: 1: composition, 2: tonality, 3: color F I T N E S S T E S T F L U I D I T Y 2012 207 form of processing, specifically, a ratio of 80% resemblance to 20% non-proportional scoring of the three painterly rules. Several functional triggers can alter this ratio in different ways, but the main trigger is when the system is "stuck". Within any run, for instance as long as an adaptive percentage of 80- 20 resemblance bias is maintained (resemblance patriarchs), the system will allow very high scoring of painterly rule individuals to be accepted into the next population. Those with high painterly scores (weighted non-proportionally including for a very high score with respect to just one rule) are saved separately, and mated with the current 80/20 population. Unless other triggers exist, their offspring are still tested with the 80- 20 resemblance test. System wide functional changes occur when redundancy triggers affect the default ratio for all individuals. As mentioned previously, when a plateau or local minimum is reached for a certain number of populations, the fitness function ratio switches such that painterly rules are weighted higher than resemblance (on a sliding scale), and work in conjunction with redundancy at the input, node, and functional levels. Similarly, but now in reverse, to the default resemblance situation, high scoring resemblance individuals can pass into the next population when a percentage of painterly rule individuals is met. Using this more associative mode, high resemblance individuals are always part of the mix, and when these individuals show a marked improvement, a trigger is set to return to the more focused 80/20 resemblance ratio. As the fitness score increases, portraits look more like the sitter. This gives us a somewhat known spread from very primitive (abstract) all the way through to realistic portraits. Thus in effect the system has two ongoing processes: (1) those ‘most fit' portraits that pass on their portrait resemblance strategies, making for more and more realistic portraits—the family ‘resemblance' patriarchs, and (2) the creative ‘strange uncles': related to the current ‘resemblance fit', but portraits that are more artistically creative or ‘artistically fit'. This dual evolving technique of ‘patriarchs and strange uncles' mimics the interplay between freedom and constraint that is so central to creativity. Paradoxically, novelty often benefits from the existence of a known framework reference system to rebel and innovate from. Creative people use some strong structural rules (as in the templates of a sonnet, tragedy, or in this case, a resemblance to the sitter image) as a resource or base to elaborate new variants beyond that structure (in this case, an abstracted variation of the sitter image). ‘Big Bang of Creativity' Results The automatic creative output was generated over thirty days of continuous, un-supervised computer use. The images in Figure 4 show a selection of representative portraits produced by the system. While the overall population improves at resembling Darwin's portrait, what is more interesting to us is the variety of recurring, emergent and merged creative strategies that evolve as the programs in different ways to become better abstract portraitists. Figure 4. These images have been seen by thousands in the last 2 years and have been perceived as creative art works on their own by the art public, including above at the MIT Museum in Cambridge, MA. Humans rated the portraits produced by this version of the portrait painting program with contextual focus as much more creative and interesting than a previous version that did not use contextual focus, and unlike its predecessor, the output of this program generated public attention worldwide. Example pieces were framed and submitted to galleries as a related set of work. Care was taken by the author to select representational images of the evolved unsupervised process, however creative human bias obvious exists in the representational editing process. Output has been accepted and exhibited at six major galleries and museums including the TenderPixel Gallery in London, Emily Carr Galley in Vancouver, and Kings Art Centre at Cambridge University as well as the MIT Museum, and the High Museum in Atlanta, all either peer reviewed, juried or commissioned shows from institutions that typically only accept human art work. A typical gallery installation consisted of 40-70 related portraits produced in time order over a given run. Gallery showings focus on "best resemblances" and those that are artistically compelling from an abstract portrait perspective. This gallery of work has been seen by tens of thousands of viewers who have commented that they see the artwork as an aesthetic piece that ebb and flows through seemly creative ideas even though they were solely created by an evolutionary art computer program using contextual focus. Note that no attempt to create a pure ‘creativity Turning Test' was attempted. Besides the issues surrounding the validity of such a test (Pease, Colton, 2011), it was not feasible in such reputable and large art venues. However most of the thousands of causal viewers assumed they were looking at human created art. The work was also selected for its aesthetic value to accompany an opinion piece in the journal Nature (Padian, 2008), and was given a strong critical review by the Harvard humani 2012 208 ties critic, Browne (2009). While these are subjective measures, they are standard in the art world. The fact that the computer program produced novel creative artifacts, both as single art pieces and as a gallery collection of pieces with interrelated themes, using contextual focus as a key element of its functioning, is compelling evidence of the effectiveness of contextual focus. Discussion and Conclusions Many species engage in acts that could be said to be creative. However, humans are unique in that our creative ideas build on each other cumulatively; indeed it is for this reason that culture is widely construed as an evolutionary process (e.g. Bentley, Ormerod, & Batty, 2011; Cavalli-Sforza & Feldman, 1981; Gabora, 1996, 2008; Hartley, 2009; Mesoudi, Whiten & Laland, 2006; Whiten, Hinde, Laland, & Stringer, 2011). Our creativity is evident in all walks of life. It has transformed the planet we live on. We discussed two transitions in the evolution of uniquely cumulative form of creativity, discussed cognitive mechanisms that have been proposed to underlie these transitions, and summarized efforts to computationally simulate them. Using an agent based computer model of cultural evolution, we obtained support for the hypothesis that the onset of cumulative, open-ended cultural evolution can be attributed to the evolution of a self-triggered recall and rehearsal loop, enabling the recursive chaining of thoughts and actions. Using a generative genetic programming system, we used a computational model of contextual focus to automatically produce a related series of art output that received critical acclaim usually given to human art work supporting the hypothesis that the capacity to shift between analytic and associative modes of thought plays an important role in the creative process. Our results suggest that the evolution of chaining and contextual focus made possible the open-ended cumulative creativity exhibited by computational models of language evolution (e.g. Kirby, 2001). Note that in chaining versus no chaining conditions the size of the neural network is the same, but how it is used differs. This suggests that it was not larger brain size per se that initiated the onset of cumulative culture, but that larger brain size enabled episodes to be encoded in more detail, allowing more routes for reminding and recall, thereby facilitating recursive redescription of information encoded in memory (KarmiloffSmith, 1992), thereby tailor it to the situation at hand. Our results suggest that it is reasonable to hypothesize that this in turn is vastly accentuated by the capacity to shift between associative and analytic different processing modes. We wish to acknowledge some limitations of this work. Chaining does not work, as in humans, by considering an idea in light of one perspective, seeing how that perspective modifies the idea, seeing how this modification suggests a new perspective from which to consider the idea, and so forth. We are planning a more sophisticated implementation of that works more along these lines. Second, there is some irony in using an art program based on the genetic algorithm as a starting point to implement contextual focus, which we have claimed is unique to the cultural evolution of ideas and has no counterpart in biological evolution. Our goal here was to see if contextual focus ‘works' at all; since this was successful, we will now move on to more cognitively plausible implementations. One of the projects currently underway is to implement contextual focus in the EVOC model of cultural evolution that was used for the ‘origin of creativity' experiments. This is being carried out as follows. The fitness function will change periodically, so that agents find themselves no longer performing well. They will be able to detect that they are not performing well, and in response, increase the probability of change to any component of a given action. This temporarily makes them more likely to "jump out of a rut" resulting in a very different action, thereby simulating the capacity to shift to a more associative form of thinking. Once their performance starts to improve, the probability of change to any component of a given action will start to decrease to base level, making them less likely to shift to a dramatically different action. This helps them perfect the action they have already settled upon, thereby simulating the capacity to shift to a more associative form of thinking. Acknowledgements We are grateful to Graeme McCaig and grants from Natural Sciences and Engineering Research Council of Canada and the Fund for Scientific Research of Flanders, Belgium. 2012_33 !2012 Corpus-Based Generation of Content and Form in Poetry Jukka M. Toivanen, Hannu Toivonen, Alessandro Valitutti and Oskar Gross Department Of Computer Science and Helsinki Institute for Information Technology, HIIT University of Helsinki, Finland Abstract We employ a corpus-based approach to generate content and form in poetry. The main idea is to use two different corpora, on one hand, to provide semantic content for new poems, and on the other hand, to generate a specific grammatical and poetic structure. The approach uses text mining methods, morphological analysis, and morphological synthesis to produce poetry in Finnish. We present some promising results obtained via the combination of these methods and preliminary evaluation results of poetry generated by the system. Introduction Computational poetry is a challenging research area of computer science, at the cross section of computational linguistics and artificial intelligence. Since poetry is one of the most expressive ways to use verbal language, computational generation of texts recognizable as good poems is difficult. Unlike other types of texts, both content and form contribute to the expressivity and the aesthetical value of a poem. The extent to which the two aspects are interrelated in poetry is a matter of debate (Kell 1965). In this paper we address the issues of generating content and form using corpus-based approaches. We present a poetry generator in which the processing of content and form is performed through access to two separate corpora, with minimal manual specification of linguistic or semantic knowledge. In order to automatically obtain world knowledge necessary for building the content, we use text mining on a background corpus. We construct a word association network based on word co-occurrences in the corpus and then use this network to control the topic and semantic coherence of poetry when we generate it. Many issues with the form, especially the grammar, we solve by using a grammar corpus. Instead of using an explicit, generative specification of the grammar, we take random instances of actual use of language from the grammar corpus and copy their grammatical structure to the generated poetry. We do this by substituting most words in the example text by ones that are related to the given topic in the word association network. Our current focus is on testing these corpus-based principles and their capability to produce novel poetry of good quality on a given topic. At this stage of research, we have not yet considered rhyme, rhythm or other phonetic features of the form. These will be added in the future, as will more elaborate mechanisms of controlling the content. As a result of the corpus-based design, the input to the current poetry generator consists of the background and the grammar corpora, and the topic of the poem. In the intended use case, the topic is directly controlled by the user, but we allow the grammar corpus to influence the content, too. Control over form is indirectly over the choice of the two corpora. The only directly language-dependent component in the system is an off-the-shelf module for morphological analysis and synthesis. The current version of our poetry generation system works in Finnish. Its rich morphology adds another characteristic to the current implementation. However, we believe that the flexible corpora-based design will be useful in transferring the ideas to other languages, as well as in developing applications that can adapt to new styles and contents. A possible application could be a news service in the web, with a poem of the day automatically generated from recent news and possibly triggering, in the mind of the reader, new views to the events of the world. After briefly reviewing related work in the next section, we will describe the corpus-based approach in more detail. Then, we will give some examples of generated poetry, with rough English translations. We have carried out an empirical evaluation of the generated poetry with twenty subjects, with encouraging results. We will describe this evaluation and its results, and will then conclude by discussing the proposed approach and the planned future work. Related Work The high complexity of creative language usage poses substantial challenges for poetry generation. Nevertheless, several interesting research systems have been developed for the task (Manurung, Ritchie, and Thompson 2000; Gervas´ 2001; Manurung 2003; Diaz-Agudo, Gervas, and Gonz ´ alez´ Calero 2002; Wong and Chun 2008; Netzer et al. 2009). These systems vary a lot in their approaches, and many different computational and statistical methods are often combined in order to handle the linguistic complexity and creativity aspects. State of the art in lexical substitution but not in poetical context is presented, for instance, by Guerini et 2012 175 al. (2011). We next review some representative poetry generation systems. ASPERA (Gervas 2001) employs a case-based reasoning ´ approach. It generates poetry out of a given input text via a composition of poetic fragments that are retrieved from a case-base of existing poems. In the system case-base each poetry fragment is annotated with a prose string that expresses the meaning of the fragment in question. This prose string is then used as the retrieval key for each fragment. Finally, the system combines the fragments by using additional metrical rules. In contrast, our "case-base" is a plain text corpus without annotations. Additionally, our method can benefit from the interaction of two distinct corpora for content and form. The work of Manurung et al. (2000) draws on rich linguistic knowledge (semantics, grammar) to generate a metrically constrained poetry out of a given topic via a grammar-driven formulation. This approach requires strong formalisms for syntax, semantics, and phonetics, and there is a strong unity between the content and form. Thus, this system is quite different from our approach. The GRIOT system on its part (Harrell 2005) is able to produce narrative poetry about a given theme. It models the theory of conceptual blending (Fauconnier and Turner 2002) from which an algorithm based on algebraic semantics was implemented. In particular, the approach employs "semantics based interaction". This system allows the user to affect the computational narrative and produce new meanings. The above mentioned systems have rather complex structures involving many different interacting components. Simpler approaches have also been used to generate poetry. In particular, Markov chains (n-grams) have been widely used as the basis of poetry generation systems as they provide a clear and simple way to model some syntactic and semantic characteristics of language (Langkilde and Knight 1998). However, the characteristics are local in nature, and therefore standard use of Markov chains tends to result in poor sentence and poem structures. Furthemore, form and content are learned from a single corpus and cannot be easily separated. Methods We next present our approach to poetry generation. In the basic scenario, a topic is given by the user, and the proposed method then aims to give as output a novel and non-trivial poem in grammatically good form, and with coherent content related to the given topic. A design principle of the method is that explicit specifications are kept at minimum, and existing corpora are used to reduce human effort in modeling the grammar and semantics. Further, we try to keep language-dependency of the methods small. The poetry generator is based on the following principles. 1. Content: The topics and semantic coherence of generated poetry are controlled by using a simple word association network. The network is automatically constructed from a so-called background corpus, a large body of text used as a source of common-sense knowledge. More specifically, the semantic relatedness of word pairs is extracted from their co-occurrence frequency in the corpus. In the experiments of this paper, the background corpus is Finnish Wikipedia. 2. Form (grammatical): The grammar, including the syntax and morphology of the generated poetry, is obtained in an instance-based manner from a given grammar corpus. Instead of explicitly representing a generative grammar of the output language, we copy a concrete instance from an existing sentence or poem but replace the contents. In our experiments, the corpus consists mainly of old Finnish poetry. 3. Form (phonetic): Rhythm, rhyme, and other phonetic features can, in principle, be controlled when substituting words in the original text by new ones. This part has not been implemented yet but will be considered in future work. The current poetry generation procedure can now be outlined as follows: • A topic is given (or randomly chosen) for the new poem. The topic is specified by a single word. • Other words associated with the topic are extracted from the background graph (see below). • A piece of text of the desired length is selected randomly from the grammar corpus. • Words in the text are analyzed morphologically for their part of speech, singular/plural, case, verb tense, clitics etc. • Words in the text are substituted independently, one by one, by words associated with the topic. The substitutes are transformed to similar morphological forms with the original words. The original word is left intact, however, if there are no words associated with the topic that can be transformed to the correct morphological form. • After all words have been considered, the novelty of the poem is measured by the percentage of replaced words. If the poem is sufficiently novel it is output. Otherwise the process can be re-tried with a different piece of text. For the experiments of this paper, we require that at least one half of the words were replaced. This seems sufficient to make readers perceive the new topic as the semantic core of the poem. We next describe in some more detail the background graph construction process as well as the morphological tools used. Background Graph A background graph is a network of common-sense associations between words. These associations are extracted from a corpus of documents, motivated by the observation that (frequent) co-occurrence of words tends to imply some semantic relatedness between them (Miller 1995). The background graph is constructed from the given background corpus using the log-likelihood ratio test (LLR). The log-likelihood ratio, as applied here for measuring associations between words, is based on a multinomial model of word co-occurrences (see, e.g., Dunning (1993) for more information). The multinomial model for a given pair {x, y} of words has four parameters p11, p12, p21, p22 corresponding to the probability of their co-occurrence as in the contingency table below. 2012 176 x ¬x ⌃ y p11 p12 p(y; C) ¬y p21 p22 1 ! p(y; C) ⌃ p(x; C) 1 ! p(x; C) 1 Here, p(x; C) and p(y; C) are the marginal probabilities of word x or word y occurring in a sentence in corpus C, respectively. The test is based on the likelihoods of two such multinomial models, a null model and an alternative model. For both models, the parameters are obtained from relative frequencies in corpus C. The difference is that the null model assumes independence of words x and y (i.e., by assigning p11 = p(x; C)p(y; C) etc.), whereas the alternative model is the maximum likelihood model which assigns all four parameters from their observed frequencies (i.e., in general p11 6= p(x; C)p(y; C)). The log-likelihood ratio test is then defined as LLR(x, y) = !2 X 2 i=1 X 2 j=1 kij log(pnull ij /pij ), (1) where kij is the respective number of occurrences. It can be seen as a measure of how much the observed joint distribution of words x and y differs from their distribution under the null hypothesis of independence, i.e., how strong the association between them is. More complex models, such as LSA, pLSA or LDA could be used just as well. Finally, edges in the background graph are constructed to connect any two words x, y that are associated with LLR(x, y) greater than an empirically chosen threshold. To find words that are likely semantically related to the given topic, first-level neighbours (i.e., words association with the topic word) are extracted from the background graph. If this set is not large enough (ten words or more in the experiments of this paper), we add randomly selected second-level neighrbours (i.e., words associated to any of the first-level neighbors). In the future, we plan to use edge weights to control the selection of substitutes, and possibly to perform more complex graph algorithms on the background graph to identify and choose content words. Morphological Analysis and Synthesis Morphological analysis is essential and non-trivial for morphologically rich languages such as Finnish or Hungarian. In these languages, much of the language's syntactic and semantic information is carried by morphemes joined to the root words. For instance, the Finnish word "juoksentelisinkohan" (I wonder if I would run around) is formed out of the root word "juosta" (run). Hence, morphological analysis provides valuable information of the syntax and to some degree of the semantics. In our current system, morphological analysis and synthesis are carried out using Omorfi1, a morphological analyzer and generator of Finnish language based on finite state automata methodology (Linden, Silfver´ berg, and Pirinen 2009). 1 URL: http://gna.org/projects/omorfi With the help of Omorfi we can thus generate substitutes that have similar morphological forms with the original words. For instance, assume that the topic of the poetry is "ageing" and we want so substitute "juoksentelisinkohan" by a word based on "muistaa" (remember). Omorfi can now generate "muistelisinkohan" (I wonder if I would think back) as a morphologically matching word. Examples We next give some example poems generated by the current system with the original example texts used to provide structure for these poems. We also give their rough English translations, even though we suspect that poetical aesthetics somewhat change in translation. The substituted words are indicated by italics. The first example poem is generated around the topic "(children's) play". We first give the Finnish poem with the template used to construct it (on the right) and then the English translation of both the generated and original poems. Kuinka han leikki ¨ silloin kuinka han leikki kerran ¨ uskaliaassa, uskaliaassa // suuressa vihreass ¨ a // ¨ kuiskeessa puistossa vaaleiden puiden alla. ihanien puiden alla. Han oli ¨ kuullut huvikseen, Han oli katsellut huvikseen, ¨ kuinka hanen ¨ kuiskeensa kuinka hanen hymyns ¨ a¨ kanteli helkkeina tuuloseen ¨ . putosi kukkina maahan, Original by Uuno Kailas: Satu meista kaikista, 1925 ¨ How she played then how she played once in a daring, daring whispering in a big green park under the pale trees. under the lovely trees. She had heard for fun She had watched for fun how her whispering how her smile drifted as jingle to the wind. falled down as flowers, The next poem is generated with "hand" as the topic. The template used is shown below the generated poem and thereafter the translations, respectively. Vaaleassa kourassa sopusuhtaisessa kourassa ovat nuput niin kalpeita kuvassasi lepa¨a¨ lapsikulta jumala. Alakuloisessa metsass ¨ a¨ Ham¨ ar¨ ass ¨ a mets ¨ ass ¨ a ovat kukat niin kalpeita ¨ Varjossa lepa¨a sairas jumala ¨ Original by Edith Sodergran: Mets ¨ an h ¨ am¨ ar¨ a, 1929 ¨ In a pale fist in a well-balanced fist, the buds are so pale in your image lies a dear child god. In a gloomy forest In a dim forest flowers are so pale In the shadow lies a sick god The final example poem has "snow" as its topic. Elot sai karkelojen teita, Aallot kulki tuulten teit ¨ a,¨ lumi ajan kotia, aurinko ajan latua, hiljaa soi kodit autiot, hiljaa hiihti paiv ¨ at pitk ¨ at, ¨ hiljaa sai armaat karkelot hiljaa hiipi pitkat y ¨ ot ¨ laiho sai lumien riemut. paiv ¨ a kutoi kuiden ty ¨ ot, ¨ Original by Eino Leino: Alkusointu, 1896 2012 177 Lives got the frolic ways, snow the home of time, softly chimed abandoned homes, softly got frolics beloved ripening crop got the snows' joys. Waves fared the wind's ways, sun the track of time, slowly skied for long days, slowly crept for long nights day wove the deeds of moons Subjectively judging, the generated poems show quite a wide range of grammatical structures, and they are grammatically well formed. The cohesion of the contents can also be regarded as fairly high. However, the quality of generated poetry varies a lot. Results from an objective evaluation are presented in the next section. Evaluation Evaluation of creative language use is difficult. Previous suggestions for judging the quality of automatically generated poetry include passing the Turing test or acceptance for publishing in some established venue. Because the intended audience of poetry consists of people, the most pragmatic way of evaluating computer poetry is by an empirical validation by human subjects. In many computer poetry studies both human written and computationally produced poetry have been evaluated for qualities like aesthetic appreciation and grammaticality. In this study we evaluated poetry using a panel of twenty randomly selected subjects (typically university students). Each subject independently evaluated a set of 22 poems, of which one half were human-written poems from the grammar corpus and the other half computer-generated ones with at least half of the words replaced. The poems were presented in a random order, and the subjects were not explicitly informed that some of the poems were computer-generated. Each subject evaluated each text (poem) separately. The first question to answer was if the subject considered the piece of text to be a poem or not, with a binary yes/no answer. Then each text was evaluated qualitatively along six dimensions: (1) How typical is the text as a poem? (2) How understandable is it? (3) How good is the language? (4) Does the text evoke mental images? (5) Does the text evoke emotions? (6) How much does the subject like the text? These dimensions were evaluated on the scale from one (very poor) to five (very good). (The interesting question of how the amount of substituted words affects the subjective experience of topic, novelty and quality is left for future research.) Evaluation results averaged over the subjects and poems are shown in Figures 1 and 2. Human-written poems were considered to be poems in 90.4% of the time and computergenerated poems 81.5% of the time (Figure 1). Intervals containing 66.7% of the poems show that there was more variation in the human-written poetry than in the computer generated poetry. Overall, these are promising results, even though statistically the difference between human-written and computer generated poetry is significant (p-value with Wilcoxon rank-sum test is 0.02). Figure 1: Relative amounts of texts (computer-generated and human-written poetry) subjectively considered to be poems, averaged over all subjects. The whiskers indicate an interval of 66.7% of poems around the median. Points indicate the best and worst poems in the both groups. Figure 2: Subjective evaluation of computer-generated and human-written poetry along six dimensions: (1) typicality as a poem, (2) understandability, (3) quality of language, (4) mental images, (5) emotions, and (6) liking (see text for details). Results are averaged over all subjects and poems. The whiskers indicate one standard deviation above and below the mean. The evaluated qualities have a similar pattern (Figure 2): The average difference between human-written and computer-generated poetry is not large, and in many cases there is a lot of overlap in the ranges of scores, indicating that a good amount of (best) computer-generated poems were as good as (worst) human-written ones. Statistically, however, the differences are highly significant (all p-values below 0.001). The biggest drop in quality was in understandability (dimension 2). However, somewhat controversially, the language remained relatively good (dimension 3). An interesting observation is that some of the generated poems were rated to be quite untypical but their language quality and pleasantness were judged to be relatively high. 2012 178 Discussion We have proposed a flexible poetry generation system which is potentially able to produce poetry out of a wide variety of different topics and in different styles. The flexibility is achieved by automating the processes of acquiring and applying world knowledge and grammatical knowledge. We use two separate corpora: background corpus for mining lexical associations, and grammar corpus for providing grammatical and structural patterns for the basis of new poetry. We have implemented the system for Finnish, a morphologically rich language. We carried out a preliminary evaluation on the produced poetry, with promising results. It may be questioned whether the current approach exhibits creative behaviour, and whether the system is able to produce poetry that is interesting and novel with respect to the text that is used as the basis of new poetry. First, the generated poems are usually very different from the original texts (our subjective view, to be evaluated objectively in the future). Second, some of the generated texts were rated to be quite untypical, even though recognized as poems. The pleasantness and language quality of these poems were still judged to be relatively high. According to these observations we think that at least some of the system's output can be considered to be creative. Thus, the system could be argued to automatically piggyback on linguistic conventions and previously written poetry to produce novel and reasonably high quality poems. Our aim is to develop methods that can be applied to other languages with minimal effort. In our current system, morphological analysis and synthesis are clearly the most strongly language-specific components. They are fairly well isolated and could, in principle, be replaced by similar components for some other language. However, it may prove to be problematic to apply the presented approach to more isolating languages (i.e., with a low morpheme-per-word ratio), such as English. In agglutinative languages (with higher morpheme-per-word ratio), such as Finnish, a wide variety of grammatical relations are realized by the use of affixation and the word order is usually quite free. We currently consider implementing the system for other languages, in order to identify and test principles that could carry over to some other languages. So far, we have not considered controlling rhythm, rhyme, alliteration or other phonetic aspects. We plan to use constraint programming methods in the lexical substitution step for this purpose. At the same time, we doubt this will be always sufficient in practice since the space of suitable substitutes can be severely constrained by grammar and semantics. Another interesting technical idea is to use n-gram language models for computational assessment of the coherence of produced poetry. We consider the approach described in this paper to be a plausible building block of more skillful poetry generation systems. The next steps we plan to take, in addition to considering phonetic aspects, includes trying to control the emotions that the poetry exhibits or evokes. We are also interested in producing computer applications of adaptive or instant poetry. Acknowledgements: This work has been supported by the Algorithmic Data Analysis (Algodan) Centre of Excellence of the Academy of Finland. 2012_34 !2012 Weaving creativity into the Semantic Web: a language-processing approach Anna Jordanous Centre for e-Research Department of Digital Humanities King's College London, UK anna . jordanous at kcl . ac . uk Bill Keller Department of Informatics University of Sussex Brighton, UK billk at sussex . ac . uk Abstract This paper describes a novel language processing approach to the analysis of creativity and the development of a machine-readable ontology of creativity. The ontology provides a conceptualisation of creativity in terms of a set of fourteen key components or building blocks and has application to research into the nature of creativity in general and to the evaluation of creative practice, in particular. We further argue that the provision of a machine readable conceptualisation of creativity provides a small, but important step towards addressing the problem of automated evaluation, 'the Achilles' heel of AI research on creativity' (Boden 1999). Introduction Creativity is a complex, multi-faceted concept encompassing many related aspects, abilities, properties and behaviours. This complexity makes the production of a comprehensive and generally applicable account of creativity problematic. Existing definitions of creativity are often too superficial for use by the research community and may be subject to discipline or domain bias, limiting their application. The need for a comprehensive, multi-dimensional account has been widely recognised (Rhodes 1961; Torrance 1967; Plucker, Beghetto, and Dow 2004; Kaufman 2009). Such an account would assist our understanding of creativity, highlighting areas of common ground and avoiding the pitfalls of disciplinary bias (Hennessey and Amabile 2010; Plucker and Beghetto 2004). Words associated with academic debate about the nature of creativity are strongly linked to our understanding of its meaning and attributes. Analysis of this language provides a sound basis for constructing a sufficiently detailed and comprehensive account of the concept. In the present work, statistical language processing techniques are used to identify words significantly associated with creativity in a corpus of academic papers on the topic. A measure of lexical similarity provides a basis for clustering words and identifying key themes or components of creativity. The set of components yields information about the nature of creativity, based on what we emphasise when we discuss the concept. Within the field of computational creativity, the problem of automatic evaluation remains a significant issue: ‘the Achilles' heel of AI research on creativity' (Boden 1999). Recently, the Semantic Web has emerged as a way to address the troublesome but important issue (Boden 1999) of articulating values, concepts and information in an open and machine-readable format. Linked Data is the term used in the Semantic Web community to describe published data that is machine-readable and connected together using semantically typed links. We take the step of encoding our components in RDF, the current W3C standard for implementing Linked Data.1 The resulting ontology is available to the wider research community as a resource in the Semantic Web, under the permanent URI http://purl.org/creativity/ontology, a form familiar to Semantic Web researchers and also accessible through browsers such as Marbles.2 Currently, most content on the Semantic Web is in the form of ontologies of ‘things': semantically structured collections of factual or objective data on topics as diverse as people, places, narratives, or music.3 To date, little work has been done on specifically defining subjective concepts in an ontology. However, current work on lexical resources such as WordNet has laid foundations for more definitionally troublesome concepts to be considered in detail; the time is ripe for development of ontologies of subjective concepts such as creativity. Components of creativity We identify a core lexicon consisting of just those words that appear to be highly associated with discussions of creativity in a corpus of academic papers on the topic. Our approach substantially develops and refines work described in Jordanous (2010). A key innovation is the use of a measure of lexical similarity, which allows the words to be clustered automatically to reveal a number of common themes or factors of creativity. Further analysis results in a set of fourteen 1 http://www.w3.org/TR/rdf-syntax-grammar, last accessed 27th January 2012. 2 http://www.w3.org/2001/sw/wiki/Marbles, last accessed 27th January 2012. 3 Example ontologies are available at http://www.foaf-project.org, http://www.geonames.org/ontology, http://www.contextus.net/ontomedia and http://musicontology.com respectively, all last accessed 27th January 2012. 2012 216 key components. Corpus data A ‘creativity corpus' was assembled from a sample of 30 academic papers examining creativity from a variety of stand-points (Jordanous 2010). The selected papers cover a wide range of years (1950-2009) and academic disciplines, from psychological studies to computational models. Academic papers were used due to ease of location (e.g. through targeted literature search), accessibility (electronic publication for download), format (ease of conversion to text allows for computational analysis) and availability of citation data (used as a criterion for inclusion of a paper).4 In Jordanous (2010), language use in the creativity corpus was compared to general language use as represented by the British National Corpus (BNC) (Leech 1992). This had the undesired effect of highlighting words that were predominant in academic papers but not necessarily specific to creativity literature, e.g. et', al'. In the present study, a further corpus of 60 academic papers on topics unrelated to creativity was assembled (a ‘non-creativity corpus'). For each paper in the creativity corpus, we retrieved the two most-cited papers in the same academic discipline5 and with the same year of publication, that did not contain any words with the prefix creat (i.e. creativity, creative, creation, etc.). Each corpus was processed using the RASP natural language processing toolkit (Briscoe, Carroll, and Watson 2006) to perform lemmatisation and part-of-speech (POS) tagging. Lemmatisation allows us to ignore morphological variation so that, e.g., processed and processing are both recognised as forms of process. POS tagging allows us to distinguish between different grammatical usages of the same orthographical form: e.g. process as a noun or as a verb. Two lists of frequency counts were produced: one for all words occurring in the creativity corpus and one for all words in the non-creativity corpus. Only ‘content-bearing' words (i.e. nouns, verbs, adjectives and adverbs) were considered to be of interest. Any ‘function words' or other minor categories (pronouns, articles, prepositions etc.), were ignored as they have little or no independent semantic content and are therefore of limited interest for the present study. Finding words associated with creativity A standard, statistical measure of association was used to identify words salient to discussions of creativity. The loglikelihood ratio (or G-squared statistic) is a measure of how well observed frequency data fit a model or expected frequency distribution. The statistic is an alternative to Pearson's chi-squared (!2) test that has been advocated as a more appropriate measure for corpus analysis as it does not rely on the (unjustifiable) assumption of normality in word distribution (Dunning 1993). This is a particular issue when 4 Note that some papers have been published in very recent years and therefore have few citations. In this case selection was based on subjective judgement of influence. 5 As categorised by the literature database Scopus http://www.scopus.com/), last accessed 27th January 2012. analysing relatively small corpora as in the present case.6 The log likelihood ratio is more accurate than !2 in its treatment of infrequent words in the data, which often hold useful information. Our use of the log-likelihood ratio follows that of Rayson and Garside (2000). Given two corpora (in our case, ‘creativity corpus' and ‘non-creativity corpus') the loglikelihood score for a given word is calculated as: LL = 2 X i2{1,2} Oi ln(Oi Ei ) (1) where Oi is the observed frequency of the given word in corpus i and Ei is its expected frequency in corpus i. The expected frequency Ei is given by: Ei = Ni ⇥ (O1 + O2) N1 + N2 (2) where Ni denotes the total number of words in corpus i. Following standard statistical practice, any word occurring fewer than five times was excluded. This ensures that the statistics are robust. To identify significant results, we also removed words with a log-likelihood score less than 10.83, representing a chi-squared significance value for p=0.001 (one degree of freedom). To identify words strongly associated with discussion of creativity it was necessary to select just those words with observed counts higher than than expected in the creativity corpus. This resulted in a total of 694 distinctive creativity words: a collection of 389 nouns, 205 adjectives, 72 verbs and 28 adverbs that occurred significantly more often than expected in the creativity corpus. The 20 such words with the highest log-likelihood ratio scores are listed in Table 1. It is important to note that our objective is to identify key themes in the lexical data, not to induce a comprehensive terminology of creativity. Despite the relatively small size of the available corpora, the resulting set of 694 creativity words is sufficiently rich for this purpose. Identifying components of creativity In Jordanous (2010) an attempt was made to identify key components by clustering creativity words by inspection of the raw data. In practice, this proved laborious and made it impossible systematically to consider all of the identified words. It also raised issues of subjectivity and experimenter bias. Here we address these problems, at least in part, by first clustering all the words automatically according to a statistical measure of distributional similarity (Lin 1998). The more manageable collection of clusters are then inspected manually to identify key components. Intuitively, words that tend to occur in similar linguistic contexts will tend to be similar in meaning (Harris 1968). For example, evidence that the words concept (LLR=189.90) and idea (LLR=475.74) are similar in meaning might be provided by occurrences such as the following: 6 At around 300K and 700K words respectively, the creativity and non-creativity corpora are very small compared to the British National Corpus (⇡ 100M words) and tiny in comparison to recent, web-derived text collections of billions of words. 2012 217 Word (and part of speech tag) LLR thinking (N) 834.55 process (N) 612.05 innovation (N) 546.20 idea (N) 475.74 program (N) 474.41 domain (N) 436.58 cognitive (J) 393.79 divergent (J) 355.11 openness (N) 328.57 discovery (N) 327.38 primary (J) 326.65 originality (N) 315.60 criterion (N) 312.61 intelligence (N) 309.31 ability (N) 299.27 knowledge (N) 290.48 create (V) 280.06 experiment (N) 253.32 plan (N) 246.29 agent (N) 246.24 Table 1: The top 20 results of the log-likelihood ratio (LLR) calculations. A significant LLR score at p=0.001 is 10.83. 1. the concept/idea involves (subject of verb ‘involve') 2. applied the concept/idea (object of verb ‘apply') 3. the basic concept/idea (modified by adjective ‘basic') Word occurrence data of this kind was obtained from an analysis of the written portion of the BNC, which had previously been processed using the RASP toolkit to extract grammatical dependency relations (subj-of, obj-of, modified-by). Each word in the creativity corpus was then associated with a list of all of the grammatical relations in which it participated, together with corresponding counts of occurrence. Distributional similarity of two words is measured in terms of the similarity of their associated lists of grammatical relations. The present work adopts an informationtheoretic measure devised by Lin (1998), which has been widely used in language processing applications and shown to perform well against other similarity measures as a means of identifying near-synonyms (Weeds and Weir 2003) . Similarity scores were obtained separately for pairs of nouns, pairs of verbs and so on. For a given set of words, the similarity data is conveniently visualised as a graph or network, where nodes correspond to words and edges are weighted by similarity scores, as in Figure 1. A possible problem with obtaining word similarity data this way would arise if the majority of the creativity words were used with distinctive or technical senses within the creativity corpus. This is unlikely, however: whilst some narrowly specialised usage may be present in our creativity lexicon, most words retain general senses reflected in the wider BNC data set. The graph clustering software Chinese Whispers (Biemann 2006) was used to automatically identify word clusFigure 1: Graph representation of the similarity of the nouns concept and idea and related words. Words are drawn as nodes linked by weighted edges representing word similarity (maximum similarity is 1.0). ters in the dataset. This algorithm uses an iterative process to group together graph nodes that are located ‘close' to each other. By grouping words with similar meanings, the number of data items was effectively reduced and themes in the data could be identified more readily by inspection. Themes discovered through clustering were further analysed in terms of the Four Ps of creativity (Rhodes 1961; Mooney 1963; MacKinnon 1970) to identify alternative perspectives and reveal subtler (but still important) aspects of creativity. From the analysis it was possible to extract a set of fourteen key components of creativity. Implementing an ontology of creativity The fourteen components provide a clear account of the constituent parts of the concept of creativity. Our remaining contribution is to express these components in a machinereadable form. We also want to use Linked Data principles (Heath and Bizer 2011) to connect the individual components to other data sources within the Semantic Web, so that creativity is defined in terms of concepts that have already been defined. To achieve this, we used SKOS (Simple Knowledge Organisation System),7 a W3C standard which provides a model for representing ontological data within the Semantic Web. We also made use of WordNet (Reed and Lenat 2002), a large lexical database of English in which words are grouped by sense and interlinked by lexical and conceptual relations. WordNet has recently been made available as a Semantic Web ontology.8 The SKOS ontology incorporates three main classes: skos:Concept (anything we may want to record information about), skos:ConceptScheme (a set that collectively defines a skos:Concept) and skos:Collection (a collection of semantically-related information). We created an instance of skos:ConceptScheme called 7 http://www.w3.org/TR/skos-reference, last accessed 27th January 2012. 8 http://wordnet.rkbexplorer.com/ 2012 218 CreativityComponents to represent the set of components that defines the skos:Concept of Creativity. Each component is represented as an individual skos:Concept. As RDF is a graph-based model, the resulting encoding can be visualised as in Figure 2. The graph has also been published in serialised format as an RDF/XML text file and made available as http://purl.org/creativity/ontology. The skos:Concept labelled Creativity has the unique URI purl.org/creativity/ontology#Creativity and any Linked Data that needs to refer to the concept can use this identifier. The distributed nature of Semantic Web research means that the enormous task of defining concepts in a machinereadable form is divided across the research field, rather than being the sole responsibility of one particular research group. This work practice acts as a form of peer review, as ontologies are developed, critiqued, and ultimately judged by the extent to which they are adopted and re-used as points of reference by other researchers. Upper ontologies allow us to link the concepts in our ontology to related ontological work on creativity in the future (even if these future researchers are not aware of our ontological contribution). An upper ontology defines higherlevel vocabularies and concepts necessary to implement ontologies themselves, providing the meta-vocabulary to link specific ontologies to more general concepts. The implementation of the Wordnet dataset and structure as an ontology provides WordNet as an upper ontology for us to use, linking a lexical string (e.g. "creativity") to various concepts associated with that string, such as its sense, hyponyms, type, ‘gloss' (brief definition) and other related lexical information. Each component in our ontology is comprised of a cluster of keywords. It makes sense, therefore, to link each component back to the appropriate keywords, using the WordNet ontology at http://wordnet.rkbexplorer.com/. In this way, our components are linked into the Semantic Web through the WordNet ontology. This linkage also provides further semantic information on each component via the lexical relations and other information represented in the WordNet hierarchy. Finally, following Linked Data principles, we also link our interpretation of creativity as an extension of the representation of the concept in WordNet. In this way, machines (and people) can see the relationship between this general concept of creativity and our more detailed ontological analysis. Discussion and Implications The current work is part of a wider project engaged with the question of the evaluation of computational creativity (Jordanous 2011). The components of creativity have already been applied, both for in-depth expert evaluation and in forming snapshot judgements of the creativeness of a given system. The resulting component-based evaluation yields detailed information about creative strengths and weaknesses. Crucially, the evaluation highlights those components where a system performs poorly, providing insight into areas where improvement in performance is needed. By publishing the ontology in the Semantic Web we ensure that it is freely available to the research community. This has a number of implications. First, it may be freely referred to, extended or amended. Refinement is clearly possible, for example in providing more fine-grained analysis of the components or in articulating the relationships between them. Second, it facilitates the development of creativity-aware applications to support manual evaluation of creativity based on the components. It also represents a step towards the development of methods of automated evaluation. One intriguing possibility is to further exploit language processing techniques to provide automated evaluation by proxy based on textual reviews or descriptions of system performance. This is analogous to the way that sentiment analysis techniques are now used to automatically evaluate attitude and opinion based on reviews of products or services (Pang and Lee 2008). The current work illuminates the sorts of issues that arise in formal modelling of subjective or ‘soft' concepts such as creativity. For example, some of our components appear logically inconsistent with others in the set: e.g. the need for autonomous, independent behaviour (Independence and Freedom) versus the requirement for social interaction (Social Interaction and Communication). Also, creativity clearly manifests itself in different ways across different domains (Plucker and Beghetto 2004) and components will vary in importance, according to the requirements of a particular domain. For example, creative behaviour in mathematical reasoning has more focus on finding a correct solution to a problem than is the case for creative behaviour in, say, musical improvisation (Colton 2008). Questions remain about how such dialectical and fluid aspects might be modelled. We present the set of components as a rather loose collection of dimensions - attributes, abilities and behaviours, etc. - which contribute to our overall understanding of creativity, rather than a unified definition. Concluding remarks This paper has described the development of an ontology of creativity using corpus-based, language processing techniques and its publication as machine-readable, Linked Data in the Semantic Web. The resulting ontology provides a multi-perspective analysis of creativity in terms of a set of fourteen key components and has application to the study and evaluation of computational creativity. Weaving the ontology into the Semantic Web has implications for future work on modelling subjective concepts and suggests some interesting directions for future research into the problem of automated evaluation of creativity. 2012_35 !2012 Coming Together: Composition by Negotiation by Autonomous Multi-Agents Arne Eigenfeldt Philippe Pasquier School for the Contemporary Arts Simon Fraser University Vancouver, Canada arne_e@sfu.ca School of Interactive Arts and Technology Simon Fraser University Surrey, Canada pasquier@sfu.ca ABSTRACT Coming Together is a series of computational creative systems based upon the premise of composition by negotiation - within a controlled musical environment, autonomous multi-agents attempt to converge their data, resulting in a self-organised, dynamic, and musically meaningful performance. All the Coming Together systems involve some aspect of a priori structure around which the negotiation by the agents is centered. In the versions demonstrated, the structure presupposes several discrete movements that together form a complete composition of a predetermined length. Characteristics of each movement - density, time signature, tempo - are generated using a fuzzy-logic method of avoiding similarity between succeeding movements. Two versions of Coming Together are described, used in two different musical compositions. The first, for the composition And One More, involves agents interacting in real-time, their output being sent via MIDI to a mechanical percussion instrument. This version has nine different agents performing on eighteen different percussion instruments, and includes a live percussionist whose performance is encoded and considered an additional agent. The second version, for the composition More Than Four, involves four agents, whose output is eventually translated into musical notation using MaxScore1 , for performance by four instrumentalists. Agent interaction is transcribed to disk prior to performance; at the onset of the performance, a curatorial agent selects previous movements from the database, and chooses from those to create a musically unified composition. 1 www.computermusicnotation.com 2012 221 2012_36 !2012 Continuous Improvisation and Trading with Impro-Visor Robert M. Keller Computer Science Department Harvey Mudd College Claremont, CA 91711 USA keller@cs.hmc.edu Demonstration Impro-Visor is a free open-source program designed to help musicians learn to improvise. Its main purpose is to help its user become a better improviser. It can exhibit creativity by improvising continuously on its own in a variety of soloist styles. We demonstrate that, in principle, Impro-Visor can continue creating indefinitely, without repeating the same sequence of musical ideas. We also demonstrate how Impro-Visor can alternate ("trade") phrases with the soloist, again continuously, as well as recording what the soloist plays on a MIDI device. Related aspects that can be shown are learning an improvisational style through grammar acquisition and using "roadmaps" as a basis for trading. The figure shows a screen shot of Impro-Visor creating phrases in real-time and capturing the soloist's input in real-time from a MIDI device. Acknowledgements The author thanks the NSF (CNS REU #0753306), ImproVisor co-developers, and Harvey Mudd College for their generous support. 2012_37 !2012 Exploring Everyday Creative Responses to Social Discrimination with the Mimesis System D. Fox Harrell†+* , Chong-U Lim* , Sonny Sidhu† , Jia Zhang† , Ayse Gursoy† , Christine Yu+ Comparative Media Studies Program† | Program in Writing and Humanistic Studies+ Computer Science and Artificial Intelligence Laboratory* Massachusetts Institute of Technology {fox.harrell, culim, sidhu, zhangjia, agursoy, czyu}@mit.edu Introduction We have created an interactive narrative system called Mimesis, which explores the social discrimination phenomena through gaming and social networking. Mimesis places players in control of a mimic octopus in its marine habitat that encounters subtle discrimination from other sea creatures. Relevant to computational creativity, Mimesis explores: 1) Collective creativity by constructing game characters algorithmically from collective musical preferences on a social networking site. 2) Everyday creativity by modeling the diverse creative ways people respond to covert acts of discrimination. Figure 1: The player character is customized based on the player's musical preferences on Facebook. Collective Creativity Building on previous work [2], Mimesis requests access to information from the player's Facebook profile, using music preferences in the player's social network as a standin for qualities of individual and social identity. Mimesis generates corresponding moods for each musical artist. By associating the player character with artists' moods such oblivious, confused, suspicious, or aggressive, players can impart these qualities onto the player character (see Figure 1). Within gameplay, moods are mapped to strategies of conversationally responding to microaggressions. Everyday Creativity The player character encounters other sea creatures who utter sentences like: "Where are you from?" and "You don't seem like the typical creature around here." This is shown in Figure 2. While such questions may seem benign, they can also covertly imply the theme: "You are an alien in your own land" (such might be encountered by an Asian American in the United States). The player responds by using gestural input such as pinching out for an open/oblivious attitude or pinching in for a closed/aggressive attitude. Each encounter plays out according to a conversational narrative schema based on sociolinguistic studies of narratives of personal experience. Figure 2: The screen shows the player's character (left) in a microinvalidation encounter with an NPC (right). These encounters convey aspects of the experience of microaggressions, which are covert acts of discrimination. Researchers Sue et al. identify "microinvalidations" as communications that exclude, negate, or nullify the experiential reality of others). The "alien in your own land" theme is an example of microinvalidation. Microaggressions have been clinically found to have strong cumulative effects on health and happiness, restrict understandings between groups. [1] We hope the system is an effective tool for increasing awareness of this subtle form of social discrimination. 2012_38 !2012 Functional Representations of Music James McDermott∗ , University College Dublin. April 30, 2012 Music is an interesting domain for the study of computational creativity. Some generative formalisms for musical composition (e.g. Markov chains) achieve plausible music over short time-scales (a few notes) but appear to be "meandering" over longer time-scales. Imposing a sense of teleology or purpose is an important goal in creating valuable music. In the field of evolutionary computation (EC), researchers draw inspiration from Darwinian evolution to address computational problems. EC can be applied to aesthetic and creative domains. Although EC is commonly used to generate music, key open issues remain. Formal measurement of the quality of a piece of music in a computational fitness function is an obvious obstacle. A naive representation for music, such as a list of integer values each corresponding directly to a note, will tend to produce disorganised music. In previous work, Hoover et al. [1, and later] showed that a functional representation could impose organisation and a sense of purpose. In this system, music is represented as a function of time. A fixed piece of pre-existing music is used as a "scaffold": the system then evolves functions, i.e. mappings from the scaffold to new accompanying material. Time-series of numerical "control" variables are also proposed as a means of imposing structure on the music. Fitness is judged interactively. The XG project is partly inspired by this work. It discards the "scaffold", but relies on the time-series of control variables (see Figure 1). Time (beats) Bar Beat x,y,z bar mod + sin2 sin2 beat + + sin2 + x unaryy sin sin2 sin2 z unary+ output sin2 output sin2 unarysin2 unaryoutput Figure 1: Time-series of control variables (left) impose a bar/beat structure and an overall AABA structure. The evolved function (right) maps these variables to numerical outputs, once per time-step. The outputs are interpreted as music. It also differs in its internal representation for the mappings (a simple language of arithmetic functions, with special accumulator functions at the outputs to control volume), and its use of a computational (noninteractive) fitness function. Surprisingly good results arise using this representation in combination with a simple fitness function which rewards variety in the output music. Neither the functional representation nor the fitness function is alone capable of producting good results. More details are available in a full paper [2] and online1. A longer-term goal of the XG project is to create large-scale musical works as mappings from pre-existing time series arising in nature and human affairs, and from non-musical artforms such as film or still images with a sequential aspect. 2012_39 !2012 MaestroGenesis: Computer-Assisted Musical Accompaniment Generation Paul A. Szerlip, Amy K. Hoover, and Kenneth O. Stanley Department of Electrical Engineering and Computer Science University of Central Florida Orlando, FL 32816-2362 USA {paul.szerlip@gmail.com, ahoover@eecs.ucf.edu, kstanley@eecs.ucf.edu} Abstract This demonstration presents an implementation of a computer-assisted approach to music generation called functional sca↵olding for musical composition (FSMC) whose representation facilitates creative combination, exploration, and transformation of musical ideas and spaces. The approach is demonstrated through a program called MaestroGenesis with a convenient GUI that makes it accessible to even nonmusicians. Music in FSMC is represented as a functional relationship between an existing human composition, or sca↵old, and a generated accompaniment. This relationship is represented by a type of artificial neural network called a compositional pattern producing network (CPPN). A human user without any musical expertise can then explore how accompaniment can relate to the sca↵old through an interactive evolutionary process akin to animal breeding. Composing with MaestroGenesis MaestroGenesis is a program that helps users create complete polyphonic pieces with only the musical expertise necessary to compose a simple, monophonic melody. Users begin creating accompaniments by establishing a sca↵old, or melody that will provide the initial rhythmic and harmonic seed for the accompaniment. The accompaniment is then represented as a functional transformation of this original sca↵old through a method called functional sca↵olding for musical composition (FSMC) (Hoover et al. 2012). FSMC explots the structure already present in the humancomposed sca↵old by computing a function that transforms its structure into the accompaniment. These FSMC-accompaniments are then bred like animals might be bred. Once the sca↵old is chosen, a population of ten accompaniments is displayed. Each is rated as good or bad by pressing the "thumbs-up" button (figure 1). By ratings accompaniments with favorable qualities higher than those without, the next generation of accompaniments tends to possess similar qualities to the well-liked parents. Through interactively evolving these accompaniments, they grow to reflect the personal inclinations of the user. Figure 1: MaestroGenesis Candidate Accompaniments. Accompaniments in MaestroGenesis are evolved through a process similar to animal breeding. Candidate accompaniments are evolved ten at a time in an interative process in which each subsequent generation inherits traits from the previous population. Conclusion MaestroGenesis is a program that facilitates creativity in music composition through functional sca↵olding for musical composition (FSMC) (Hoover et al. 2012). Accompaniments are evolved through a process similar to animal breeding. The program is availble for download at http://maestrogenesis.org. Acknowledgements This work was supported in part by the National Science Foundation under grant no. IIS-1002507 and also by a NSF Graduate Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF. 2012_4 !2012 Automated Generation of Cross-Domain Analogies via Evolutionary Computation Atılım Gunes¸ Baydin ¨ 1,2, Ramon Lopez de M ´ antaras ´ 1, Santiago Ontan˜on´ 3 1Artificial Intelligence Research Institute, IIIA CSIC Campus Universitat Autonoma de Barcelona, 08193 Bellaterra, Spain ` 2Departament d'Enginyeria de la Informacio i de les Comunicacions ´ Universitat Autonoma de Barcelona, 08193 Bellaterra, Spain ` 3Department of Computer Science, Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104, USA gunesbaydin@iiia.csic.es, mantaras@iiia.csic.es, santi@cs.drexel.edu Abstract Analogy plays an important role in creativity, and is extensively used in science as well as art. In this paper we introduce a technique for the automated generation of cross-domain analogies based on a novel evolutionary algorithm (EA). Unlike existing work in computational analogy-making restricted to creating analogies between two given cases, our approach, for a given case, is capable of creating an analogy along with the novel analogous case itself. Our algorithm is based on the concept of "memes", which are units of culture, or knowledge, undergoing variation and selection under a fitness measure, and represents evolving pieces of knowledge as semantic networks. Using a fitness function based on Gentner's structure mapping theory of analogies, we demonstrate the feasibility of spontaneously generating semantic networks that are analogous to a given base network. Introduction In simplest terms, analogy is the transfer of information from a known subject (the analogue or base) onto another particular subject (the target), on the basis of similarity. The cognitive process of analogy is considered at the heart of many defining aspects of human intellectual capacity, including problem solving, perception, memory, and creativity (Holyoak and Thagard 1996); and it has been even argued, by Hofstadter (2001), that analogy is "the core of cognition". Analogy-making ability is extensively linked with creative thought (Hofstadter 1995; Holyoak and Thagard 1996; Ward, Smith, and Vaid 2001; Boden 2004) and plays a fundamental role in discoveries and changes of knowledge in arts as well as science, with key examples such as Johannes Kepler's explanation of the laws of heliocentric planetary motion with an analogy to light radiating from the Sun1 (Gentner and Markman 1997); or Ernest Rutherford's analogy between the atom and the Solar System2 (Falkenhainer, Forbus, and Gentner 1989). Boden (2004; 2009) classifies analogy as a form of combinational creativity, noting that it works by producing unfamiliar combinations of familiar ideas. 1 Kepler argued, as light can travel undetectably on its way between the source and destination, and yet illuminate the destination, so can motive force be undetectable on its way from the Sun to planet, yet affect planet's motion. 2 The Rutherford- Bohr model of the atom considers electrons to circle the nucleus in orbits like planets around the Sun, with electrostatic forces providing attraction, rather than gravity. In this paper, we present a technique for the automated generation of cross-domain analogies using evolutionary computation. Existing research on computational analogy is virtually restricted to the discovery and assessment of analogies between a given pair of base case A and target case B (French 2002) (An exception is the Kilaza model by O'Donoghue (2004)). On the other hand, given a base case A, the approach that we present here is capable of creating a novel analogous case B itself, along with the analogical mapping between A and B. This capability of open-ended creation of novel analogous cases is, to our knowledge, the first of its kind and makes our approach highly relevant from a computational creativity perspective. It replicates the psychological observation that an analogy is not always simply "recognized" between an original case and a retrieved analogous case, but the analogous case can sometimes be created together with the analogy (Clement 1988). As the core of our approach, we introduce a novel evolutionary algorithm (EA) based on the concept of "meme" (Dawkins 1989), where the individuals forming the population represent units of culture, or knowledge, that are undergoing variation, transmission, and selection. We represent individuals as simple semantic networks that are directed graphs of concepts and binary relations (Sowa 1991). These go through variation by memetic versions of EA crossover and mutation, which we adapt to work on semantic networks, utilizing the commonsense knowledge bases of ConceptNet (Havasi, Speer, and Alonso 2007) and WordNet (Fellbaum 1998). Defining a memetic fitness measure using analogical similarity from Gentner's psychological structure mapping theory (Gentner and Markman 1997), we demonstrate the feasibility of generating semantic networks that are analogous to a given base network. In this introductory work, we focus on the evolution of analogies using a memetic fitness function promoting analogies. But it is of note that considering different possible fitness measures, the proposed representation and algorithm can serve as a generic tool for the generation of pieces of knowledge with any desired property that is a quantifiable function of the represented knowledge. Our algorithm can also act as a computational model for experimenting with memetic theories of knowledge, such as evolutionary epistemology and cultural selection theory. After a review of existing research in analogy, evolution, and creativity, the paper introduces details of our algorithm. We then present results and discussion of using the fitness function based on analogical similarity, and conclude with future work and potential applications in creativity. 2012 25 Background Analogy Analogical reasoning has been actively studied from both cognitive and computational perspectives. The dominant school of research in the field, advanced by Gentner (Falkenhainer, Forbus, and Gentner 1989; Gentner and Markman 1997), describes analogy as a structural matching, in which elements from a base domain are mapped to (or aligned with) those in a target domain via structural similarities of their relations. This approach named structure mapping theory, with its computational implementation, the Structure Mapping Engine (SME) (Falkenhainer, Forbus, and Gentner 1989), has been cited as the most influential work to date on the modeling of analogy-making (French 2002). Alternative approaches in the field include the coherence based view developed by Holyoak and Thagard (Thagard et al. 1990; Holyoak and Thagard 1996), in which analogy is considered as a constraint satisfaction problem involving structure, semantic similarity, and purpose; and the view of Hofstadter (1995) of analogy as a kind of high-level perception, where one situation is perceived as another one. Veale and Keane (1997) extend the work in analogical reasoning to the more specific case of metaphors, which describe the understanding of one kind of thing in terms of another. A highly related cognitive theory is the conceptual blending idea developed by Fauconnier and Turner (2002), which involves connecting several existing concepts to create new meaning, operating below the level of consciousness as a fundamental mechanism of cognition. An implementation of this idea is given by Pereira (2007) as a computational model of abstract thought, creativity, and language. According to whether the base and target cases belong to the same or different domains, there are two types of analogy: intra-domain, confined to surface similarities within the same domain; and cross-domain, using deep structural similarities between semantically distant information. While much of the research in artificial intelligence has been restricted to intra-domain analogies (e.g. case-based reasoning), studies in psychology have been more concerned with cross-domain analogical experiments (Thagard et al. 1990). Evolutionary and Memetic Algorithms Generalizing the mechanisms of the evolutionary process that has given rise to the diversity of life on earth, the approach of Universal Darwinism uses a simple progression of variation, natural selection, and heredity to explain a wide variety of phenomena; and it extends the domain of this process to systems outside biology, including economics, psychology, physics, and even culture (Dennett 1995). In terms of application, the metaheuristic optimization method of evolutionary algorithms (EA) provides an implementation of this idea, established as a solid technique with diverse problems in engineering as well as natural and social sciences (Coello Coello, Lamont, and Van Veldhuizen 2007). In an analogy with the unit of heredity in biological evolution, the gene, the concept of meme was introduced by Dawkins (1989) as a unit of idea or information in cultural evolution, hosted, altered, and reproduced in individuals' minds, forming the basis of the field of memetics3. 3 Quoting Dawkins (Dawkins 1989): "Examples of memes are tunes, ideas, catch-phrases, clothes fashions, ways of making pots or of building arches. Just as genes propagate themselves in the gene pool by leaping from body to body via sperms or eggs, so memes propagate themselves in the meme pool by leaping from brain to brain..." Within evolutionary computation, the recently maturing field of memetic algorithms (MA) has experienced increasing interest as a method for solving many hard optimization problems (Moscato, Cotta, and Mendes 2004). The existing formulation of MA is essentially a hybrid approach, combining classical EA with local search, where the populationbased global sampling of EA in each generation is followed by an individual learning step mimicking cultural evolution, performed by each candidate solution. For this reason, this approach has been often referred to under different names besides MA, such as "hybrid EA" or "Lamarckian EA". To date, MA has been successfully applied to a wide variety of problem domains such as NP-hard optimization problems, engineering, machine learning, and robotics. The potential of an evolutionary approach to creativity has been noted from cultural and practical viewpoints (Gabora 1997; Boden 2009). EA techniques have been shown to emulate creativity in engineering, such as genetic programming (GP) introduced by Koza (2003) as being capable of "routinely producing inventive and creative results"4; as well as in visual art, design, and music (Romero and Machado 2008). In psychology, there are studies providing support to an evolutionary view of creativity, such as the behavioral analysis by Simonton (2003) inferring that scientific creativity constitutes a form of constrained stochastic behavior. The Algorithm Our approach is based on a meme pool comprising individuals represented as semantic networks, subject to variation and selection under a fitness measure. We position our algorithm as a novel memetic algorithm, because (1) it is the units of culture, or information, that are undergoing variation, transmission, and selection, very close to the original sense of "memetics" as it was introduced by Dawkins; and (2) this is unlike the existing sense of the word in current MA as an hybridization of individual learning and EA. This algorithm is intended as a new tool focused exclusively on the memetic evolution of knowledge itself, which can find use in knowledge-based systems, reasoning, and creativity. Our algorithm proceeds similar to a conventional EA cycle (Algorithm 1), with a relatively small set of parameters. We implement semantic networks as linked-list data structures of concept and relation objects. The descriptions of representation, fitness evaluation, variation, and selection steps are presented in the following sections. Parameters affecting each step of the algorithm are given in Table 1. Algorithm 1 Outline of the algorithm 1: procedure MEMETICALGORITHM 2: P(t = 0) INITIALIZE(P opsize, Cmax, Rmin, T) 3: repeat 4: !(t) EVALUATEFITNESSES(P(t)) 5: S(t) SELECTION(P(t), !(t), Ssize, Sprob) 6: V (t) VARIATION(S(t), Pc, Pm, T) 7: P(t + 1) V (t) 8: t t + 1 9: until stop criterion 10: end procedure 4 Striking examples of demonstrated GP creativity include replication of historically important discoveries in engineering, such as the reinvention of negative feedback circuits originally conceived by Harold Black in 1920s. 2012 26 Representation The algorithm is centered on the use of semantic networks (Sowa 1991) for encoding evolving memotypes. An important characteristic of a semantic network is whether it is definitional or assertional: in definitional networks the emphasis is on taxonomic relations (e.g. IsA(bird, animal)5) describing a subsumption hierarchy that is true by definition; in assertional networks, relations describe instantiations that are contingently true (e.g. AtLocation(human, city)). In this study we combine the two approaches for increased expressivity. As such, semantic networks provide a simple yet powerful means to represent the "memes" of Dawkins as data structures that are algorithmically manipulatable, allowing a procedural implementation of memetic evolution. In terms of representation, our approach is similar to several existing graph-based encodings of individuals in EA. The most notable is genetic programming (GP) (Koza et al. 2003), where candidate solutions are computer programs represented in a tree hierarchy. Montes and Wyatt (2004) present a detailed overview of graph-based EA techniques besides GP, which include parallel distributed genetic programming (PDGP), genetic network programming (GNP), evolutionary graph generation, and neural programming. Using a graph-based representation makes the design of variation operators specific to graphs necessary. In works such as GNP, this is facilitated by using a string-based encoding of node types and connectivity, permitting operators very close to their counterparts in conventional EA; and in PDGP, operations are simplified by making nodes occupy points in a fixed-size two-dimensional grid. What is common with GP related algorithms is that the output of each node in the graph can constitute an input to another node. In comparison, the range of connections that can form a semantic network of a given set of concepts is limited by commonsense knowledge, i.e. the relations have to make sense to be useful (e.g. IsA(bird, animal) is meaningful while Causes(bird, table) is not). To address this issue, we introduce new crossover and mutation operations for memetic variation, making use of commonsense reasoning (Mueller 2006) and adapted to work on semantic networks. Commonsense Knowledge Bases Commonsense reasoning refers to the type of reasoning involved in everyday thinking, based on commonsense knowledge that an ordinary person is expected to know, or "the knowledge of how the world works" (Mueller 2006). Knowledge bases such as the ConceptNet6 project of MIT Media Lab (Havasi, Speer, and Alonso 2007) and Cyc7 maintained by Cycorp company are set up to assemble and classify commonsense information. The lexical database WordNet8 maintained by the Cognitive Science Laboratory at Princeton University also has characteristics of a commonsense knowledge base, via synonym, hypernym9, and hyponym10 relations (Fellbaum 1998). In our implementation we make use of ConceptNet version 4 and WordNet version 3 to process commonsense 5 Here we adopt the notation IsA(bird, animal) to mean that the concepts bird and animal are connected by the directed relation IsA, i.e. "bird is an animal." 6 http://conceptnet.media.mit.edu 7 http://www.cyc.com 8 http://wordnet.princeton.edu 9 Y is a hypernym of X if every X is a (kind of) Y (IsA(dog, canine)). 10Y is a hyponym of X if every Y is a (kind of) X. knowledge, where ConceptNet contributes around 560,000 definitional and assertional relations involving 320,000 concepts and WordNet contributes definitional relations involving around 117,000 synsets11. The hypernym and hyponym relations among noun synsets in WordNet provide a reliable collection of IsA relations. In contrast, the variety of assertions in ConceptNet, contributed by volunteers across the world, makes it more prone to noise. We address this by ignoring all assertions with a reliability score (determined by contributors' voting) below a set minimum Rmin (Table 1). Initialization At the start of each run of the algorithm, the population of size P opsize is initialized with individuals created by random semantic network generation (Algorithm 1). This is achieved by starting from a network comprising only one concept randomly picked from commonsense knowledge bases and running a semantic network expansion algorithm that (1) randomly picks a concept in the given network (e.g. human); (2) compiles a list of relations—from commonsense knowledge bases—that the picked concept can be involved in (e.g. {CapableOf(human, think), Desires(human, eat), · · ·}) (3) appends to the network a relation randomly picked from this list, together with the other involved concept; and (4) repeats this until a given number of concepts has been appended or a set timeout T has been reached (covering situations where there are not enough relations). Note that even if grown in a random manner, the resulting network itself is totally meaningful and consistent because it is a combination of rational information from commonsense knowledge bases. The initialization algorithm depends upon the parameters of Cmax, the maximum number of initial concepts, and Rmin, the minimum ConceptNet relation score (Table 1). Fitness Measure Since the individuals in our approach represent knowledge, or memes, the fitness for evolutionary selection is defined as a function of the represented knowledge. For the automated generation of analogies through evolution, we introduce a memetic fitness based on analogical similarity with a given semantic network, utilizing the Structure Mapping Engine (SME) (Falkenhainer, Forbus, and Gentner 1989; Gentner and Markman 1997). Taking the analogical matching score from SME as the fitness, our algorithm can evolve collections of information that are analogous to a given one. In SME, an analogy is a one-to-one mapping from the base domain into the target domain, which correspond, in our fitness measure, to the semantic network supplied at the start and the individual networks whose fitnesses are evaluated by the function. The mapping is guided by the structure of relations between concepts in the two domains, ignoring the semantics of the concepts themselves; and is based on the systematicity principle, where connected knowledge is preferred over independent facts and is assigned a higher matching score. As an example, the Rutherford- Bohr atom and Solar System analogy (Gentner and Markman 1997) would involve a mapping from sun and planet in the first domain to nucleus and electron in the second domain. The labels and structure of relations in the two domains (e.g. {Attracts(sun, planet), Orbits(planet, sun), 11A synset is a set of synonyms that are interchangeable without changing the truth value of any propositions in which they are embedded. 2012 27 · · ·} and {Attracts(nucleus, electron), Orbits(electron, nucleus), · · ·}) define and constrain the possible mappings between concepts that can be formed by SME. We make use of our own implementation of SME based on the original description by Falkenhainer et al. (1989) and adapt it to the simple concept- relation structure of semantic networks, by mapping the predicate calculus constructs of entities into concepts, relations to relations, attributes to IsA relations, and excluding functions. Selection After assigning fitness values to all individuals in the current generation, these are replaced with offspring generated by variation operators on parents. The parents are probabilistically selected from the population according to their fitness, with reselection allowed. While individuals with a higher fitness have a better chance of being selected, even those with low fitness have a chance to produce offspring, however small. In our experiments we employ tournament selection (Coello Coello, Lamont, and Van Veldhuizen 2007), meaning that for each selection, a "tournament" is held among a few randomly chosen individuals, and the more fit individual of each successive pair is the winner according to a winning probability (Table 1). In each cycle of algorithm, crossover is applied to parents selected from the population until P opsize⇥Pc offspring are created (Table 1). Mutation is applied to P opsize ⇥ Pm selected individuals, supplying the remaining part of the next generation (i.e. Pc + Pm = 1). We also employ elitism, by replacing a randomly picked offspring in next generation with the individual with the current best fitness. Variation Operators In contrast with existing graph-based evolutionary approaches that we have mentioned, our representation does not permit arbitrary connections between different nodes and requires variation operators that should be based on information provided by commonsense knowledge bases. This means that any variation operation on the individuals should: (1) preserve the structure within boundaries set by commonsense knowledge; and (2) ensure that even vertices and edges randomly introduced into a semantic network connect to existing ones through meaningful and consistent relations12. Here we present commonsense crossover and mutation operators specific to semantic networks. Commonsense Crossover In classical EA, features representing individuals are commonly encoded as linear strings and the crossover operation simulating genetic recombination is simply a cutting and merging of this one dimensional object from two parents. In graph-based approaches such as GP, subgraphs can be freely exchanged between parent graphs (Koza et al. 2003; Montes and Wyatt 2004). Here, as mentioned, the requirement that a semantic network has to make sense imposes significant constraints on the nature of recombination. We introduce two types of commonsense crossover that are tried in sequence by the variation algorithm. The first type attempts a sub-graph interchange between two selected parents similar to common crossover in standard GP; and 12It should be noted that we depend on the meaningfulness and consistency (i.e. compatibility of relations with others involving the same concepts) of information in the commonsense knowledge bases, which should be ensured during their maintenance. where this is not feasible due to the commonsense structure of relations forming the parents, the second type falls back to a combination of both parents into a new offspring. Type I (subgraph crossover): A pair of concepts, one from each parent, that are interchangeable13 are selected as crossover concepts, picked randomly out of all possible such pairs. For instance, in Figure 1, bird and airplane are interchangeable, since they can replace each other in the relations CapableOf(·, fly) and AtLocation(·, air). In each parent, a subgraph is formed, containing: (1) the crossover concept; (2) the set of all relations, and associated concepts, that are not common with the other crossover concept (In Figure 1 (a), HasA(bird, feather) and AtLocation(bird, forest); and in (b) HasA(airplane, propeller), M adeOf(airplane, metal), and UsedF or(airplane, travel)); and (3) the set of all relations and concepts connected to these (In Figure 1 (a) P artOf(feather, wing) and P artOf(tree, forest); and in (b) M adeOf(propeller, metal)), excluding the ones that are also one of those common with the other crossover concept (the concept fly in Figure 1 (a), because of the relation CapableOf(·, fly)). This, in effect, forms a subgraph of information specific to the crossover concept, which is insertable into the other parent. Any relations between the subgraph and the rest of the network not going through the crossover concept are severed (e.g. UsedF or(wing, fly) in Figure 1 (a)). The two offspring are formed by exchanging these subgraphs between the parent networks (Figure 1 (c) and (d)). Type II (graph merging crossover): A concept from each parent that is attachable14 to the other parent is selected as a crossover concept. The two parents are merged into an offspring by attaching a concept in one parent to another concept in the other parent, picked randomly out of all possible attachments (CreatedBy(art, human) in Figure 2. Another possibility is Desires(human, joy).). The second offspring is formed randomly the same way. In the case that no attachable concepts are found, the parents are merged as two separate clusters within the same semantic network. Commonsense Mutation We introduce several types of commonsense mutation operators that modify a parent by means of information from commonsense knowledge bases. For each mutation to be performed, the type is picked at random with uniform probability. If the selected type of mutation is not feasible due to the commonsense structure of the parent, another type is again picked. In the case that a set timeout of T trials has been reached without any operation, the parent is returned as it is. Type I (concept attachment): A new concept randomly picked from the set of concepts attachable to the parent is attached through a new relation to one of existing concepts (Figure 3 (a) and (b)). Type IIa (relation addition): A new relation connecting two existing concepts in the parent is added, possibly connecting unconnected clusters within the same network (Figure 3 (c) and (d)). 13We define two concepts from different semantic networks as interchangeable if both can replace the other in all, or part, of the relations the other is involved in, queried from commonsense knowledge bases. 14We define a distinct concept as attachable to a semantic network if at least one commonsense relation connecting the concept to any of the concepts in the network can be discovered from commonsense knowledge bases. 2012 28 (a) Parent 1 (b) Parent 2 airplane fly CapableOf air AtLocation metal MadeOf propeller HasA travel UsedFor MadeOf CapableOf kite (c) Offspring 1 air bird AtLocation fly CapableOf feather HasA forest AtLocation wing PartOf lift Causes tree PartOf oxygen PartOf (d) Offspring 2 Figure 1: Commonsense crossover type I (subgraph crossover), centered on the concepts of bird for parent 1 and airplane for parent 2. notes PartOf music art IsA joy Causes violin UsedFor Causes (a) Parent 1 woman brain HasA human IsA earth planet IsA HasA AtLocation (b) Parent 2 music joy Causes art IsA Causes human CreatedBy violin UsedFor notes PartOf brain HasA earth AtLocation woman IsA HasA planet IsA (c) Offspring Figure 2: Commonsense crossover type II (graph merging crossover), merging by the relation CreatedBy(art, human). If no concepts attachable through commonsense relations are encountered, the offspring is formed by merging the parent networks as two separate clusters within the same semantic network. Type IIb (relation deletion): A randomly picked relation in the parent is deleted, possibly leaving unconnected clusters within the same network (Figure 3 (e) and (f)). Type IIIa (concept addition): A randomly picked new concept is added to the parent as a new cluster (Figure 3 (g) and (h)). Type IIIb (concept deletion): A randomly picked concept is deleted with all the relations it is involved in, possibly leaving unconnected clusters within the same network (Figure 3 (i) and (j)). Type IV (concept replacement): A concept in the parent, randomly picked from the set of those with at least one interchangeable concept, is replaced with one (randomly picked) of its interchangeable concepts. Any relations left unsatisfied by the new concept are deleted (Figure 3 (k) and (l)). Results and Discussion In this introductory study, we adopt values for crossover and mutation probabilities similar to earlier studies in graphbased EA (Koza et al. 2003; Montes and Wyatt 2004) (Table 1). We use a crossover probability of Pc = 0.85, and a somewhat-above-average mutation rate of Pm = 0.15, accounting for the high tendency of mutation postulated in memetic literature15. In our experiments, we subject a population of P opsize = 200 individuals to tournament selection with tournament size Ssize = 8 and winning probability Sprob = 0.8. Using this parameter set, we present the results from two runs of experiment: evolved analogies for a network de15See Gil-White (Gil-White 2008) for a review and discussion of mutation in memetics. 2012 29 dessert eat UsedFor person Desires home AtLocation (a) Mutation type I (before) dessert eat UsedFor person Desires home AtLocation walk CapableOf (b) Mutation type I (after) cheesecake dessert IsA cake IsA (c) Mutation type IIa (before) cheesecake dessert IsA cake IsA IsA (d) Mutation type IIa (after) cheesecake dessert IsA cake IsA IsA (e) Mutation type IIb (before) cheesecake dessert IsA cake IsA (f) Mutation type IIb (after) cheesecake dessert IsA cake IsA sweet HasProperty eat UsedFor IsA (g) Mutation type IIIa (before) cheesecake dessert IsA cake IsA sweet HasProperty eat UsedFor IsA person (h) Mutation type IIIa (after) cheesecake dessert IsA cake IsA sweet HasProperty eat UsedFor IsA (i) Mutation type IIIb (before) dessert sweet HasProperty eat UsedFor cake IsA (j) Mutation type IIIb (after) dessert eat UsedFor person Desires home AtLocation walk CapableOf (k) Mutation type IV (before) dessert eat UsedFor person Desires university AtLocation walk CapableOf (l) Mutation type IV (after) Figure 3: Examples illustrating the types of commonsense mutation used in this study. scribing some basic astronomical knowledge are shown in Figure 4 and for a network of familial relations in Figure 5. We show in Figure 6 (a) the progress of the best and average fitness in the population during the run that produced the results in Figure 4. The best and average size of semantic networks forming the individuals are shown in Figure 6 (b). We observe that evolution asymptotically reaches a fitness plateau after about 40 generations. This coincides roughly with the point where the size of the best individual (13- 14) becomes comparable with that of the given base semantic network (11, in Figure 4), after which improvements in the one-to-one analogy become sparser and less feasible. We also note that, between generations 21- 34, the best network size actually gets smaller, demonstrating the possibility of improvement in network configuration without adding further nodes. Our experiments demonstrate that the proposed algorithm is capable of spontaneously creating collections of knowledge analogous to the one given in a base semantic network, with very good performance. In most cases, our implementation was able to reach extensive analogies within 50 generations and reasonable computational time. From an analogical reasoning viewpoint, the algorithm achieves the generation of diverse novel cases analogous to a given case. Compared with the Kilaza model of O'Donoghue (2004) for finding novel analogous cases, which works by evaluating possible analogies to a given target case from a collection of candidate source domains that are assumed to be available, our approach is capable of open-ended and spontaneous creation of analogous cases from the ground up, replicating an essential mode of creative behavior observed in psychology (Clement 1988). An important result is that, even if the use of commonsense knowledge in our algorithm was prompted by concerns that are practical in nature (i.e. restrictions on the meaningfulness and consistency of memetic variation by the introduced crossover and mutation operators), it eventually serves to tackle a very fundamental and long-standing problem in computational creativity: as put forth by Boden (2009), "no current AI system has access to the rich and subtly structured stock of concepts that any normal adult human being has built up over a lifetime" and "what's missing, as compared with the human mind, is the rich store of world knowledge (including cultural knowledge) that's often involved." We believe that the inherent commonsense reasoning element in our approach provides a means to address this criticism of lack of world knowledge in computational approaches to creativity. Conclusions and Future Work We have presented a novel evolutionary algorithm that employs semantic networks as evolving individuals, paralleling the model of cultural evolution in the field of memetics. This algorithm, to our knowledge, is the first of its kind. The use of semantic networks provides a suitable basis for implementing variation and selection of memes as put forth by Dawkins (Dawkins 1989). We have introduced preliminary versions of variation operators that work on this representation, utilizing knowledge from commonsense knowledge bases. We have also contributed a memetic fitness measure based on the structure mapping theory from psychology. Even if it is an intuitive fact that human culture and knowledge are evolving with time, existing models of cul 2012 30 planet large object IsA matter MadeOf mass HasA solar system AtLocation HasProperty galaxy AtLocation universe PartOf earth IsA moon HasA spherical HasProperty HasProperty (a) Given semantic network, 10 concepts, 11 relations (base domain) IUXLW VRXUFHRIYLWDPLQ ,V$ VHHG +DV$ WUHH $W/RFDWLRQ PRXQWDLQ $W/RFDWLRQ IRUHVW 3DUW2I DSSOH ,V$ OHDI +DV$ JUHHQ +DV3URSHUW\ +DV3URSHUW\ (b) Evolved individual, 9 concepts, 9 relations (target domain) Figure 4: Experiment 1: The evolved individual is encountered after 35 generations, with fitness value 2.8. Concepts and relations of the individual not involved in the analogy are not shown here for clarity. Table 1: Parameters used during experiments Parameter Value Evolution Population size (P opsize) 200 Crossover probability (Pc) 0.85 Mutation probability (Pm) 0.15 Semantic Max. initial concepts (Cmax) 5 networks Min. relation score (Rmin) 2.0 Timeout (T) 10 Selection Type Tournament Tournament size (Ssize) 8 Tournament win prob. (Sprob) 0.8 Elitism Employed ture, in their current state, are too minimalistic and weak in their descriptions of individual creativity and novelty; and conversely, theories modeling individual creativity lack consideration of cultural transmission and replication (Gabora 1997). We believe that studies exploring creativity with evolutionary approaches have the potential for bridging this gap. In future work, an interesting possibility is to start the random semantic network generation procedure with several given concepts, allowing the discovery of cases formed around a particular set of seed concepts. The simple fitness human home AtLocation sleep CapableOf dream HasSubevent man IsA father IsA family PartOf woman IsA female mother IsA IsA PartOf care CapableOf (a) Given semantic network, 11 concepts, 11 relations (base domain) instrument music hall AtLocation make music CapableOf play instrument HasSubevent percussion instrument IsA drum IsA wind instrument IsA member of orchestra IsA clarinet IsA perform glissando CapableOf (b) Evolved individual, 10 concepts, 9 relations (target domain) Figure 5: Experiment 2: The evolved individual is encountered after 42 generations, with fitness value 2.7. Concepts and relations of the individual not involved in the analogy are not shown here for clarity. function used in this introductory study can be extended to take graph-theoretical properties of semantic networks into account, such as the number of nodes or edges, shortest path length, or the clustering coefficient. The research would also benefit from exploring different types of mutation and crossover, and grounding the design of such operators on existing theories of cultural transmission and variation, discussed in sociological theories of knowledge. A direct and very interesting application of our approach would be to devise experiments with realistically formed fitness functions modeling selectionist theories of knowledge, which remain untested until this time. One such theory is the evolutionary epistemology of Campbell (Bickhard and Campbell 2003), describing the development of human knowledge and creativity through selectionist principles such as blind variation and selective retention (BVSR). Acknowledgments This work was supported by a JAE-Predoc fellowship from CSIC, and the research grants: 2009-SGR-1434 from the Generalitat de Catalunya, CSD2007-0022 from MICINN, and Next-CBR TIN2009-13692-C03-01 from MICINN. References Bickhard, M. H., and Campbell, D. T. 2003. Variations in variation and selection: the ubiquity of the variation-and 2012 31 È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë 0 10 20 30 40 50 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Generations HtL Fitness (a) È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È È Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë Ë 0 10 20 30 40 50 0 5 10 15 20 Generations HtL Semantic network size (b) Figure 6: Evolution of (a) fitness and (b) semantic network size during the course of an experiment with parameters given in Table 1. Filled circles represent the best individual in a generation, empty circles represent population average. Network size is taken to be the number of relations (edges). selective-retention ratchet in emergent ogranizational complexity. Foundations of Science 8:215- 2182. 2012_40 !2012 CrossBee: Cross-Context Bisociation Explorer Matjaž Juršič1,2, Bojan Cestnik3,1 , Tanja Urbančič4,1, Nada Lavrač1,4 1 Jožef Stefan Institute, Ljubljana, Slovenia 2 International Postgraduate School Jožef Stefan, Ljubljana, Slovenia 3 Temida d.o.o., Ljubljana, Slovenia 4 University of Nova Gorica, Nova Gorica, Slovenia {matjaz.jursic, bojan.cestnik, tanja.urbancic, nada.lavrac}@ijs.si CrossBee is an exploration engine for text mining and cross-context link discovery, implemented as a Web application with a user-friendly interface. The system supports the expert in advanced document exploration supporting document retrieval, analysis and visualization. It enables document retrieval from public databases like PubMed, as well as by querying the Web, followed by document cleaning and filtering through several filtering criteria. Document analysis includes document presentation in terms of statistical and similarity-based properties, topic ontology construction through document clustering. A distinguishing feature of CrossBee is its powerful cross-context and crossdomain document exploration facility and bisociative (Koestler 1964) term discovery aimed at finding potential cross-domain linking terms/concepts. Term ranking based on an ensemble heuristic (Juršič et al. 2012) enables the expert to focus on cross-context links with high potential for cross-context link discovery. CrossBee's document visualization and user interface customization additionally support the expert in finding relevant documents and terms through similarity graph visualization, a color-based domain separation scheme and highlighted top-ranked bisociative terms. A typical user scenario starts by inputting two sets of documents of interest and by regulating the parameters of the system. The required input is a file with documents from two domains. Each line of the file contains exactly three tab-separated entries: (a) document identification number, (b) domain acronym, and (c) the document text. The other options available to the user include specifying the exact preprocessing options, specifying the base heuristics to be used in the ensemble, specifying outlier documents identified by external outlier detection software, defining the already known bisociative terms (b terms), and others. Next, CrossBee starts a computationally very intensive step in which it prepares all the data needed for the fast subsequent exploration phase. During this step the actual text preprocessing, base heuristics, ensemble, bisociation scores and rankings are computed in the way presented in the previous section. This step does not require any user intervention. After this computation, the user is presented with a ranked list of b term candidates. The list provides the user with some additional information including the ensemble's individual base heuristics votes and term's domain occurrence statistics in both domains. The user then browses through the list and chooses the term(s) he believes to be promising b terms, i.e. terms for finding meaningful connections between the two domains. At this point, the user can start inspecting the actual appearances of the selected term in both domains, using the efficient side-by-side document inspection. In this way, he can verify whether his rationale behind selecting this term. CrossBee is available at website: http://crossbee.ijs.si/. The system's home page is shown below. 2012_41 !2012 Computer Software for Measuring Creative Search Kyle E. Jennings Department of Psychology University of California, Davis Davis, CA 95616 USA kejennings@ucdavis.edu The creative process can be thought of as the search through a space of possible solutions for one that best satisfies the problem criteria. To better understand this search process, two face valid creative tasks have been created, both of which track the intermediate configurations that creators explore. These data—called search trajectories—may yield valuable insights into the creative process. This demonstration allows visitors to try both tasks and to see the sorts of data that are produced. The first task is a computerized version of Amabile's (1982) popular collage task, wherein participants make themed collages using colored shapes (see Figure 1). The software allows the shapes to be moved, rotated, flipped, and stacked using an intuitive mouse-based interface. The creator's moves can be characterized according to the extent that the set of shape movements actually performed exceeds the minimal set of movements needed to produce the final collage. Figure 1: Intermediate screen of the collage task. The white area represents a piece of paper and the gray area is a work area. Initially all of the shapes are stacked in the work area, similar to the triangles in the upper-right corner. The second task, called the orbital composition task (Jennings 2010; Jennings, Simonton, and Palmer 2011), involves arranging a camera and light that lie in fixed circular orbits around a set of objects. The configuration space has only three dimensions—camera angle, camera zoom, and light angle—but the scene is constructed in a way that permits many interesting and varied images (see Figure 2). While less face valid than the collage task, the orbital task's simplicity permits more consistent analyses and makes it possible to collect ratings from a standardized subset of the space, thereby providing a sense of the search landscape topology that can help overcome some of the ambiguities inherent in analyzing search trajectories alone (see Jennings 2012). Figure 2: Final images from the orbital composition task as selected by four different research participants. 2012_42 !2012 ANGELINA Coevolution in Automated Game Design Michael Cook and Simon Colton Computational Creativity Group, Imperial College, London (ccg.doc.ic.ac.uk) Figure 1: Screenshot from a game about a murdered aid worker from Scotland. The background image is of the Scottish landscape, and a red ribbon image has been selected to represent the aid charity. ANGELINA ANGELINA is a co-operative co-evolutionary system for automatically creating simple videogames. It has previously been used to design both simple arcade-style games and twodimensional platformers. In the past, ANGELINA's efforts have been focused on mechanical aspects of design, such as level creation, rule selection and enemy design. We are now in the process of expanding ANGELINA's remit to cover other aspects of videogame design, including aesthetic problems such as art direction and the selection and use of external media to evoke emotion or communicate meaning. ANGELINA has produced several new games for this demonstration, exemplifying the new abilities the system now has. Its co-operative co-evolutionary system for platform games is composed of four modules: (i) a level designer that places solid blocks and locked doors to shape the progress of the player (ii) a layout designer that places and designs the enemies the player faces, as well as the start and end of the level (iii) a powerup designer that defines what bonus items the player can acquire during gameplay and (iv) a creative direction module that arranges a set of media resources in the level for the player to discover during gameplay. This latter module is the newest addition to the system, and takes advantage of many new capabilities built into ANGELINA for retrieving content from the web dynamically for use in themed videogames. Design Task Inspired by the collage-creation problem described in (Cook and Colton 2011) ANGELINA obtains current affairs articles by accessing the website of the British newspaper The Guardian. It selects a news story, and attempts to design a short platform game whose theme is inspired by the news article selected. Currently, this allows ANGELINA to demonstrate simple abilities such as the appropriate selection of media from a wide variety of sources, and arrangement in a potentially nonlinear level space. Figure 2: Media retrieved for a game inspired by an inquiry into a newspaper. Left: an image retrieved using the phrase 'newspapers and magazines'. On the right is Rebekah Brooks, one of the journalists in the investigation. ANGELINA uses online knowledge sources such as Wikipedia to extract additional information about data retrieved from the news articles it can, for instance, identify when a country is the subject of a news article, allowing the system to search photography websites such as Flickr for photographs of that country to use as a backdrop to the game. Keyword-based searches can also be augmented with emotional keywords to alter the results they return, based on techniques described in (Cook and Colton 2011). By reading live Twitter search results about a named person in the news article, ANGELINA can use search augmentation appropriate to the opinions it finds to retrieve media that reflect perceived public opinion of a particular topic. Although a simple technique, it is a first step towards the system dealing with opinion and bias through the work it produces. Games The games produced are simple platform games, loosely following the design tenets of the Metroidvania subgenre. The player must navigate the level space to reach the exit, but in order to gain access to later level sections, it is necessary to seek out and obtain items that add to the player's capabilities (for example: unlocking doors or changing the player's jumping abilities). As the player explores further they will encounter enemies, as well as images and sound content that is appropriate to the game's theme. ANGELINA is implemented in Java, but the games the system produces are Flash-based. When ANGELINA has evolved a game design, it modifies an existing ActionScript game template to include the generated design content, as well as incorporating the media downloaded and selected from the internet. All of the games available in the demonstration, as well as others developed by the system, are available on the project website: www.gamesbyangelina.org 2012_43 !2012 SynAPP: An online web application for exploring creativity Alberto Gael Abadín Martínez, Universidade de Vigo (Spain) / AGH-UST (Cracow, Poland) Bipin Indurkhya, AGH University of Science and Technology (Cracow, Poland) Juan Carlos Burguillo Rial, Universidade de Vigo (Spain) DESCRIPTION SynAPP is a web application currently hosted at AGH-UST (http://149.156.205.250:15180) designed to stimulate users' creative skills through image association tasks and a rating feedback system. In SynAPP users perform two tasks related to image-image associations: -Associating two images using a word or short phrase. The two images can be presented simultaneously, left and right, or sequentially with a five seconds delay in between. The user is allowed to make only one association per couple. -Evaluating associations generated by other users according to two criteria: originality (0, 0.5 or 1 points) and intelligibility (0, 0.5 or 1 points). The set of image pairs for these two tasks are mutually disjoint, so if a user generates association for an image pair, then she or he does not evaluate associations generated by other users for the same pair; and vice versa. All the responses are recorded with their respective time stamps, and the time taken to perform each association is also recorded. A user can see how her or his associations were rated (with respect to their originality and intelligibility) by other users, and also how this evaluation evolved over time. This information is shown in an intuitive way using tables and graphs. Users perform three standard tests of creativity before and after using SynAPP: -Will Shortz & Morgan Worthy's word equation (ditloid) puzzles like "24 = H. in O. D." ("24 = Hours in One Day"). Different equations are used for before-SynAPP and afterSynAPP tests. -Guilford's alternative uses task: the user is asked to give as many uses as possible of a common item. Different objects are used for before-SynAPP and after-SynAPP tests. -Wallace & Kogan's assessment of creativity: A test similar to Guilford's, but the user is asked to find objects with a common property instead. The answers given by each user are evaluated by the other users, similar to the image associations, and statistics on these evaluations is also displayed graphically to the user. We hypothesize that SynAPP helps users to improve their creative, out-of-the-box divergent thinking cognitive abilities, and our goal is to properly evaluate this hypothesis based on the analysis of the data collected from the association tasks and the creativity tests. APPLICATION WORKFLOW 2012 229 2012_44 !2012 Co-creating Game Content using an Adaptive Model of User Taste Antonios Liapis, Georgios N. Yannakakis, and Julian Togelius Center for Computer Games Research IT University of Copenhagen Rued Langgaards Vej 7, 2300 Copenhagen, Denmark {anli, yannakakis, juto}@itu.dk Mixed-initiative procedural content generation can augment and assist human creativity by allowing the algorithm to take care of the mechanisable parts of content creation, such as consistency and playability checking. But it can also enhance human creativity by suggesting new directions and structures, which the designer can choose to adopt or not. The proposed framework generates spaceship hulls and their weapon and thruster topologies in order to match a user's visual taste as well as conform to a number of constraints aimed for playability and game balance. The 2D shapes representing the spaceship hulls are encoded as pattern-producing networks (CPPNs) and evolved in two populations using the feasible-infeasible 2-population approach (FI-2pop). One population contains spaceships which fail ad-hoc constraints pertaining to rendering, physics simulation and game balance, and individuals in this population are optimized towards minimizing their distance to feasibility. The second population contains feasible spaceships, which are optimized according to ten fitness dimensions pertaining to common attributes of visual taste such as symmetry, weight distribution, simplicity and size. These fitness dimensions are aggregated into a weighted sum which is used as the feasible population's fitness function — the weights in this quality approximation are adjusted according to a user's selection among a set of presented spaceships. This adaptive aesthetic model aims to enhance the visual patterns behind the user's selection and minimize visual patterns of unselected content, thus generating a completely new set of spaceships which more accurately match the user's tastes. A small number of user selections allows the system to recognize their preference, minimizing user fatigue. The proposed two-step adaptation system, where (1) the user implicitly adjusts their preference model through content selection and (2) the preference model affects the patterns of generated content, should demonstrate the potential of a flexible tool both for personalizing game content to an end-user's visual taste but also for inspiring a designer's creative task with content guaranteed to be playable, novel and yet conforming to the intended visual style. Related Work A. Liapis, G. N. Yannakakis, and J. Togelius, "Adapting Models of Visual Aesthetics for Personalized Content Creation," IEEE Transactions on Computational Intelligence and AI in Games, Special Issue on Computational Aesthetics in Games, 2012, (to appear). A. Liapis, G. N. Yannakakis, and J. Togelius, "Optimizing Visual Properties of Game Content Through Neuroevolution," in Artificial Intelligence for Interactive Digital Entertainment Conference, 2011. A. Liapis, G. N. Yannakakis, and J. Togelius, "Neuroevolutionary Constrained Optimization for Content Creation," in Computational Intelligence and Games (CIG), 2011 IEEE Conference on, 2011, pp. 71- 78. Figure 1: The fitness dimensions used to evaluate spaceships' visual properties and sample spaceships optimized for each fitness dimension. Weapons are displayed in green and thrusters in red. Figure 2: The graphic user interface for spaceship selection. 2012 230 2012_5 !2012 Cross-domain Literature Mining: Finding Bridging Concepts with CrossBee 0DWMDå-XUãLþ1,2, Bojan Cestnik3,1 7DQMD8UEDQþLþ4,11DGD/DYUDþ1,4 1 Jožef Stefan Institute, Ljubljana, Slovenia 2 International Postgraduate School Jožef Stefan, Ljubljana, Slovenia 3 Temida d.o.o., Ljubljana, Slovenia 4 University of Nova Gorica, Nova Gorica, Slovenia {matjaz.jursic, bojan.cestnik, tanja.urbancic, nada.lavrac}@ijs.si Abstract In literature-based creative knowledge discovery one of the challenging tasks is to identify interesting bridging terms or concepts which relate different domains. To find these bridging concepts, our cross-domain literature mining approach assumes that one first has to identify two seemingly unrelated domains of interest. Bridging terms, found in the intersection of these domains, are then ranked according to their potential to uncover useful, previously unexplored links between the two domains. Term ranking, based on voting of an ensemble of heuristics, is the main functionality of the CrossBee (Cross-Context Bisociation Explorer) system presented in this paper. The utility of the proposed approach is show-cased by exploring scientific papers on migraine and magnesium, which is a reference domain in literature mining. Introduction This paper 1 investigates the creative knowledge discovery process which has its grounds in Mednick's associative creativity theory (Mednick 1962) and Koestler's domaincrossing associations, called bisociations (Koestler 1964). Mednick defines creative thinking as the capacity of generating new combinations of distinct associative elements (concepts). He explains how thinking about the concepts that are not strictly related to the elements under investigation inspires unforeseen useful connections between these elements. On the other hand, according to Koestler, a bisociation is a result of creative processes of the mind when making completely new associations between concepts from domains that are usually considered separate. Consequently, discovering bisociations may considerably improve creative discovery processes. According to Koestler, 1 This work was supported by the European Commission under the 7th Framework Programme FP7-ICT-2007-C FET-Open project BISON-211898, and Slovenian Research Agency grant Knowledge Technologies (P2 0103). through the history of science, this mechanism has been a crucial element of progressive insights and paradigm shifts. The approach to creative knowledge discovery from text documents presented in this paper is based on identifying and exploring terms which have the potential to relate different domains of interest, i.e., two distinct domain literatures. While in general literature refers to any document corpus (articles, novels, stories, etc.), our approach to cross-domain literature mining focuses on the task of mining scientific papers in the so-called closed discovery2 setting (Weeber et al., 2001) where two domains of interest, A and C, are identified by the expert prior to starting the knowledge discovery process, and the goal is to find interesting bridging terms that relate the two literatures. Weeber et al. (2001) have followed the work of literature-based knowledge discovery in medical domains by Swanson (1986) who designed the so-called ABC model approach to investigate whether the phenomenon of interest C in the first domain is related to some phenomenon A in the other literature through some interconnecting phenomenon B addressed in both literatures. If the literature about C relates C with B, and the literature about A relates A with the same B, then combining these relations may suggest a relation between C and A. If closer inspection confirms that an uncovered relation between C and A is new, meaningful and interesting, this can be viewed as new evidence or considered as a new piece of knowledge. Smalheiser and Swanson (1998) developed an online system ARROWSMITH which takes as input two sets of titles from disjoint domains A and C and lists terms that are common to literatures A and C; the resulting bridging terms (b-terms, forming set B) are further investigated for their potential to generate new scientific hypotheses (see an 2 In contrast with closed discovery, open discovery leads the creative knowledge discovery process from a given starting domain towards a yet unknown second domain which at the end of this process turns out to be connected with the first one. 2012 33 example in Figure 1). Investigation of pairs of documents might seem rather straightforward, like in the example documents titled "Migraine treatment with calcium channel blockers" (Anderson et al., 1986) and "Magnesium: nature's physiologic calcium blocker" (Iseri and French, 1984). However, it should be left to domain experts to check whether bridging term calcium channel blocker suggests a valid, new and interesting relation (in this case, the relation that migraine could be treated with magnesium). To this end, it is helpful not just to identify a set of candidate bridging terms B between literatures A and C, but also to provide an expert with an easy access to the documents to be checked and to support this laborious process by ranking bridging terms candidates in order to start the exploration by considering the most promising terms first. Figure 1. Gold standard cross-domain literature mining example: migraine (domain C) on the left, magnesium (domain A) on the right, and in between a selection of bridging terms B as identified by Swanson et al. (2006). The approach presented in this paper is closely related to bridging terms identification in the RaJoLink system (Urbanþiþ et al. 2007, Petriþ et al. 2009). RaJoLink can be used to identify interesting scientific articles in the PubMed database, to compute different statistics, and to analyze the articles with the aim to discover new knowledge. The RaJoLink method involves three principal steps, Ra, Jo and Link, which have been named after the key elements of each step: Rare terms, Joint terms and Linking terms, respectively. In the Ra step, interesting rare terms in literature about phenomenon C under investigation are identified. In the Jo step, all available articles about the selected rare terms are inspected and interesting joint terms that appear in the intersection of the literatures about rare terms are identified as the candidates for A. This results in a candidate hypothesis that C is connected with A. To provide explanation for hypotheses generated in the Jo step, in the final Link step the method searches for b-terms, linking literatures A and C. Note that steps Ra and Jo implement the open discovery, while step Link corresponds to the closed discovery process of searching for b-terms when A and C are already known (as illustrated in Figure 1). Focusing on the closed discovery process, the method proposed in this paper aims at finding bridging terms in documents of two given domains A and C, enabling the exploration of potentially interesting bisociative links between the given domains with the aid of an ensemble of new heuristics for bridging term discovery. Term ranking, based on voting of an ensemble of heuristics, is the main functionality of the new CrossBee (Cross Context Bisociation Exploration) system presented in this paper. To verify the utility of the proposed approach, CrossBee was tested on the problem of rediscovering links between migraine and magnesium literatures, first explored by Swanson (1986) and later by numerous other authors, including Weeber et al. (2001) and (Urbanþiþ et al. 2009). This paper is organized as follows. Section 2 presents and relates two creative knowledge discovery frameworks: Koestler's bisociative link discovery (Koestler 1964) and Swanson's ABC model of closed discovery in literature mining (Swanson 1986). It also relates our work to Boden's definition of creativity (Boden 1992) and Wigging's computational creativity definition (Wigging 2006). Section 3 presents the heuristics used for selecting the most promising bridging concepts (bridging terms or b-terms) in the intersection of two different sets of documents (two domains of interest), evaluated on the migraine-magnesium domain pair, explored originally in Swanson's research. It also presents an ensemble heuristic composed of six selected elementary heuristics. Section 4 presents the functionality and implementation of our system CrossBee for crosscontext bridging term discovery. We conclude with a discussion and directions for further work. Koestler's Bisociations, Cross-domain Literature Mining and Computational Creativity Let us present some background on the mechanism of bisociative reasoning which is at the heart of creative, accidental discovery, referred to as serendipity by Roberts (1989). Bisociative discovery, studied in this work, is focused on finding unexpected terms/concepts linking different domains. Scientific discovery requires creative thinking to connect seemingly unrelated information, for example, by using metaphors or analogies between concepts from different domains. These modes of thinking allow the mixing of conceptual categories or domains, which are normally separated. One of the functional bases for these modes is the idea of bisociation, coined by Artur Koestler (1964): "The pattern . . . is the perceiving of a situation or idea, L, in two self-consistent but habitually incompatible frames of reference, M1 and M2. The event L, in which the two intersect, is made to vibrate simultaneously on two different wavelengths, as it were. While this unusual situation lasts, L is not merely linked to one associative context but bisociated with two." literature about migraine 5 hydroxytryptamine prostaglandin serotonin calcium channel blocker . . . literature about magnesium domain C bridging terms B domain A 2012 34 Koestler investigated bisociation as the basis for human creativity in seemingly diverse human endeavors, such as humor, science, and arts. In this paper we explore a specific pattern of bisociation (Berthold, 2012): terms, appearing in documents, which represent bisociative links between concepts of different domains, where the creative act is to find links which lead ‘out-of-the-plane' in Koestler's terms, i.e., links which cross two or more different domains. According to Berthold (2012), we claim that two concepts are bisociated if (a) there is no direct, obvious evidence linking them, (b) one has to cross domains to find the link, and (c) this new link provides some novel insight into the problem domain. We explore an approach to bisociative cross-domain link discovery, based on the identification and ranking of terms which have the potential of acting as previously unexplored links between different predefined domains of expertise. It can be seen that- in terms of the Swanson's ABC model used in literature mining- this is an approach to closed knowledge discovery, where two domains of interest, A and C, are identified by the expert in advance. In terms of the Koestler's model, the two domains, A and C, correspond to the two habitually incompatible frames of reference, M1 and M2. Moreover, the linking terms (called bridging terms or b-terms in this paper) that are common to literature A and C, explored by Smalheiser and Swanson (1998), clearly correspond to Koestler's notion of a situation or idea, L, which is not merely linked to one associative domain but bisociated with two domains M1 and M2. Since our work originates from Koestler's creative process definition, it naturally satisfies his notion of creativity. However, the concepts of creativity and computational creativity have several other definitions. We argue that our approach can be labeled as creative according to at least two other definitions, introduced by Boden (1992) and Wiggins (2006). Boden (1992) defines creativity as "the ability to come up with ideas or artefacts that are new, surprising and valuable." Considering this definition, and given that the main output of our methodology is a ranked list of potentially interesting bridging terms/concepts, we argue that- although we do not produce new concepts- the ranking of potentially interesting bridging concepts itself may represent new, surprising and valuable ideas or artefacts. The proposed approach produces new term rankings, because- to the best of our knowledge- there are no similar methodologies available. The results are often also surprising, both because of their unlikeliness (as not commonly used terms may appear at the top of the ranked list) and their effect in subjective surprise (as noted by observing the expert using our system). The weakest claim we provide is the notion of value of the system as until now the developed approach did not yet produce any scientific breakthroughs; however, we already observed that it triggered novel insights by the expert who tested the early versions of our system. Therefore, we conclude that using Boden's definition, the level of our systems creativity is limited by the value of its results and only the reduced exploration time and the number of users will show how valuable the system is and how valuable its results really are. Considering computational creativity, Wiggins (2006) proposes the following definition for which he states to be commonly adapted by the AI community: computational creativity refers to "performance of tasks (by a computer) which, if performed by a human, would be deemed creative." We argue that, although the ranking problem we solve is not something people usually do, our system can be considered creative according to this definition. Take an analogy with online search engines whose task is finding documents and ranking the search results. We believe that, if such rankings were performed by a human, this could be considered as a very creative process. The final results of our methodology- the insights which might arise from using our system- could also be considered scientifically creative, where the ultimate creative act will be performed by the experts using the system and not the system alone. We designed the methodology in a way to enable the expert to be more productive when generating such creative ideas. Therefore, we argue that this added effectiveness of the expert's creativity process originates from the system and its underlying methodology. Hence we believe our system possesses some elements of computational creativity proposed by Wiggins. Bridging Term Detection Methodology Creative thinking requires focusing on problems from new perspectives. In this paper we follow Koestler (1964), who investigated bisociation as the basis for human creativity, with a goal of developing a computational system with the ability to bridge different domains. Such relations between distinct domains can be revealed through bridging concepts (bridging terms, referred to as b-terms in this paper). Since this may lead to the generation of many possible ideas, the innovative generation of hypotheses as well as the support for facilitated exploration of alternatives are needed for creative cross-domain knowledge discovery. Based on this assumption, we have developed and experimented with different heuristics for finding bridging terms in the context of closed knowledge discovery from two different domains of expertise. The intuition behind this research is that by developing appropriate heuristics for term evaluation and ranking, this will enable the user to inspect only the top-ranked terms which should result in a high probability to find observations that may lead to the discovery of new bridges between the literatures of different domains. In summary, our research aim is to find cross-domain links by exploring the bridging terms in the intersection of two literatures that establish previously unknown links between literature A and literature C. In more detail, our method of b-term discovery is performed as follows. 2012 35 1. Perform text preprocessing to encode input texts into the standard bag-of-words (BoW) representation. As in standard text preprocessing for text mining, this is performed through a number of steps: a. text tokenization (where a continuous character sequence is split into meaningful sub-tokens, i.e., individual words or terms), b. stop-word removal (removing predefined words from a language that usually carry no relevant information:. and, or, a, an, the, ...), c. stemming or lemmatization (the process that converts each word/token into its morphologically neutral form), d. n-gram construction (n-grams are terms defined as a concatenation of 1 to n words which appear consecutively in the text), e. bag-of-words (BoW) representation, i.e., a vector representation of a document, with value 1 (or word frequency-based weight) for words/terms appearing in the document, and value 0 for the rest of the corpus vocabulary. 2. Calculate the values of heuristics which favor b-terms over other terms. 3. Sort the intersecting terms according to the values of the best performing heuristics and present the topranked terms (hopefully representing the b-terms) to the expert during interactive exploration of the two domains. The development of the best performing heuristics consisted of two phases: 1. Training: we proposed over 40 elementary heuristics, which vary from very simple term-frequency statistics to very elaborate combined measures. We then evaluated their quality on the migraine-magnesium gold standard domain investigated already by Swanson et al. (1988). Results of the evaluation were used to select some of the best performing and most complementary heuristics that were joined into a new ensemble heuristic. The ensemble heuristic proposed in this paper is generally more accurate and robust than any of the elementary heuristics used in its construction. 2. Testing: we independently evaluated the ensemble heuristic on a second dataset, autism-calcineurin docuPHQWVLQYHVWLJDWHGE\ 3HWULþHWDOfiWRFRQILUP its domain independence and its potential for b-term identification. Note that due to space restrictions, the description of testing of the system on the autismcalcineurin domain pair is out of the scope of this paper; the interested reader can find more information is provided in (JurãLþHWDO). Elementary Heuristics for b-term Detection We have proposed over 40 elementary heuristics for b-term evaluation (JurãLþ HW DO ), divided into four groups: frequency based, tf-idf based3 , similarity based, and outlier based heuristics. Most of these heuristics work fundamentally in a similar way: they manipulate the data present in the BoW document vector format to derive the term bisociation potential quality measure, named the bisociation score. The only exceptions are the outlier based heuristics which first detect outlier documents and then use the BoW vector information. Instead of providing the entire list of heuristics whose performance we tested extensively, we only specify a subset of these which we actually selected to construct the ensemble heuristic. The selected heuristics are defined as follows. Term to document frequency ratio: is a frequency based (ݐ) ೠ Τ஽ܿ݋ܦݐ݊ݑ݋ܿ (ݐ)ೠ஽݉ݎ݁ܶݐ݊ݑ݋ܿ = (ݐ)݋݅ݐܴܽݍ݁ݎ݂ heuristic defined as the ratio of the number of occurrences of term t in document set Du (named term frequency in tf-idf related text preprocessing contexts), and the number of documents where term t appears in document set Du (named document frequency in tf-idf related contexts). Sum of term's importance in both domains: is a heuristic + (ݐ)భ஽݂݂݀݅ݐ = (ݐ)݉ݑܵ݊݉݋ܦ݂݂݀݅ݐ based on tf-idf metrics ݐ݂݂݅݀஽మ(ݐ), defined as a sum of tf-idf value of term t in the centroid vector of document set D1 plus term's tf-idf value in the centroid vector of document set D2, where the centroid vector is defined as the sum of all document vectors and thus represents an average document of the given document collection. Sum of term frequencies in three outlier sets: is an outlier based heuristic ݎܨݐݑ݋݁ݍܵݑ݉(ݐ) = ܿݑ݋݊ݐܶ݁ݎ݉஽಴ೄ(ݐ) + which computes the (ݐ)ೄೇಾ஽݉ݎ݁ܶݐ݊ݑ݋ܿ + (ݐ)ೃಷ஽݉ݎ݁ܶݐ݊ݑ݋ܿ sum of term frequencies in three outlier sets, where the sets of outliers were identified by three classifiers (Sluban et al. 2012): Centroid Similarity (CS) classifier, Random Forest (RF) classifier, and Support Vector Machine (SVM) classifier. Relative frequencies in outlier sets: focusses on outlier .(ݐ) ೠ Τ஽݉ݎ݁ܶݐ݊ݑ݋ܿ (ݐ)ೄ಴஽݉ݎ݁ܶݐ݊ݑ݋ܿ = (ݐ)ܵܥ݈ܴ݁ݍ݁ݎܨݐݑ݋ sets Documents in the outliers set frequently embody new information that is often hard to explain in the context of existing knowledge. We concentrate on specific outliers- domain outliers- i.e., documents that tend to be more similar to the documents of the other domain than to those of their own domain. The procedure that we use to detect outlier documents first builds a classification model for each domain and then classifies all the documents using the trained classifier. The documents that are misclassified are declared as outlier documents, since according to the classification model they do not belong to their initial domain. The other two outlier based heuristics- relative frequency in the RF outlier set (outFreqRelRF) and relative 3 tf-idf stands for Term Frequency Inverse Document Frequency word weight computation, used in text mining (Feldman and Sanger, 2007). 2012 36 frequency in the SVM outlier set (outFreqRelSVM)- are defined in the same way as the outFreqRelCS heuristic. We have defined also a supplementary baseline heuristic: ݎܽ݊݀݋݉(ݐ) = ݎܽ݊݀ܰݑ݉() which serves as a baseline for the others, as it returns a random number from interval (0,1) regardless of a term under investigation. Evaluation of Elementary Heuristics To test the proposed heuristics for b-term detection, we have evaluated them on the problem of detecting bisociative links between migraine and magnesium in the respective literatures. To this end, we replicated the early Swanson's migraine-magnesium experiment that represents a gold standard for literature-based discovery. The evaluation procedure used in this experiment differs from the original Swanson's method and the RaJoLink method in that a human expert was not involved. Magnesium deficiency has been shown in several studies to cause migraine headaches (e.g., Swanson 1990; Thomas et al. 1992; Thomas et al. 2000; Demirkaya et al. 2001; Trauninger et al. 2002). In the literature-based closed discovery process Swanson managed to find more than 60 pairs of articles connecting the migraine domain with the magnesium deficiency via several bridging concepts. His closer inspection of the literature about migraine and the literature about magnesium showed that 11 pairs of documents, when put together, provided confirmation of a hypothesis that magnesium deficiency may cause migraine headaches (Swanson 1990). Some of the detected bridging terms are shown in Figure 1. Similar to Swanson's original study of the migraine literature (Swanson 1988) we used titles as input to our closed discovery process. We performed the experiments on a subset of PubMed titles of articles that were published before 1988 (i.e., before Swanson's literature-based discovery of the migraine-magnesium relation) and were retrieved with the PubMed search engine. As a result we got 2,425 migraine and 5,633 magnesium titles of PubMed articles. These article titles were preprocessed with standard text mining techniques resulting in 13,525 distinct terms which were analyzed and scored by presented elementary heuristics. Each heuristic assigned a score to every term from the list. Afterwards we sorted all 7 lists (6 elementary heuristics and the baseline heuristic) and thus, created 7 ranked lists of terms. Among these 13,525 terms, there were also all 43 terms which Swanson (1988) marked as b-terms and which we hoped to propagate to the top of the ranked list using the designed heuristics methodology. The b-terms identified by Swanson, verified by the expert to provide new discoveries in the field, are used as a gold standard in the evaluation in this work. We compared the heuristics based on their ROC (Receiver Operating Characteristic) curves and AUC (Area Under ROC) analysis. The idea underlying ROC curve construction is the following: go from the beginning of a ranked list and every time a b-term is seen, draw line up on the ROC canvas, otherwise draw line right. The ideal curve (when all b-terms are at the very beginning of a ranked list) would go straight up to the top followed by straight right section to the rightmost part of graph. Area under the ideal ROC curve is equal to 1 when both scales are normalized. ROC analysis (see Figure 2) shows the performance of elementary heuristics on the migraine-magnesium gold standard dataset. Details on heuristics evaluation can be found in (Juršiþ et al. 2012), while the main observations and results are outlined below. It can be observed that some heuristics are really well constructed for the purpose of b-term discovery. We are especially satisfied with heuristics which have good performance at the start of the ranked list, e.g., heuristic outFreqRelRF places four bterms already among the first 50 terms in its ranked list, while the random approach finds less than one b-term among its first 200 terms. On the other hand some heuristics do not perform so well at the start of the list, e.g., outFreqSum and tfidfDomnSum do not look promising at the first sight. However, we included them into the set of six heuristics on the basis of complementarity- so that they fit together well when used in the ensemble heuristics- providing not only better performance but also greater robustness of the ensemble. Figure 2. ROC curves representing the performance of elementary heuristics on the learning (migraine-magnesium) dataset. The Ensemble Heuristic The ensemble heuristic is a heuristic which combines results of the selected elementary heuristics (outFreqRelRF, outFreqRelSVM, outFreqRelCS, outFreqSum, tfidfDomnSum, and, freqRatio) into an aggregated result. In principle, the ensemble heuristic score is a sum of two parts: the 0 5 10 15 20 25 30 35 40 0 300 600 900 1.200 1.500 1.800 freqRatio (93.35%) tfidfDomnSum (93.85%) outFreqSum (94.96%) outFreqRelRF (95.24%) outFreqRelSVM (95.06%) outFreqRelCS (94.96%) random (50%) 2012 37 ensemble voting score and the ensemble position score and is computed as: ݏ௧ = ݏ௧ ௩௢௧௘ + ݏ௧ ௣௢௦. 1. Ensemble voting score (ݏ௧ ௩௢௧௘) of term t is based on the number of times the term appears in the first third of the elementary heuristics ranked lists. Each selected base heuristic ݄௜ gives one vote (ݏ௧ೕ,௛೔ ௩௢௧௘ = 1) to each term which is in the first third in its ranked list of terms and zero votes to all the other terms (ݏ௧ೕ,௛೔ ௩௢௧௘ = 0). Formally, the ensemble voting score of a term ݐ௝ that is at position ݌௝ in the ranked list of ݊ terms is computed as a sum of individual heuristics' voting scores: ௧ೕݏ ௩௢௧௘ = ෍ ݏ௧ೕ,௛೔ ௩௢௧௘ ௞ ௜ୀଵ = ෍ ൜ 1: ݌௝ < ݊/3, ݁ݏ݅ݓݎ݄݁ݐ݋ :0 ௞ ௜ୀଵ . Therefore, each term can get a score ݏ௧ೕ ௩௢௧௘ א {0, 1, 2, … , ݇}, where ݇ is the number of base heuristics used in the ensemble. 2. Ensemble position score (ݏ௧ ௣௢௦) of term t is based on an average position of the term in the elementary heuristics ranked lists. For each heuristic ݄௜, the term's position score ݏ௧ೕ,௛೔ ௣௢௦ is calculated as ൫݊ െ ݌௝൯Τ݊, which result in position scores being in the interval [0,1). For an ensemble of ݇ heuristics, the ensemble position score is computed as an average of individual heuristics' position scores: s୲ౠ ୮୭ୱ = 1 k෍ s୲ౠ,୦౟ ୮୭ୱ ୩ ୧ୀଵ = 1 k෍ (n െ p୨) n ୩ ୧ୀଵ . Using the migraine-magnesium domain pair, we experimentally confirmed- through the ROC curve evaluation of different heuristics in terms of the quality of b-term retrieval- that the ensemble heuristic is the best measure for b-term detection and is able to retrieve b-terms approximately 7 times faster compared to the random approach. Besides testing on the migraine-magnesium dataset we evaluated the ensemble heuristic also on an independent autism-calcineurin dataset ff3HWULþ HW DO fi and confirmed the utility and domain independence of the proposed approach. The CrossBee System This section presents our system which helps the experts in searching for hidden links that connect two seemingly unrelated domains. We designed and implemented an online system named CrossBee (Cross-Context Bisociation Explorer) 4 . The system was first designed as an online implementation of the ensemble ranking methodology. To the core functionality we have however added other functionalities and content presentations which effectively turned CrossBee into a user-friendly tool for ranking and exploration of bisociative terms that have the potential for cross-context link discovery. This enables the user not only 4 CrossBee is available at website: http://crossbee.ijs.si/. to spot but also to efficiently investigate terms that represent potential cross-domain links. Below we describe a typical use-case and the extended system's functionality. A Typical CrossBee Use Case The most standard use case is the following. The user starts at the system's home page by inputting two sets of documents of interest and by tuning the parameters of the system. The minimal required user's input at this point is a file with the documents from two domains. The prescribed format of the input file is kept simple to enable all users, regardless of their computing skills, to prepare the files. Each line of the file contains exactly three tab-separated entries: (a) document identification number, (b) domain acronym, and (c) the document text. The other options available to the user include specifying the exact preprocessing options, specifying the base heuristics to be used in the ensemble, specifying outlier documents identified by an external outlier detection software, defining the already known b-terms, and others. When the user selects all the desired options he proceeds to the next step. CrossBee then starts a computationally very intensive step in which it prepares all the data needed for the fast subsequent exploration phase. During this step the actual text preprocessing, base heuristics, ensemble, bisociation scores and rankings are computed in the way presented in the previous section. This step does not require any user intervention. After computation, the user is presented with a ranked list of b-term candidates. The list provides the user with some additional information including the ensemble's individual base heuristics votes and term's domain occurrence statistics in both domains. The user then browses through the list and chooses the term(s) he believes to be promising for finding meaningful connections between the two domains. At this point, the user can start inspecting the actual appearances of the selected term in both domains, using the efficient side-by-side document inspection as shown in Figure 3. In this way, he can verify whether his rationale behind selecting this term as a bridging term can be justified based on the contents of the inspected documents. The most important result of the exploration procedure is a proof for a chosen term to be an actual bridge between the two domains, based on supporting facts from the documents. As experienced in sessions with the experts, the identified documents are an important result as well, as they usually turn out to be a valuable source of information providing a deeper insight into the discovered terms which indicate new cross-domain relations. Extended CrossBee Functionality Below we list the implemented functionalities of the CrossBee system. 2012 38 x Document focused exploration empowers the user to filter and order the documents by various criteria. The user can find it more pleasing to start exploring the domains by reading documents and not browsing through the term lists. The ensemble ranking can be used to propose the user which documents to read by suggesting those with the highest proportion of highly ranked terms. x Detailed document view provides a more detailed presentation of a single document including various term statistics and a similarity graph showing the similarity between this document and other documents from the dataset. x Methodology performance analysis supports the evaluation of the methodology by providing various data which can be used to measure the quality of the results, e.g., data for plotting the ROC curves. x High-ranked term emphasis marks the terms according to their bisociation score calculated by the ensemble heuristic. When using this feature all high-ranked terms are emphasized throughout the whole application making them easier to spot. x b-term emphasis marks the terms defined as b-terms by the user. x Domain separation is a simple but effective option which colors all the documents from the same domain with the same color, making an obvious distinction between the documents from the two domains. x UI customization enables the user to decrease or increase the intensity of the following features: highranked term emphasis, b-term emphasis and domain separation. In cooperation with the experts, we discovered that some of them do like the emphasizing features while the others do not. Therefore, we introduced the UI customization where everybody can set the intensity of these features according to their preferences. Figure 3. Illustration of the side-by-side document inspection of potential cross-domain links functionality of CrossBee, using an example from the migraine-magnesium dataset analysis, focusing on the analysis of the paroxysmal term. Discussion and Further Work Current literature-based approaches depend strictly on simple, associative information search. Commonly, a literature-based association is computed using measures of similarity or co-occurrence. Because of their ‘hard-wired' underlying criteria of co-occurrence or similarity, association-based methods often fail to discover relevant information which is not related in obvious associative ways. Especially information related across separate domains is hard to identify with the conventional associative approaches. In such cases the domain-crossing connections, called bisociations (Berthold, 2012), can help generate creative and innovative discoveries. There was previous research by Swanson (1986), Weeber et al. (2001), PetriþHWDOff) and several other 2012 39 authors investigating the means for finding novel interesting connections between disparate research findings which can be extracted from the published literature. They have shown that the analysis of implicit cross-context associations hidden in scientific literature can guide hypotheses formulation and lead to the discovery of new knowledge. The methodology presented in this paper has the potential for improved computational creativity in supporting the expert in the task of cross-domain literature mining. The main novelty is an approach to ensemble-based bridging term ranking. The creative act of finding bridging terms is supported by the user-friendly CrossBee system for literature mining, implementing closed cross-domain link discovery. It has the potential to identify bridging concepts in the intersection of different domain literatures, as confirmed in the experiments in mining the literature on migraine and magnesium In further work we will apply the CrossBee system to new domain pairs, focusing on the system's potential to lead to new scientific discoveries. In addition to linking to PubMed, we will explore also the ways to connect CrossBee to other document sources, including its connection to keyword search from documents on the web, Moreover, it would be interesting to explore the potential of CrossBee in media research, as well as linguistics where metaphors could potentially be discovered by cross-context text mining. One of the priorities of our work will be, however, to use CrossBee in collaboration with the experts from different fields (e.g. physicists and biologists) to address real life domain problems and to get valuable feedback from these targeted users. 2012_6 !2012 A closer look at creativity as search Graeme Ritchie Computing Science University of Aberdeen Aberdeen AB24 3UE g.ritchie:abdn.ac.uk Abstract Several papers by Wiggins (building on ideas by Boden) have outlined a view of creative concept generation as a very general search process, but that formalisation has not been developed much in the past few years. Also, there are some aspects where clarification or spelling out of details would be useful. We present a re-formulation of the central ideas in Wiggins's framework, with slightly more rigorous statements of the definitions and a number of minor extensions. We also explain how this framework relates to some hitherto completely separate proposals by Ritchie. Introduction In recent years, there have been various formalisations of aspects of the computational creative process ((Pease, Winterstein, and Colton 2001), (Colton, Pease, and Ritchie 2001), (Garc´ıa et al. 2006), (Thornton 2007), (Colton, Charnley, and Pease 2011)). Hence there is a consensus among at least some established researchers that it is methodologically beneficial to have fully precise, detailed and formal accounts of any mechanisms being considered as ‘creative'. A prominent example is Wiggins's creative systems framework (CSF), presented in a number of papers (particularly Wiggins(2006a; 2006b)). That framework emphasises the notion of search as the central mechanism for simulating creativity, and outlines how a metalevel search could represent some phenomena sometimes discussed as ‘transformational' creativity. Although these ideas are very helpful in clarifying the nature of creative computation, the published versions of the CSF are at best a preliminary sketch: some details are unspecified, some natural extensions are undeveloped, and there are some formal errors or infelicities. The current paper starts from the central ideas of the CSF, but redefines the formal mechanisms in a way which leaves fewer gaps, aspires to have fewer formal inconsistencies, and includes the description of more aspects of computationally creative processes. The central motivation for this is that, if we subscribe to a belief in the benefits of formal models (as noted above), then these models should not be left undeveloped, but should continue to be maintained, repaired, and extended as necessary. It is important to realise that the underlying intuitive ideas - creation as the exploration of a ‘conceptual space', and possible ‘transformation' of that space - have been set out in numerous articles by Boden, with many illustrative examples from human creativity. Wiggins's contribution was to take those informal, broadbrush ideas and outline a formal framework which both captured the core notions and made sense computationally. The reader is referred to publications by Boden, Wiggins, and many others for more about the intuitive motivation; our aim here is to refine and extend Wiggins's proposals. A summary of Wiggins's CSF Although this paper is centrally concerned with formalisation, we start with a very brief informal overview of the ideas in the existing version of the CSF. The framework posits a universal set of all concepts, a term which covers both abstract ideas (e.g. a mathematical theorem, a design for a better political system) and concrete artefacts (e.g. a painting, a poem). Within this wide-ranging set, there are particular types of idea/artefact (e.g. stories, paintings, poems), and what counts as a recognisable example of a story/painting/poem/etc. may be highly dependent on socio-cultural norms. For many such creative genres, it is not realistic for there to be a firm definition of acceptability, as the specific concept may be vague in the sense of (van Deemter 2010). That is, the extent to which a text is or is not a well-formed story (or other artistic category) is a matter of degree, rather than an all-or-nothing decision. Hence, within the CSF, the criterion for acceptability/recognisability is represented as a rating between 0 and 1; in effect, the set of examples of a genre is treated as a fuzzy set. As well as whether something falls within the definition of some artistic genre, there is the separate question of whether it is a good (high quality) instance (e.g. a profound poem, a beautiful painting). This is similarly a ‘vague' notion, and again is represented within the CSF as a score between 0 and 1. This means that an artefact can be an acceptable instance (e.g. recognisably a story) without being of high quality (e.g. it may be a poor story); hence the need for two different ratings (mappings from concepts to values). The inspiration for the CSF was the work of Boden(1998; 2003), in which creativity was described as occurring within a conceptual space, which could be explored, or - in more radical creativity - transformed into a new space. This view has some resemblance to the traditional ideas of search 2012 41 within AI (Nilsson 1971), and Wiggins set out both to clarify the relationship between conventional AI search and creative computation, and to provide a formal framework for describing the latter. The idea is that a creative system starts from some set of concepts, and by a series of steps creates further concepts one after another, thus ‘exploring the space'. Although the term ‘creative' has connotations of ‘construction', the CSF, following the practice in formalising AI search, regards ‘new' concepts as not so much ‘constructed' as ‘reached'. That is, notionally all the possible concepts are elements of some universal set, but the creative system computes a route through that set to particular concepts, and those which have been thus reached represent ‘discoveries' or ‘inventions'. The exploration of the space of concepts (the search method) is modelled by an operator which acts on a sequence of concepts (a list of the concepts the system has already processed), and yields as its result a new (presumably longer) list of concepts, which can then be processed in turn as the next cycle of search. Sequences are used because the search method, like an agenda-based AI search system, has to maintain some record of where it has reached within the space of available concepts, and what to work on next. Wiggins points out that by separating the acceptability rating from the search method, we can describe the situation where different composers, each with a personal way of finding new artefacts (different search procedures) are working within the same style of music (a shared notion of acceptability). A creative agent should be able to recognise something as being a recognisable artefact, or a high quality artefact, without necessarily having a search method that would allow the agent to reach (create) that artefact. More concretely, this is an intuitively plausible account of certain potentially creative programs. The MCGONAGALL poetry generator (Manurung, Ritchie, and Thompson 2012) uses an explicit search mechanism (a genetic algorithm) to find suitable candidate texts. Each text must at least be syntactically well-formed according to the system's linguistic grammar; this could be regarded as the acceptability mapping. During the search, items are scored for rhythmic suitability and proximity to an initial semantic message; this would be the quality mapping. At each stage, the system keeps an ordered list of the current candidates, from which each cycle in the search starts. A small formal detail is that the CSF search operator takes as arguments the two fuzzy criteria (for acceptability and quality), and from there computes a way of going from an existing concept-sequence to a new concept-sequence. This means that the search method can be sensitive to these two criteria if necessary, or that we can describe two systems as having the same search strategy relative to their own different definitions of validity and quality. For Wiggins, the mappings (the two fuzzy sets and the search mechanism) are to be represented as expressions in some symbolic language, translatable to actual mappings. Hence, in the CSF, an exploratory creative system consists of the following seven components: (i) the universal set of concepts (ii) the language for expressing the relevant mappings (iii) a symbolic representation of the acceptability mapping (iv) a symbolic representation of the quality mappings (v) a symbolic representation of the search mechanism (vi) an interpreter for expressions like (iii) and (iv) (vii) an interpreter for expressions like (v) That constitutes the object level of the creative system, which searches through concepts in the domain (e.g. melodies). Wiggins also proposes that there can be a metalevel, which searches through possible object level systems to find an interestingly different ‘conceptual space', thus modelling Boden's idea of a ‘transforming' the space. The metalevel in CSF is structured in the same way as the object level (i.e. the seven parts as set out above), except that its set for exploration (i.e. its universal set) contains expressions in the symbolic language used at the object level. In this way, the metalevel searches through expressions describing object-level systems, assessing these descriptions for acceptability and for quality (using the metalevel's criteria for these two measures). Relative to the published accounts of the CSF, the revisions or extensions made here are: • The symbolic language for expressing the various mappings is given a much less central role. • The way in which the metalevel defines the object level is explicitly stated. In particular, the notion of ‘transformation' of an (object-level) space is defined. • Some minor inconsistencies in definitions are removed. • We outline how search methods within the CSF can be compared at the metalevel. • The CSF is related to a proposal for formal assessment of creative systems (Ritchie 2001; 2007). The object level The structure of an object level system Wiggins posits the existence of one universal set, U, the set of all concepts, but then defines a creative system as a tuple, one component of which is the universal set. If the set is truly universal across all systems, it should not need to be mentioned as a defining component of a specific system. On the other hand, it would be useful to be able to allow different creative systems to consider only specific subsets of this hugely general set. The compromise here is to accept the existence of the wholly universal set, but for the definition of each creative system to specify a subset of this universe; this could, in principle, be a non-proper subset. We will use P (mnemonic for ‘possibilities') for these subsets in our definitions, below. The idea is that U is universal enough to contain concepts for every type of artefact that might ever be conceived: it includes poems, stories, sculptures, jokes, paintings, theorems, architectural plans, designs for food mixers, etc. On the other hand P represents the whole range of concepts within some narrower sphere, such as two-dimensional arrays of coloured pixels (which could act as a ‘universal' set for the creation of visual art), or finite sequences of words and punctuation (a possible ‘universal' set for various textual artistic forms). 2012 42 Notation: For any sets A, B, BA denotes the set of mappings from A to B. In particular, for any set X, [0, 1]X denotes the set of mappings from X to values between 0 and 1 inclusive. Since a fuzzy set is defined by a mapping from possible elements to values betwen 0 and 1, our fuzzy sets of ‘acceptable' elements and of ‘valuable' elements will be stated in this way; that is, as members of [0, 1]P . We also take tuples(X) to denote the set of finite tuples (of any length) of elements of the set X. Definition 1: An exploratory creative system comprises: (i) a subset P of U (possible concepts within this type or genre) (ii) N 2 [0, 1]P , the acceptability mapping (mnemonically, this describes norms) (iii) V 2 [0, 1]P , the value mapping (mnemonically, this describes value) (iv) a mapping Q (the exploration scheme) from [0, 1]P ⇥ [0, 1]P to the set of mappings from tuples(P) to tuples(P) (mnemonically, this describes a quest for creations - ‘s' for ‘search' is used elsewhere). The four components in our definition are direct counterparts of those in Wiggins's CSF, but we have chosen different symbols for the components of a system, to avoid confusion; the relationships to Wiggins's notation are: N ⇠= [[R]], V ⇠= [[E]], Q(N , V) ⇠= hhR, T , Eii. The intuitive meanings of the components are the same: P is the set of possible concepts (e.g. arrays of pixels, sequences of words), N defines the fuzzy set of acceptable instances of whatever domain/genre is being explored, V indicates how ‘good' an instance is, and Q defines how to explore the space. The component Q, the search method, may need some explanation. Directly following Wiggins's proposals, Q is applied to a particular N and V, and from that produces a mapping which takes sequences of concepts into sequences of concepts; hence, N and V could in principle influence Q, or could be ignored. It might seem odd to describe Q as taking these two parameters, when the only possible values for the parameters seem to be fixed as N and V - why not just ‘compile in' these two values, as they are specified in the same 4-tuple package as Q? At present, this level of parameterisation has no real advantage, but it leaves open the possibility, as the framework is elaborated, of considering a ‘transformed' version of an exploratory creative system in which Q is unchanged, but one or both of N , V are altered, with automatic consequences for the operation of Q. As in the original Wiggins formulation, Q(N , V) maps from sequences of concepts to sequences of concepts, providing an agenda-like exploration of the set of possibilities, with the sequence representing the current search state As noted earlier, the CSF includes a symbolic language in which mappings are expressed as rules, which are then interpreted into mappings. Here, we abstract away from the use of a language, and define a creative system using mappings. The advantage of this is that it states the essential relations within a creative system without regard for representational issues. In a later section, we show how the symbolic language can be incorporated, directly reflecting the Wiggins approach. Characterising the conceptual space As noted above, the basic definition of an exploratory creative system contains a fuzzy set, N , of concepts, which are - intuitively - those concepts which conform to the current norms of the domain. In Wiggins's formulation, this fuzzy set is not regarded as modelling Boden's ‘conceptual space'. Instead, Wiggins stipulates that conceptual spaces are ordinary subsets (not fuzzy) of the universal set U, and that the conceptual space C (of a given creative system) consists of all concepts mapped by his [[R]] (our counterpart is N ) to values greater than or equal to 0.5. Similarly, Wiggins defines the valued set as those concepts which his [[E]] (our equivalent is V) maps to values greater than or equal to 0.5. That is, although the CSF allows both these sets to have graded membership (0 to 1), Wiggins immediately simplifies them to non-graded sets by imposing a threshold. Within our formulation, the counterpart to Wiggins's nonfuzzy definitions would be as follows: Definition 2: Given an exploratory creative system S = (P, N , V, Q), we define, for any ↵ 2 [0, 1] and X ✓ P: (i) N↵(X) = {c 2 X | N (c) > ↵} (the set of concepts which reach the threshold ↵ in their ‘normality'). (ii) V↵(X) = {c 2 X | V(c) > ↵} (the set of concepts which reach the threshold ↵ in their ‘quality'). (iii) the fuzzy conceptual space of S is N (this just confirms the status of N as outlined earlier). (iv) the conceptual space of S is N0.5(P) (this is for backwards compatibility with Wiggins's 0.5 threshold). (v) the fuzzy valued set of S is V (this just confirms the status of V as outlined earlier).. (vi) the valued set of S is V0.5(P) (this is for backwards compatibility with Wiggins's 0.5 threshold). Searching In the CSF, the searching process begins from an initial set of concepts. This is to allow for the situation where the creative system starts from some given concept set, representing the status quo. It is also useful later, when considering metalevel search. In Wiggins's definitions, exploration always starts from the single totally unspecified concept, >, representing a situation in which the system has no known concepts already. Here, we generalise this slightly. If the sequences on which Q operates are to be like an agenda in conventional AI search, then those sequences should contain only items which have been produced from previous steps of the search (i.e. applications of Q). This means that the initial agenda has to include every item (concept) which could ever participate in discoveries but is not produced by an application of Q. Wiggins defines ‘reachable' concepts using indefinitely many applications of the search operator. There is a minor slip in his definition, in that repeated applications of hhR, T , Eii (which corresponds to our Q) will compute sequences (tuples) of concepts, not individual concepts. This is easily remedied (and we can add an intermediate version, for limited search). First we need a minor definition of all the items appearing within a set of tuples: 2012 43 Definition 3: Given any set of tuples A, we define elements(A) ⌘ {x | 9hy1,...,yni 2 A, 9i 1  i  n : x = yi} The Wiggins formalisation defines all the concepts which can be reached with any amount of search, i.e. without limit. Although this case is of theoretical interest, in practice any system will search only to some finite depth, and so we also define the notion of ‘reachable in a fixed number of steps', relying on the fact that a single application of the search mapping Q corresponds to one step in the search process. Definition 4: Given an exploratory creative system (P, N , V, Q), and a set B ⇢ P of concepts (B is the starting set of concepts for the search): (i) the set reachable from B in m steps is S m n=0 elements(Q(N , V)n(B)) (i.e. any number of repeated applications of Q, up to m; this describes search up to some depth.) (ii) the set reachable from B is S 1 n=0 elements(Q(N , V)n(B)) (i.e. any number of repeated applications of Q; this allows any depth of search.) (iii) the set of reachable concepts is the set reachable from {>}. (This matches Wiggins's notion, where all search starts from a single unspecified concept). (iv) the set of concepts reachable in m steps is the set reachable from {>} in m steps. (This is a bounded search variant of Wiggins's ‘start from nothing' definition.) In considering the behaviour of a creative system, it is important to know which of its final output (i.e. creations) were provided to it initially and which were computed by the system itself. We can define these thus: Definition 5: Given an exploratory creative system (P, N , V, Q), a subset B of P, and a set of concepts K reachable from B, the discoveries in K is the set of concepts in K ) B. Wiggins defines the set of valued concepts as being those concepts reachable from the undefined concept, >, which exceed a particular threshold value (0.5) when his ‘value' mapping (our V) is applied. That definition can be restated in the terminology here. Definition 6: Given an exploratory creative system (P, N , V, Q), where RC is the set of reachable concepts, and a value ↵ 2 [0, 1]: (i) the ↵-valued set of reachable concepts is V↵(RC). (ii) the valued set of reachable concepts is the 0.5-valued set of reachable concepts (i.e. V0.5(RC)); this mirrors Wiggins's definition. The metalevel Structure of the metalevel For the metalevel, the first matter to be clarified concerns the set of items used for exploration. In the Wiggins papers, this is the set of possible expressions in a symbolic language L. An expression in L is defined earlier as defining a rule-set representing either R (acceptability rules), E (value rules) or T (search rules), with different interpreters applying depending on which of these is intended. If the metalevel is considering single L-expressions, how does one such expression represent an entire object level system, which contains all three of R, E, and T ? Within the language-based formalisation, a possible response would be to say that L must contain notation which allows one expression to represent three rule sets. If following this path, Wiggins's definitions of the language interpreters would also have to be patched. As we are separating out the language aspect, we have a different solution. Here, the object level space is defined by the mappings N , V, Q, so it seems reasonable to have the metalevel searching through triples (N,V,Q) where N, V , Q, are possible values for N , V, Q respectively (and hence are elements of the appropriate sets such as [0, 1]P ). The exploration set at the metalevel will be the set of such triples; for brevity here, we will call this set of triples ECS(P) (as it is the set of possible exploratory creative systems for P). Given a subset P of U, a metalevel creative system for P is an exploratory creative system made up of: (i) ECS(P) (i.e. this is the metalevel's set to explore) (ii) an element N meta of [0, 1]ECS(P) ; this rates potential object-level systems as to how ‘normal' they are, thus providing a (fuzzy) set of ‘acceptable' triples (N,V,Q). (iii) an element Vmeta of [0, 1]ECS(P) ; this rates potential object-level systems as to their ‘quality', thus providing a (fuzzy) set of ‘valuable' triples (N,V,Q). (iv) a mapping Qmeta from [0, 1]ECS(P) ⇥ [0, 1]ECS(P) to the set of mappings from tuples(ECS(P)) to tuples(ECS(P)); this is structured like the search device Q in an object-level creative system, but operates on elements of ECS(P) instead of P. That is, the structure at the metalevel is exactly parallel to the structure at the object level, as in the Wiggins version. A metalevel has information about what an object level creative system should look like (N meta) and what would count as a ‘good' object level system (Vmeta). It also contains a way of searching through potential object level systems (Qmeta). Characterising an object level system Given a definition of the components of a metalevel, it is essential to then define exactly how the parts of the metalevel characterise an object level system. This is not discussed in detail by Wiggins, but he indicates that the metalevel is to operate (in terms of search, etc.) exactly as an object level creative system. An object level creative system will ascribe various characteristics to a set of concepts. Each concept will be: rated (by N ) as to how acceptable it is as an member of the conceptual space in question, rated (by V) as to its value/quality, and defined (by Q) as either reachable or not. As noted earlier, Wiggins proposes that the ratings by (his equivalents of) N and V are turned into non-fuzzy sets using a threshold. However, even then the object level does not characterise a 2012 44 single object, or a unique set of systems: it defines three independent sets, via N , V and Q. Since the metalevel has exactly the same structure as an object level system, the items which it explores (for Wiggins, expressions in a symbolic language L) are presumably similarly allocated to 3 sets: the recognisable, the valued and the reachable (and for Wiggins, reachability is always relative to >, not some specified starting set of items). Hence, the metalevel is assigning (potential) object level systems to these three categories. What the metalevel does not do is characterise a single object level system, or even a unique set of systems. This means that we do not, from the published papers, have a definition of how one object level system is a transformation of another, or how a computation at the metalevel will yield a new object level system - all that the metalevel provides is this tripartite classification. We will remedy this by defining how a metalevel can define (or transform) a specific object level system. In the next few definitions, we assume two exploratory creative systems Sobj = (P, N , V, Q) and S0 obj = (P, N 0 , V0 , Q0 ), and a metalevel system Smeta = (ECS(P), N meta, Vmeta, Qmeta) for P. Where the relation ‘6=' is used here, this allows for the two items in question to have elements in common. Definition 7: (i) S0 obj is a revision of Sobj using Smeta if S0 obj is in the set reachable from {Sobj} within Smeta, and S0 obj 6= Sobj . (ii) for any ↵ 2 [0, 1], S0 obj is ↵-valued with respect to Smeta if Vmeta((N 0 , V0 , Q0 )) $ ↵. (iii) S0 obj is a transformation of Sobj using Smeta if S0 obj is a revision of Sobj using Smeta and also N 6= N 0 . Here we have taken a ‘transformation' to be a revision in which the definition of the conceptual space (acceptable set) changes, as indicated by the condition ‘N =6 N 0 '. As this could be true even if N and N 0 differ only on one element, proponents of transformation as a form of radical change might wish to enhance this definition. Relationship to the original CSF To clarify the amendments we have made to the formalisation, we can compare it with the original version in the papers by Wiggins. As mentioned earlier, the original CSF includes a symbolic language in which the components (the counterparts of our N , V, Q) are expressed. We can add this to our framework by defining a symbolic-language version of an exploratory creative system, with appropriate links to the definitions given above. In order to mimic Wiggins's definitions, we first have to clarify certain aspects which are unclear in the published papers. Sometimes the mapping N (or what corresponds to this in Wiggins's framework) is represented as a single expression in a symbolic language, and sometimes it is said to be a set of expressions. Either of these accounts could be made to work, if applied consistently. Here, we have opted for the single expression version, with the observation that the symbolic language could contain connective symbols such as ‘conjunction', ‘disjunction' or other logical operators, thereby getting the effect of a set of rules in one syntactic expression. Wiggins's version does not separate clearly the definition of the language from the specific language expressions used in a particular creative system. We have tried to draw this distinction in the next two definitions. Definition 8: Given a set of concepts P, a creative systems language for P is a tuple (A,LR,LT , [[.]],hh.ii) where: (i) A is a set of symbols, the alphabet. (ii) LR and LT are languages over A (only 2 are needed because the language LR can be used for both the ‘acceptability' rules and the ‘value' rules, since these both describe fuzzy sets of concepts). (iii) [[.]] is a mapping from LR to [0, 1]P (this is the interpreter which takes an expression in the symbolic language and returns a mapping; that mapping is then a fuzzy set of concepts). (iv) hh.ii is a mapping from LR ⇥ LT ⇥ LR to tuples(P)tuples(P) (this is the interpreter which turns symbolic expressions specifying a search method into an actual mapping to carry out the search). The above definition (which is closely modelled on Wiggins's proposals) provides the symbolic mechanisms, separately from any particular creative system which might use these representations. The next two definitions state how these mechanisms can be used to define a specific system. Definition 9: Given a set of concepts P and a creative systems language (A,LR,LT , [[.]],hh.ii) for P, then a symbolically represented exploratory creative system for P consists of a tuple (WR, WE , WT ) where: (i) WR 2 LR; the norms or acceptability rules. (ii) WE 2 LR; the rules assigning value to items. (iii) WT 2 LT ; rules which define the search method. Definition 10: Given a set of concepts P, a creative systems language (A,LR,LT , [[.]],hh.ii), and a symbolically represented exploratory creative system SE = (WR, WE , WT ), then the exploratory creative system associated with SE is the tuple S = (P, N , V, Q) where (i) N = [[WR]]; i.e. the meaning of this rule expression is the normality mapping. (ii) V = [[WE ]]; i.e. the meaning of this rule expression is the value mapping. (iii) Q(N , V) = hhWR, WT , WE ii; i.e. the meanings of these rule expressions give the search mapping. (This directly mirrors the arrangement of Wiggins's CSF.) Using his formalisation, (Wiggins 2006a) provides a number of definitions of specific behaviours that a creative system could display, in terms of what concepts are valued, which can be reached, etc. These terms can all be defined within the formalisation given here, as follows, assuming an exploratory creative system S = (P, N , V, Q), and using some of the terminology already defined above: Hopeless uninspiration: The valued set of concepts is empty. That is, there are no concepts anywhere within the universal set that meet the ‘quality' threshold. 2012 45 Conceptual uninspiration: V0.5(N0.5(P)) = ;. That is, there are no concepts within the acceptable (‘normal') set of concepts that meet the ‘quality' threshold. Generative uninspiration: The valued set of reachable concepts is empty. That is, there are no concepts which the search mechanism can reach which meet the quality threshold. Aberration: Where B consists of exactly those elements in the reachable set which are not in N0.5(P), aberration occurs if B =6 ;. That is, aberration is when the search mechanism goes outside the ‘normal' set of concepts. Perfect aberration is where V0.5(B) = B (i.e. all the nonnormal concepts meet the quality threshold); productive aberration is where V0.5(B) 6= ; and V0.5(B) 6= B (i.e. just some of the non-normal concepts meet the quality threshold); pointless aberration is where V0.5(B) = ; (i.e. no non-normal concepts meet the quality threshold). Evaluating search methods Ventura's analysis Ventura(2011) gives an analysis of the limitations of uninformed search strategies in a creative context. His definitions and results are general enough that they should be applicable to the framework here, although there is one small formal point that needs to be stipulated first. Ventura (implicitly) makes the following assumption: One concept per step: Each formal ‘step' in the search process corresponds to the generation of exactly one concept/artefact. That is, Ventura's analysis does not allow for intermediate computational steps behind the scenes which do not directly correspond to the generation of an artefact. The Wiggins definitions (and our reformulations) do not demand this restriction, but it is a plausible constraint, and could be formalised thus: Given an exploratory creative system (P, N , V, Q), Q is a one-concept-per-step scheme if, whenever z0 = Q(N , V)(z), 9c0 2 P such that: (i) c0 is an element of z0 ; (ii) c0 is not an element of z; (iii) for every element c of z0 where c0 6= c, c is an element of z. This perspective could be taken even further, by revising our definition of an exploratory creative system to include a set OP of operators, which are mappings from tuples of concepts to concepts; that is, each operator is a member of PPk for some integer k. Then we would stipulate that each step in a search meets the constraint that it corresponds to the invocation of one operator: If two concept-sequences hc1,...,cni,hd1,...,dmi are such that Q(N , V)(hc1,...,cni) = hd1,...,dmi, then: there must be an operator p 2 OP, and concepts he1,...,eki (where k  n) such that for every 1  i  k, ei = cj for some j, and p(e1,...,ek) = dr for some dr 2 {d1,...,dm}, and dr 2/ {c1,...,cn}, and 8di 2 {d1,...,dm}, either i = r or di 2 {c1,...,cn}. Ventura's analysis provides one possible formalisation of the intuitive notion of a search strategy being ‘better' (or ‘best'). It posits a set of target elements (concepts, in our terminology) and considers the probability of the search reaching one of these elements. In a footnote, Ventura also offers a definition where the desirability of elements is a function to [0, 1] (cf. our V), and computes the probability of reaching an element with a maximal value. It is arguable that in the area of creative systems, there is less emphasis on finding specific target items (or even finding a maximal-scoring item) and more on generally reaching acceptable or highly valued concepts (a direction in which Ventura's footnote moves). Our formalisation allows an alternative perspective on the assessment or comparison of search methods; see the next subsection. Comparing search Our formulation of the CSF allows the comparison of two search methods according to how the concepts they reach are rated by the related mappings N and V, taking into account the depth of search involved. As we will want to apply certain definitions to various mappings (including N and V), we will start with a general schematic definition in which the mapping F can be anything of the appropriate type (so F is not mnemonic for anything, being just a placeholder for now). Also, AGG will stand for either AV G (arithmetic mean) or MAX (maximum) of a function applied to a set. Definition 11: Suppose there are two exploratory creative systems (P, N , V, Q1) and (P, N , V, Q2), with Si(k, k0 ) = the set of concepts reachable in no fewer than k and no more than k0 steps in these two systems (i = 1, 2). Also, F 2 [0, 1]P (i.e. a rating of concepts, of some sort). Then (i) Q1 has higher AGG F-values up to k0 steps than Q2 if AGG(F, S1(0, k0 )) > AGG(F, S2(0, k0 )). (ii) Q1 has higher AGG F-values beyond k steps than Q2 if AGG(F, S1(k, k0 )) > AGG(F, S2(k, k0 )) for any k0 > k. This compares two variants of a system in which only the search method Q is different. The above definitions will be applied, below, to specific values for F. Definition 12: Given a two exploratory creative systems (P, N , V, Q1) and (P, N , V, Q2): (i) Q1 is higher valued on average up to k steps than Q2 if Q1 has higher AVG V-values up to k steps than Q2. (ii) Q1 is more normal on average up to k steps than Q2 if Q1 has higher AVG N -values up to k steps than Q2. (iii) Q1 achieves higher value up to k steps than Q2 if Q1 has higher MAX V-values up to k steps than Q2. (iv) Q1 achieves greater conformity up to k steps than Q2 if Q1 has higher MAX N -values up to k steps than Q2. For each of the 4 subparts of the above definition, we can frame a corresponding definition which says that there exists some depth k0 after which one of the search methods Qi gives a higher value than the other; e.g.: 2012 46 Q1 is higher valued eventually than Q2 if there is some integer k0 > 0 such that Q1 has higher AVG V-values beyond k0 steps than Q2. Similar substitutions can be made in the other definitions. In this way, we have several ways of describing one search method Q1 as being ‘better' than another, Q2. Next, we consider how comparisons of search methods can be more detailed. Descriptive criteria Ritchie(2001; 2007) defines a set of formal criteria which can be used to describe aspects of a potentially creative system's behaviour. Central to these formal criteria are two rating schemes for assigning values in [0, 1] to elements of the set of basic items (i.e. the set of possible artefacts). One rating scheme (typ) represents typicality, indicating the extent to which an item lies within the norm for this type of artefact. The other rating (val) is for value, and indicates the ‘quality' of an item. Ritchie's typ and val, Wiggins's R and E and our N and V all appear to capture the same intuitive notions: that we can rate possible creations as to their membership of a concept set, and in terms of the quality of such creations. There are some differences of nuance between Ritchie's constructs and those in the CSF, to which we will return later, but for the moment let us consider how the central ideas in some of Ritchie's criteria could be used within the CSF as stated here. The first eight of Ritchie's criteria are stated in terms of a result set, R, which is the set of artefacts produced by the computer program, and the two ratings typ and val. There is not space here to reproduce them all, but Criterion 7 illustrates the general idea: ratio(V!,1(R) \ T0,"(R), T0,"(R)) > ✓, for suitable ", #, ✓. where V!,1(R) is the set of elements of set R which are rated above # by val, T0,"(R) is the set of elements of R which are rated below " by typ, and ratio computes the ratio of the sizes of two sets. That is, this computes the proportion of the untypical items which are of good quality. Criteria like this could be applied to a creative system (P, N , V, Q), using N to define Ti,j and V to define Vi,j . There are various ways in which the result set R could be defined in terms of reachable concepts: all reachable concepts? concepts reachable from a starting set B? concepts reachable after some number of steps k? All of these are plausible models of a ‘result set'. Hence there would be a few families of very similar formula, parameterised according to starting set or number of steps. Ritchie(2007) emphasises that these criteria are not all measures of creative success, but can be used to ‘profile' a (potentially) creative program by describing its behaviour in more detail. In the same way, they could give a more detailed picture of a creative system, in the CSF sense. Ritchie also postulates an inspiring set, I, which are the existing artefacts upon which the design of the creative program was based. The remaining criteria (9 18 in Ritchie(2007)) make comparisons of different sorts between I and the result set R. There is no exact counterpart within the CSF, as there is no ‘design' stage in the formalisation. However, the formal structure of Ritchie's criteria which involve I could be coerced into service within the CSF, by replacing I with some initial set of concepts B, from which search starts. That is, where Ritchie has a criterion such as: ratio(V!,1(R " I) \ T↵,1(R " I)),(R " I)) > ✓, for suitable ↵, #, ✓ (informally, ‘a high proportion of the novel results are both highly valued and very typical of the genre') we would have: ratio(V!,1(R"B)\T↵,1(R"B)),(R"B)) > ✓, for suitable ↵, #, ✓. where R is the set of concepts reachable from B. (Again, there is a possible variant where a number of steps k is stipulated.) Although this seems to indicate that we can simply port the Ritchie criteria into the CSF, there is one further issue to consider: there is a difference in the overall perspective of the two formal accounts. There are various viewpoints one could assume in devising a formal model in this area. For example, it would be possible to have an abstract declarative formalisation of the nature of the creative task, without including any details of how this task might be executed. Or a model might be proposed as describing (at some suitably abstract level) how a creative system operates. Ritchie's criteria arguably take a third viewpoint, in which one treats the program as a conveyor of input/output data, and attempts, from an external viewpoint, to say more precisely how it has performed. The typ and val mappings are certainly not proposed as components of the program or system. Instead, these are measures which might be applied by, for example, having humans judge the program's output. In Wiggins's CSF, the formal definitions of R and E (the symbolic counterparts of our N , V) would be compatible either with a characterisation of the abstract nature of creativity, or with a model of a creative system. However, the terminology used, and the inclusion of the T (‘agenda') mapping (our Q) determine that this is a model (at a very abstract level) of the working of a creative system. Hence, the whole intent of the CSF is radically different from that of Ritchie's definitions, even though there is a clear parallel between the intuitive meanings of N and typ, V and val. This means that what we have sketched above is not the direction application of Ritchie's criteria within the framework, but the definition of counterparts within the CSF, where the Ti,j and Vi,j mappings are defined using internal components (N , V) of the system, not external judgements. However, this means that if we are scrutinising the behaviour of a creative system, we can apply the conditions outlined above (i.e. the counterparts of Ritchie's criteria) in distinct ways: Internal: How is the system performing, in its own terms? For this, we use N and V to define Ti,j and Vi,j . External: How is the system performing, in terms of independent measures such as human judgements? For this, we use the independent measures to define Ti,j and Vi,j . 2012 47 In both of these, we retain the notions of B (initial set) and R (reachable concepts) discussed above. Given that the set of reachable concepts is in effect the ‘result set', if the inspiring set I is known, then using I instead of B, with an External perspective, is effectively the scenario in the Ritchie papers. These various adaptations of the criteria to the CSF can be viewed alongside the definitions in our section Comparing search above, and provide a slightly finer grained and more detailed vocabulary for comparing search strategies. Guiding the metalevel We have already shown how the metalevel of a creative system can start from an existing object level system and search for a variant system, using a metalevel value function Vmeta. What should the content of Vmeta be? Since the metalevel has access to the object level mappings N and V, it would be possible for Vmeta to be defined using composite criteria such as those we have outlined above, the counterparts of Ritchie's criteria. That is, the metalevel search could be guided by how candidate object-level systems performed according to these criteria. For this, the distinction between internally-parameterised and externallyparameterised versions of the criteria is important. Whereas the externally-parameterised version (using human judgements or other measures) is exactly appropriate for profiling or assessing the success of the object-level system, the internally-parameterised version (using N and V) are the only ones that make sense within the creative (metalevel) system itself. This glosses over the significant question of whether a real creative program would be implemented with the meta/object strata of the CSF, or whether the formal framework is only a way of describing, at some fairly abstract level, what creative systems do (or could do). It is possible that the actual use of structured criteria which compare ratings of initial sets and of reachable concepts would not be realistically applicable in implemented systems. Summing up We have presented a reformalisation of Wiggins's CSF, which: • makes the use of a symbolic language an optional extra • indicates how search strategies can be formally compared within the CSF • clarifies the metalevel, defining some metalevel constructs in more detail and making explicit some formal comparisons. • shows how Ritchie's criteria can be adapted, in a number of ways, to the CSF formalisation, thereby clarifying the relationship between these frameworks In this way, we have extended the development of formal descriptive frameworks for creative systems. Acknowledgments The author is very grateful to Geraint Wiggins for detailed discussions of this material, and for comments on an draft of this paper. 2012_7 !2012 Creative Search Trajectories and their Implications Kyle E. Jennings Department of Psychology University of California, Davis Davis, CA 95616 USA kejennings@ucdavis.edu Abstract Creative search trajectories are chronologically organized intermediate products (such as sketches and drafts) from the creative process. We discuss what sorts of conclusions can be made when these trajectories show non-monotonic progress toward the final creation. We introduce several key distinctions that are often overlooked, and argue that two null hypothesis processes must be rejected before non-monotonicity can be claimed to support more complex processes. We show that these null hypotheses are in fact difficult to rule out definitively using the sorts of evidence that past research has offered. The sketches, drafts, revisions, and rejected ideas that creators leave in their wake on the way toward great masterpieces offer glimpses of the mental processes that are responsible for their achievements. Ordered in time, these artifacts trace a trajectory through a mental space, and may be the signatures of the specific exploration strategies that differentiate great thinking from the mundane. Even the differences in creativity that can be observed among study participants may be explainable by tracing and analyzing the detailed steps taken with simple creative problems. This approach—which we call trajectory analysis— borrows from the problem solving research of Newell and Simon (1972) and others. However, whereas most problem solving research uses problems that are concrete, objective, and formalizable in terms of specific states, operators, and goals, the creative problems faced by artists, writers, and scientists are less easily reduced to symbols, rules, and computational steps. Though there are examples where researchers have meticulously examined creative search trajectories using objective features (Weisberg 2004), most research has let subjective judgments stand in for complete formal details (Kozbelt 2006; Simonton 2007; Kozbelt and Serafin 2009; Damian and Simonton 2011). Despite taking different approaches, many of these results point to the conclusion that creative outcomes are not arrived at directly, but rather through the twists and turns of false starts and retraced steps. These observations have been taken as evidence that the creative process is tentative and experimental rather than deliberate and informed. Given how complex it would be to devise comprehensive and objective descriptors of intermediate states in creative problems, it may not be practical to overcome all of the imperfections of subjective judgments. However, this does not mean that conclusions drawn about the creative process on the basis of these judgments do not need to be rigorously justified. While it seems intuitively clear that a process that is tentative, uncertain, or experimental would produce trajectories that do not move monotonically closer to the final solution (Simonton 2007; Damian and Simonton 2011) or that do not show monotonic improvement over time (Kozbelt 2006; Kozbelt and Serafin 2009), caution must be exercised before concluding that trajectories with these features could only have been produced by such processes. This paper aims to clarify what can and cannot be concluded about the creative process on the basis of creative search trajectories. Though we are ultimately optimistic about the potential of this approach and advocate the view that creativity requires uncertainty and experimentation, we will show that existing evidence that creative search trajectories are non-monotonic is in fact compatible with very straightforward search processes, and that more care and precision must be used when analyzing search trajectories. We begin with a more detailed discussion of existing trajectory analysis approaches, and then describe in overview the distinctions that past work has overlooked. We then illustrate these distinctions by presenting simple but formally complete examples that demonstrate the need for caution when drawing the conclusion that non-monotonic search trajectories necessarily reflect something other than a straightforward process. We conclude by discussing the burden of proof placed on researchers who wish to infer the causes of non-monotonicity and by offering suggestions for future work. Background While there are many approaches that people have taken to using intermediate products to understand the creative process (Getzels and Csikszentmihalyi 1976; Finke, Ward, and Smith 1992; Hennessey 1994; Ruscio, Whitney, and Amabile 1998; Rostan 2010), this paper focuses specifically on techniques that characterize the nature of the changes between successive revisions of the work (Kozbelt 2006; Simonton 2007; Kozbelt and Serafin 2009; Damian and Simonton 2011). Common to these analyses is the prediction that creativity should be associated with non-monotonicity, 2012 49 A B C D E F G H A B C D E F G H 8 7 6 4 2 1 5 3 8 7 6 4 2 1 5 3 What must be found? The most desirable final state A path between initial and goal states What is (non-) monotonic? Distances from intermediate to final state Quality of intermediate states When is quality judged? During search After search How is distance judged? Differences in salient features Differences in state representations Shortest path between states Genotypic distance Phenotypic distance Transformation distance Contemporaneous fitness Retrospective fitness State monotonicity Fitness monotonicity Path search Place search External problem state What sequence of states is being analyzed? Internal search state (Solution) path Search trajectory Figure 1: Depiction of the key questions, answers (italicized), and terms (bolded) that impact how search trajectories are analyzed. which is perhaps best understood as the opposite of direct, incremental progress toward the final creation (other theories predict this, too, e.g., Perkins 2000). Beyond this commonality, the theories differ in their reasons that nonmonotonicity should be expected, and indeed in how they operationalize monotonicity. One approach to characterizing monotonicity is that taken by Kozbelt (2006) in his analysis of the sketches preceding Matisse's Large Reclining Nude. His approach focuses on the evaluation of each sketch, and in particular on whether the sketches become monotonically better with time. Given that Matisse's own evaluations are not available, Kozbelt has both artists and non-artists rate each sketch (presented in a random order) on 26 items, which are then analyzed in order to extract a latent quality dimension. The artists' ratings are found to be maximal at the final image, but beyond this they show non-monotonic variations over time (the contour of which is replicated in the non-artist sample). Similarly, Kozbelt and Serafin (2009) analyze intermediate sketches from drawings by non-eminent artists, and find that artwork that had been previously rated as more creative resulted from less monotonic trajectories. This is suggested to be due to an "interactive, hypothesis-testing dynamic" (Kozbelt and Serafin 2009, p. 358), though the specifics of this process are not articulated. The other approach taken to characterizing monotonicity involves looking at the structure of the sketches themselves. The preponderance of this evidence stems from sketches Picasso left of Guernica. After an informal suggestion by Simonton (1999) that these sketches showed "false starts and wild experiments" (p. 197), Weisberg (2004) undertook a detailed analysis of the features shared between sketches and concluded that they were elaborations on a basic idea that itself had precedent in other work, including Picasso's own Minotauromachy. In response, Simonton (2007) undertook an alternative analysis in which various raters arranged the sketches in the order they would most logically have been generated, which reliably resulted in an order that did not match the actual temporal order. Later, Damian and Simonton (2011) had raters judge the similarity of the components from Minotauromachy to their counterparts in the Guernica sketches, and again found that the Guernica work did not get monotonically closer (or further) from the prototypes. Simonton claims this as evidence in support of the Blind Variation and Selective Retention (BVSR) theory of creativity (Simonton 2003; 2010), which holds that both desirable and undesirable variations will be generated during the creative process. The Simonton work, in particular, has generated a great deal of recent controversy (Dasgupta 2011; Gabora 2011), much of it focused on the computational and algorithmic specifics of BVSR theory. In fact, neither Kozbelt's nor Simonton's analyses offered precise and detailed accounts of the processes that would lead to non-monotonicity, with Gabora (2011) suggesting that BVSR wouldn't even predict non-monotonicity. While Simonton has begun better formalizing BVSR theory (Simonton 2011; 2012), the fact is that there are basic problems for the trajectory analy 2012 50 sis approach that any theory of the creative process must overcome. Therefore, rather than specifically addressing the claims by Kozbelt and Simonton, this paper will instead try to address some basic inconsistencies regarding how monotonicity is conceptualized and operationalized, and will demonstrate that non-monotonicity can result from processes that are more straightforward than most theories suggest. Overview of Distinctions In this paper we introduce several distinctions that are essential when analyzing search trajectories. As depicted in Figure 1, these distinctions revolve around questions about what the search must find, what aspect of trajectory monotonicity is of interest, and how monotonicity is measured. Throughout the paper we will define and illustrate the terms that these questions delineate, using the coordinates in the margins to reference to the relevant portions of the figure. The first and most important question is what the search must find (Figure 1, D1). Problem solving research describes problems by the set of states and the operators that move between them. The problem solver's goal is to find a sequence of operator applications that transforms the initial state into (one of) the goal state(s). Because the goal state is already known, the thing being searched for is the path itself, which is why we refer to these as path searches (Figure 1, AB23). (See also Jennings, Simonton, and Palmer 2011.) Because the goal states are well known in path search, solution quality depends more on the quality of the path than on the specific goal state reached, with shorter (or less costly) paths being better. Though this situation aptly describes many problems (e.g., proving a theorem, inventing a process for synthesizing a given protein) there are other problems where the end points are not known in advance. For instance, in painting the artist seeks to depict a certain scene, theme, or emotion using brush and paint. In most cases we compare paintings not by the set of brushstrokes that led from an empty canvas to the completed image, but rather by that image itself. Thinking of these final images as places in a solution space, we refer to this as a place search (Figure 1, EF23). Creativity is possible with both path and place search, and most real problems involve some element of each (e.g., choosing a place and then finding the path to it). Though we'll speak of these as distinct kinds of searches in this paper, we recognize that understanding joint path-place search is an essential task for future work. Monotonicity in Path Searches Our discussion begins with path search. For simplicity we'll consider the classical Towers of Hanoi problem. (Though this problem leaves little room for creativity, it nicely illustrate our key points.) As depicted in Figure 2, there are three disks of decreasing size that are initially stacked on the leftmost of three pegs. The problem is to move the disks to the rightmost peg by moving one disk at a time without placing larger disks on top of smaller disks. The figure shows the shortest sequence of states that solves the problem, which together constitute the path found in this path search. The Towers of Hanoi is often used to illustrate the failure of an heuristic called difference reduction, which entails iteratively applying the operator that most reduces the discrepancy between the current state and the goal state (Anderson 1993). In fact, solving this problem requires selecting operators that temporarily make the current state less similar to the goal state, and for this reason it could be argued that even a simple problem like the Towers of Hanoi has a non-monotonic solution. By this logic, there is no controversy behind claiming that creative problems exhibit non-monotonicity. However, we will see that the Towers of Hanoi isn't inherently non-monotonic, at least in sense that matters when making inferences about the search process. Let's begin by looking at the monotonicity of the sequence of states forming the shortest path in Figure 2. Though we'll ultimately conclude that these are not necessarily the states that we should be analyzing when making inferences about the search process, they conveniently illustrate the different ways that monotonicity can be judged. The difference reduction heuristic works by economically comparing the current and goal states. For example, we could compare states by looking directly at their representations. Each state in Figure 2 can be described as an ordered triple indicating which pegs the smallest, middle, and largest disks are on. (This is sufficiently descriptive since larger disks cannot be on top of smaller disks.) Thus, the starting state is (1, 1, 1), the final state is (3, 3, 3), and the intermediate states are (3, 1, 1), (3, 2, 1), etc. By analogy to genetics, we can think of the state encoding as being a genotype. We'll define genotypic distance to be the dissimilarity between the encodings of two states, which here we can define as the sum of absolute differences in state representations (Figure 1, C78). For instance, the third state, (3, 2, 1) differs from the final state, (3, 3, 3), by |3!3|+|3!2|+|3!1| = 3. We could also compare states by looking at their salient features, which may or may not be directly related to the state encoding. Again by analogy to genetics, we'll call this phenotypic distance (Figure 1, D78). For the Towers problem, let's calculate phenotypic distance by counting the number of disks that are on different pegs. Thus, the first and final states differ by three, the second and final by two, and so on. As Figure 2 shows, the best solution to the Towers problem is indeed non-monotonic in genotypic distance (DG) and phenotypic distance (DP ), with the solution becoming less similar to the goal at various points, and difference reduction would indeed fail with either of these methods for choosing operators. However, the fact remains that the sequence of moves shown is minimal (the path is the shortest path possible). Defining the transformation distance (DT ) as the length of the shortest path between two states (Figure 1, E78), then we can see in Figure 2 that the path does get monotonically closer to the final state. On the one hand this is a useless insight since performing difference reduction with DT just pushes the work of finding the solution into calculating DT . On the other hand this reveals that the solution really isn't non-monotonic, in the sense that there 2012 51 DP DT t 0 1 2 3 4 5 6 7 DG Figure 2: Illustration of non-monotonicity in genotypic distance, DG, and phenotypic distance, DP , but monotonicity in transformation distance, DT , with the Towers of Hanoi problem. Distances are between the current state and the goal state, and have been normalized. The sequences of moves shown is optimal. is no point when the path takes steps leading away from the goal state. Having defined the various distance metrics that can apply to states, we need to ask whether we're in fact looking at the states that are relevant when making process inferences. Recall that in path search the solution is itself a sequence of states connected by operators. Thus, the states at the bottom of Figure 2 jointly constitute the solution path (Figure 1, A45). If the problem solver only represents the current state and the goal state, essentially treating each step in the solution as a new problem, then each of the individual states in Figure 2 may indeed describe the problem solver's internal state in each step, and we would conclude that the path is in fact monotonic since no state is ever revisited. Suppose instead that the problem solver uses a technique like means-ends analysis (Newell and Simon 1961). In this case the internal search state would need to represent subgoals and paths with gaps. For example, the first subgoal in means-ends analysis would be to move the largest disk to the rightmost peg. This might be represented as: (1, 1, 1) ! ??? ! (?, ?, 3) ! ??? ! (3, 3, 3) The path would then be built recursively by filling in the steps before (?, ?, 3) and then the steps after (?, ?, 3). Whether this process is monotonic depends on, for example, whether the problem solver ever fails to achieve one subgoal and tries another one. Monotonicity would have to be evaluated according to the sequence of internal states (the search trajectory; Figure 1, B45), not the states of the problem itself (Figure 1, A45). The importance of looking at internal states is clear when we realize that any search process that successfully finds a monotonic path may have explored longer paths that were ultimately edited before emitting the solution. For instance, a mathematician may prove several different lemmas as part of the proof of a larger theorem before realizing that they are all part of one general lemma and collapsing them accordingly. The final published proof would reflect this realization, but that does not imply that the mathematician's own thought processes followed the minimal path presented in the publication. To summarize, when analyzing path searches, the relevant states to consider are the internal states, which will contain but may not be identical with the problem states. Non-monotonicity should be judged using transformation distance, as this is the most direct measure of whether the trajectory includes apparently wasted effort. Monotonicity in Place Searches We're now ready to shift our emphasis to place searches, where the major aim of the search is to find the most desirable final state (Figure 1, F23). Here we'll adopt a landscape metaphor, wherein states, ~x, are thought of as being topographically organized according to the available operators. Each state's desirability is denoted f(~x), which we'll call it's fitness in keeping with genetics-inspired language used for genotypic and phenotypic distance. Our discussion in this section considers search processes that maintain a single current state that is iteratively improved. Though these processes may consider several alternate states in each iteration, only one survives into the next iteration. (As with path-place searches, we leave place searches where multiple states are under simultaneous refinement for future work.) We're not going to attempt to show that any particular process fitting these parameters is most plausible. Instead, we discuss two processes that are relatively implausible and yet not straightforward to rule out with trajectory analysis. The first is a computationally implausible process that always finds the most efficient path between the initial and final state, which we call the direct process.1 The second, which we call the hill climbing process, takes the psychologically implausible steps of evaluat1 Practically speaking, finding evidence for the direct process suggests that the entire search occurred mentally, making trajectory analysis the wrong analytical approach for that problem. 2012 52 DG Hill Climber Direct Time f(x) Figure 3: Illustration of (a) the hill-climbing process, which shows non-monotonicity in genotypic distance but monotonicity in fitness (solid lines), and (b) the direct process, which shows monotonicity in genotypic distance but non-monotonicity in fitness (dashed lines). Times and distances have been normalized. ing every available move and always choosing the one that most improves fitness, stopping when no improvement is possible (regardless of how low the current state's fitness). Whereas monotonicity in path search always referred to the states in the search trajectory, place search lets us look at the monotonicity of both the intermediate states and their fitnesses (Figure 1, F3). We'll call the monotonicity of the intermediate statesstate monotonicity (Figure 1, DE45). The same distinctions between genotypic, phenotypic, and transformation distance apply, and as before transformation distance is the most relevant form of monotonicity (though usually not the most convenient to calculate). We'll call the monotonicity of the fitness function over time fitness monotonicity (Figure 1, G45). As we'll see, different conclusions can result from considering fitness monotonicity as assessed during search (Figure 1, FG78) or after search (Figure 1, H78). In the following we'll present visual examples of searches over two-dimensional grids, with the states in the x-z plane and fitness on the y-axis. In this way, the ideal endpoint for a place search is the highest point on the landscape. We'll allow single-unit moves in the up-down, left-right, and diagonal directions. Note that in this case DG, defined as the Euclidean distance, is a good proxy for DT . Insufficiency of Either Monotonicity Alone Now we can evaluate whether fitness or state trajectories are individually sufficient to rule out either the direct or hill climbing processes. For the landscape shown in Figure 3, a hill-climber that follows the fitness function, f, will trace a path like the solid line shown in the figure. This trajectory is non-monotonic in genotypic distance but monotonic in fitness. The direct process would form a trajectory that is monotonic in genotypic distance but non-monotonic in fitness. Therefore, finding one but not both of state or fitness non-monotonicity is not sufficient to rule out both null hypotheses processes. Partially Observable States The trajectory traced in Figure 3 is a spiral. While this is literally a roundabout path to take, it is at least free of cycles where the trajectory revisits previously-encountered states. A trajectory with cycles seems to clearly contradict both null hypotheses. Indeed, the presence of a cycle can rule out the direct process, as editing out the cycle will not prevent the path from reaching the same final state. A cycle is likewise impossible with hill climbing, unless the internal state is only partially observable (cf. Gabora 2011).2 Consider the landscape in Figure 4, where the complete state ~x = (x, xˆ) has an observable part (x) and an unobservable part (xˆ). Projecting the full trajectory into x shows a cycle, but the cycle disappears with (x, xˆ). Criteria Change In a case almost identical to the previous example, suppose that people's criteria change during search. This is akin to having an unobserved portion of the state that affects the evaluation function. Here again, cycles could emerge in the observable state that are in fact monotonic if the evaluation function's hidden parameter were observable. (The cases are not quite identical since changes to this hidden parameter are dependent on some higher-level process, and hence aren't fully explainable by reference to f.) If one considers changes to evaluation criteria as separate from the main search process then it's still possible for this apparent nonmonotonicity to occur with a hill climber. A special problem can occur when criteria change occurs since the same state will be evaluated differently before and after the change. This could lead to retrospective evaluations of fitness over time showing non-monotonicity while in fact fitness was experienced as increasing monotonically during the search (see Figure 1, FH78). Suppose that the 2 Cycles could also occur in the special case where all of the states in the cycle are indistinguishable in terms of fitness. 2012 53 x ^ x DG Observable State Complete State f(x) Time Figure 4: Illustration of an apparent cycle using a hill climbing process where the state is only partially observable. Projecting the trajectory into observable space (horizontal axis) shows a cycle, while the complete trajectory is non-cyclic. DG Contemporaneous Retrospective f(x) Time Figure 5: Illustration of a search over x while a criteria weight w is simultaneously changing. The path (solid line) starts in the foreground and proceeds to the background. As shown in the right graph, contemporaneous fitness is monotonic. However, retrospective evaluation of the traversed points would be non-monotonic, as shown by the dashed lines in both graphs. x1 x2 Distance DG DT f(x) Time Figure 6: Illustration of a path that is non-monotonic with DG but monotonic with DT . The gray region in the left plot shows infeasible configurations that can be represented with the genotype but that cannot be realized. 2012 54 state is a scalar x and that f(x) depends on a weight parameter w such that f(x) = w · f1(x) + (1 ! w) · f2(x). Figure 5 demonstrates how retrospective fitness evaluation could show non-monotonicity even though contemporaneous fitness was monotonic. Infeasible Regions So far we have considered problems where DG is a good proxy for DT . However, there are problems where the state representation does a poor job reflecting the actual difficulty of transforming one state into another. Consider the landscape in Figure 6, where the gray regions are infeasible states. Though a hill climber would fail on this particular landscape, the direct process would find the path shown. As can be seen, the state trajectory appears non-monotonic with DG but isn't when judged with DT . The fitness trajectory is also non-monotonic, though that could occur with any direct process. Phenotypic Distance In real-world situations, both transformation distance and genotypic distance can be impractical to compute. In contrast, phenotypic distance, which is based on comparing the state's salient features, can be assessed fairly directly, such as by having several human raters make intuitive similarity judgments for pairs of intermediate products. However, there are two issues with this approach. From a psychological standpoint, it has long been known that human similarity judgments are not well-behaved metrics, meaning that they can be context-dependent, asymmetric, and intransitive (Tversky 1977). These properties could introduce nonmonotonicity that was is not in the original stimulus (Gabora 2011). Another problem with similarity judgments lies in the fact that genotype and phenotype may be non-monotonically related. Let the state be a scalar x 2 [0, 1] and suppose that this maps to a single salient property, p(x) = sin(2⇡x) (see top half of Figure 7). If a search proceeds linearly from x = 0 to x = 1, DG (the absolute difference in state representations) will be monotonically decreasing but DP (the absolute difference in the level of the property) will be non-monotonic (see bottom half of Figure 7). This is true regardless of what sort of landscape or search process is used. Discussion This paper has introduced a set of important distinctions that must be made when analyzing creative search trajectories, as summarized in Figure 1. We have argued that path searches and place searches must be understood differently. With path searches, we have claimed that the important measure of the efficiency of the search process is the transformation monotonicity of the problem solver's internal states, which may or may not be the same as the states of the problem itself. With place searches, we have argued that researchers claiming that people use complex search processes must first reject two null hypothesis processes, the direct and hill-climbing processes. We have shown how rejecting these processes entails demonstrating not just that there is nonmonotonicity in both states and fitness, but also that: x x p(x) Distance Time DG DP Figure 7: The top graph shows the relationship between the state x and its salient property, p(x). The left graph shows monotonicity in DG (and presumably DT ) as x goes from zero to one, but non-monotonicity in DP . • State non-monotonicity occurs with transformation distance, not just with the more convenient indices of genotypic and phenotypic distance • The observed state non-monotonicity does not reflect movement on unobservable dimensions or the effects of criteria changes • Fitness non-monotonicity assessed retrospectively would also have been non-monotonic when assessed contemporaneously We have also illustrated how non-monotonicity may result from processes that are at either end of the spectrum from intelligent and informed (the direct process) and mechanical and uninformed (the hill climbing process). Regarding existing empirical work, we have shown that authors have differed on whether they've analyzed state (Simonton 2007; Damian and Simonton 2011) or fitness (Kozbelt 2006; Kozbelt and Serafin 2009) monotonicity. Though we don't claim that the counterexamples provided here disprove the claims made by these authors, we have raised important issues that future work must address. We also acknowledge that our work has several significant limitations. First, for simplicity we have falsely dichotomized path and place search, even though we are quite sure that real creativity involves elements of each. Indeed, the joint operation of these searches may map nicely onto the problem finding/problem solving distinction that Getzels and Csikszentmihalyi (1976) introduced. Second, though we have alluded to criteria change as an important consideration when analyzing place searches, we have avoided discussing how and when this would occur and what could drive it. This decision was ultimately practical, since any 2012 55 discussion of changes to criteria leads to the question of what guides criteria selection, and whether this itself can change—leading to an infinite regresss that may not be resolvable at the level of an individual creator (cf. Jennings 2010a). We have also treated criteria change as compatible with the hill climbing process, which we recognize may not be without controversy. Third, we recognize that we have not carefully considered the role of operators, and in particular how the discovery of new operators mid-search may affect conclusions about state monotonicity. Finally, we are fully aware that all of our counterexamples involve highly abstracted problems, and so ultimately they serve more to illustrate our points rather than provide an existence proof. We look forward to addressing these and other limitations in future work. Our purpose here is not to cast aspersions on the trajectory analysis approach. Indeed, we are actively pursuing empirical techniques that rely upon trajectory analysis (Jennings 2010b; Jennings, Simonton, and Palmer 2011), although these techniques promise to reveal more about the underlying problem than can be obtained from a search trajectory alone. Beyond this, we continue to believe that trajectory monotonicity is an eminently practical way to study the creative process, particularly in cases where the available data are slim. The objections that we've raised in this paper do not seal the fate of this approach, but rather offer constructive critiques that should be addressed in future research. 2012_8 !2012 The Creative Computer as Romantic Hero? Computational Creativity Systems and Creative Personæ Colin G. Johnson School of Computing University of Kent Canterbury, UK C.G.Johnson@kent.ac.uk Abstract A popular definition of computational creativity is that it consists in behaviour that would be regarded as creative if performed by humans. This raises the question of which humans, as there are many different styles of human creative behaviour. This paper unpacks a number of ways in which human artistic creativity can be characterised, compares them with the kinds of creative actions found in computational creativity, and explores some aspects of human creativity that are underrepresented in computational creativity systems. Introduction Computational creativity (CC) has been pursued for decades, with intense activity in recent years. One of the most common definitions of CC is "building software that exhibits behavior that would be deemed creative in humans" (Colton and others 2009). This paper explores this definition. Creative behaviour is not monolithic; there are many different ways in which to validly be a creative person. It might be asked what kind of creative behaviour in humans we want to compare computer systems with. This paper is concerned with creative systems in artistic domains such as visual art, music and literature, not with so-called everyday creativity. Creative Personæ There are clear parallels between the idea of CC as exhibiting human-like behaviour and with Turing test-style definitions of intelligence. In both, the system is designated as acting in a creative/intelligent way if it can generate behaviour or product that would require, creative/intelligent action in humans to produce it, that a human observer would recognise. Creative acts in artistic domains are not actions that most people will carry out regularly. So, care must be taken in selecting which humans are taken as exemplars. In traditional Turing style AI tests, the exemplar human is a member of the general population. In these tasks we will have to assume that the exemplar has some specific skills and knowledge. One approach is to choose beginners in the domain as exemplars. This fits with the approach to CC system development that sees the development of systems that can do beginner tasks as the first step towards the development of more sophisticated systems. Another approach is to build systems to be compared with mature creative work, where the exemplars are mature artists. There is a danger with any of these exemplar-based systems (Pease and Colton 2011) that they encourage pastiche; but, if evaluators are primed sufficiently, perhaps this can be avoided. McCormack (2005) notes that CC algorithms will be valued when they "produce art recognized by humans for its artistic contribution (as opposed to any purely technical fetish or fascination)". This seems, reasonable; yet, it might just be a temporary state. Once we have got beyond the point at which the products/activities of computational artistic creative systems are acknowledged as valid artworks, we might become interested in "biographical" aspects of them, and produce works that reflect on the origins of the work without this seeming like "technical fetish". This reflection on origins might become part of the depth of the works. An important point is that not all creative people are creative in the same way. This paper will consider a number of dimensions of what will be termed creative persona space: an informally defined space representing broad attitudes/approaches. This paper considers three dimensions: the social vs. individualist dimension; the importance (or not) of ongoing tradition, development and "craft skills"; ideas of new and old media and the way in which technology is used in the artistic production itself. Ritchie (2007) uses the inspiring set in evaluating CC: a set of human works as exemplars of what a successful CC system would generate. The intention here is similar, but with regard to the creator rather than the works: for a particular CC system, can exemplars of the creator that is being represented by the system be given? Dimension 1: Socially Embedded vs. Individualistic One dimension of difference between creative artists is between those that work as individuals and those that create work in a socially embedded context. No creative artist is entirely divorced from social context, but we focus on those that work directly with others in collaborative creation. This is rare in the literary arts and uncommon in the visual arts; occasionally small groups will work consistently together over a long period of time (e.g. artistic duos such as Thomson & Craighead, the Chapman brothers and Cardiff & Miller), but literature, visual art and theatre/film writing are dominated by individual creators (comedy writing— 2012 57 especially for TV—is a notable exception, as are the works of groups such as the Dogme 95 filmmaking collective). In music, collaboration is more common. This reaches its peak in performance styles such as free improvisation, where a number of performers work together to collaboratively create a work without a preformed plan or an idea of leadership or direction. In some more commercial creative domains, such as advertising, group creative work is standard. Most work in CC focuses on the individualistic concept of creativity: writing of stories, creation of jokes, composition of melodies, creation of pictures. There has been some interest (Cook and Colton 2011) in mapping out the various contributors to creativity in CC; this is explored futher below in the discussion of computer as medium. Whilst CC systems might create interactive works—for example in game level design (Togelius and others 2010)—it is rare that the system continues to be creatively active during interaction. There are a number of potential reasons for this focus on the individual. These are readily criticisable and there is no reason to believe that all of them are believed by all practitioners, but listing them gives us an initial scoping out of the potential reasons: • The work in CC is coming out of an AI tradition, which has focused on the idea of the simulation of the individual mind interacting with a task (though multi-agent systems are a counterexample). • From an artistic perspective, there is a tradition of the "lone genius" in the romanticist tradition in art (Lovejoy 1948), which sees the role of the artist as developing their own authentic and original voice. This idea of the artist as romantic hero rejects the idea of collaboration and the development of an ongoing tradition, instead of which the great artist is seen as creating an individual body of work expressing their own personal world-view. • Creativity might be seen as happening because of various interacting processes within the mind (see e.g. Koestler (1964)). However, people have a curious reluctance to admit hierarchical models of interacting networks, both in intelligence and creativity: creativity/intelligence might be seen as a product of interactions, but it is tempting to contain those interactions to one level in a complex system. It is difficult to conceive of a system where creativity is a product of interacting systems within the mind and also a network of interacting minds. • There is a desire to be able to pin down exactly where the creativity is coming from. If a system is embedded in a complex social system with both human and computer agents, it is harder to point to a specific creative act by the computers. An easy criticism of such systems is that all of the creativity is coming from the human agents, and the seeming creativity of the computer agents is replicating, decorating or making trivial responses to or elaborations of the creative acts of the human agents. • Individual creativity might be seen as the first stage in the development of more sophisticated interactive creative systems; therefore, until CC systems have demonstrated individual creativity, there is no point in tackling the "more complicated" task of group creativity. • In practice, most creative work is individual; group creativity, in the arts, is confined to specialised areas. Therefore, CC systems are just emulating the world. There are a number of such collaborative systems that have been created. Consider Voyager (Lewis 2000), where heuristics interact within a listening/responding musical system that improvises alongside human musicians; and Sanfilippo's LIES (Sanfilippo 2012), where sound processing systems are connected, the parameters of these interactions being adjusted by the user. Is this a CC system? It is easy to say that the creativity is coming from the user in the form of the parameter changes; but, those might be provoked by sounds from the system, what Blackwell and Young call "strong interactivity", which "depends on instigation and surprise as well as response." (Blackwell and Young 2005). What might a CC system that was designed to work in a free, collaborative environment look like? Take musical improvisation as an example. One source of inspiration might be the broad guidelines that are given to beginners making a start in improvisation; whilst improvisation might be "free", it is not "anything goes" and there is a strong, often unarticulated, tradition about what is acceptable behaviour in such performances. This needs to be learned by participation and reflection, but guidelines can help to guide beginners so that they are not floundering totally. Consider, the three guidelines put forward by Dave Smith: Listen, Don't waste sounds, and Develop a sense of social responsibility. How might a CC system attempt to work with these? "Listen." is both trivial (we need to have some means of getting input from the other performers) but, of course, is actually a very deep and complex guideline, especially considering that listening is an active process. Clearly, "listen" means that improvisation should take account of that listening. However, account-taking can easily become trivial, and bejust imitation; learning to develop material, and links between different heard sounds, is important to produce depth. How might a CC system "listen" in this deeper sense? One characteristic of listening is that people frame listening with regard to what has been heard in the past, making subtle distinctions between some objectively very similar sounds, and grouping other sounds together that are objectively different (e.g. (Goto 1971)). One way to do this would be to run a system for a long time, and accumulate a set of listenings. This runs into the problem (Bown and McCormack 2010) of getting people to interact over a long period of time with a "naive" system that isn't providing engaging feedback. An alternative would be to take inspiration from the idea of an adult learner who is new to free improvisation but already has a body of sonic knowledge from listening, speaking and playing an instrument. For example, a system might match listened phrases to a corpus of sonic information— a set of melodies, or a set of nature sounds like birdsong, or an artificially generated set such as sound contours derived from spoken text. Developments in audio information retrieval might mean that such information could be gained from web search. The system could then base responses on these matches, which would bring in a broader, allusive set of responses than those that just work with varying the input based on just that input alone. Consider the second guideline: "Don't waste sounds.". Again, there is a naive interpretation of this: don't play all of the time. This is simplistic, but could provide the basis for an implementation; indeed, this is how many beginning improvisers might, with some success, interpret it. 2012 58 However, it has more depth. "Waste" could include aspects of listening: don't waste the other sounds that are going on in the environment, whether by ruining them with your own sounds or by failing to exploit the sounds that have potential. There is a suggestion of depth of contribution: providing a contribution that is constantly striving for more depth and development and eschewing cliche,´ Again, simple versions of these could be implemented in a CC system, particularly when combined with the deeper listening ideas above. For example, the agent could measure the complexity of the current activity and hold back in over-complex situations, the agent could hold a mediumterm memory of what material is being worked on and wait until a particular piece of material is being "played out" before introducing new material, and so on. Turning to the final point, how could a computational agent "develop a sense of social responsibility"? The "society" in question clearly includes the fellow players; also perhaps, the audience. One aspect of being socially responsible is recognising when behaviour is regarded by others as gauche or inappropriate. Could a computer agent measure this by the inference of affective states from fellow players (Picard 1997)? Clumsily, this could be done by direct feedback from the players to the agent. Another way in which humans learn social cues is by observing social interactions amongst others; could a creative agent build a model of such interactions within the group of improvisers and then use this model to develop its own interactions? Or, more simply, by analyse whether the material it performs is taken up by fellow players and use that as a proxy for the appropriateness of certain kinds of interaction? Finally, how could such a system be evaluated? Evaluation of "individualist" creative systems is typically done via the impression that the system makes on the audience. However, a collaborative system could also be evaluated by the fellow players, and such players might use different criteria to that used by (or, indeed, perceptible by) an audience. For example, a creative improvisation system might be rated highly by fellow (human) improvisers if it provides a contextually-sensitive way of provoking the other improvisers to produce more engaging music. Dimension 2: Tradition-Classicism vs. Romantic Individualism Another aspect to the notion of the "romantic hero" is the contrast between the artist working in isolation as an individual genius and the artist working in a tradition. This is somewhat different to the above discussion: this section considers the contrast between the artist who pursues their own "individual voice" against the idea of one who is concerned with developing out from and contributing to the development of an ongoing tradition. By contrast, consider a classicist view which sees originality happening via the gradual development of a tradition, constantly underpinned by ideals of balance and proportion. The idea of romantic originality was a significant shift in concept of artistic development within the European mainstream tradition, contrasting with earlier traditions about artistic creation being about skillful execution within a style, including reuse and redevelopment of material. The romantic artist creates work that reflects their own, individual, often troubled, engagement with the world (Butler 1981). The relationship of CC to this dimension is complex. The idea of creativity expressed by Boden's model of transformational creativity reflects the classicist notion of a gradually developing tradition; the space is, after all, transformed, not rejected. But, perhaps too much stall shouldn't be placed on this—after all, all artistic work builds to some extent on previous work, and the most individualist romantic hero uses tools developed by previous generations. Indeed, individual initiative eventually tips over into eccentricity; witness the reviews collected by Slonimsky (2000), or the reception of "outsider art" (Rhodes 2000). In areas where there is no ground truth, radical novelty is difficult to evaluate. A seemly incomprehensible piece could be madness or genius. In Boden's terms, if a transformation consisted of taking one space and replacing it by another, how would works in this new space be evaluated? For a more sober transformation, evaluation can start with our existing ideas of evaluation and push them a little; in a completely new space there is no corresponding grounding. The view along this axis therefore gives a contrast to the previous one. Typical CC systems represent individual creativity rather than social creativity; but, they represent individual creativity within a tradition rather than that of the radical outsider. Dimension 3: Old vs. New Media Another distinction that can be made in artistic creativity is between so-called old media and new media. Defining new media is complex and contestable. Manovich presents an initial, naive understanding of new media as those cultural objects that essentially involve the use "of a computer for distribution and exhibition." (Manovich 2001). He goes on to describe ways in which digital technology has influenced the process of creating cultural works. One example is where speedup faciliates a difference in kind rather than just a difference in degree. E.g., real-time rendering of 3D scenes makes interactive games possible as well as improving the production process of traditional animation. Furthermore, computer technologies provoke artists into exploring new creative areas: for example, the notion of transcoding (Manovich 2001), i.e. the ready exchange of data between different media formats, makes us think about creating works in which different media streams are created from the same source material, whether in a supportive or disruptive manner. Computational Creativity: Old or New Media? Are the products of CC systems old media or new media? Perhaps counterintuitively, the vast majority of CC systems are in an old media tradition. Whilst the computer is essential in CC, the role it plays is as creator; in new media, the computer is essential in the work as it is presented. The work on computational creation of stories (Gervas´ 2009), poetry (Manurung and others 2000) and jokes (Ritchie 2009) is clearly in this vein: the aim of the vast majority of such systems is to create a work that is presented as words-on-the-page; whether these words are presented on paper or on the computer screen (or read out 2012 59 loud) is not part of their essence. There are a few examples of literary creativity systems that are clearly within the new media tradition: for example, nn by Montfort (2007) is an example of the generation of interactive fiction. In the case of CC systems for music, the landscape is more mixed. Systems such as Voyager produce sequences of notes to be performed by a synthesis system (sounding like a traditional instrument) or by a mechanical instrument. However, there are a number of examples that demonstrate how a creative music computing system could work in a new media fashion. For example, Magnus's evolutionary music concrete ` experiments (Magnus 2006) and Sanfilippo's LIES system discussed above show how creative computer systems can create electronic music that is concerned with sound manipulation rather than the production of notes. CC works in a real new media tradition are rare. There is little CC work producing internet art (Greene 2004) or multimedia works. There are few works that use computationallybased means of organising material such as transcoding, or dynamic creation of work from a database (Manovich 2001), or whose aesthetic is a computational one such as the database aesthetics discussed by Vesna (2007). One example is Dance Evolution by Dubbin & Stanley (2010), which uses a computer game engine as the basis for a system whereby characters learn to dance in time to music. The characters are stock video-game images of soldiers; it is not clear whether this was simply a use of the resources that were readily available within the engine, or whether the choice was deliberate. Regardless, this unusual choice of avatar provides a provocative image, reminiscent of various artists attempts to subvert the video-game culture by, for example, the performance of street theatre within MMORPG environments (Greene 2004). The Creative Networked Computer Despite CC arising during the "Internet age", the typical CC system produces creative output in a fixed medium (e.g. words, pixels on a screen, MIDI notes) that ignores the networked context of the computer. CC systems that draw on allusions to the world beyond are rare. Most storytelling systems (Gervas 2009) produce stories about a fixed set of ´ ideas and characters. Most of the CC systems for visual arts work in an abstract medium. Where they do provide external reference, this has typically been provided by the system designer directly. One way is that the designer builds into the system some understanding of the external world: for example, whilst the details of how people are drawn in Cohen's AARON system (Cohen 1995; Boden 1990) are created by the system, the basic idea of a person-shape is part of the system design. The second way this that the designer might place processes within the system that generate allusions to the world beyond, for example in ecologically inspired systems such as those discussed by Bown & McCormack (2010) where the interactions between components of the system are inspired by the kinds of interactions found in natural ecological systems. This could well suggest aspects of the natural world to viewers of the system, even if the presentation is very abstracted; but, the decision to make this allusion is that of the system designer. There are exceptions. Krzeczkowska et al. (2010) present a system where the initial source material is drawn from current news stories, and keywords extracted are used in web searches for pictures, which are used as the source material for the creation of a collage using Colton's Painting Fool system (Colton 2012). An important part of much art (particularly visual and conceptual art) is connotation: the depth of the work comes from ideas that are suggested, triggering connections that remain under conscious awareness, or revealed via "ah-ha" moments where the link or allusion is suddenly revealed (the idea of CC systems framing information as part of their creativity (Colton and others 2011) captures some of this). Computer as Creator; Computer as Medium CC researchers might eschew working in new media because of the potential for confusion between two different roles for the computer. For a CC system in a computer-based medium, the computer is playing two roles; that of creator, and that of the medium in which the work is realised. In theory, there is no reason why such a system should not be successful. However, in terms of evaluation, the creative computer working in new media is harder to evaluate. There is a question of the role that the "creative" computer played, versus the role played by the broader computational context. McCormack (2005) has argued that to be taken seriously, CC systems need to create works that are not just examples of "technical fetish", but that are accepted in their own right. This may be why CC systems have avoided new media works: many new media works play with the idea of their technological groundedness in a self-referential and essential way, and CC system creators avoid building systems that work in such media to avoid accusations of "mere" technical obsession. But, awareness of origins need not reflect an over-obsession with a trivial part of the work; indeed, the lack of any sense of biographical depth is one of the shallowing aspects of many CC-produced works. One new direction would be to build CC systems that explore and celebrate what computers are good at. One inspiration for this comes from John Cage's account of his development: After I had been studying with him for two years, Schoenberg said, "In order to write music, you must have a feeling for harmony." I explained to him that I had no feeling for harmony. He then said that I would always encounter an obstacle, that it would be as though I came to a wall through which I could not pass. I said, "In that case I will devote my life to beating my head against that wall." (John Cage, Lecture on Indeterminacy (Cage 1973)) A CC system could adopt a similar attitude, to acknowledge that computers might not be capable of "passing off" certain aspects of human creativity (e.g. creating fluent natural language text) but that certain new forms of creativity are facilitated that draw on the specific capabilities of computers and the computational infrastructure. Let us consider two recent works. Kessel's 2011 Photography In Abundance where a day's uploads to Flickr were printed out and placed in a gallery—hundreds of thousands of photos. Such a work could not have been achieved without computational infrastructure; yet, is this computational creativity? Presumably not, as the creative decision about 2012 60 what to do with the computational capability of Flickr was made by the human artist. Yet, the computational infrastructure seems to have played a stronger role here than that of medium—the contribution of the computer to this work is greater than the contribution of paint to a painting. Perhaps we need to get used to the idea of artists working in a richer medium, which "pushes back" in terms of creative contribution much more than a conventional medium. The second example is my own 2011 piece Blank: nine printed panels containing image search hits for "blank". Most of the images are of empty objects: map outlines, blank signs, empty music paper. One image in the seventh panel consists of a photo of a number of gun cartridges— referencing "blanks" as a cartridge without a bullet. Had this work been produced unaided by a human artist, people might ascribe it creative depth. Having set up the expectation of empty, neutral images, there is this single flash of violence in the middle of one panel, causing us to reinterpret the meaning of the emptyness. Of course, there is no intention to create this as such; it is "mere" serendipity. Yet, it is deeper than the readerly (Barthes 1970) interpretation of a purely aleatoric work; the images were "chosen" by a process that has a huge amount of infrastructure behind it. Is this any shallower than some purely human-produced work which has "come to mind" as the end result of the memory structures produced by the artist's life experiences? 2012_9 !2012 Whence is Creativity? Bipin Indurkhya (bipin@agh.edu.pl) Department of Computer Science, AGH University of Science and Technology, Cracow, Poland Cognitive Science Lab, International Institute of Information Technology, Hyderabad, India Abstract We start with a critical examination of the traditional view of creativity in which the creator is the major player. We analyze many different examples to point out that the origin of all different creativity scenarios is rooted in the viewer-artifact interaction. To recognize this explicitly, we propose an alternative formulation of creativity by putting the viewer in the driver's seat. We examine some implications of this formulation, especially for the role of computers in creativity, and argue that it captures the essence of creativity more accurately. Introduction: Traditional View of Creativity In a typical creativity scenario, there is a creator, a product and the audience. The creator creates the product, and the audience appreciates it. The creativity is almost always imputed to the creator. One could talk about a creative process, but that is again associated with the creator. The audience plays a role, and has been dubbed the field by Csikszentmihalyi (1996), but somewhat indirectly, and even then, one usually makes a distinction between a popular artist and a creative artist, which are not synonymous. This framework is questioned in this paper by analyzing a number of creativity scenarios and tracing the root factor that allows us to dub them creative. It is not the first time that these issues are being raised, for they are well known in the literature. But it may be the first time we are bringing them all together to suggest that perhaps there is something fundamentally wrong with this view of creativity. We then propose an alternative formalism for creativity by looking at it from the audience's point of view. Though this may seem almost heretical at first sight, we argue that it provides a more accurate framework to address various issues surrounding creativity, including the role of computers therein. Analysis of Some Creativity Scenarios We present here several creativity scenarios and analyze them to identify the root cause as to why they are labeled creative. Case of Creative Individuals If we try to think of creative people, who comes to mind? Perhaps Einstein, Mozart, Michelangelo or Leonardo da Vinci. In the modern times, we might think of Steve Jobs. But what do we mean when we say that they are creative? Perhaps music came naturally to Mozart. In a letter to his father on Nov. 8, 1777, he wrote: "I cannot write in verse, for I am no poet. I cannot arrange the parts of speech with such art as to produce effects of light and shade, for I am no painter. Even by signs and gestures I cannot express my thoughts and feelings, for I am no dancer. But I can do so by means of sounds, for I am a musician." However, what makes his work great is because of the way people have responded to his music over more than two centuries. (See also Kozbelt 2005; and Painter 2002.) Steve Jobs has sprouted and nurtured many creative ideas, but again it is how people responded to the artifacts based on his ideas, like Mac, iPod, and iPhone, that is the key factor in his having become an icon of technological innovation and creativity. And one could easily dispute whether he was creative when it came to his clothes. Einstein's brain was preserved after his death so that people can study it to get any clues about the biological basis for creativity. But here also it is the impact of his theory of relativity, and its eventual acceptance by the scientific community that was a key factor in him becoming an icon of scientific creativity of the twentieth century. Moreover, Einstein was also dogmatic at times, perhaps the most famous case being his rejection of Alexander Friedmann's expanding universe hypothesis (Singh 2004). Needless to say, we are not trying to argue that these people were not creative, but merely to point out that they were creative some of the times, and in some areas; and, more importantly for the discussion here, that we determine when they were creative based on how the audience responded to their ideas or artifacts. Creativity in Mentally Different Individuals Take the case of Stephen Wiltshire, discussed in Sacks (1995). He has an amazing ability to draw a landscape from memory after seeing it only once. Though he is diagnosed with autism, his work is highly regarded both by critics and general population. He was awarded Member of the Order of the British Empire for services to art in 2006. So he is no doubt a very creative person, no matter which criterion one chooses to apply. But let us think about it a minute. What do we mean by saying that he is creative? His work has a certain style, level of details that most people cannot reach, aesthetic appeal, and all that. As with Mozart, we can go further and say that perhaps this is the way he expresses himself naturally: just like you and I might describe what we did on our last summer vacation, he draws fantastic landscapes. The 2012 62 landscapes are fantastic to us, his audience, and that is the crucial factor in his being recognized as a creative genius. We can now throw in here examples of people with schizophrenia or brain damage, savants or manicdepressive people, and so on (Sawyer 2006). When these people produce work that is considered creative, it is exclusively the evaluation of the audience that is the key factor in this judgment. For many of them, this is their mode of being, and it could not have been otherwise. Often the intention is missing as well. (See also Abraham et al. 2007; Glicksohn 2011.) Cultural Creativity In many cultures, art is practiced as a group activity. For example, Maduro (1976) provides a study of Mewari painting community in a village in the Rajasthan province of India. It describes a strictly hierarchical group with each artist belonging to one of the laborers, master craftsmen or creative artists class. They mostly copy existing forms and patterns, with rarely an innovation, at least in the way it is considered in the Western art. Sawyer (2006) discusses this and many other examples to argue that for such scenarios, one needs to take into account cultural context to evaluate creativity, and novelty may be neither necessary nor sufficient. However, the audience response is still a key factor. Notice that the culture itself can be an audience. Computer and Creativity We now consider the scenario at the other extreme, when technically there is no creator and the intent is missing. Even though the last two or three decades have seen a steady progress in the development of computer systems that produce artifacts in the domain of visual art (Cohen 1981; McCorduck 1991), music (Chordia & Rae 2010; López et al. 2010; Monteith et al. 2010), literature (Kurzweil 2001, Pérez y Pérez et al. 2010); and so on, generally they have received a negative press as regard to their creativity: computers cannot have emotions, programs do not have intents, creativity cannot be algorithmic, etc. etc. (Boden 2009; Sawyer 2006). In fact, such views blatantly expose the implicit assumptions underlying creativity: namely that it crucially needs a creator with emotions, intentions, and such. However, we have just seen a number of scenarios above involving humans where the intentions and emotions are missing and, even when they are there, what determines whether something is creative or not is the audience response. So why should we not apply the same yardstick for computer-generated artifacts? Creativity in Viewer-Artifact Interaction In all the scenarios above, we have seen that it is the audience response that determines whether an action (of the creator or the group), a process, or a product is creative. So suppose we drop the pretense, stop being apologetic about it, and embrace this view formally: We define creativity as the process by which a cognitive agent acquires a novel perspective that is useful (or meaningful) to it in some way by interacting with an object or a situation. There are two aspects of this definition that need to be emphasized here. One is that we are taking completely the audience's perspective here, so the creator is not even mentioned. Needless to say, this is not the first time that such a position is articulated. Barthes' (1977) concluded: "We know that to restore to writing its future, we must reverse its myth: the birth of the reader must be ransomed by the death of the Author," and he traced this view to even earlier scholars. Moreover, even when one does not take this extreme position, most accounts of creativity do acknowledge the role of audience (Cropley et al. 2011; Csikszentmihalyi 1996; Horn & Salvendy 2006; Maher 2010). Our aim here is to explore the implications of this authorless view of creativity, especially for computational systems. Secondly, both the novelty and the usefulness are defined from the agent's personal point of view. So in this way, this definition refers to little-c creativity, or Pcreativity (Boden 1990; Kaufman & Beghetto 2009). However, as we will see soon, it can be extended to Big-C or Hcreativity. Implications of the Proposed View We will now discuss how many intuitive notions associated with creativity can be rooted in the formulation proposed above. We start with the four Ps of creativity (Runco & Kim 2011). Product: Potential of Artifacts for Creativity If an agent can interact with an object to get a novel and useful perspective, we can impute creative potential to that object. If many people can get a novel and useful perspective, it should be emphasized that the perspective that each agent gets need not be the same, or need not be useful in the same way. So, for instance, different agents may see a work of modern art in various ways, and find it meaningful in different ways, and some may not see anything at all. Though most accounts of creativity incorporate audience response to some extent, the extreme view we are examining here would allow Oracle readings of tea leaves and such as creative interactions. (See also Indurkhya 2007.) Just to contrast, in the traditional view, such objects lack a creator so, by association, lack creativity as well. This is a common argument to deny creativity to computer systems. Process: Creativity in Generating Artifacts Now we can take another step back and consider the process of generating a creative artifact. In other words, we need to consider the process of generating an artifact with which a viewer can interact and get a novel and useful perspective. So the viewer is always hovering in the background, and has a significant impact on whether the process is really generating an ordinary artifact as opposed to a creative artifact. The main implication for modeling the generative aspect of creativity is that we cannot pursue it without considering 2012 63 the audience, and making some assumptions about how they are likely to interact with the generated artifacts. Person: Creativity of an Individual/Group We can step back some more and consider the creativity of an individual or a group of individuals. Here we are looking at the ability of the individual to generate artifacts with which viewers can interact and get novel and useful perspectives. So the viewer is again in the background and is playing a critical role. Moreover, even though we speak of this person or that person as being creative, we are really focusing on certain artifacts that they have generated in their career, which have given their audience some novel and useful perspective. The implication of this is that though we can certainly study personality traits of certain individuals who generated some artifacts during their career that were deemed creative by the audience, it does not follow that those personality traits in a different culture, in a different context and with a different audience will necessarily result in the generation of artifacts that would also be considered creative. This point is highlighted in the essay ‘Late Bloomers' (Gladwell 2009), where early geniuses are contrasted with late bloomers. The relevant point here is that whether a work is accepted by the audience or not does not depend much on whether it was produced early or late in the career, but on the kind of work and the context and the culture in which it was produced. Press: Context, Culture and H-Creativity Press refers to the environmental factors that have an influence on the generation of the artifact; but that is taking the traditional perspective, where the focus is on the creator. If we are putting the viewer in the drivers' seat, then an analogous set of environmental factors can be identified that determine how a work is received by the viewer, and whether it is successful or not. Let us first consider the artifact interaction with an individual viewer. Clearly, the context in which a viewer interacts with the artifact can have a major influence in what perspective is gleaned from it, and whether it is novel or meaningful. The most classic example of this might be Marcel Duchamp's Fountain, which was a urinal turned around. (See also ‘When is Art?' in Goodman 1978.) There is also the effect of the viewer's background knowledge: when one views the Parthenon in Athens, one's knowledge of the history and culture of ancient Greek certainly effects one's perceptions and aesthetic experience. Moving to larger groups and societies, there are many instances when a novel and potentially useful idea was not successful when introduced in one context, but the same idea was a big hit in another context. We mentioned above the example of Alexander Friedmann's expanding universe hypothesis, which was rejected when introduced because of Einstein's influence but was widely heralded later. Wegener's (1966) theory of continental drift suffered a similar fate when it was first introduced in 1915, even though that was no fault of Einstein. There are also cases where the theory, although novel and carefully worked out, never received acceptance: for instance, Velikovsky's theory, which hypothesized Earth's encounters with a large comet expelled from Jupiter and provided explanations for many biblical events (Casti 1989, pp. 7- 10). History of marketing and product development also provides many such examples that are studied in business schools all over. Gladwell (2009), for instance, recounts Jim Wigon's not-so-successful odyssey to develop creative ketchup flavors, and contrasts this with mustard and spaghetti sauce, for which similar efforts were more readily accepted by the consumers. The implication of all this simply is that one needs to study all these contextual factors that make an idea or an artifact novel and meaningful, and thereby eligible for the ‘creativity' label. But this is essentially what is called Hcreativity: novelty and usefulness for a culture or society. We should emphasize here that this novelty and usefulness with respect to a culture is not the same as popularity. Certain ideas or artifact can be popular in a society without being considered novel (by the members of the society themselves), and vice versa. Who is Creative? This is one question that is often asked with regard to artificial creativity systems. For example, who is being creative when the computer program Aaron generates a painting in which many people see some aesthetic value? Is it the program? Is it the programmer? In many complex computational systems, the programmer cannot see all the consequences of what their system can generate, and can be quite surprised by the artifacts it produces. We would like to argue that this question is meaningless in the framework of creativity we are proposing here. Just to get away from the computational system scenario, consider a painting by a schizophrenic person that many people find interesting and insightful. Now obviously, the schizophrenic person is the creator of the painting. But is she or he creative? How can we determine this? Perhaps it is a natural way for them to express themselves; they may not see what all the fuss is about the artifact they created. In other words, if we zoom out and look at the larger picture, they are creating artifacts that many people find insightful, and so in that sense we can ascribe them creativity as explained above. But if we zoom in on the processes by which they generate these artifacts, where is the creativity? They may not be trying to generate some aesthetically pleasing object, and may not even be aware of any audience. The point is that there is nothing distinctive about the generation process itself that we can label it as creative. Creativity and Computational Systems Once we acknowledge that it is meaningless to ask who is being creative, the stigma surrounding the potential creativity of computational systems recedes away. The goal becomes simply to create artifacts that give them some novel and meaningful perspectives. 2012 64 This seemingly small shift of focus has far reaching consequences, and our society is already moving towards it. To start with, modeling the audience, their cultural tastes and preferences, their cognitive processes that influence their response to novel stimuli, and so on, becomes very crucial. In the last 10-15 years, research in neuroscience has revealed that at least some of our aesthetic values are hardwired in the structure of the brain (Ramachandran & Hirstein 1999; Zeki 2000). Then to add to that, machine learning techniques can learn about the cultural preferences of an audience based on the past data. For instance, Ni et al. (2011) trained their program with the official UK top-40 singles chart over the past 50 years to learn as to what makes a song popular. A program like this might successfully predict, for instance, the winner of the future Eurovision competitions. To reiterate a subtle point, creativity is not the same as popularity. So to be able to predict whether a song, or a book, or a video will become popular (Szabo & Huberman 2010) is not the same thing as evaluating their creativity. Nonetheless, we expect that similar techniques, perhaps with some adaptations, are more likely to yield the key to creativity. Going one step further, once audience-based models of creativity are articulated, we can design, implement and experiment with computational systems that generate artifacts that are more likely to appeal to the audience, both with respect to their novelty and meaningfulness. One could even argue that computational systems are more ideally suited than humans to explore this space of creative possibilities (Harry 1992; Indurkhya, to appear). Designing Creativity-Support Systems A related issue is how to stimulate and enhance creativity in people, if it is possible at all. Indeed, a number of approaches have been proposed and tried out over the years (de Bono 1975; Gordon 1961; Holstein 1970; Rodari 1996; Shapira & Liberman 2009). One key observation that recurs in many of these studies is that trying to associate unrelated objects or situations stimulates creativity (Indurkhya 2010). In our past research, we have explored some approaches to design creativity-support systems based on this observation (Indurkhya 1997; Indurkhya et al. 2008; Ishii et al. 1998), but much more remains to be done. Conclusions Though most of the research on computational creativity has implicitly assumed that the creative value is in the artifact, they have been sort of apologetic about it. For example, Colton (2008) argues that it is not enough to generate an interesting or creative artifact, but one must also take into account the process by which the artifact was generated. Krzeczkowska et al. (2010) took pains to project some notion of purpose in their painting tool so that it might be perceived as creative. In this paper we have sought to not only drop this veil of apology, but move to the other extreme by proposing a formulation of creativity that puts the onus on the viewer by characterizing it as the process of getting a new and meaningful insight about an object or situation. We have argued that this formulation reflects more accurately what actually goes on in the whole creativity cycle. Moreover, other situations, like when we deem an artifact or an individual as being creative, we are really implicitly relying on the viewer-artifact interaction to make this judgment. Therefore, creativity of an agent and that of an artifact are best seen as derived concepts based on our proposed formulation of creativity. We hope that this will stimulate further discussion about the nature of creativity and, more importantly, will generate new approaches to the design and development of computational creativity systems. 2013_1 !2013 Computationally Created Soundscapes with Audio Metaphor Miles Thorogood and Philippe Pasquier School of Interactive Art and Technology Simon Fraser University Surrey, BC V3T0A3 CANADA mthorogo@sfu.ca Abstract Soundscape composition is the creative practice of processing and combining sound recordings to evoke auditory associations and memories within a listener. We present Audio Metaphor, a system for creating novel soundscape compositions. Audio Metaphor processes natural language queries derived from Twitter for retrieving semantically linked sound recordings from online user-contributed audio databases. We used a simple natural language processing to create audio file search queries, and we segmented and classified audio files based on general soundscape composition categories. We used our prototype implementation of Audio Metaphor in two performances, seeding the system with keywords of current relevance, and found that the system produced a soundscape that reflected Twitter activity and kept audiences engaged for more than an hour. 1 Introduction Creativity is a preeminent attribute of the human condition that is being actively explored in artificial intelligence systems aiming at endowing machines with creative behaviours. Artificial creative systems have simulated or been inspired by human creative processes, including, painting, poetry, and music. The aim of these systems is to produce artifacts that humans would judge as creative. Much of the successful research in musical creative systems has focussed on symbolic representations of music, often with corpora of musical scores. Alternatively, non-symbolic forms of music have been little explored in as much detail. Soundscape composition is a type of non-symbolic music aimed to rouse listeners memories and associations of soundscapes using sound recordings. A soundscape is the audio environment perceived by a person in a given locale at a given moment. A listener brings a soundscape to mind with higher cognitive functions like template matching of the perceived world with known sound environments and deriving meaning from the triggered associations (Botteldooren et al. 2011). People communicate their subjective appraisal of soundscapes using natural language descriptions, revealing the semiotic cues of soundscape experiences (Dubois and Guastavino 2006). Soundscape composition is the creative practice of processing and combining sound recordings to evoke auditory associations and memories within a listener. It is positioned along a continuum with concrete music that uses found sound recordings, and electro-acoustic music that uses more abstracted types of sounds. Central to soundscape composition, is processing sound recordings. There are a range of approaches to using sound recordings. One approach is to portray a realistic place and time by using untreated audio recordings, or, recordings with only minor editing (such as cross-fades). Another is to evoke imaginary circumstances by applying more intensive processing. In some cases, these manufactured sound environments appear imaginary, by the combination of largely untreated with more highly processed sound recordings. For example, the soundscape composition Island, by Canadian composer Barry Truax (Truax 2009), adds a mysterious quality to a recognizable sound environment by contrasting clearly discernible wave sounds against less-recognizable background drone and texture sounds. Soundscape composition requires many decisions about selecting and cutting audio recordings and their artistic combination. These processes become exceedingly time consuming for people when large amounts of audio data are available, as is now the case with online databases. As such, different generative soundscape composition systems have automated many sub-procedures of the composition process, but we have not found any systems in the literature to date that use natural language processing for generative soundscape composition. Likewise, automatic audio segmentation for soundscape composition specific categories is an area not yet explored. The system described here searches online for the most recent Twitter posts about a small set of themes. Twitter provides an accessible platform for millions of discussions and shared experiences through short text-based posts (Becker, Naaman, and Gravano 2010). In our research, audio file search queries are generated from natural language queries derived from Twitter. However, these requests could be a memory described by a user, a phrase from a book, or a section of a research paper. Audio Metaphor accepts a natural language query (NLQ), which is made into audio file search queries by our algorithm. The system searches online for audio files semantically related to word features in the NLQ. The resulting audio file recommendations are classified and segmented based 2013 1 upon the soundscape categories background, foreground, and background with foreground. A composition engine autonomously processes and combines segmented audio files. The title of Audio Metaphor refers to the idea that audio representations of NL queries that the system generates may not have literal associations. Although, in some cases, an object referenced in the NL query may have a direct referential sound such as with "raining outside" that results in a type of audio analogy. However, an example that is not as direct such as, "A brooding thought struck me down" has no such direct referent to an object in the world. In this latter case, Audio Metaphor would create a composition by processing sound recordings that have some semantic relationship with words in the NL query. For example, the sound of a storm and the percussive striking of an object are the types of sounds that would be processed in this case. Margret A. Boden actively proposes types of creativity being synthesized by computational means (Boden 1998). She states, that combinatorial type creativity "involves novel (improbable) combinations of familiar ideas ... wherein newly associated ideas share some inherent conceptual structure." The artificial creative system here uses semantic inference driven by NLQs as a way to frame the soundscape composition and make use of semantic structures inherent in crowdsourced systems. Further to this, the system associates words with sound recordings for combining into novel representations of texts. For this reason, the system is considered to exhibit combinatorial creative behaviour. Our contribution is a creative and autonomous soundscape composition system with a novel method of generating compositions from natural language input and crowd-sourced sound recordings. Furthermore, we present a method of audio file segmentation based on soundscape categories, and a soundscape composition engine that contrasts sound recording segments with different levels of processing. We outline our research in the design of an autonomous soundscape composition system called Audio Metaphor. In the next section, we show the related works in the domains of soundscape studies and generative soundscape composition. We go on to describe the system architecture, including natural language processing, classification and segmentation, and the soundscape composition engine. The system is then disused in terms of a number of performances and presentations. We conclude with our ideas for future work. 2 Related Work Birchfield, Mattar, and Sundaram (2005) describe a system that uses an adaptive user model for context-aware soundscape composition. In their work, the system has a small set of hand-selected and hand-labelled audio recordings that were autonomously mixed together with minimal processing. Similarly, Eigenfeldt and Pasquier (2011) employ a set of hand-selected and hand-labelled environmental sound recordings for the retrieval of sounds from a database by autonomous software agents. In their work, agents analyze audio when selecting sounds to mix based on low-level audio features. In both cases, listening and searching for selecting audio files is very time consuming. Search Query Generator Audio File Segmentation Soundscape Engine Audio File Recommendations Freesound User NLQ Twitter Sourcing Processing Figure 1: Audio Metaphor system architecture overview. A different approach to selecting and labelling sound recordings is to take advantage of collaborative tagging of online user-contributed collections of sound recordings. This is a crowdsourcing process where a body of tags is produced collaboratively by human users connecting terms to documents (Halpin, Robu, and Shepherd 2007). In online environments, collaborative tags are part of a shared language made manifest by users (Marlow et al. 2006). Online audio repositories such as pdSounds (Mobius 2009) and Freesound (Akkermans et al. 2011) demonstrate collaborative tagging systems applied to sound recordings. A system that uses collaborative tags to retrieve sound recordings is described by Janer, Roma, and Kersten (2011). In their work, a user defines a soundscape composition by entering locations on a map that has sounds tags associated with various locations. As the user navigates the map, a soundscape is produced. In related research, the locations on a map are used as a composition environment (Finney and Janer 2010). Their compositions use hand-selected sounds, which are placed in close and far proximity based upon semantic identifiers derived from tags. 3 System Architecture Audio Metaphor creates unique soundscape compositions that represent the words in an NLQ using a series of processes as follows: • Receive a NLQ from a user, or Twitter; • Transforms a NLQ into audio file search queries; • Search online for audio file recommendations; • Segment audio files into soundscape regions; • Process and combine audio segments for soundscape composition. In the Audio Metaphor system, these processes are handled by sequentially as is shown in Figure 1. 1 1 A modular approach was taken for the system design. Accordingly, the system is flexible to be used for separate objectives, including, making audio file recommendations to a user from an NLQ, and deriving a corpus of audio segments. 2013 2 rainy autumn day vancouver rainy autumn day autumn day vancouver rainy autumn autumn day day vancouver rainy autumn day vancouver Table 1: All sub-lists generated from a word-feature list from the query "On a rainy autumn day in Vancouver'. 3.1 Audio File Retrieval Using Natural Language Processing The audio file recommendation module creates audio file search queries given a natural language request and a maximum number of audio file recommendations for each search. The Twitter web API (Twitter API ) is used to retrieve the 10 most recent posts related to a theme to find current associations. The longest of these posts is then used as a natural language query. To generate audio file search queries, a list of word features is extracted from the input text and generates a queue of all unique sublists. These sublists are used as search queries, starting with the longest first. The aim of the algorithm is to minimize the number of audio files returned and still represent all the word features in the list. When a search query returns a positive result, all remaining queries that contain any of the successful word features are removed from the queue. To extract the word features from the natural language query, we use essentially the same method as that proposed by Thorogood, Pasquier, and Eigenfeldt (2012), but with some modifications. The algorithm first removes common words listed in the Oxford English Dictionary Corpus, leaving only nouns, verbs, and adjectives. Words are kept in order and treated as a list. For example, with the word feature list from the natural language query "The angry dog bit the crying man," "angry dog bit crying man," is more valid than "angry man bit crying dog." The algorithm for generating audio file queries essentially extracts all the sublists from the NLQ that have a length greater than or equal to 1. For example, a simple request such as "On a rainy autumn day in Vancouver" is first processed to extract the word feature list: rainy, autumn, day, vancouver. After that, sub-lists are generated as shown in Table 1. Audio Metaphor accesses the Freesound audio repository for audio files with the Freesound API. Freesound is an online collaborative database with over 120,000 audio clips. The indexed data includes user-entered descriptions and tags. The content of the audio file is inferred from usercontributed commentary and social tags. Although there is no explicit user rating of audio files, a download counter for each file provides a measure of its popularity, and search results are presented by descending popularity count. The sublists are used to search online for semantically related audio files using an exclusive keyword search. Sublists are used in the order created, from largest to smallest. A search is considered successful when it returns one or more recommendations. Additionally, the algorithm optimizes audio file recommendations by ignoring future sublists that contain word features from a previously successful search. The most favourable result is a recommendation for the longest sub-list, with the worst case being no recommendations. In practice, the worst case is, typically, a recommendation for each singleton word feature. For each query, the URLs of the recommendations are logged in a separate list. The list is constrained to a number specified at the system startup. Furthermore, if a list has less than the number of files requested it is considered sparsely populated and no further modification made to its items. For example, if the maximum number of recommendations specified for each query is five, and there are two queries where one returns nine recommendations and the other three, the longer list will be constrained to five, and the empty items of the second list are ignored. The separate lists of audio file recommendations are then presented to the audio segmentation module. 3.2 Audio File Classification and Segmentation Audio segmentation is an essential preprocessing step in many audio applications (Foote 2000). In soundscape composition, a composer will choose background and foreground sound regions to combine into new soundscapes. Background and foreground sounds are general categories that refer to a signal's perceptual class. Background sounds seem to come from farther away than foreground sounds or occur often enough to belong to the aggregate of all sounds that make up the background texture of a soundscape. This is synonymous with a ubiquitous sound (Augoyard and Torgue 2006): a sound that is diffuse, omnidirectional, constant, and prone to sound absorption and reflection factors having an overall effect on the quality of the sound. Urban drones and the purring of machines are two examples of ubiquitous or background sound. Conversely, foreground sounds are typically heard standing out clearly against the background. At any moment in a sound recording, there may be either background sound, foreground sound, or a combination of both. Segmenting an audio file is a process of listening to the recording for salient features and cutting regions for later use. To automate this process, we have designed an algorithm to classify segments of an audio file and concatenate neighbouring segments with the same label. An established technique for classification of an audio recording is to use a supervised machine learning algorithm trained with examples of classified recordings. 3.3 Audio Features Used for Segmentation The classifier models the generic soundscape categories background, foreground, and background with foreground. We use a vector of the low-level audio features totalloudness, and the first three mel-frequency cepstral coefficients (MFCC). These features reflect the behaviour of the human auditory system, which is an important aspect of 2013 3 soundscape studies. They are extracted at a frame-level from an audio signal with a window of 23 ms and a step size of 11.5 ms using the Yaafe audio feature extraction software package (Mathieu et al. 2010). MFCC audio features represent the spectral characteristics of a sound by a small number of coefficients calculated by the logarithm of the magnitude of a triangular filter bank. We use an implementation of MFCC that builds a logarithmically spaced filter bank according to 40 coefficients mapped along the perceptual Mel-scale by: M(f) = 1127 log ✓ 1 + f 700◆ (1) where f is the frequency in Hz. Total loudness is the characteristic of a sound associated with the sensation of intensity. The human auditory system affects the perception of intensity of different frequencies. One model of loudness (Zwicker 1961) takes into account the disparity of loudness at different frequencies along the Bark scale, which corresponds to the first 24 critical bands of hearing. Bands near human speech frequencies have a lower threshold than those of low and high frequencies. The conversion from a frequency in Hz f to the equivalent frequency in the Bark scale B is calculated with the following formula (Traunmuller 1990). B(f) = 13 arctan(0.00076f)+3.5 arctan ✓ f 7500◆2 (2) Where f is the frequency in Hz. A specific loudness is the loudness calculated at each Bark band; the total loudness is the sum of individual specific loudnesses over all bands. Because a soundscape is perceived by a human not at the sample level, but over longer time periods, we use a so called bag of frames approach (Aucouturier and Defreville 2007) to account for longer signal durations. Essentially, this kind of approach considers frames that represent a signal have possibly different values, and the density distribution of frames provides a more effective representation than a singular frame. Statistical methods, such as the mean and standard deviation of features, recapitulate the texture of an audio signal, and provides a more effective representation than a single frame. In our research, audio segments are represented with an eight-dimensional feature vector of the means and standard deviations from the total loudness and the first 3 MFCC. The mean and standard deviation of the feature vector models the background, foreground, and background with foreground soundscape categories well. For example, sounds distant from the listener and considered background sound will typically have a smaller mean total loudness. Sounds that occur often enough will have a smaller standard deviation of those in foreground listening. MFCC takes into account the spectrum of the sound affected by its source placement in the environment. 3.4 Supervised Classifier Used for Segmentation We used a Support Vector Machine classifier (SVM) to classify audio segments. SVMs have been used in environmental sound classification problems, and consistently demonstrated good classification accuracy. A SVM is a nonprobabilistic classifier that learns optimal separating hyperplanes in a higher dimensional space from the input. Typically, classification problems present non-linearly separable data that can be mapped to a higher-dimensional space with a kernel function. We use the C-support vector classification (C-SVC) algorithm shown by Chang and Lin (2011). This algorithm uses a radial basis function as a kernel, which is suited to a vector with a small number of features and takes into account the relation between class labels and attributes being non-linear. Training Corpus The classifier was trained using feature vectors from a pre-labelled corpus of audio segments. The training corpus consists of 30 segments between 2 and 7 seconds long. Audio segments were labelled from a consensus vote by human subjects in an audio segment classification study. The study was conducted online through a web browser. Audio was played to participants using an HTML5 audio player object. This player allowed participants to repeatedly listen to a segment. Depending on the browser software, the audio format of segments was either MP3 at 196 kps, or Vorbis at an equivalent bit rate. Participants selected a category from a set of radio buttons and each selection was confirmed when the participant pressed a button to listen to the next segment. There were 15 unique participants in the study group from Canada and the United States. Before the study started, an example for each of the categories, background, foreground, and background with foreground, was played, and a short description of the categories was displayed. Participants were asked to use headphones or audio monitors to listen to segments. Each participant was asked to listen to the randomly ordered soundscape corpus. On completing the study, the participant's classification results were uploaded into a database for analysis. The results of the study were used to label the recordings by a majority vote. Figure 2 shows the results of the vote. Results of the vote gave the labelling to the recordings. There are a total of 10 recordings for each of the categories. A quantitative analysis of the voter results shows the average agreement of recordings for each category as follows: background 84.6% (SD=18.6%); foreground 77.0% (SD=10.4%), and; background with foreground 76.2% (SD=13.4%). The overall agreement was shown to be 79.3% (SD=4.6%). Classifier Evaluation We evaluated the classifier, using the training corpus, with a 10-fold cross validation. The results summary is shown in Table 2. The classifier achieved an overall sample accuracy of 80%, which shows that the classifier was human competitive against the overall human agreement statistic of 79.3%. The kappa statistic is a chance-corrected measure showing the accuracy of prediction among each k-fold model. A kappa score of 0 means the classifier is performing only as well as chance; 1 implies a perfect agreement; and a kappa score of .7 is generally considered satisfactory. The kappa score of .7 in the results shows a good classification accuracy was achieved using the described method. 2013 4 Figure 2: Audio classification vote results from human participants for 30 sound recordings with three categories: Background, Foreground, and Background with Foreground (BaForound) sound. Table 2: Summary of SVM classifier with the mean and standard deviation for features total loudness and 3 MFCC. Correctly classified instances 24 80% Incorrectly classified instances 6 20% Kappa statistic 0.7 These performance measures are reflected by the confusion matrix in Table 3. All 10 of the audio segments labelled "background" from the study were classified correctly. The remaining audio segments, labelled "foreground" and "background with foreground," were correctly classified 7 out of 10 times, with the highest level of confusion between these latter categories. 3.5 Background-Foreground Segmentation In our segmentation method, we use a 500 ms sliding analysis window with a hop size of 250 ms. We found that for our application an analysis window of this length provided reasonable information for the bag of frames approach and ran with satisfactory computation time. The resulting feature vector is classified and labelled as belonging to one of the three categories. In order to create labelled regions of more than one window, neighbouring windows with the same label are concatenated and the start and end time of the new window are logged. To demonstrate the segmentation algorithm, we used a 9 second audio file containing a linear combination of background, foreground, and background with foreground regions. Figure 3. shows the ground truth with the solid black line, and algorithm segmentation of the audio file with background, foreground, and background with foreground labelled regions applied. We use the SuperCollider3 software package for visualizing the segmented waveform sc3. This example shows concatenated segments labelled as reTable 3: Confusion matrix of SVM classifier for the categories background (BG), foreground (FG), and background with foreground (BgFg). Bg Fg BgFg 10 0 0 Bg 0 7 3 Fg 1 2 7 BgFg Figure 3: Segmentation of the audio file with ground-truth regions (black line) and segmented regions Background (dark-grey), Foreground (mid-grey), and Background with Foreground (light-grey). gions. One of the background with foreground segments was misclassified resulting in a slightly longer foreground region than the ground truth classification. The audio files and the accompanying segmentation data are then presented to the composition module. 3.6 Composition The composition module creates a layered two-channel soundscape composition by processing and combining classified audio segments. Each layer in the composition consists of processed background, foreground, and background with foreground sound recordings. Moreover, an agentbased model is used in conjunction with a heuristic in order to handle different sound recordings and mimic the decisions of a human composer. Specifically, we based this heuristic from production notes for the soundscape composition Island, by Canadian composer Barry Truax. In these production notes, Truax gives detailed information on how sound recordings are effected, and the temporal arrangement of sounds. In our modelling of these processes, we chose to use the first page of the production notes, which corresponds to around 2 minutes of the composition. Furthermore, we framed the model to comply with the protocol of the seg 2013 5 mentation labels and aesthetic evaluations by the authors. A summary of the model is as follows: • Regions labelled background are played sequentially in the order presented by the segmentation. They are processed to form a dramatic textured background. This processing is carried out by first playing the region at 10% of its original speed and applying a stereo time domain granular pitch shifter with ratios 1:0.5 (down an octave) and 1:0.667 (down a 5th). We added a Freeverb reverb (Smith 2010) with a room size of 0.25 to give the texture a more spacious quality. A low pass filter with a cutoff frequency at 800 Hz is used to obscure any persistent high end detail. Finally, a slow spatialization is applied in the stereo field at a rate of 0.1 Hz. • Regions labelled foreground are chosen from the foreground pool by a roll of the dice. They are played individually, separated by a period proportional to the duration of the current region played t = d.75 + d + C, where t is the time between playing the next region, d is the duration of the current region, and C is a constant controlling the minimum duration between regions. In order to separate them from the background texture, foreground regions are processed by applying a band pass filter with a resonant frequency 2,000 Hz and high Q value of 0.5. Finally, a moderate spatialization is applied in the stereo field at a rate of .125 Hz. • Regions labelled background with foreground are slowly faded in and out to evoke a mysterious quality to the soundscape. They are chosen from the pool of regions by a roll of the dice and are played for an arbitrarily chosen duration of between 10 and 20 seconds. Regions with a length less than the chosen duration are looped. In order to achieve a separation from the background texture and foreground sounds, regions are processed by applying a band pass filter with a resonant frequency 8,000 Hz and high Q value of 0.1. The addition of a Freeverb reverb with a room size of 0.125 and a relatively fast spatialization at a rate of 1 Hz was used to further add to the mysterious quality of the sound. This composition model is deployed individually by each of agents of the system, who are responsible for processing a different audio file. An agents decisions are, choosing labelled regions of an audio recording, processing and combining them in a layered soundscape composition according to the composition model. Because of the potentially large number of audio files available to the system, and in order to limit the acoustic density of a composition, a maximum number of agents are specified on system start-up. If there are more audio file results than there are agents to handle them, the extra results are ignored. Equally, if the number of results is smaller then the number of agents, agents without tasks are temporarily ignored. An agent uses the region labels of the audio file to decide which region to process. An audio file may have a number of labelled regions. If there is no region of a type then that type is ignored. The agent can play one of each types of region simultaneously. 4 Qualitative Results Audio Metaphor has been used in performance environments. In one case, the system was seeded with the words "nature," "landscape," and "environment." There were roughly 150 people in the audience. They were told that the system was responding to live Twitter posts and shown the console output of the search results. During the performance, there was an earthquake off the coast of British Columbia, Canada, and the current Twitter posts focused on news of the earthquake. Audio Metaphor used these as natural language requests, searched online for sound recordings related to earthquakes, and created a soundscape composition. The sound recordings processed by the system included an earthquake warning announcement, the sound of alarms, and a background texture of heavy destruction. The audience reacted by checking to see if this event was indeed real. This illustrated how the semantic space of the soundscape composition effectively maps to the concepts of a natural language request. In a separate performance, Audio Metaphor was presented to a small group of artists and academics. This took place during the height of the 2012 conflict in Syria, and the system was seeded with the words "Syria," "Egypt," and "conflict." The soundscape composition presented segments of spoken word, traditional instruments, and other sounds. The audience listened to the composition for over an hour without losing its engagement with the listening experience. One comment was, "It was really good, and we didn't get bored." The sounds held peoples' attention because they were linked to current events, and the processing of sound recordings added to the interest of the composition. Because the composition model deployed in Audio Metaphor is based of a relatively short section of a composition, there was not a great deal of variation in the processing of sound recordings. The fact that people were engaged for such long periods of time suggests that other factors contributed to the novel stimulus. Our nascent hypothesis is that the dynamic audio signal of recordings, in addition to the processing of audio files contributed to listeners ongoing engagement. 2 5 Conclusions and Future Work We describe a soundscape composition engine that chooses audio segments using natural language queries, segments and classifies the resulting files, processes them, and combines them into a soundscape composition at interactive speeds. This implementation uses current Twitter posts as natural language queries to generate search queries and retrieves audio files that are semantically linked to queries from the Freesound audio repository. The ability of Audio Metaphor to respond to current events was shown to be a strong point in audience engagement. The presence of signifier sounds evoked listeners' associations of concepts. Listener engagement was further reinforced through the artistic processing and combination of sound recordings. 2 Sound examples of Audio Metaphor using the composition engine can be found at http://www.audiometaphor.ca/aume 2013 6 Audio Metaphor can be used to help sound artists and autonomous systems retrieve and cut sound field recordings from online audio repositories. Although, its primary function, as we have demonstrated, is autonomous machine generated soundscapes for performance environments and installations. In the future, we will evaluate people's response to these compositions by distributing them to user-contributed music repositories and analyzing user comments. These comments can then be used to inform the Audio Metaphor soundscape composition engine. Although the system generates engaging and novel soundscape compositions, the composition structure is tightly regulated by the handling of background and foreground segments. In future work, we aim toward equipping our system with the ability to evaluate its audio output, in order to make more in-depth composition decisions. By developing these methods, Audio Metaphor will be not only be capable of processing audio files to create novel compositions, but, additionally, be able to respond to the compositions it has made. 6 Acknowledgments This research was funded by a grant from the Natural Sciences and Engineering Research Council of Canada. The authors would also like to thank Barry Truax for his composition and production documentation. 2013_10 !2013 Considering Vertical and Horizontal Context in Corpus-based Generative Electronic Dance Music Arne Eigenfeldt School for the Contemporary Arts Simon Fraser University Vancouver, BC Canada Philippe Pasquier School of Interactive Arts and Technology Simon Fraser University Surrey, BC Canada Abstract We present GESMI (Generative Electronica Statistical Modeling Instrument) - a computationally creative music generation system that produces Electronic Dance Music through statistical modeling of a corpus. We discuss how the model requires complex interrelationships between simple patterns, relationships that span both time (horizontal) and concurrency (vertical). Specifically, we present how context-specific drum patterns are generated, and how auxiliary percussion parts, basslines, and drum breaks are generated in relation to both generated material and the corpus. Generated audio from the system has been accepted for performance in an EDM festival. Introduction Music consists of complex relationships between its constituent elements. For example, a myriad of implicit and explicit rules exist for the construction of successive pitches - the rules of melody (Lerdahl and Jackendoff 1983). Furthermore, as music is time-based, composers must take into account how the music unfolds: how ideas are introduced, developed and later restated. This is the concept of musical form - the structure of music in time. As these relationships are concerned with a single voice, and thus monophonic, we can consider them to be horizontal1 . Similarly, relationships between multiple voices need to be assessed. As with melody, explicit production rules exist for concurrent relationships - harmony - as well as the relationships between melodic motives: polyphony. We can consider these relationships to be vertical (see Figure 1). 1 The question of whether melody is considered a horizontal or vertical relationship is relative to how the data is presented: in traditional music notation, it would be horizontal; in sequencer (list) notation, it would be vertical. For the purposes of this paper, will assume traditional musical notation. Music has had a long history of applying generative methods to composition, due in large part to the explicit rules involved in its production. A standard early reference is the Musikalsches Würfelspiel of 1792, often attributed to Mozart, in which pre-composed musical sections were assembled by the user based upon rolls of the dice (Chuang 1995); however, the "Canonic" compositions of the late 15th century are even earlier examples of procedural composition. In these works, a single voice was written out, and singers were instructed to derive their own parts from it by rule: for example, singing the same melody delayed by a set number of pulses, or at inversion (Randel 2003). Figure 1. Relationships within three musical phrases, a, a1 , b: melodic (horizontal) between pitches within a; formal (horizontal) between a and a1 ; polyphonic (vertical) between a and b. Exploring generative methods with computers began with some of the first applications of computers in the arts. Hiller's Illiac Suite of 1956, created using the Illiac computer at the University of Champaign-Urbana, utilized Markov chains for the generation of melodic sequences (Hiller and Isaacson 1979). In the next forty years, a wide variety of approaches were investigated - see (Papadopoulos and Wiggins 1999) for a good overview of early uses of computers within algorithm composition. However, as the authors suggest, "most of these systems deal with algorithmic composition as a problem solving task rather than a creative and meaningful process". Since that time, this separation has continued: with a few exceptions (Cope 1992, Waschka 2007, Eigenfeldt and Pasquier 2012), contemporary algorithmic systems that employ AI methods a a 1 b 2013 72 remain experimental, rather than generating complete and successful musical compositions. The same cannot be said about live generative music, sometimes called interactive computer music due to its reliance upon composer or performer input during performance. In these systems (Chadabe 1984, Rowe 1993, Lewis 1999), the emphasis is less upon computational experimentation and more upon musical results. However, many musical decisions - notably formal control and polyphonic relationships - essentially remain in the hands of the composer during performance. Joel Chadabe was the first to interact with musical automata. In 1971, he designed a complex analog system that allowed him to compose and perform Ideas of Movement at Bolton Landing (Chadabe 1984). This was the first instance of what he called interactive composing, "a mutually influential relationship between performer and instrument." In 1977, Chadabe began to perform with a digital synthesizer/small computer system: in Solo, the first work he finished using this system, the computer generated up to eight simultaneous melodic constructions, which he guided in realtime. Chadabe suggested that Solo implied an intimate jazz group; as such, all voices aligned to a harmonic structure generated by the system (Chadabe 1980). Although the complexity of interaction increased between the earlier analog and the later digital work, the conception/aesthetic between Ideas of Movement at Bolton Landing and Solo did not change in any significant way. While later composers of interactive systems increased the complexity of interactions, Chadabe conceptions demonstrate common characteristics of interactive systems: 1. Melodic constructions (horizontal relationships) are not difficult to codify, and can easily be "handed off" to the system; 2. harmonic constructions (vertical relationships) can be easily controlled by aligning voices to a harmonic grid, producing acceptable results; 3. complex relationships between voices (polyphony), as well as larger formal structures of variation and repetition, are left to the composer/performer in realtime. These limitations are discussed in more detail in Eigenfeldt (2007). GESMI (Generative Electronica Statistical Modeling Instrument) is an attempt to blend autonomous generative systems with the musical criteria of interactive systems. Informed by methods of AI in generating horizontal relationships (i.e. Markov chains), we apply these methods in order to generate vertical relationships, as well as highlevel horizontal relationships (i.e. form) so as to create entire compositions, yet without the human intervention of interactive systems. The Generative Electronica Research Project (GERP) is an attempt by our research group - a combination of scientists involved in artificial intelligence, cognitive science, machine-learning, as well as creative artists - to generate stylistically valid EDM using human-informed machinelearning. We have employed experts to hand-transcribe 100 tracks in four genres: Breaks, House, Dubstep, and Drum and Bass. Aspects of transcription include musical details (drum patterns, percussion parts, bass lines, melodic parts), timbral descriptions (i.e. "low synth kick, mid acoustic snare, tight noise closed hihat"), signal processing (i.e. the use of delay, reverb, compression and its alteration over time), and descriptions of overall musical form. This information is then compiled in a database, and analysed to produce data for generative purposes. More detailed information on the corpus is provided in (Eigenfeldt and Pasquier 2011). Applying generative procedures to electronic dance music is not novel; in fact, it seems to be one of the most frequent projects undertaken by nascent generative musician/programmers. EDM's repetitive nature, explicit forms, and clearly delimited style suggest a parameterized approach. Our goal is both scientific and artistic: can we produce complete musical pieces that are modeled on a corpus, and indistinguishable from that corpus' style? While minimizing human/artistic intervention, can we extract formal procedures from the corpus and use this data to generate all compositional aspects of the music so that a perspicacious listener of the genre will find it acceptable? We have already undertaken empirical validation studies of other styles of generative music (Eigenfeldt et al 2012), and now turn to EDM. It is, however, the artistic purpose that dominates our motivation around GESMI. As the authors are also composers, we are not merely interested in creating test examples that validate methods. Instead, the goals remain artistic: can we generate EDM tracks and produce a fullevening event that is artistically satisfying, yet entertaining for the participants? We feel that we have been successful, even at the current stage of research, as output from the system has been selected for inclusion in an EDM concert2 as well as a generative art festival3 . Related Work Our research employs several avenues that combine the work of various other researchers. We use Markov models to generate horizontal continuations, albeit with contextual constraints placed upon the queries. These constraints are learned from the corpus, which thus involve machinelearning. Lastly, we use a specific corpus, experttranscribed EDM in order to generate style-specific music. Markov models offer a simple and efficient method of deriving correct short sequences based upon a specific corpus (Pachet et al. 2011), since they are essentially quoting portions of the corpus itself. Furthermore, since the models are unaware of any rules themselves, they can be quickly adapted to essentially "change styles" by switching the corpus. However, as Ames points out (Ames 1989), while simple Markov models can reproduce the surface features 2 http://www.metacreation.net/mumewe2013/ 3 http://xcoax.org/ 2013 73 of a corpus, they are poor at handling higher-level musical structures. Pachet points out several limitations of Markovbased generation, and notes how composers have used heuristic measures to overcome them (Pachet et al. 2011). Pachet's research aims to allow constraints upon selection, while maintaining the statistical distribution of the initial Markov model. We are less interested in maintaining this distribution, as we attempt to explore more unusual continuations for the sake of variety and surprise. Using machine-learning for style modeling has been researched previously (Dubnov et al. 2003), however, their goals were more general in that composition was only one of many possible suggested outcomes from their initial work. Their examples utilized various monophonic corpora, ranging from "early Renaissance and baroque music to hard-bop jazz", and their experiments were limited to interpolating between styles rather than creating new, artistically satisfying music. Nick Collins has used music information retrieval (MIR) for style comparison and influence tracking (Collins 2010). The concept of style extraction for reasons other than artistic creation has been researched more recently by Tom Collins (Collins 2011), who tentatively suggested that, given the state of current research, it may be possible to successfully generate compositions within a style, given an existing database. Although the use of AI within the creation of EDM has been, so far, mainly limited to drum pattern generation (for example, Kaliakatsos-Papakostas et al. 2013), the use of machine-learning within the field has been explored: see (Diakopoulos 2009) for a good overview. Nick Collins has extensively explored various methods of modeling EDM styles, including 1980s synth-pop, UK Garage, and Jungle (Collins 2001, 2008). Our research is unique in that we are attempting to generate full EDM compositions using completely autonomous methods informed by AI methods. Description We have approached the generation of EDM as a producer of the genres would: from both a top-down (i.e. form and structure) and bottom-up (i.e. drum patterns) at the same time. While a detailed description of our formal generation is not possible here (see Eigenfeldt and Pasquier 2013 for a detailed description of our evolutionary methods for form generation), we can mention that an overall form is evolved based upon the corpus, which determines the number of individual patterns required in all sixteen instrumental parts, as well as their specific relationships in time. It is therefore known how many different patterns are required for each part, and which parts occur simultaneously - and thus require vertical dependencies - and which parts occur consecutively, and thus require horizontal dependencies. The order of generation is as follows: 1. Form - the score, determining which instruments are active for specific phrases, and their pattern numbers; 2. Drum Patterns - also called beats4 (kick, snare, closed hihat, open hihat); 3. Auxiliary percussion - (ghost kick/snare, cymbals, tambourine, claps, shakers, percussive noises, etc.) generation is based upon the concurrent drum patterns; 4. Bassline(s) - onsets are based upon the concurrent drum pattern, pitches are derived from associated data; 5. Synth and other melodic parts - onsets are based upon bassline, pitches are derived from associated data. All pitch data is then corrected according to an analysis of the implied harmony of the bassline (not discussed here); 6. Drum breaks - when instruments stop (usually immediately prior to a phrase change, and a pattern variation (i.e. drum fill) occurs; 7. One hits - individual notes and/or sounds that offer colour and foreground change that are not part of an instrument's pattern (not discussed here). Drum Pattern Generation Three different methods are used to generate drum patterns, including: 1. Zero-order Markov generation of individual subparts (kick, snare, closed hihat, and open hihat); 2. first-order Markov generation of individual subparts; 3. first-order Markov generation of combined subparts. In the first case, probabilities for onsets on a given beat subdivision (i.e. sixteen subdivisions per four beat measure) are calculated for each subpart based upon the selected corpus (see Figure 2). As with all data derived from the corpus, the specific context is retained. Thus, if a new drum pattern is required, and it first appears in the main verse (section C), only data derived from that section is used in the generation. Figure 2. Onset probabilities for individual subparts, one measure (sixteenth-note subdivisions), main verse (C section), "Breaks" corpus. In the second case, data is stored as subdivisions of the quarter note, as simple on/off flags (i.e. 1 0 1 0) for each subpart, and separate subparts are calculated independ4 The term "beat" has two distinct meanings. In traditional music, beat refers to the basic unit of time - the pulse of the music - and thus the number of subdivisions in a measure; within EDM, beat also refers to the combined rhythmic patterns created by the individual subparts of the drums (kick drum, snare drum, hi-hat), as well as any percussion patterns. 2013 74 ently. Continuations5 are considered across eight measure phrases, rather than limited to specific patterns: for example, the contents of an eight measure pattern are considered as thirty-two individual continuations, while the contents of a one measure pattern that repeats eight times are considered as four individual continuations with eight instances, because they are heard eight separate times. As such, the inherent repetition contained within the music is captured in the Markov table. In the third case, data is stored as in the second method just described; however, each subpart is considered 1 bit in a 4-bit nibble for each subdivision that encodes the four subparts together: bit 1 = open hihat; bit 2 = closed hihat; bit 3 = snare; bit 4 = kick. This method ensures that polyphonic relationships between parts - vertical relationships - are encoded, as well as time-based relationships - horizontal relationships (see Figure 3). Figure 3. Representing the 4 drum subparts (of two beats), as a 4bit nibble (each column of the four upper rows), translated to decimal (lower row), for each sixteenth-note subdivision. These values are stored as 4-item vectors representing a single beat. It should be noted that EDM rarely, if ever, ventures outside of sixteenth-note subdivisions, and this representation is appropriate for our entire corpus. The four vectors are stored, and later accessed, contextually: separate Markov tables are kept for each of the four beats of a measure, and for separate sections. Thus, all vectors that occur on the second beat are considered queries to continuations for the onsets that occur on the third beat; similarly, these same vectors are continuations for onsets that occur on the first beat. The continuations are stored over eight measure phrases, so the first beat of the second measure is a continuation for the fourth beat of the first measure. We have not found it necessary to move beyond first-order Markov generation, since our data involves four-items representing four onsets. We found that the third method produced the most accurate re-creations of drum patterns found in the corpus, yet the first method produced the most surprising, while main5 The generative music community uses the term "continuations" to refer to what is usually called transitions (weighted edges in the graph). taining usability. Rather than selecting only a single method for drum pattern generation, it was decided that the three separate methods provided distinct "flavors", allowing users several degrees of separation from the original corpus. Therefore, all three methods were used in the generation of a large (>2000) database of potential patterns, from which actual patterns are contextually selected. See (Eigenfeldt and Pasquier 2013) for a complete description of our use of populations and the selection of patterns from these populations. Auxiliary Percussion Generation Auxiliary percussion consists of non-pitched rhythmic material not contained within the drum pattern. Within our corpus, we have extracted two separate auxiliary percussion parts, each with up to four subparts. The relationship between these parts to the drum pattern is intrinsic to the rhythmic drive of the music; however, there is no clear or consistent musical relationship between these parts, and thus no heuristic method available for their generation. We have chosen to generate these parts through firstorder Markov chains, using the same contextual beatspecific encoding just described; as such, logical horizontal relationships found in the corpus are maintained. Using the same 4-bit representation for each auxiliary percussion part as described in method 3 for drum pattern generation, vertical consistency is also imparted; however, the original relationship to the drum pattern is lost. Therefore, we constrain the available continuations. Figure 4. Maintaining contextual vertical and horizontal relationships between auxiliary percussion beats (a) and drum beats (b). As the drum patterns are generated prior to the auxiliary percussion, the individual beats from these drum patterns serve as the query to a cross-referenced transition table made up of auxiliary percussion pattern beats (see Figure 4). Given a one measure drum pattern consisting of four beats b1 b2 b3 b4, all auxiliary percussion beats that occur simultaneously with b1 in the corpus are considered as available concurrent beats for the auxiliary percussion pattern's initial beat. One of these, a1, is selected as the first beat, using a weighted probability selection. The available 2013 75 continuations for a1 are a2-a6. Because the next auxiliary percussion beat must occur at the same time as the drum pattern's b2, the auxiliary percussion beats that occur concurrently with b2 are retrieved: a2, a3, a5, a7, a9. Of these, only a2, a3, and a5 intersect both sets; as such, the available continuations for a1 are constrained, and the next auxiliary percussion beat is selected from a2, a3, and a5. Of note is the fact that any selection from the constrained set will be horizontally correct due to the transition table, as well as being vertically consistent in its relationship to the drum pattern due to the constraints; however, since the selection is made randomly from the probabilistic distribution of continuations, the final generated auxiliary percussion pattern will not necessarily be a pattern found in the corpus. Lastly, we have not experienced insufficient continuations since we are working with individual beats, rather than entire measures: while there are only a limited number of four-element combinations that can serve as queries, a high number of 1-beat continuations exist. Bassline Generation Human analysis determined there were up to two different basslines in the analysed tracks, not including bass drones, which are considered a synthesizer part. Bassline generation is a two-step process: determining onsets (which include held notes longer than the smallest quantized value of a sixteenth note); then overlaying pitches onto these onsets. Figure 5. Overlaying pitch-classes onto onsets, with continuations constrained by the number of pitches required in the beat. Bassline onset generation uses the same method as that of auxiliary percussion - contextually dependent Markov sequences, using the existing drum patterns as references. One Markov transition table encoded from the corpus' basslines contains rhythmic information: onsets (1), rests (.), and held notes (-). The second transition table contains only pitch data: pitch-classes relative to the track's key (24 to +24). Like the auxiliary percussion transition tables, both the queries and the continuations are limited to a single beat. Once a bassline onset pattern is generated, it is broken down beat by beat, with the number of onsets occurring within a given beat serving as the first constraint on pitch selection (see Figure 5). Our analysis derived 68 possible 1-beat pitch combinations within the "Breaks" corpus. In Figure 5, an initial beat contains 2 onsets (1 - 1 .) Within the transition table, 38 queries contain two values (not grayed out in Figure 5's vertical column): one of these is selected as the pitches for the first beat using a weighted probability selection (circled). As the next beat contains 2 onsets (1 1 . . ), the first beat's pitches (0 -2) serve as the query to the transition table, and the returned continuations are constrained by matching the number of pitches required (not grayed out in Figure 5's horizontal row). One of these is selected for the second beat (circled) using additional constraints described in the next section. This process continues, with pitch-classes being substituted for onset flags (bottom). Additional Bassline Constraints Additional constraints are placed upon the bassline generation, based upon user set "targets". These include constraints the following: - Density: favouring fewer or greater onsets per beat; - straightness: favouring onsets on the beat versus syncopated; - dryness: favouring held notes versus rests; - jaggedness: favouring greater or lesser differentiation between consecutive pitch-classes. Each available continuation is rated in comparison to the user-set targets using a Euclidean distance function, and an exponential random selection is made from the top 20% of these ranked continuations. This notion of targets appears throughout the system. While such a method does allow some control over the generation, the main benefit will be demonstrated in the next stage of our research: successive generations of entire compositions - generating hour long sets of tracks, for example - can be guaranteed to be divergent by ensuring targets for parameters are different between runs. Contextual Drum-fills Fills, also known as drum-fills, drum-breaks, or simply breaks, occur at the end of eight measure phrases as variations of the overall repetitive pattern, and serve to highlight the end of the phrase, and the upcoming section change. Found in most popular music, they are often restricted to the drums, but can involve other instruments (such as auxiliary percussion), as well as a break, or silence, from the other parts. Fills are an intuitive aspect of composition in patternbased music, and can be conceptually reduced to a rhythmic variation. As such, they are not difficult to code algo 2013 76 rithmically: for example, following seven repetitions of a one measure drum pattern, a random shuffle of the pattern will produce a perfectly acceptable fill for the eighth measure (see Figure 6). Figure 6. Left: drum pattern for kick, snare, and hihat; right: pattern variation by shuffling onsets can serve as a fill. Rather than utilizing such creative "shortcuts", our fill generation is based entirely upon the corpus. First, the location of the fill is statistically generated based upon the location of fills within phrases in the corpus, and the generated phrase structure. Secondly, the type of fill is statistically generated based upon the analysis: for example, the described pattern variation using a simple onset shuffle has a 0.48 probability of occurring within the Breaks corpus - easily the most common fill type. Lastly, the actual variation is based upon the specific context. Figure 7. Fill generation, based upon contextual similarity Fills always replace an existing pattern; however, the actual pattern to be replaced within the generated drum part may not be present in the corpus, and thus no direct link would be evident from a fill corpus. As such, the original pattern is analysed for various features, including density (the number of onsets) and syncopation (the percentile of onsets that are not on strong beats). These values are then used to search the corpus for patterns with similar features. One pattern is selected from those that most closely match the query. The relationship between the database's pattern and its fill is then analysed for consistency (how many onsets remain constant), density change (how many onsets are added or removed), and syncopation change (the percentile change in the number of onsets that are not on strong beats). This data is then used to generate a variation on the initial pattern (see Figure 7). The resulting fill will display a relationship to its original pattern in a contextually similar relationship to the corpus. Conclusions and Future Work The musical success of EDM lies in the interrelationship of its parts, rather than the complexity of any individual part. In order to successfully generate a complete musical work that is representative of the model, rather than generating only components of the model (i.e. a single drum pattern), we have taken into account both horizontal relationships between elements in our use of a Markov model, as well as vertical relationships in our use of constraint-based algorithms. Three different methods to model these horizontal and vertical dependencies at generation time have been proposed in regards to drum pattern generation (through the use of a combined representation of kick, snare, open and closed hihat, as well as context-dependent Markov selection), auxiliary percussion generation (through the use of constrained Markov transitions) and bassline generation (through the use of both onsetand pitch-constrained Markov transitions. Each of these decisions contributes to what we believe to be a more successful generation of a complete work that is stylistically representative and consistent. Future work includes validation to investigate our research objectively. We have submitted our work to EDM festivals and events that specialize in algorithmic dance music, and our generated tracks have been selected for presentation at two festivals so far. We also plan to produce our own dance event, in which generated EDM will be presented alongside the original corpus, and use various methods of polling the audience to determine the success of the music. Lastly, we plan to continue research in areas not discussed in this paper, specifically autonomous timbral selection and signal processing, both of which are integral to the success of EDM. This research was created in MaxMSP and Max4Live running in Ableton Live. Example generations can be heard at soundcloud.com/loadbang. Acknowledgements This research was funded by a grant from the Canada Council for the Arts, and the Natural Sciences and Engineering Research Council of Canada. 2013_11 !2013 Harmonising Melodies: Why Do We Add the Bass Line First? Raymond Whorley and Christophe Rhodes Department of Computing Goldsmiths, University of London New Cross, London, SE14 6NW, UK {r.whorley, c.rhodes}@gold.ac.uk Geraint Wiggins and Marcus Pearce School of Electronic Engineering and Computer Science Queen Mary, University of London Mile End Road, London, E1 4NS, UK {geraint.wiggins, marcus.pearce}@eecs.qmul.ac.uk Abstract We are taking an information theoretic approach to the question of the best way to harmonise melodies. Is it best to add the bass first, as has been traditionally the case? We describe software which uses statistical machine learning techniques to learn how to harmonise from a corpus of existing music. The software is able to perform the harmonisation task in various different ways. A performance comparison using the information theoretic measure cross-entropy is able to show that, indeed, the bass first approach appears to be best. We then use this overall strategy to investigate the performance of specialist models for the prediction of different musical attributes (such as pitch and note length) compared with single models which predict all attributes. We find that the use of specialist models affords a definite performance advantage. Final comparisons with a simpler model show that each has its pros and cons. Some harmonisations are presented which have been generated by some of the better performing models. Introduction In our ongoing research, we are developing computational models of four-part harmony such that alto, tenor and bass parts are added to a given soprano part in a stylistically suitable way. In this paper we compare different strategies for carrying out this creative task. In textbooks on four-part harmony, students are often encouraged to harmonise a melody in stages. In particular, it is usual for the bass line to be added first, with harmonic symbols such as Vb (dominant, first inversion) written underneath. The harmony is then completed by filling in the inner (alto and tenor) parts. This paper sets out to show what information theory has to say about the best way to approach harmonisation. Is adding the bass line first optimal, or is there a better approach? In order to investigate questions such as this, we have written software based on multiple viewpoint systems (Conklin and Witten 1995) which enables the computer to learn for itself how to harmonise by building a statistical model from a corpus of existing music. The multiple viewpoint framework allows different attributes of music to be modelled. The predictions of these individual models are then combined to give an overall prediction. The multiple viewpoint systems are selected automatically, on the basis of minimising the information theoretic measure crossentropy. We have developed and implemented three increasingly complex versions of the framework, which allow models to be constructed in different ways. The first two versions are particularly pertinent to the aims of this paper, since they facilitate precisely the comparisons we wish to make without the time complexity drawbacks of the more complex version 3. The latter is therefore not utilised in this part of our research. The fact that the resulting models are statistical (and indeed self-learned from a corpus) means that harmonies are generated in a non-deterministic way. The harmonies are more or less probable, rather than right or wrong, with an astronomical number of ways for a melody to be harmonised from the probability distributions. Of course, there is little point in producing something novel if it is also deemed to be bad. Our aim is to hone the models in such a way that the subjective quality and style of the generated harmony is consistently similar to that of the corpus, whilst retaining almost infinite variety. In this way, the computational models can be thought of as creative in much the same way as a human composer (or at the very least that they imitate such creativity). Finding a good overall strategy for carrying out the harmonisation task is an important part of this improvement process. Multiple Viewpoint Systems There follows a brief description of some essential elements of multiple viewpoint systems. In order to keep things simple we look at things from the point of view of melodic modelling (except for the subsection entitled Cross-entropy and Evaluation). Types of Viewpoint Basic viewpoints are the fundamental musical attributes that are predicted, such as Pitch and Duration. The domain (or alphabet) of Pitch is the set of MIDI values of notes seen in the melodies comprising the corpus. A semibreve (or whole note) is divided into 96 Duration units; therefore the domain of Duration is the set of integer values representing note lengths seen in the corpus. Derived viewpoints such as Interval (sequential pitch interval) and DurRatio (sequential duration ratio) are derived from, and can therefore predict, basic types (in this case Pitch and Duration respectively). A B4 2013 79 (MIDI value 71) following a G4 (MIDI value 67) has an Interval value of 4. Descending intervals have negative values. Similarly, a minim (half note) following a crotchet (quarter note) has a DurRatio value of 2. Threaded viewpoints are defined only at certain positions in a sequence, determined by Boolean test viewpoints such as Tactus; for example, Pitch ! Tactus has a defined Pitch value only on Tactus beats (i.e., the main beats in a bar). A linked viewpoint is the conjunction of two or more simple (or primitive) viewpoints; for example, DurRatio ⊗ Interval is able to predict both Duration and Pitch. If any of the constituent viewpoints are undefined, then the linked viewpoint is also undefined. These are just a few of the viewpoints we have implemented. See Conklin and Witten (1995) for more information about viewpoints. N-gram Models So far, N-gram models, which are Markov models employing subsequences of N symbols, have been the modelling method of choice when using multiple viewpoint systems. The probability of the Nth symbol, the prediction, depends only upon the previous N −1 symbols, the context. The number of symbols in the context is the order of the model. Only defined viewpoint values are used in N-grams; sequence positions with an undefined viewpoint value are skipped. See Manning and Sch ¨utze (1999) for more details. Modelling Viewpoints What we call a viewpoint model is a weighted combination of various orders of N-gram model of a particular viewpoint. The combination is achieved by Prediction by Partial Match (Cleary and Witten 1984). PPM makes use of a sequence of models, which we call a back-off sequence, for context matching and the construction of complete prediction probability distributions. The back-off sequence begins with the highest order model, proceeds to the second-highest order, and so on. An escape method (in this research, method C) determines prediction probabilities, which are generally high for predictions appearing in high-order models, and vice versa. If necessary, a probability distribution is completed by backing off to a uniform distribution. Combining Viewpoint Models A multiple viewpoint system comprises more than one viewpoint; indeed, usually many more. The prediction probability distributions of the individual viewpoint models must be combined. The first step is to convert these distributions into distributions over the domain of whichever basic type is being predicted at the time. A weighted arithmetic or geometric (Pearce, Conklin, and Wiggins 2004) combination technique is then employed to create a single distribution. A run-time parameter called a bias affects the weighting. See Conklin (1990) for more information. Long-term and Short-term Models Conklin (1990) introduced the idea of using a combination of a long-term model (LTM), which is a general model of a style derived from a corpus, and a short-term model (STM), which is constructed as a piece of music is being predicted or generated. The latter aims to capture musical structure peculiar to that piece. Currently, the same multiple viewpoint system is used for each. The LTM and STM distributions are combined in the same way as the viewpoint distributions, for which purpose there is a separate bias (L-S bias). Cross-entropy and Evaluation Cross-entropy is used to objectively compare the prediction performance of different models. If we define Pm(Si|Ci,m) as the probability of the i th musical symbol given its context for a particular model m, and assume that there are a total of n sequential symbols, then cross-entropy is given by −(1/n) !n i=1 log2 Pm(Si|Ci,m). Jurafsky and Martin (2000) note that because the cross-entropy of a sequence of symbols (according to some model) is always higher than its true entropy, the most accurate model (i.e., the one closest to the true entropy) must be the one with the lowest crossentropy. In addition, because it is a "per symbol" measure, it is possible to similarly compare generated harmonisations of any length. Harmonisations with a low cross-entropy are likely to be simpler and more predictable to a listener, while those with a high cross-entropy are likely to be more complex, more surprising and in the extreme possibly unpleasant. See Manning and Sch ¨utze (1999) for more details on cross-entropy. Model Construction Cross-entropy is also used to guide the automatic construction of multiple viewpoint systems. Viewpoints are added (and sometimes removed) from a system stage by stage. Each candidate system is used to calculate the average crossentropy of a ten-fold cross-validation of the corpus. The system producing the lowest cross-entropy goes on to the next stage of the selection process. For example, starting with the basic system {Duration, Pitch}, of all the viewpoints tried let us assume that ScaleDegree lowers the crossentropy most on its addition. Our system now becomes {Duration, Pitch, ScaleDegree}. Duration cannot be removed at this stage, as a Duration-predicting viewpoint must be present. Assuming that on removing Pitch the cross-entropy rises, Pitch is also retained. Let us now assume that after a second round of addition we have the system {Duration, Pitch, ScaleDegree, Interval}. Trying all possible deletions, we may now find that the cross-entropy decreases on the removal of Pitch, giving us the system {Duration, ScaleDegree, Interval}. The process continues until no addition can be found to lower the cross-entropy by a predetermined minimum amount. When selection is complete, the biases are optimised. Development of Multiple Viewpoints The modelling of melody is relatively straightforward, in that a melody comprises a single sequence of nonoverlapping notes. Such a sequence is ideal for creating N-grams. Harmony is much more complex, however. Not 2013 80 only does it consist (for our purposes) of four interrelated parts, but it usually contains overlapping notes. In other words, music is usually not homophonic; indeed, very few of the major key hymn tune harmonisations (Vaughan Williams 1933) in our corpora are completely homophonic. Some preprocessing of the music is necessary, therefore, to make it amenable to modelling by means of N-grams. We use full expansion on our corpora (corpus ‘A' and corpus ‘B' each contain fifty harmonisations), which splits notes where necessary to achieve a sequence of block chords (i.e., without any overlapping notes). This technique has been used before in relation to viewpoint modelling (Conklin 2002). To model harmony correctly, however, we must know which notes have been split. Basic viewpoint Cont is therefore introduced to distinguish between notes which are freshly sounded and those which are a continuation of the preceding one. Currently, the basic viewpoints (or attributes) are predicted at each point in the sequence in the following order: Duration, Cont and then Pitch. Version 1 The starting point for the definition of the strictest possible application of viewpoints is the formation of vertical viewpoint elements (Conklin 2002). An example of such an element is !69, 64, 61, 57", where all of the values are from the domain of the same viewpoint (i.e., Pitch, as MIDI values), and all of the parts (SATB) are represented. This method reduces the entire set of parallel sequences to a single sequence, thus allowing an unchanged application of the multiple viewpoint framework, including its use of PPM. Only those elements containing the given soprano note are allowed in the prediction probability distribution, however. This is the base-level model, to be developed with the aim of substantially improving performance. Version 2 In this version, it is hypothesised that predicting all unknown symbols in a vertical viewpoint element at the same time is neither necessary nor desirable. It is anticipated that by dividing the overall harmonisation task into a number of subtasks (Allan and Williams 2005; Hild, Feulner, and Menzel 1992), each modelled by its own multiple viewpoint system, an increase in performance can be achieved. Here, a subtask is the prediction or generation of at least one part; for example, given a soprano line, the first subtask might be to predict the entire bass line. This version allows us to experiment with different arrangements of subtasks. As in version 1, vertical viewpoint elements are restricted to using the same viewpoint for each part. The difference is that not all of the parts are now necessarily represented in a vertical viewpoint element. Comparison of Subtask Combinations In this section we carry out the prediction of bass given soprano, alto/tenor given soprano/bass, tenor given soprano, alto/bass given soprano/tenor, alto given soprano, and tenor/bass given soprano/alto (i.e., prediction in two stages), in order to ascertain the best performing combination for subsequent comparisons. Prediction in three stages is not considered here because of time limitations. Earlier studies in the realm of melodic modelling revealed that the model which performed best was an LTM updated after every prediction in conjunction with an STM (a BOTH+ model) using weighted geometric distribution combination. Time constraints dictate the assumption that such a model is likely to perform similarly well with respect to the modelling of harmony. In addition, only corpus ‘A', a bias of 2 and an L-S bias of 14 are used for viewpoint selection (as for the best melodic BOTH+ runs using corpus ‘A'). As usual, the biases are optimised after completion of selection. Here, we predict Duration, Cont and Pitch together (i.e., using a single multiple viewpoint system at each prediction stage). We also use the seen Pitch domain at this juncture (i.e., the domain of Pitch vertical viewpoint elements seen in the corpus, as opposed to all possible such elements). It is appropriate at this point to make some general observations about the bar charts presented in this paper. Comparisons are made for a range of ¯h (maximum N-gram order) from 0 to 5. Each value of ¯h may have a different automatically selected multiple viewpoint system. Please note that all bar charts have a cross-entropy range of 2.5 bits/prediction, often not starting at zero. All bars have standard errors associated with them, calculated from the cross-entropies obtained during ten-fold cross-validation (using final multiple viewpoint systems and optimised biases). Figure 1 compares the prediction of alto given soprano, tenor given soprano, and bass given soprano. The first thing to notice is that the error bars overlap. This could be taken to mean that we cannot (or should not) draw conclusions in such cases; however, the degree of overlap and the consistency of the changes across the range of ¯h is highly suggestive of the differences being real. A clinching quantitative argument is reserved until consideration of Figure 3. Prediction of the alto part has the lowest cross-entropy and prediction of the bass has the highest across the board. This is very likely to be due to the relative number of elements in the Pitch domains for the individual parts (i.e., 18, 20 and 23 for alto, tenor and bass respectively). The lowest crossentropies occur at an ¯h of 1 except for the bass, which has its minimum at an ¯h of 2 (this cross-entropy is only very slightly lower than that for an ¯h of 1, however). There is a completely different picture for the final stage of prediction. Figure 2 shows that, having predicted the alto part with a low cross-entropy, the prediction of tenor/bass has the highest. Similarly, the high cross-entropy for the prediction of the bass is complemented by an exceptionally low cross-entropy for the prediction of alto/tenor (notice that the error bars do not overlap with those of the other prediction combinations). Once again, this can be explained by the number of elements in the part domains: the sizes of the cross-product domains are 460, 414 and 360 for tenor/bass, alto/bass and alto/tenor respectively. Although we are not using cross-product domains, it is likely that the seen domains are in similar proportion. The lowest cross-entropies occur at an ¯h of 1. Combining the two stages of prediction, we see in Fig 2013 81 0 1 2 3 4 5 A given S T given S B given S Maximum N-gram Order Cross-entropy (bits/prediction) 1.0 1.5 2.0 2.5 3.0 3.5 Figure 1: Bar chart showing how cross-entropy varies with ¯h for the version 2 prediction of alto given soprano, tenor given soprano, and bass given soprano using the seen Pitch domain. Duration, Cont and Pitch are predicted using a single multiple viewpoint system at each prediction stage. ure 3 that predicting bass first and then alto/tenor has the lowest cross-entropy. Notice, however, that the error bars of this model overlap with those of the other models. This is a critical comparison, requiring a high degree of confidence in the conclusions we are drawing. Let us look at the ¯h = 1 and ¯h = 2 comparisons in more detail, as they are particularly pertinent. In both cases, all ten cross-entropies produced by ten-fold cross-validation are lower for B then AT than for A then TB; and nine out of ten are lower for B then AT than for T then AB. The single increase is 0.11 bits/chord for an ¯h of 1 and 0.09 bits/chord for an ¯h of 2 compared with a mean decrease of 0.22 bits/chord for the other nine values in each case. This demonstrates that we can have far greater confidence in the comparisons than the error bars might suggest. A likely reason for this is that there is a range of harmonic complexity across the pieces in the corpus which is reflected as a range of cross-entropies (ultimately due to compositional choices). This inherent cross-entropy variation seems to be greater than the true statistical variation applicable to these comparisons. We can be confident, then, that predicting bass first and then alto/tenor is best, reflecting the usual human approach to harmonisation. The lowest cross-entropy is 4.98 bits/chord, occurring at an ¯h of 1. Although having the same cross-entropy to two decimal places, the very best model combines the bass-predicting model using an ¯h of 2 (optimised bias and L-S bias are 1.9 and 53.2 respectively) with the alto/tenor-predicting model using an ¯h of 1 (optimised bias and L-S bias are 1.3 and 99.6 respectively). Table 1 gives some idea of the complexity of the multiple viewpoint systems involved, listing as it does the first six viewpoints automatically selected for the prediction of bass given soprano (¯h = 2) and alto/tenor given soprano/bass 0 1 2 3 4 5 TB given SA AB given ST AT given SB Maximum N-gram Order Cross-entropy (bits/prediction) 1.0 1.5 2.0 2.5 3.0 3.5 Figure 2: Bar chart showing how cross-entropy varies with ¯h for the version 2 prediction of tenor/bass given soprano/alto, alto/bass given soprano/tenor and alto/tenor given soprano/bass using the seen Pitch domain. Duration, Cont and Pitch are predicted using a single multiple viewpoint system at each prediction stage. (¯h = 1). Many of the primitive viewpoints involved have already been defined or are intuitively obvious. LastInPhrase and FirstInPiece are either true of false, and Piece has three values: first in piece, last in piece or otherwise. Metre is more complicated, being an attempt to define metrical equivalence within and between bars of various time signatures. Notice that only two of the viewpoints are common to both systems. In fact, of the twenty-four viewpoints in the B given S system and twelve in the AT given SB system, only five are common. This demonstrates the degree to which the systems have specialised in order to carry out these rather different tasks. The difference in the size of the systems suggests that the prediction of the bass part is more complicated than that of the inner parts, as reflected in the difference in cross-entropy. The Effect of Model Order Figure 1 indicates that, for example, there is only a small reduction in cross-entropy from ¯h = 0 to ¯h = 1. The degree of error bar overlap means that even this small reduction is questionable. Is it possible that there is no real difference in performance between a model using unconditional probabilities and one using the shortest of contexts? Let us, in the first place, examine the individual ten-fold cross-validation cross-entropy values. All ten of these values are lower for an ¯h of 1, giving us confidence that there is indeed a small improvement. Having established that, however, it would be useful to explain why the improvement is perhaps smaller than we might have expected. One important reason for the less than impressive improvement is that although the ¯h = 0 model is nominally unconditional, the viewpoints Interval, DurRatio and Interval ! Tactus appear in the ¯h = 0 multiple view 2013 82 0 1 2 3 4 5 A then TB T then AB B then AT Maximum N-gram Order Cross-entropy (bits/chord) 3.5 4.0 4.5 5.0 5.5 6.0 Figure 3: Bar chart showing how cross-entropy varies with ¯h for the version 2 prediction of alto then tenor/bass, tenor then alto/bass and bass then alto/tenor given soprano using the seen Pitch domain. Duration, Cont and Pitch are predicted using a single multiple viewpoint system at each prediction stage. point system (linked with other viewpoints). These three viewpoints make use of attributes of the preceding chord; therefore with respect to predicted attributes Duration and Pitch, this model is partially ¯h = 1. This hidden conditionality is certainly enough to substantially improve performance compared with a completely unconditional model. Another reason is quite simply that the corpus has failed to provide sufficient conditional statistics; in other words, the corpus is too small. This is the fundamental reason for the performance dropping off above an ¯h of 1 or 2. We would expect peak performance to shift to higher values of ¯h as the quantity of statistics substantially increases. Supporting evidence for this is provided by our modelling of melody. Much better melodic statistics can be gathered from Viewpoint B AT Pitch × Interval ⊗ InScale × Cont ⊗ TactusPositionInBar × × Duration ⊗ (ScaleDegree # LastInPhrase) × × Interval ⊗ (ScaleDegree # Tactus) × ScaleDegree ⊗ Piece × Cont ⊗ Interval × DurRatio ⊗ TactusPositionInBar × ScaleDegree ⊗ FirstInPiece × Cont ⊗ Metre × Table 1: List of the first six viewpoints automatically selected for the prediction of bass given soprano (B, ¯h = 2) and alto/tenor given soprano/bass (AT, ¯h = 1). the same corpus because the Pitch domain is very much smaller than it is for harmony. A BOTH+ model shows a large fall in cross-entropy from ¯h = 0 to ¯h = 1 (with error bars not overlapping), while peak performance occurs at an ¯h of 3. Figure 2 reveals an even worse situation with respect to performance differences across the range of ¯h. For TB given SA, for example, it is not clear that there is a real improvement from ¯h = 0 to ¯h = 1. In this case, there is a reduction in five of the ten-fold cross-validation cross-entropy values, but an increase in the other five. This is almost certainly due to the fact that, having fixed the soprano and alto notes, the number of tenor/bass options are severely limited; so much so, that conditional probabilities can rarely be found. This situation should also improve with increasing corpus size. Separate Prediction of Attributes We now investigate the use of separately selected and optimised multiple viewpoint systems for the prediction of Duration, Cont and Pitch. Firstly, however, let us consider the utility of creating an augmented Pitch domain. Approximately 400 vertical Pitch elements appear in corpus ‘B' which are not present in corpus ‘A', and there are undoubtedly many more perfectly good chords which are absent from both corpora. Such chords are unavailable for use when the models generate harmony, and their absence must surely skew probability distributions when predicting existing data. One solution is to use a full Cartesian product; but this is known to result in excessively long run times. Our preferred solution is to transpose chords seen in the corpus up and down, a semitone at a time, until one of the parts goes out of the range seen in the data. Such elements not previously seen are added to the augmented Pitch domain. Derived viewpoints such as ScaleDegree are able to make use of the extra elements. We shall see shortly that this change increases cross-entropies dramatically; but since this is not a like-for-like comparison, it is not an indication of an inferior model. Figure 4 shows that better models can be created by selecting separate multiple viewpoint systems to predict individual attributes, rather than a single system to predict all of them. The difference in cross-entropy is quite marked, although there is a substantial error bar overlap. An ¯h of 1 is optimal in both cases. All ten cross-entropies produced by ten-fold cross-validation are lower for the separate system case, providing confidence that the improvement is real. The lowest cross-entropy for separate prediction at ¯h = 1 is 5.44 bits/chord, compared with 5.62 bits/chord for prediction together. The very best model for separate prediction, with a cross-entropy of 5.35 bits/chord, comprises the best performing systems of whatever the value of ¯h. Comparison of Version 1 with Version 2 A comparison involving Duration, Cont and Pitch would show that version 2 has a substantially higher crossentropy than version 1. This is due to the fact that whereas the duration of an entire chord is predicted only once in version 1, it is effectively predicted twice (or even three times) 2013 83 0 1 2 3 4 5 Separately Together Maximum N-gram Order Cross-entropy (bits/chord) 4.0 4.5 5.0 5.5 6.0 6.5 Figure 4: Bar chart showing how cross-entropy varies with ¯h for the version 2 prediction of bass given soprano followed by alto/tenor given soprano/bass using the augmented Pitch domain. The prediction of Duration, Cont and Pitch separately (i.e., using separately selected multiple viewpoint systems) and together (i.e., using a single multiple viewpoint system) are compared. in version 2. Prediction of Duration is set up such that, for example, a minim may be generated in the bass given soprano generation stage, followed by a crotchet in the final generation stage, whereby the whole of the chord becomes a crotchet. This is different from the prediction and generation of Cont and Pitch, where elements generated in the first stage are not subject to change in the second. The way in which the prediction of Duration is treated, then, means that versions 1 and 2 are not directly comparable with respect to that attribute. By ignoring Duration prediction, and combining only the directly comparable Cont and Pitch cross-entropies, we can make a judgement on the overall relative performance of these two versions. Figure 5 is strongly indicative of version 2 performing better than version 1. Again, there is an error bar overlap; but for an ¯h of 1, nine out of ten crossentropies produced by ten-fold cross-validation are lower for version 2; and for an ¯h of 2, eight out of ten are lower for version 2. The single increase for an ¯h of 1 is 0.07 bits/chord, compared with a mean decrease of 0.22 bits/chord for the other nine values. The mean of the two increased values for an ¯h of 2 is 0.03 bits/chord, compared with a mean decrease of 0.20 bits/chord for the other eight values. As one might expect from experience of harmonisation, predicting the bass first followed by the alto and tenor is better than predicting all of the lower parts at the same time. It would appear that the selection of specialist multiple viewpoint systems for the prediction of different parts is beneficial in rather the same way as specialist systems for the prediction of the various attributes. The optimal version 2 cross-entropy, using the best subtask models irrespective of the value of ¯h, is 0.19 bits/prediction lower than that of ver0 1 2 3 4 5 Version 1 Version 2 Maximum N-gram Order Cross-entropy (bits/prediction) 3.5 4.0 4.5 5.0 5.5 6.0 Figure 5: Bar chart showing how cross-entropy varies with ¯h for the separate prediction of Cont and Pitch in the alto, tenor and bass given soprano using the augmented Pitch domain, comparing version 1 with version 2. sion 1. Finally, the systems selected using corpus ‘A' are used in conjunction with corpus ‘A+B'. Compared with Figure 5, Figure 6 shows a much larger drop in cross-entropy for version 1 than for version 2: indeed, the bar chart shows the minimum cross-entropies to be exactly the same. Allowing for a true variation smaller than that suggested by the error bars, as before, we can certainly say that the minimum crossentropies are approximately the same. The only saving grace for version 2 is that the error bars are slightly smaller. We can infer from this that version 1 creates more general models, better able to scale up to larger corpora which may deviate somewhat from the characteristics of the original corpus. Conversely, version 2 is capable of constructing models which are more specific to the corpus for which they are selected. This hypothesis can easily be tested by carrying out viewpoint selection in conjunction with corpus ‘A+B' (although this would be a very time-consuming process). Notice that there are larger reductions in cross-entropy from ¯h = 0 to ¯h = 1 in Figure 6 than in Figure 5. The only difference between the two sets of runs is the corpus used; therefore this performance change must be due to the increased quantity of statistics gathered from a larger corpus, as predicted earlier in the paper. Generated Harmony Generation is achieved simply by random sampling of overall prediction probability distributions. Each prediction probability has its place in the total probability mass; for example, attribute value X having a probability of 0.4 could be positioned in the range 0.5 to 0.9. A random number from 0 to 1 is generated, and if this number happens to fall between 0.5 and 0.9 then X is generated. It was quickly very obvious, judging by the subjective quality of generated harmonisations, that a modification 2013 84 0 1 2 3 4 5 Version 1 Version 2 Maximum N-gram Order Cross-entropy (bits/prediction) 3.5 4.0 4.5 5.0 5.5 6.0 Figure 6: Bar chart showing how cross-entropy varies with ¯h for the separate prediction of Cont and Pitch in the alto, tenor and bass given soprano using the augmented Pitch domain and corpus ‘A+B' with systems selected using corpus ‘A', comparing versions 1 and 2. to the generation procedure would be required to produce something coherent and amenable to comparison. The problem was that random sampling sometimes generated a chord of very low probability, which was bad in itself because it was likely to be inappropriate in its context; but also bad because it then formed part of the next chord's context, which had probably rarely or never been seen in the corpus. This led to the generation of more low probability chords, resulting in harmonisations of much higher cross-entropy than those typically found in the corpus (quantitative evidence supporting the subjective assessment). The solution was to disallow the use of predictions below a chosen value, the probability threshold, defined as a fraction of the highest prediction probability in a given distribution. This definition ensures that there is always at least one usable prediction in the distribution, however high the fraction (probability threshold parameter). Bearing in mind that an expert musician faced with the task of harmonising a melody would consider only a limited number of the more likely options for each chord position, the removal of low probability predictions was considered to be a reasonable solution to the problem. Separate thresholds have been implemented for Duration, Cont and Pitch, and these thresholds may be different for different stages of generation. It is hoped that as the models improve, the thresholds can be reduced. The probability thresholds of models used for generating harmony are optimised such that the cross-entropy of each subtask, averaged across twenty harmony generation runs using the ten melodies from test dataset ‘A+B', approximately matches the corresponding prediction cross-entropy obtained by ten-fold cross-validation of corpus ‘A+B'. One of the more successful harmonisations of hymn tune Das walt' Gott Vater (Vaughan Williams 1933, hymn no. 36), automatically generated by the best version 1 model with optimised probability threshold parameters, is shown in Figure 7. It is far from perfect, with the second phrase being particularly uncharacteristic of the corpus. There are two parallel fifths in the second bar and another at the beginning of the fourth bar. The bass line is not very smooth, due to the many large ascending and descending leaps. One of the more successful harmonisations of the same hymn tune, automatically generated by the best version 2 model with optimised probability threshold parameters, is shown in Figure 8. The first thing to notice is that the bass line is more characteristic of the corpus than that of the version 1 harmonisation. This could well be due to the fact that this version employs specialist systems for the prediction of bass given soprano. It is rather jumpy in the last phrase, however, and in the final bar there is a parallel unison with the tenor. The second chord of the second bar does not fit in with its neighbouring chords, and there should be a root position tonic chord on the third beat of the fourth bar. On the positive side, there is a fine example of a passing note at the beginning of the fifth bar; and the harmony at the end of the third phrase, with the chromatic tenor movement, is rather splendid. Conclusion The first set of version 2 viewpoint selection runs, for attribute prediction together using the seen Pitch domain, compare different combinations of two-stage prediction. By far the best performance is obtained by predicting the bass part first followed by the inner parts together, reflecting the usual human approach to harmonisation. It is interesting to note that this heuristic, almost universally followed during harmonisation, therefore has an information theoretic explanation for its success. Having demonstrated the extent to which multiple viewpoint systems have specialised in order to carry out these two rather different prediction tasks, we use an even greater number of specialist systems in a second set of runs. These show that better models can be created by selecting separate multiple viewpoint systems to predict individual musical attributes, rather than a single system to predict them all. In comparing version 1 with version 2, only Cont and Pitch are taken into consideration, since the prediction of Duration is not directly comparable. On this basis, version 2 is better than version 1 when using corpus ‘A', which again tallies with human experience of harmonisation; but when corpus ‘A+B' is used, their performance is identical. We can infer from this that version 1 creates more general models, better able to scale up to larger corpora which may deviate somewhat from the characteristics of the original corpus. Conversely, version 2 is capable of constructing models which are more specific to the corpus for which they are selected. Acknowledgements We wish to thank the three anonymous reviewers for their constructive and insightful comments, which greatly improved this paper. Figure 7: Relatively successful harmonisation of hymn tune Das walt' Gott Vater (Vaughan Williams 1933, hymn no. 36) automatically generated by the best version 1 model with optimised probability threshold parameters, using corpus Figure 8: Relatively successful harmonisation of hymn tune Das walt' Gott Vater (Vaughan Williams 1933, hymn no. 36) automatically generated by the best version 2 model with optimised probability threshold parameters, using corpus ‘A+B'. 2013_12 !2013 A Fully Automatic Evolutionary Art Tatsuo Unemi Department of Information Systems Science Soka University Tangi-machi 1-236, Hachioji, Tokyo 192-8577 Japan ¯ unemi@iss.soka.ac.jp Figure 1: Sample image. This is a project of an automatic art that the computer autonomously produces animations of a type of abstract images. Figure 1 is a typical frame image of an animation. A custom software, SBArt4 version 3, developed by the author is tanking a main role of the work, that based on a genetic algorithm utilizing computational aesthetic measures as fitness function (Unemi 2012a). The fitness value is a weighted geometric mean of measures including complexity, global contrast factor, distribution of color values, distribution of edge angles, difference of color values between consecutive frame images, and so on. Figure 2 illustrates the system configuration using two personal computers connected by the Ethernet. The left side is for evolutionary process, and the right side is for rendering and sound synthesis. Starting from a population randomly initialized with mathematical expressions that determines the color value for each pixel in a rectangular area, a never-ending series of abstract animations are continuously displayed on the screen in turn with synchronized sound effect (Unemi 2012b). Each of the 20 seconds animation is corresponding to an individual of relatively high fitness chosen from the population in the evolutionary process. The evolutionary part is using Minimal Generation Gap model (Satoh, Ono, and Kobayashi 1997) for the generational alternation to guarantee the time for each computation step is minimal. After 120 steps of generational alterna f ff fi Figure 2: System setup. tions, the genotypes of the best ten individuals are sent to the player side in turn. To avoid convergence to lead a narrower variation of individuals in the population, the individuals of lower fitness in one forth of the population are replaced with random genotypes for each 600 steps. The visitors will notice not only the recent progress of the power of computer technology but also will possibly be given an occasion to think what the artistic creativity is. These technologies are useful not only to build up a system that makes unpredictable interesting phenomena but also to provide an occasion for people to reconsider how we should relate to the artifacts around us. We know the nature is complex and often unpredictable, but we, people in the modern democratic society, intend to assume that artificial systems should be under our control and there must be some person who takes responsibility on the effects. The author hopes the visitors will notice that it is difficult to keep some of the complex artifacts under our control, and will learn how we can enjoy with them. 2013_13 !2013 Implications from Music Generation for Music Appreciation Amy K. Hoover, Paul A. Szerlip, and Kenneth O. Stanley Department of Electrical Engineering and Computer Science University of Central Florida Orlando, FL 32816-2362 USA {ahoover@eecs.ucf.edu,paul.szerlip@gmail.com,kstanley@eecs.ucf.edu} Abstract This position paper argues that fundamental principles that are exploited to achieve effective music generation can also shed light on the elusive question of why humans appreciate music, and which music is easiest to appreciate. In particular, we highlight the key principle behind an existing approach to assisted accompaniment generation called functional scaffolding for musical composition (FSMC). In this approach, accompaniment is generated as a function of the preexisting parts. The success of this idea at generating plausible accompaniment according to studies with human participants suggests that perceiving a functional relationship among parts in a composition may be essential to the appreciation of music in general. This insight is intriguing because it can help to explain without any appeal to traditional music theory why humans with no knowledge or training in music can nevertheless find satisfaction in coherent musical structure. Introduction Among the most fundamental questions on the human experience of music is why we appreciate it so universally and what makes some pieces more appealing than others (Hanslick, 1891; Sacks, 2008; Frith, 2004; Gracyk, 1996). There are many possible approaches to addressing these questions, from studies of expectation fulfillment (Huron, 2006; Schmuckler, 1989; Pearce and Wiggins, 2012; Abdallah and Plumbley, 2009) to cultural factors (Balkwill and Thompson, 1999; Peddie, 2006). Our aim in this paper is to propose an alternative route to addressing the fundamental basis for music appreciation, by beginning with an approach to music generation and from its mechanics drawing implications for at least one key underlying ingredient in the appreciation of music. The motivation is that the process of designing an effective music generator implicitly forces the designer to confront the basis of music appreciation as well. After all, a music generator is little use if its products are not appealing. Particularly revealing would be a simple principle that almost always can be applied. The simpler such a principle, the more plausible that it might explain some aspect of music appreciation. One approach to assisted music generation based on such a simple principle is called functional scaffolding for musical composition (FSMC) (Hoover and Stanley, 2009; Hoover, Szerlip, and Stanley, 2011b,a; Hoover et al., 2012). Our position is that the principle at the heart of this approach, initially conceived as a basis for generating accompaniment, offers a unique hint at the machinery behind human musical appreciation. In this way, it can contribute to explaining in part both when and why humans appreciate music. Functional Scaffolding for Musical Composition (FSMC) The FSMC approach is based on the insight that music is at heart a pattern of notes played over time with some regularity. As a result, one way to conceptualize music is as a function of time. Formally, for any musical voice, the pattern of pitches and the pattern of durations and rests can be expressed together as a vector function of time f(t) that outputs both pitch and rhythm information. In practice, to generate a sequence of notes, f could be queried at every time t and the complete output sequence would constitute the pattern. The parts played by each instrument in an ensemble piece could also be output simultaneously by such a function. This perspective is helpful for music generation when combined with the insight that all the instrumental sequences (i.e. each track) in a single piece must be somehow related to each other. For example, in a popular rock piece, the drum pattern, say d(t), typically establishes the rhythm for the rest of the piece. Therefore, the bass pattern, b(t), which helps structure the harmonic form, will by necessity depend in some way upon the drum pattern. This idea of relatedness between parts can be expressed more formally by saying that the bass pattern is a function of the drum pattern, which can be expressed by a function h that relates b(t) to d(t): b(t) = h(d(t)). Building on the drum and bass patterns, vocalists and other instrumental parts can then explore more complicated melodic patterns that are themselves also related to the established rhythmic and chord patterns. It follows then that not only can each of these instrumental parts be represented as a function of time, but that they are indeed each functions of each other. Beyond just observations, these insights imply a practical opportunity for generating musical accompaniment. By casting instrumental parts as functions of each other, the problem of accompaniment is illuminated in a new light: 2013 92 g(t) = f (t) = h (f(t)) # 3 4 . . # . . 3 4 (represented by a CPPN) Figure 1: Representing and Searching for Accompaniments with FSMC. The function f(t), which is depicted by a piano keyboard, represents the human composition called the scaffold, from which the computer-generated accompaniments are created. A possible such accompaniment, g(t), is shown atop and depicted by the image of a computer. Each accompaniment is internally represented by a helper function, h(f(t)), which is represented by a special type of artificial neural network (ANN) called a compositional pattern producing network (CPPN). Like ANNs, CPPNs can theoretically approximate any continuous function. Thus these CPPNs represent h, which transforms the scaffold into an accompaniment. Given an existing part f(t), the problem of formulating an appealing accompaniment becomes the problem of searching for accompaniment g(t) such that g(t) complements f(t). Yet while applying a search algorithm directly to finding such a function g(t) would be difficult because the search space is vast, instead the search can be significantly constrained by searching for h(f(t)), as depicted in figure 1. The major benefit of this approach is that because h is a function of the part it will accompany, it cannot help but follow to some extent its contours. Therefore, the idea for generating accompaniment in FSMC is to search with the help of a human user for a function h(f(t)), where f(t) is a preexisting part or scaffold. By searching for a transforming function instead of an explicit sequence of notes, the plausibility of output accompaniments is enhanced. In effect, f(t) provides the functional scaffolding for the accompaniment. The idea in FSMC that searching for h(f(t)) can yield plausible accompaniment to f(t) can be exploited in practice by programming a search algorithm to explore possible variations of the function h. In fact, this approach has been tested extensively in practice through an implementation called MaestroGenesis (http: //maestrogenesis.org/), whose results have been reported in a number of publications (Hoover, Szerlip, and Stanley, 2011b,a; Hoover et al., 2012). In MaestroGenesis, the function h is represented by a special kind of artificial neural network called a compositional pattern producing network (CPPN). A population of candidate CPPNs is evolved interactively by allowing a human user to direct the search algorithm by picking his or her favorite candidate accompaniments to produce the offspring for the next generation. Thus the representation of the transforming function is a CPPN (which is a kind of function approximator) and the search algorithm is interactive evolution (which is an evolutionary algorithm guided by a human; Takagi, 2001). A full technical description is given in Hoover, Szerlip, and Stanley (2011a), Hoover, Szerlip, and Stanley (2011b), and Hoover et al. (2012). Interestingly, listener study results from FSMC-generated music showed that musical pieces with accompaniments that were generated purely through functional relationships were indistinguishable from fully human composed pieces (Hoover, Szerlip, and Stanley, 2011a). In fact, some fully human composed pieces were rated more mechanicalsounding than those that were only partially human composed. Similarly positive results were also reported in other studies (Hoover and Stanley, 2009; Hoover, Szerlip, and Stanley, 2011b,a; Hoover et al., 2012). Although more variation in the initial human composition (i.e. a polyphonic versus monophonic scaffold) provides more richness from which to work, as Hoover et al. (2012) show, plausible accompaniments can nevertheless be generated from as little as a single monophonic starting melody. Furthermore, often in MaestroGenesis even the first generation of candidate accompaniments, which are randomly-generated CPPNs, sounds plausible because the functional relationship ensures at least some relationship between the scaffold and its accompaniment (Hoover and Stanley, 2009). From Music Generation to Music Appreciation These results are of course relevant to progress in music generation, but they hint at a deeper implication. In particular, it is notable that MaestroGenesis (and FSMC behind it) has almost no musical knowledge programmed into it. In fact, the only real musical rule in the program is that CPPN outputs are forced to be interpreted as notes within the key of the scaffold track. Aside from that, MaestroGenesis has no knowledge of chords, rhythm, progression, melody, har 2013 93 mony, dissonance, style, genre, or anything else that a typical music generator might have (Simon, Morris, and Basu, 2008; Chuan, 2009; Ebcioglu, 1990). It thus relies almost entirely on the functional relationship between the scaffold and the accompaniment to achieve plausibility. In effect, the functional relationship causes the accompaniment to inherit the gross structure of the scaffold, thereby endowing it with many of the same aesthetic properties. Thus the key observation behind this position paper is that establishing such a functional relationship between different parts of a song seems to be sufficient on its own to achieve plausible musical structure. This observation is intriguing because it implies a hypothesis about the nature of musical appreciation: If a functional relationship alone is sufficient to achieve musical plausibility in the experience of human listeners, then perhaps musical appreciation itself is at least in part the result of perceiving a functional relationship between different parts of a composition. That is, functional relationships, which are mathematical properties of patterns that do not require any specific musical knowledge to perceive, could explain why listeners without any musical training or expertise nevertheless experience and appreciate music and separate it firmly from cacophony. In effect the human is appreciating the functional relationship that binds different parts of a composition together. If true, this hypothesis can explain to some extent when humans will or will not appreciate a composition. For example, the harder it is to perceive how one part is functionally related to another, the less pleasing that piece may be. Such functional relationships are potentially perceived not only between different instrumental parts or tracks, but also within a single instrumental part played over time. That is, if a functional relationship can be perceived between an earlier sequence of notes and a later one, then the entire sequence may succeed as musically plausible or even appealing. At the same time, it may also explain why some compositions are more difficult to enjoy. For example, some research in computer music explores the sonification of non-auditory data (Cope, 2005; Park et al., 2010; Vickers, 2005). Typically, the user inputs semi-random data to a computer model (e.g. cellular automata, swarms, etc.) that outputs music. While the output is a function of the input, because the initial seed does not stem from inherently musical events, the outputs are often difficult for non-creators to immediately understand. However, as these systems develop, composers begin to build musical frames for anticipation and expectation. While there can certainly be beauty in such pieces, the audience needs some familiarity, like the composer, with the style to begin to perceive the important relationships. Though perhaps not explicitly, composers have long intuited the importance of incorporating functional relationships into compositions. Not only are musical lines regularly translated, inverted, and reflected, but some logarithmic and modular transformations (and set theory concepts) predate the mathematical formalisms themselves (Risset, 2002; Harkleroad, 2006). The implicit nature of the composers' insight is that people appreciate these functional transformations within a "relatable" musical context. In fact, much of compositional musical theory was developed to produce consistent aesthetic results (Payne, 1995; Christensen, 2002). By following certain heuristics and established patterns, composers ensure that pieces fit within particular styles and genres. Such heuristics generally encompass narrow sets of phenomena, e.g. waltzes, counterpoint, jigs, etc. The hypothesis that functional relationships provide a general principle for musical appreciation provides a unifying perspective for all such disparate stylistic conventions: At some level, all of them ultimately establish some kind of functional relationship among the parts of a composition. Furthermore, this perspective suggests that as long as they are perceptible (i.e. not so complex as to sound cacophonous), relatively simple functions likely exist that generate relationships among parts that are aesthetically appealing yet not related to any genre, rule, or heuristic currently taught or even yet conceived. For example, music generated with FSMC exhibits a range of complexity, suggesting little restriction on the type of function necessary to create plausible accompaniments. Some of the most appealing accompaniments are generated from very simple relationships (Hoover, Szerlip, and Stanley, 2011b,a), while sometimes more complex relationships between melody and percussion are also appreciated by listeners (Hoover and Stanley, 2009). To some extent this theory thereby suggests without any other musical theory when breaking the rules might be appealing and when it might not; as long as a functional relationship among parts can still be perceived, the average listener will not necessarily react negatively to breaking established conventions. Sometimes it can also take a while to habituate to styles or genres that do not follow conventional rules. For example, atonal pieces can be difficult to enjoy for the uninitiated. Interestingly, the functional hypothesis provides a potential insight into why such unconventional styles can become appealing with experience. The explanation is that initially the functional relationships among different parts in such an unconventional context are difficult to perceive because the relationships are both complex and unfamiliar. Therefore, the brain initially struggles to identify the functional relationship. However, over time, repeated exposure familiarizes the listener with the kinds of transformations that are typical in the new context, such that eventually the brain can pick out functional relationships that once were too complex to perceive. At that point, the music becomes possible again to appreciate. In fact, as noted by Huron (2006), the patterns associated with a particular style are designed to elicit emotions by playing on the listeners' expectations. Those expectations can be viewed as mediated by the kinds of functional relationships with which the listener is familiar. Music theory provides many heuristics for composing plausible types of music like fugues or walking bass lines. But as any musician knows, simply following such rules without the elusive element of inspiration results in plausible yet dry-sounding pieces. A good musician must know the standard rules for composition and also when to break them, but the problem of when to break the rules in music theory is less understood than the standard rules for composition. The 2013 94 insight that a rule is well-broken if it still preserves a perceptible functional relationship provides a possible direction for studying this issue further. Conclusions This type of general theory follows directly from taking a minimalist approach to music generation. Approaches that rely on acquiring or enumerating all the complexities of music of certain types or composers of certain types, such as through statistical inference (Rhodes, Lewis, and Mullensiefen, 2009; Kitani and Koike, 2010) or grammatical ¨ rules (Holtzman, 1981; McCormack, 1996), cannot probe the possibility of deeper underlying principles than the rules that are apparent at the surface. In contrast, FSMC and MaestroGenesis took the minimalist approach to music generation by predicating everything only on functional relationships. While a potential criticism of such an approach is that it is too simplistic to capture all the subtlety of sophisticated musical composition, its benefit in a scientific context is that it isolates a single phenomenon so that the full implication of that phenomenon can be tested. The result is a simple hypothesis that reduces musical theory to a mathematical principle, i.e. perceiving functional relationships, that can plausibly be appreciated even by listeners without musical training. It also becomes a tool for music generation, as in MaestroGenesis, that does not require enumerating complex rules. While functional relationships need not constitute the entire explanation for all musical appreciation, they are an appealing ingredient because of their simplicity and possibility for future study - they suggest that within the mind of a composer perhaps at some level such a function is realized as the overall pattern of a musical piece is first conceived. In a broader context, explaining the appreciation of music through perceiving functional relationships also connects musical appreciation to non-musical aesthetics. After all, across the spectrum of art, architecture, and even human beauty, symmetry, repetition, and variation on a theme are paramount. It is notable that all such regularities ultimately reduce to one instance of a pattern being functionally related to another. Given that we appreciate such relationships in so many spheres of our experience, that music too would draw from such an affinity follows elegantly. Acknowledgments This work was supported in part by the National Science Foundation under grant no. IIS-1002507 and also by a NSF Graduate Research Fellowship. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation. 2013_14 !2013 Autonomously Communicating Conceptual Knowledge Through Visual Art Derrall Heath, David Norton, Dan Ventura Computer Science Department Brigham Young University Provo, UT 84602 USA dheath@byu.edu, dnorton@byu.edu, ventura@cs.byu.edu Abstract In visual art, the communication of meaning or intent is an important part of eliciting an aesthetic experience in the viewer. Building on previous work, we present three additions to DARCI that enhances its ability to communicate concepts through the images it creates. The first addition is a model of semantic memory based on word associations for providing meaning to concepts. The second addition composes universal icons into a single image and renders the image to match an associated adjective. The third addition is a similarity metric that maintains recognizability while allowing for the introduction of artistic elements. We use an online survey to show that the system is successful at creating images that communicate concepts to human viewers. Introduction DARCI (Digital ARtist Communicating Intention) is a system for generating original images that convey meaning. The system is part of ongoing research in the subfield of computational creativity, and is inspired by other artistic image generating systems such as AARON (McCorduck 1991) and The Painting Fool (Colton 2011). Central to the design philosophy of DARCI is the notion that the communication of meaning in art is a necessary part of eliciting an aesthetic experience in the viewer (Cs´ıkzentmihalyi and ´ Robinson 1990). DARCI is unique from other computationally creative systems in that DARCI creates images that explicitly express a given concept. DARCI is composed of two major subsystems, an image analysis component, and an image generation component. The image analysis component learns how to annotate images with adjectives by training a series of neural networks with labeled images. The specific inputs to these neural networks, called appreciation networks, are global features extracted from each image, including information about the general occurrence of color, lighting, and texture in the images (Norton, Heath, and Ventura 2010). The image generation component uses a genetic algorithm, governed partly by the analysis component, to render a source image to visually convey an adjective (Norton, Heath, and Ventura 2011). While often effective, excessive filtering and extreme parameters can leave the source image unrecognizable. In this paper we introduce new capabilities to DARCI— primarily, the ability to produce original source images rather than relying upon pre-existing, human-provided images. DARCI composes these original source images as a collage of iconic concepts in order to express a range of concepts beyond adjectives, similar to a recently introduced system for The Painting Fool that creates collages from the text of web documents (Krzeczkowska et al. 2010). However, in contrast to that system, ours creates collages from conceptual icons discovered with a semantic memory model. The resulting source images are then rendered according to an adjective discovered with this same semantic memory model. In order to preserve the content of the collages after rendering them, we introduce a variation on DARCI's traditional image rendering technique. Figure 1 outlines the two major components and their interaction, including the new elements presented in this paper. By polling online volunteers, we show that with these additions, DARCI is capable of creating images that convey selected concepts while maintaining the aesthetics achieved with filters. Figure 1: A diagram outlining the two major components of DARCI. Image analysis learns how to annotate new images with adjectives using a series of appreciation networks trained with labeled images. Image generation uses a semantic memory model to identify nouns and adjectives associated with a given concept. The nouns are composed into a source image that is rendered to reflect the adjectives, using a genetic algorithm that is governed by a set of evaluation metrics. The final product is an image that reflects the given concept. Additions from this paper are highlighted. 2013 97 Methodology Here we introduce the improvements to DARCI that enhance the system's capability to communicate intended meaning in an aesthetic fashion: a semantic memory model for broadening the range of concepts the system can communicate, an image composer for composing concrete representations of concepts into source images to be rendered, and a new metric for governing the evolution of the rendering process. We also describe an online survey that we use to evaluate the success of these additions. Semantic Memory Model In cognitive psychology, the term semantic memory refers to the memory of meaning and other concept-based knowledge that allows people to consciously recall general information about the world. It is often argued that creativity requires intention (and we are certainly in this camp). In this context, we mean creativity in communicating a concept, and at least one part of this can be accommodated by an internal knowledge of the concept (i.e, a semantic memory). The question of what gives words (or concepts) meaning has been debated for years; however, it is commonly agreed that a word, at least in part, is given meaning by how the word is used in conjunction with other words (i.e., its context) (Erk 2010). Many computational models of semantic memory consist of building associations between words (Sun 2008; De Deyne and Storms 2008), and these word associations essentially form a large graph that is typically referred to as a semantic network. Associated words provide a level of meaning to a concept (word) and can be used to help convey its meaning. Word associations are commonly acquired in one of two ways: from people and automatically by inferring them from a corpus. Here we describe a computational model of semantic memory that combines human free association norms with a simple corpus-based approach. The idea is to use the human word associations to capture general knowledge and then to fill in the gaps using the corpus method. Lemmatization and Stop Words In gathering word associations, we use the standard practice of removing stop words and lemmatizing. The latter process is accomplished using WordNet's (Fellbaum 1998) database of word forms; it should be noted, however, that lemmatization with WordNet has its limits. For example, we cannot lemmatize a word across different parts of speech. As a result, words like ‘redeem' and ‘redeeming' will remain separate concepts because ‘redeeming' could be the gerund form of the verb ‘redeem' or it could be an adjective (i.e., the act of ‘a redeeming quality'). Free Association Norms One of the most common means of gathering word associations from people is through Free Association Norms (FANs), which is done by asking hundreds of human volunteers to provide the first word that comes to mind when given a cue word. This technique is able to capture many different types of word associations including word co-ordination (pepper, salt), collocation (trash, can), super-ordination (insect, butterfly), synonymy (starving, hungry), and antonymy (good, bad). The association strength between two words is simply a count of the number of volunteers that said the second word given the first word. FANs are considered to be one of the best methods for understanding how people, in general, associate words in their own minds (Nelson, McEvoy, and Schreiber 1998). In our model we use two preexisting databases of FANs: The Edinburgh Associative Thesaurus (Kiss et al. 1973) and the University of Florida's Word Association Norms (Nelson, McEvoy, and Schreiber 1998). Note that in this model we consider word associations to be undirected. In other words, if word A is associated with word B, then word B is associated with word A. Hence, when we encounter data in which word A is a cue for word B and word B is also a cue for word A, we combine them into a single association pair by adding their respective association strengths. Between these two databases, there are a total of 19,327 unique words and 288,069 unique associations. We refer to these associations as human data. Corpus Inferred Associations Discovering word associations from a corpus is typically accomplished using a family of techniques called Vector Space Models (Turney and Pantel 2010), which uses a matrix for keeping track of word counts either co-occurring with other words (term ⇥ term matrix) or within each document (term ⇥ document matrix). One of the most popular vector space models is Latent Semantic Analysis (LSA) (Deerwester et al. 1990), based on the idea that similar words will appear in similar documents (or contexts). LSA builds a term ⇥ document matrix from a corpus and then performs Singular Value Decomposition (SVD), which essentially reduces the large sparse matrix to a low-rank approximation of that matrix along with a set of vectors, each representing a word (as well as a set of vectors for each document). These vectors also represent points in semantic space, and the closer words are to each other in this space, the closer they are in meaning (and the stronger the association between words). Another popular method is the Hyperspace Analog to Language (HAL) model (Lund and Burgess 1996). This model is based on the same idea as LSA, except the notion of context is reduced more locally to a word co-occurrence window of ±10 words instead of an entire document. Thus, the HAL model builds a term ⇥ term matrix of word cooccurrence counts from a corpus. HAL then uses the cooccurrence counts directly as vectors representing each word in semantic space. The size of the term ⇥ term matrix is invariant to the size of the corpus and has been argued to be more congruent to human cognition than the term ⇥ document matrix used in LSA (Wandmacher, Ovchinnikova, and Alexandrov 2008; Burgess 1998). The corpus component of our model is constructed similarly to HAL but with some important differences. We restrict the model to the same number of unique words as the human-generated free associations, building a 19,327 ⇥ 19,327 (term ⇥ term) co-occurrence matrix M using a co-occurrence window of ±50 words. To account for the fact that common words will have generally higher cooccurrence counts, we scale these counts by weighting each element of the matrix by the inverse of the total frequency 2013 98 of both words at each element. This is done by considering each element Mi,j , then adding the total number of occurrences of each word (i and j), subtracting out the value at Mi,j (to avoid counting it twice), then dividing Mi,j by this computed number, as follows: Mi,j Mi,j ( P i Mi,j + P j Mi,j " Mi,j ) (1) The result could be a very small number, and therefore we then also normalize the values between 0 and 1. For our corpus we use Wikipedia, as it is large, easily accessible, and covers a wide range of human knowledge (Denoyer and Gallinari 2006). Once the co-occurrence matrix is built from the entire text of Wikipedia, we use the weighted/normalized co-occurrence values themselves as association strengths between words. This approach works, since we only care about the strongest associations between words, and it allows us to reduce the number of irrelevant associations by ignoring any word pairs with a co-occurrence count less than some threshold. We chose a threshold of 100 (before weighting), which provides a good balance of producing a sufficient number of associations, while reducing the number of irrelevant associations. When looking up a particular word, we return the top n other words with the highest weighted/normalized co-occurrence values. This method, which we will call corpus data from now on, gives a total of 4,908,352 unique associations. Combining Word Associations Since each source (human and corpus) provide different types of word associations, a combination of these methods into a single model has the potential to take advantage of the strengths of each method. The hypothesis is that the combined model will better communicate meaning to a person than either model individually because it presents a wider range of associations. Our method merges the two separate databases into a single database before querying it for associations. This method assumes that the human data contains more valuable word associations than the corpus data because the human data is typically used as the gold standard in the literature. However, the corpus data does contain some valuable associations not present in the human data. The idea is to add the top n associations for each word from the corpus data to the human data but to weight the association strength low. This is beneficial for two reasons. First, if there are any associations that overlap, adding them again will strengthen the association in the combined database. Second, new associations not present in the human data will be added to the combined database and provide a greater variety of word associations. We keep the association strength low because we want the corpus data to reinforce, but not dominate, the human data. To do this, we first copy all word associations from the human data to the combined database. Next, let W be the set of all 19,327 unique words, let Ai,n ✓ W be the set of the top n words associated with word i 2 W from the corpus data, let scorei,j be the association strength between words i and j from the corpus data, let maxi be the maximum association score present in the human data for word i, and let ✓ be a weight parameter. Now for each i 2 W and for each j 2 Ai,n, the new association score between words i and j is computed as follows: scorei,j (maxi · ✓) · scorei,j (2) This equation scales scorei,j (which is already normalized) to lie between 0 and a certain percentage (✓) of maxi. The n associated words from the corpus are then added to the combined database with the updated scores. If the word pair is already in the database, then the updated score is added to the score already present. For the results presented in this paper we use n = 20 and ✓ = 0.2, which were determined based on preliminary experiments. After the merge, the combined database contains 443,609 associations. Image Composer The semantic memory model can be considered to represent the meaning of a word as a (weighted) collection of other words. DARCI effectively makes use of this collection as a decomposition of a (high-level) concept into simpler concepts that together represent the whole, the idea being that in many cases, if a (sub)concept is simple enough, it can be represented visually with a single icon (e.g., the concept ‘rock' can be visually represented with a picture of a ‘rock'). Given such collection of iconic concepts, DARCI composes their visual representations (icons) into a single image. The image is then rendered to match some adjective associated with the original (collective) concept. To represent these "simple enough" concepts, DARCI makes use of a collection of icons provided by The Noun Project, whose goal is to build a repository of symbols/icons that can be used as a visual language (Thomas et al. 2013). The icons are intended to be simple visual representations of any noun and are published by various artists under the Creative Commons license. Currently, The Noun Project provides 6,334 icons (each 420 ⇥ 420 pixels) representing 2,535 unique nouns and is constantly growing. When given a concept, DARCI first uses the semantic memory model to retrieve all words associated with the given concept, including itself. These word associations are filtered by returning only nouns for which DARCI has icons and adjectives for which DARCI has appreciation networks. The nouns are sorted by association strength and the top 15 are kept. For each noun, multiple icons are usually available and one or two of these icons are are chosen at random to create a set of icons for use in composing the image. The icons in the set are scaled to between 25% and 100% of their original size according to their association strength rank. Let I be the set of icons, and let r : I ! [0, |I| " 1] be the rank of icon i 2 I, where the icon with rank 0 corresponds to the noun with the highest association strength. Finally, let "i be the scaling factor for icon i, which is computed as follows: "i 1 " 0.75 |I| " 1 r(i) (3) An initial blank white image of size 2000 ⇥ 2000 pixels is created and the set of scaled icons are drawn onto the blank 2013 99 image at random locations, the only constraints being that no icons are allowed to overlap and no icons are allowed to extend beyond the border of the image. The result is a collage of icons that represents the original concept. DARCI then randomly selects an adjective from the set returned by the semantic memory model weighted by each adjective's association strength. DARCI uses its adjective rendering component, described in prior work, to render the collage image according to the selected adjective (Norton, Heath, and Ventura 2011; 2013; Heath, Norton, and Ventura 2013). The final image will both be artistic and in some way communicate the concept to the viewer. Figure 1 shows how this process is incorporated into the full system. Similarity Metric To render an image, DARCI uses a genetic algorithm to discover a combination of filters that will render a source image (in this case, the collage) to match a specified adjective. The fitness function for this process combines an adjective metric and an interest metric. The former measures how effectively a potential rendering, or phenotype, communicates the adjective, and the latter measures the "difference" between the phenotype and the source image. Both metrics use only global image features and so fail to capture important local image properties correlated with image content. In this paper we introduce a third metric, similarity, that borrows from the growing research on bag-of-visual-word models (Csurka et al. 2004; Sivic et al. 2005) to analyze local features, rather than global ones. Typically, these interest points are those points in an image that are the most surprising, or said another way, the least predictable. After an interest point is identified, it is described with a vector of features obtained by analyzing the region surrounding the point. Visual words are quantized local image features. A dictionary of visual words is defined for a domain by extracting local interest points from a large number of representative images and then clustering them (typically with kmeans) by their features into n clusters, where n is the desired dictionary size. With this dictionary, visual words can be extracted from any image by determining which clusters the image's local interest points belong. A bag-of-visualwords for the image can then be created by organizing the visual word counts for the image into a fixed vector. This model is analogous to the bag-of-words construct for text documents in natural language processing. For the new similarity metric, we first create a bag-ofvisual-words for the source image and each phenotype, and then calculate the Euclidean distance between these two vectors. This metric has the effect of measuring the number of interest points that coincide between the two images. We use the standard SURF (Speeded-Up Robust Features) detector and descriptor to extract interest points and their features from images (Bay et al. 2008). SURF quickly identifies interest points using an approximation of the difference of Gaussians function, which will often identify corners and distinct edges within images. To describe each interest point, SURF first assigns an orientation to the interest point based on surrounding gradients. Then, relative to this orientation, SURF creates a 64 element feature vector by summing both the values and magnitudes of Haar wavelet responses in the horizontal and vertical directions for each square of a four by four grid centered on the point. We build our visual word dictionary by extracting these SURF features from the database of universal icons mentioned previously. The 6334 icons result in more than two hundred thousand interest points which are then clustered into a dictionary of 1000 visual words using Elkan k-means (Elkan 2003). Once the Euclidean distance, d, between the source image's and the phenotype's bags-ofvisual-words is calculated, the metric, S, is calculated to provide a value between 0 and 1 as follows: S = MAX( d 100, 1) where the constant 100 was chosen empirically. Online Survey Since our ultimate goal is a system that can create images that both communicate intention and are aesthetically interesting, we have developed a survey to test our most recent attempts at conveying concepts while rendering images that are perceived as creative. The survey asks users to evaluate images generated for ten concepts across three rendering techniques. The ten concepts were chosen to cover a variety of abstract and concrete topics. The abstract concepts are ‘adventure', ‘love', ‘music', ‘religion', and ‘war'. The concrete concepts are ‘bear', ‘cheese', ‘computer', ‘fire', and ‘garden'. We refer to the three rendering techniques as unrendered, traditional, and advanced. For unrendered, no rendering is applied—these are the plain collages. For the other two techniques, the images are rendered using one of two fitness functions to govern the genetic algorithm. For traditional, the fitness function is the average of the adjective and interest metrics. For advanced rendering, the new similarity metric is added. Here the adjective metric is weighted by 0.5, while the interest and similarity metrics are each weighted by 0.25. For each rendering technique and image, DARCI returned the 40 highest ranking images discovered over a period of 90 generations. We then selected from the pools of 40 for each concept and technique, the image that we felt best conveyed the intended concept while appearing aesthetically interesting. An example image that we selected from each rendering technique can be seen in Figure 2. To query the users about each image, we followed the survey template that we developed previously to study the perceived creativity of images rendered with different adjectives (Norton, Heath, and Ventura 2013). In this study, we presented users with six five-point Likert items (Likert 1932) per image; volunteers were asked how strongly they agreed or disagreed (on a five point scale) with each statement as it pertained to one of DARCI's images. The six statements we used were (abbreviation of item in parentheses): I like the image. (like) I think the image is novel. (novel) I would use the image as a desktop wallpaper. (wallpaper) Prior to this survey, I have never seen an image like this one. (never seen) I think the image would be difficult to create. (difficult) I think the image is creative. (creative) 2013 100 (a) unrendered (b) traditional (c) advanced Figure 2: Example images1 for the three rendering techniques representing the concept ‘garden'. (a) unrendered (b) traditional (c) advanced Figure 3: Example dummy images2 for the concept ‘water' that appeared in the survey for the indicated rendering techniques. In previous work, we showed that the first five statements correlated strongly with the sixth, "I think the image is creative" (Norton, Heath, and Ventura 2013), justifying this test as an accurate evaluation of an image's subjective creativity. In this paper, we use the same six Likert items and add a seventh to determine how effective the images are at conveying their intended concept: I think the image represents the concept of " ." (concept) To avoid fatigue, volunteers were only presented with images from one of the three rendering techniques mentioned previously. The technique was chosen randomly and then the images were presented to the user in a random order. To help gauge the results, three dummy images were introduced into the survey for each technique. These dummy images were created for arbitrary concepts and then assigned different arbitrary concepts for the survey so that the image contents would not match their label. Unfiltered dummy collages were added to the unrendered set of images, while traditionally rendered versions were added to the traditional and advanced sets of images. The three concepts used to generate the dummy images were: ‘alien', ‘fruit', and ‘ice'. The three concepts that were used to describe these images in the survey were respectively: ‘restaurant', ‘water', and ‘freedom'. To avoid confusion, from here on we will always refer to these dummy images by their description word. The 1 The original icons used for the images in Figure 2 were designed by Adam Zubin, Birdie Brain, Evan Caughey, Rachel Fisher, Prerak Patel, Randall Barriga, dsathiyaraj, Jeremy Bristol, Andrew Fortnum, Markus Koltringer, Bryn MacKenzie, Hernan Schlosman, Maurizio Pedrazzoli, Mike Endale, George Agpoon, and Jacob Eckert of The Noun Project. 2 The original icons used for the images in Figure 3 were designed by Alessandro Suraci, Anna Weiss, Riziki P.M.G. Nielsen, Stefano Bertoni, Paulo Volkova, James Pellizzi, Christian Michael Witternigg, Dan Christopher, Jayme Davis, Mathies Janssen, Pavel Nikandrov, and Luis Prado of The Noun Project. (a) (b) (c) (d) Figure 4: The images3 that were rated the highest on average for each statement. Image (a) is the advanced rendering of ‘adventure' and was rated highest for like, novel, difficult, and creative. Image (b) is the traditional rendering of ‘music' and was rated highest for wallpaper. Image (c) is the advanced rendering of ‘love' and was rated highest for never seen. Image (d) is the advanced rendering of ‘music' and was rated highest for concept. dummy images for the concept of ‘water' are shown in Figure 3. In total, each volunteer was presented with 13 images. Results A total of 119 anonymous individuals participated in the online survey. Volunteers could quit the survey at anytime, thus not evaluating all 13 images. Each person evaluated an average of 9 images and each image was evaluated by an average of 27 people. The highest and lowest rated images for each question can be seen in Figures 4 and 5 respectively. The three dummy images for each rendering technique are used as a baseline for the concept statement. The results of the dummy images versus the valid images are show in Figure 6. The average concept rating for the valid images is significantly better than the dummy images, which shows that the intended meaning is successfully conveyed to human viewers more reliably than an arbitrary image. These results confirm that the intelligent use of iconic concepts is beneficial for the visual communication of meaning. Further, it is suggestive that the ratings for the other statements are generally lower for the dummy images than for the valid 3 The original icons used for the images in Figure 4 were designed by Oxana Devochkina, Kenneth Von Alt, Paul te Kortschot, Marvin Kutscha, James Fenton, Camilo Villegas, Gustavo Perez Rangel, and Anuar Zhumaev of The Noun Project. 2013 101 (a) (b) (c) (d) (e) (f) Figure 5: The images4 that were rated the lowest on average for each statement. Image (a) is the advanced rendering of ‘fire' and was rated lowest for difficult and creative. Images (b) and (c) are the unrendered and advanced version of ‘religion' and were rated lowest for neverseen and wallpaper respectively. Images (d), (e), and (f) are the traditional renderings of ‘fire', ‘adventure', and ‘bear', respectively, and were rated lowest for like, novel, and concept respectively. images. Since the the dummy images were created for a different concept than the one which they purport to convey in the survey, this may be taken as evidence that successful conceptual or intentional communication is an important factor for the attribution of creativity. The results of the three rendering techniques (unrendered, traditional, and advanced) for all seven statements are shown in Figure 7. The unrendered images are generally the most successful at communicating the intended concepts. This is likely because the objects/icons in the unrendered images are left undisturbed and are therefore more clear and discernible, requiring the least perceptual effort by the viewer. The rendered images (traditional and advanced) often distort the icons in ways that make them less cohesive and less discernible and can thus obfuscate the intended meaning. The trade-off, of course, is that the unrendered images are generally considered less likable, less novel, and less creative than the rendered images. The advanced images are generally considered more novel and creative than the traditional images, but the traditional images are liked slightly more. The advanced images also convey the intended meaning more reliably than the traditional images, which indicates that the similarity metric is finding a better balance between adding artistic elements and maintaining icon recognizability. The difference between the traditional and advanced rendering was minimized by the fact that we selected the image 4 The original icons used for the images in Figure 5 were designed by Melissa Little, Dan Codyre, Carson Wittenberg, Kenneth Von Alt, Nicole Kathryn Griffing, Jenifer Cabrera, Renee Ramsey-Passmore, Ben Rex Furneaux, Factorio.us collective, Anuar Zhumaev, Luis Prado, Ahmed Hamzawy, Michael Rowe, Matthias Schmidt, Jule Steffen, Monika Ciapala, Bru Rakoto, Patrick Trouv, Adam Heller, Marco Acri, Mehmet Yavuz, Allison Dominguez, Dan Christopher, Nicholas Burroughs, Rodny Lobos, and Norman Ying of The Noun Project. Figure 6: The average rating from the online survey for all seven statements comparing the dummy images with the valid images. The valid images were more successful at conveying the intended concept than the dummy images by a significant margin. Results marked with an asterix (*) indicate statistical significance using the two tailed independent t-test. The lines at the top of each bar show the 95% confidence interval for each value. The sample sizes for dummy and valid images are 251 and 818 respectively. (out of DARCI's top 40) from each group that best conveyed the concept while also being aesthetically interesting. Out of all the traditional images, 39% had at least one recognizable icon, while 74% of the advanced images had at least one recognizable icon. This difference demonstrates that the new similarity metric helps to preserve the icons and provides a greater selection of good images from which to choose, which is consistent with the results of the survey. For comparison, Figure 8 shows some example images (both traditional and advanced) that were not chosen for the survey. The results comparing the abstract concepts with the concrete concepts are shown in Figure 9. For all seven statements, the abstract concepts are, on average, rated higher than the concrete concepts. One possible reason for this is that concrete concepts are not easily decomposed into a collection of iconic concepts because, being concrete, they are more likely to be iconic themselves. For concrete concepts, the nouns returned by the semantic memory model are usually other related concrete concepts, and it becomes difficult to tell which object is the concept in question. For example, the concept ‘bear' returns nouns like ‘cave', ‘tiger', ‘forest', and ‘wolf', which are all related, but don't provide much indication that the intended concept is ‘bear'. A person might be inclined to generalize to a concept such as ‘wildlife'. Another possible reason why abstract concepts result in better survey results than do concrete concepts is because abstract concepts allow a wider range of interpretation and are generally more interesting. For example, the concept ‘cheese' would generally be considered straightforward to most people, while the concept ‘love' could have variable meanings to different people in different circumstances. Hence, the 5 The original icons used for the images in Figure 8 are the same as those used in Figures 4 and 5 with attribution to the same designers. 2013 102 Figure 7: The average rating from the online survey for all seven statements comparing the three rendering techniques. The unrendered technique is most successful at representing the concept, while the advanced technique is generally considered more novel and creative. Statistical significance was calculated using the two tailed independent t-test. The lines at the top of each bar show the 95% confidence interval for each value. The sample sizes for the unrendered, traditional, and advanced techniques are 256, 285, and 277 respectively. images generated for abstract concepts are generally considered more likable, more novel, and more creative than the concrete images. Conclusions and Future Work We have presented three additions to the computer system, DARCI, that enhance the system's ability to communicate specified concepts through the images it creates. The first addition is a model of semantic memory that provides conceptual knowledge necessary for determining how to compose and render an image by allowing the system to make decisions and reason (in a limited manner) about common world knowledge. The second addition uses the word associations from a semantic memory model to retrieve conceptual icons and composes them into a single image, which is then rendered in the manner of an associated adjective. The third addition is a new similarity metric used during the adjective rendering phase that preserves the discernibility of the icons while allowing for the introduction of artistic elements. We used an online survey to evaluate the system and show that DARCI is significantly better at expressing the meaning of concepts through the images it creates than an arbitrary image. We show that the new similarity metric allows DARCI to find a better balance between adding interesting artistic qualities and keeping the icons/objects recognizable. We show that using word associations and universal icons in an intelligent way is beneficial for conveying meaning to human viewers. Finally, we show that there is some degree of correlation between how well an image communicates the intended concept and how well liked, how novel, and how creative the image is considered to be. To further illustrate DARCI's potential, Figure 10 shows additional images encountered during various experiments with DARCI that we (a) (b) (c) (d) (e) (f) Figure 8: Sample images5 that were not chosen for the online survey. Images (a), (b), and (c) are traditional renderings of ‘adventure', ‘love', and ‘war' respectively. Images (d), (e), and (f) are advanced renderings of ‘bear', ‘fire', and ‘music' respectively. thought were particularly interesting. In future research we plan to do a direct comparison of the images created by DARCI with images created by human artists and to further investigate how semantic memory contributes to the creative process. We plan to improve the semantic memory model by going beyond word-to-word associations and building associations between words and other objects (such as images). This will require expanding DARCI's image analysis capability to include some level of image noun annotation. The similarity metric presented in this paper is a step in that direction. An improved semantic memory model could also help enable DARCI to discover its own topics (i.e., find its own inspiration) and to compose icons together in more meaningful ways, by intentional choice of absolute and relative icon placement, for example. 2013_15 !2013 Abstract This paper describes a computer model for visual compositions. It formalises a series of concepts that allows a computer agent to progress a visual work. We implemented a prototype to test the model; it employs letters from the alphabet to create its compositions. The knowledge base was built from examples provided by designers. From these examples the system obtained the necessary information to produce novel compositions. We asked a panel of experts to evaluate the material produced by our system. The results suggest that we are in the right track although much more work needs to be done. Introduction This text reports a computer model for visual compositions. The following lines describe the motivation behind it. One of the most important topics that a student in design needs to master is that related to visual composition. By composition we refer to the way in which elements in a graphic work are organised on the canvas. The design process of a composition implies the selection, planning and conscious organisation of visual elements that aim to communicate (Myers 1989; Deepak 2010). Compositions can be very complex with several elements interacting in diverse ways. Unfortunately, an important number of design texts include what we called "unclear" explanations about composition and its characteristics; in many cases, they are based on personal appreciations rather than on more objective criteria. To illustrate our point, here are descriptions of the concept of visual balance found in some design texts: "Psychologically we cannot stand a state of imbalance for very long. As time passes, we become increasingly fearful, uncomfortable, and disoriented" (Myers 1989: 85); "The formal quality in symmetry imparts an immediate feeling of permanence, strength, and stability. Such qualities are important in public buildings to suggest the dignity and power of a government" (Lauer and Pentak 2012: 92); "exacting, noncasual and quiet, but can DOVREHERULQJµ ff%UDLQDUGfi6LPLODUGHÀQLWLRQVFDQ be found in Germani-Fabris (1973); Faimon and Weigand (2004); Fullmer (2012); and so on. As one can see there is a need for clearer explanations that can guide designers, teachers and students on these topics. We believe that computer models of creativity are very useful tools that can contribute to formalize this type of concepts and, hopefully, to make them more accessible and clearer to students and the general public. Therefore, the purpose of this project is to develop a computer model of visual composition and implement a prototype. Particularly, we are interested in representing the genesis of the visual composition process; c.f. with other computer models that represent more elaborated pieces of visual works like ERI-Designer (Pérez y Pérez et al. 2010), The Painting Fool (Colton 2012), DARSY (Norton et al. 2011). Related works also include shape grammars (Stiny 1972) and relational production systems (Vere 1977, 1978). Other interesting approaches are those based in evolutionary mechanism (e.g. Goldberg 1991; Bentley 1999). However, we are interested in understanding each step in the composition process rather than look for optimization processes. This paper is organised as follows: section 2 describes some characteristics that we consider essential in visual composition; section 3 describes the core aspects of our model; section 4 describes the core characteristics of our prototype and how we used it to test our model; section 5 discusses the results we obtained. Characteristics of a Composition Composition is a very complex process that usually involves several features and multiple relations between them. It is out of the scope of this project to attempt to represent the whole elements involved in a composition. A composition is integrated by design elements and by design principles. The design elements are dots, lines, colours, textures, shapes and planes that are placed on a canvas. The design principles are the way these elements relate to A Computer Model for the Generation of Visual Compositions Rafael Pérez y Pérez1 , María González de Cossío1 , Iván Guerrero2 División de Ciencias de la Comunicación y Diseño1 Universidad Autónoma Metropolitana, Cuajimalpa, México D. F. Posgrado en Ciencia e Ingeniería de la Computación2 Universidad Nacional Autónoma de México {rperez/mgonzalezc}@correo.cua.uam.mx; cguerreror@uxmcc2.iimas.unam.mx 2013 105 each other and to the canvas. The principles that we employ in this project are rhythm, balance and symmetry. Rhythm is the regular repetition of elements. For regular repetition we mean that the distance between adjacent elements is constant. Groups of repeated elements make patterns. The frequency of a pattern describes how many times the same element is repeated within a given area in a canvas. Thus, the frequency depends on the size and distance between elements. A composition might include two or more patterns with the same or different frequencies. Balance is related to the distribution of visual elements on the canvas. If there is an equal distribution on both sides of the canvas, there is a formal balance. If the elements are not placed with equal distribution, there is an informal balance. Myers describes informal balance as "Off-centre balance. It is best understood as the principle of WKHVHHVDZ$Q\ODUJH¶KHDY\·ÀJXUHPXVWEHSODFHGFORVHU WRWKHIXOFUXPLQRUGHUWREDODQFHDVPDOOHU¶OLJKWHU·ÀJure located on the opposite side. The fulcrum is the point of support for this balancing act. It is a physical principle WUDQVSRVHGLQWRDSLFWRULDOÀHOG7KHIXOFUXPLVQHYHUVHHQ but its presence must be strongly felt" (1989: 90). 6\PPHWU\ ffIURP WKH *UHHN ƱƳƫƫƤƲƯƤԃƬ V\PPHWUHtQfi "with measure", means equal distribution of elements on both sides of the canvas. The canvas is divided into many equal areas as needed. The basic divisions separate the canvas in four areas using a vertical axis and a horizontal axis. Diagonal divisions can also be included. Symmetry can be H[SODLQHGDVIROORZV´*LYHQSODQH$DÀJXUHLVV\PPHWULFDO LQUHODWLRQWRLWZKHQLWUHÁHFWVLQ$DQGJRHVEDFNWRLWVLQLtial position" (Agostini 1987:97). In other words "symmetry of a (planar) picture [is] a motion of the plane that leaves that picture unchanged" (Field 1995:41). In this project we work with three types of symmetry: 5HÁHFWLRQDO V\PPHWU\ RU PLUURU V\PPHWU\ It refers to the UHÁHFWLRQRIDQHOHPHQWIURPDFHQWUDOD[LVRUPLUURUOLQH ,IRQHKDOIRIDÀJXUHLVWKHPLUURULPDJHRIWKHRWKHUZH VD\ WKDW WKH ÀJXUH KDV UHÁHFWLRQDO RU PLUURU V\PPHWU\ and the line marking the division is called the line of reÁHFWLRQWKHPLUURUOLQHRUWKHOLQHRIV\PPHWU\ff.LQVH\ and Moore 2002:129). 5RWDWLRQDO V\PPHWU\ The elements rotate around a central axis. It can be in any angle or frequency, whilst the elements share the same centre. For example, in nature, a VXQÁRZHUVKRZVHDFKHOHPHQWURWDWLQJDURXQGDFHQWUH %LODWHUDO V\PPHWU\ RU WUDQVODWLRQDO V\PPHWU\ Refers to equivalent elements that are placed in different locations but with the same direction. "The element moves along DOLQHWRDSRVLWLRQSDUDOOHOWRWKHRULJLQDOµ ff.LQVH\DQG Moore 2002:148). Description of the Model For this work we assume that all compositions are generated RQDZKLWHFDQYDVZLWKDÀ[HGVL]H&RPSRVLWLRQVDUHFRPprised by the following elements: blank, simple elements and compound elements, also referred to as groups. Blank is the space of the canvas that is not occupied by any element. A simple-element is the basic graphic unit employed to create a visual composition. A compound-element is a group formed by simple-elements (as it will be explained later, all adjacent elements within a group must have the same distance). A compound-element might also include other compoundelements. Once a simple-element is part of a group, it cannot participate in another group as a simple-element. All elements have associated a set of attributes: 1. Blank has an area. 2. Simple-elements have a position (determined by the centre of the element), an orientation, an area and an inclination. 3. Compound-elements have a position, an area, a shape, a rhythm and a size. The position is calculated as the geometric centre of the element. Compound-elements can have four possible shapes: horizontal, vertical, diagonal DQGDQ\RWKHU7KHUK\WKPLVGHÀQHGDVWKHFRQVWDQWUHSHWLWLRQRIHOHPHQWV7KHVL]HLVGHÀQHGE\WKHQXPEHURI elements (simple or compound) that comprise the group. There are three basic primitive-actions that can be performed on simple and compound elements: insert in the canvas, eliminate from the canvas and modify its attributes. Relations. All elements in a canvas have relations with the other elements. Our model represents three types of relations: distance, balance and symmetry. 'LVWDQFH. We include four possible distances between elements: /\LQJRQRQHHOHPHQWLVRQWRSRIRWKHUHOHPHQW 7RXFKWKHHGJH RI RQHHOHPHQWLVWRXFKLQJWKHHGJH RI other element. &ORVHQRQHRIWKHSUHYLRXVFODVVLÀFDWLRQVDSSO\DQGWKH distance between the centre of element 1 and element 2 is equal or minor to a distance known as Distance of Closeness (DC). It represents that an element is close to another element. The appropriate value of DC depends on cultural aspects and might change between different societies (see Hall 1999). 5HPRWHWKHGLVWDQFHEHWZHHQWKHFHQWUHVRIHOHPHQWDQG element 2 is major to DC. %DODQFH. We employ two different axes to calculate balance: horizontal and vertical. They all cross the centre of the canvas. The balance between two elements is obtained as follows. The area of each element is calculated and then multi 2013 106 plied by its distance to the centre. If the results are alike the elements are balanced. Unbalanced relations are not explicitly represented. 6\PPHWU\:H ZRUN ZLWK WKUHH W\SHV RI V\PPHWU\ UHÁHFtional (Rf), translational (Tr) and rotational (Rt). We employ two different axes to calculate it: horizontal (H) and vertical (V). So, two different elements in a canvas might have RQH RI ÀYH GLIIHUHQW V\PPHWULF UHODWLRQV EHWZHHQ WKHP KRUL]RQWDOUHÁHFWLRQDO ff+5Ifi YHUWLFDOUHÁHFWLRQDO ff95Ifi horizontal-translational (H-Tt), vertical-translational (V-Tt) and rotational (Rt). Asymmetrical relations are not explicitly represented. Creation of Groups. Inspired by Gestalt studies in perception (Wertheimer 2012) in this work, groups are created based on the distance between its elements. The minimum distance (MD) is the smallest distance between two elements (e.g. if the distance between element 1 and element 2 is 1 cm, the distance between element 2 and element 3 is 3 cm, and the distance between element 1 and element 3 is 4 cm, MD is equal to 1 cm). Its value ranges from zero (when the centre of element 1 is lying on top of the centre of element 2) to DC. 0''& That is, inspired by Gestalt studies that indicate that the eye perceives elements that are close as a unit, a group cannot include elements with a remote distance. The process of grouping works as follows. All simple-elements that are separated from other simple-elements by the same distance are grouped together, as long as such a distance is minor to the remote distance. If as a result of this process at least one group is created, the same process is performed again. The process is repeated until it is not possible to create more groups. Notice that this way of grouping produces that all groups have associate a rhythm, i.e. all groups include the constant repetition of (at least one) elements. We refer to the groups created during this process as Groups of Layer 1. Figure 1 layer 0 shows simple elements on a canvas before the system groups them; Figure 1 layer 1 shows the groups that emerge after performing this process: group 1 (the blue one), group 2 (the purple one) and group 3 (the yellow one); d1 represents the distance between elements in group 1; d2 represents the distance between elements in group 2; d3 represents the distance between elements in group 3. The following lines describe the algorithm: )LUVW LWHUDWLRQ /D\HU &RQVLGHULQJRQO\VLPSOHHOHPHQWVÀQGWKH0'YDOXH 2. If there are not at least two simple-elements whose MD LVHTXDORUPLQRUWR'&WKHQÀQLVK 2. All simple-elements that are separated from other simpleelements by a distance MD form a new group. 3. Go to step 1. Now, employing a similar mechanism, we can try to create new groups using the Groups of Layer 1 as inputs (see Figure 1 Layer 2). We refer to the groups created during this second process as Groups of Layer 2. Groups at layer 2 are comprised by simple-elements and/or compound-elements. The algorithm works as follows: If at least one group was created during Layer 1 then perform Layer 2. 6HFRQG LWHUDWLRQ /D\HU 1. Considering simple and compound elements, that have not IRUPHGDJURXSLQWKLVOD\HU\HWÀQGWKHYDOXHRIWKH0' 2. If there are not at least two elements whose MD is equal RUPLQRUWR'&WKHQÀQLVK 2. All elements that are separated from other elements by a distance MD form a new group. 3. Go to step 1. Notice how the blue group and the purple group merge; the reason is that the distance between purple group and the blue group (d21) is smaller than the distance between the blue group and the yellow group (d13), or the distance between the purple group and the yellow group (d23). Because there is no other group to merge, the yellow group has to wait until the next cycle (next layer) to be integrated (see Figure Figure 1. A composition represented by 3 layers. Layer 3: 1 group Layer 2: 2 groups Layer 1: 3 groups Layer 0: simple elements R1 R2 R3 R3 R1 R2 R21 d1 d2 d3 R: Rhythm d: distance d21 d23 d13 d213 R213 2013 107 1 layer 3). This process is repeated until no more layers can EH FUHDWHG$OO JURXSV FUHDWHG GXULQJWKH ÀUVWLWHUDWLRQ DUH known as Groups at Layer 1; all groups created during the second iteration are known as Groups at Layer 2; all groups created during the nth iteration are known as Groups at Layer n. A composition that generates n layers is referred to as nth Layers Composition. Calculating rhythms. The process to calculate rhythms within a composition works as follows. Each group at layer 1 has its own rhythm (see Figure 1 layer 1). So, the blue group has a rhythm 1 (R1), the purple group has a rhythm 2 (R2) and the yellow group has a rhythm 3 (R3). When the system blends the blue and purple groups, the new group includes three different rhythms (see Figure 1 Layer 2): R1, R2 and a new rhythm R21. Rhythm R21 is the result of the distance between the centre of the blue group and the centre of the purple group. We can picture groups as accumulating the rhythms of its members. So, in Figure 1 Layer 2 we can observe four rhythms: R1, R2, R21 (inside the purple group) and R3 in the yellow group. A group that includes only one rhythm is FODVVLÀHG PRQRWRQRXV D JURXS WKDW LQFOXGHV WZR RU PRUH UK\WKPVLVFODVVLÀHGDVYDULHG6RWKHSXUSOHEOXHKDVDYDUied rhythm while the yellow group has a monotonous rhythm. Analysis of the composition. Our model represents a composition in terms of all existing relations between its elements. This representation is known as Context. Because each layer within a composition includes different elements, and possible different relations between them, the number of contexts associated to one composition depends on its number of layers. Thus, a 3 layers composition has associated three contexts: context-layer 1, context-layer 2 and context-layer 3. Context of the composition = Context-layer 1 + Context-layer 2 + Context-layer 3 Besides relationships, a context-layer also includes information about the attributes of each element, and what we refer to as the attributes of the layer: Density of the layer, Balance of the layer, Symmetry of the layer and Rhythm of the layer. The Density of the Layer (DeL) is the relation between the blank's area and all elements' area: Density of the Layer = The Balance of the layer and Symmetry of the layer indicate if the layer as a whole is symmetrical and is balanced. The Rhythm of the layer indicates the type of rhythm that the layer has as a whole. Like in the case of the groups it can have the following values: Monotonous or Varied (see Figure 2). Components of a context-layer Relation between elements Attributes of the elements Attributes of the layer Figure 2. Components of a context layer. Composition process We can describe a composition as a process that consists on sequentially applying a set of actions, which generate several partial or incomplete works… until the right composition arises or the process is abandoned (Pérez y Pérez et al. 2010) Thus, if we have a blank canvas and perform an action on it, we will produce an initial partial composition; if we modify that partial composition by performing another action, then we will produce a more elaborated partial composition; we can keep on repeating this process until, with some luck, we will end producing a whole composition. Thus, by performing actions we progress the composition (see Figure 3). The model allows calculating for each partial composition all its contextual-layers. This information is crucial for generating novel compositions. Producing new works Our model includes two main processes: the generation of knowledge structures and the generation of compositions. Generation of knowledge structures The model requires a set of examples that are provided by human experts; we refer to them as the previous designs. So, each previous design is comprised by one or more partial compositions; each of these partial compositions is more elaborated WKDQWKHSUHYLRXVRQH$WWKHHQGZHKDYHWKHÀQDOFRPSRVLWLRQ All Elements' area Blanks' area Blank canvas Empty context Action 1 Partial Composition 1 Context 1 Action 2 Partial Composition 2 Context 2 And so on... Figure 3. A composition process. 2013 108 As explained earlier, we can picture a composition process as a progression of contexts mediated by actions until the last context is generated. In the same way, if we have the sequence of actions that leads towards a composition (and that is the type of information we can get from the set of examples), we can analyse and register how the composition process occurred. The goal is to create knowledge structures that group together a context and an action to be performed. In other words, the knowledge base is comprised by contexts (representing partial compositions) and actions to transform them in order to progress the composition. Because the previous designs do not represent explicitly their associated actions, it is necessary to obtain them. The following lines explain how this process is done. We compare two contexts and register the differences between them. Such differences become the next action to perform. For example, if Context 1 represents an asymmetrical composition and Context 2 represents a horizontal symmetrical one, we can associate the action "make the current composition horizontally symmetrical" to Context 1 as the next action to continue the work in progress. Once this relation has been established, it is recorded in the knowledge base as a new knowledge structure. We do the same with all the contexts in all the layers of a given partial composition. The actions that can be associated to a context DUHPDNHffUHÁHFWLRQDOURWDWLRQDORUWUDQVODWLRQDOfiV\PPHWULcal the current composition; balance (horizontally or vertically) the current composition; insert, delete or modify a simple RUFRPSRXQGHOHPHQWPDNHffUHÁHFWLRQDOURWDWLRQDORUWUDQVlational) asymmetrical the current composition; unbalance (horizontally or vertically) the current composition; end the process of composition. The following lines describe the algorithm to process the previous designs. 1. Obtain the number of all the partial compositions of a given example (NumberPC) 2. Calculate all the contexts for each partial composition 3. For n:= 1 to (NumberPC - 1) 3.1 Compare the differences between Context n and Context n+1 3.2 Find the action that transform Context n into Context n+1 3.3 Create a new knowledge structure associating Context n and the new Action 3.4 Record in the knowledge base this new knowledge structure. 4. The context of the last partial composition gets the action "end of the process of composition". We repeat the same process for each example in the set of previous designs. All the knowledge structures obtained in this way are recorded in the knowledge base. The bigger the set of previous designs the richer our knowledge base is. Generation of compositions: The composition process follows the E-R model described in (Pérez y Pérez and Sharples 2001). The following lines describe how it works. The E-R model has two main processes: engagement and UHÁHFWLRQ'XULQJHQJDJHPHQWWKHV\VWHPJHQHUDWHVPDWHULDO GXULQJUHÁHFWLRQVXFKPDWHULDOLVHYDOXDWHGDQGLIQHFHVVDU\ PRGLÀHG7KHFRPSRVLWLRQLVDFRQVWDQWF\FOH EHWZHHQHQJDJHPHQWDQGUHÁHFWLRQ7KHPRGHOUHTXLUHVDQLQLWLDOVWDWH i.e. an initial partial composition to start; then, the process is WULJJHUHG7KH IROORZLQJOLQHVGHVFULEHKRZZHGHÀQHGHQJDJHPHQWDQGUHÁHFWLRQ (QJDJHPHQW: 1. The system calculates all the Contexts that can be obtained from the current partial composition. 2. All these contexts are employed as cues to probe memory. 3. The system retrieves from memory all the knowledge structures that are equal or similar to the current contexts. If none structure is retrieved, an impasse is declared and WKHV\VWHPVZLWFKHVWRUHÁHFWLRQ 4. The system selects one of them at random and performs its associated action. As a consequence the current partial composition is updated. 5. And the cycles repeats again (step 1). 5HÁHFWLRQ 1. If there is an impasse the system attempts to break it and then returns to the generation phase. 7KH V\VWHP FKHFNVWKDWWKH FXUUHQW FRPSRVLWLRQ VDWLVÀHV WKHUHTXLUHPHQWVRIFRKHUHQFHffHJWKHV\VWHPYHULÀHVWKDW all the elements are within the area of the canvas; that elements are not accidentally on top of each other; and so on). 7KH V\VWHP YHULÀHV WKH QRYHOW\ RI WKH FRPSRVLWLRQ LQ progress. A composition is novel if it is not similar to any of the compositions in the set of previous designs. The system starts in engagement; after three actions it switchHVWRUHÁHFWLRQDQGWKHQJRHVEDFNWRHQJDJHPHQW,IGXULQJ engagement an impasse is declared, the system switches to UHÁHFWLRQWRWU\WREUHDNLWDQGWKHQVZLWFKHVEDFNWRHQJDJHment. The cycle ends when an unbreakable impasse is triggered or when the action "end of the process of composition" is performed. Example of a composition: For space reasons, it is impossible to describe in detail how the system creates a whole new design. Instead, in Figure 4 we show some partial compositions generated by our program and their associated contexts. To create the partial-composition in Figure 4A, the system starts 2013 109 with a blank canvas and then inserts three elements at random (the three elements on the top-left). This partial composition has two layers: the context of each layer is depicted on the ULJKWVLGHRI)LJXUH$)RUWKHVDNHRIFODULW\WKHÀJXUHGRHV not include the attributes of the elements; then, during engagement, it takes the current contexts as cues to probe memory and retrieves some actions to progress the work. Between the retrieved actions one is selected at random. So, it inserts three new elements that produce a vertical translational symmetry (see Figure 4B). The context in each layer clearly shows the relation between all elements in the canvas. In this case, in Layer 1 we have two Vertical Translational Symmetry (VTS) and in Layer 2 we have one VTS symmetry. 7KHV\VWHPVZLWFKHVWRUHÁHFWLRQDQGUHDOLVHVWKDWVRPH elements are on top of others. Employing some heuristics to analyse the composition, the program decides that is better to separate them. The system switches back to engagement, takes the current contexts as cues to probe memory and retrieves actions to be performed. In this occasion, the system inserts in the third quadrant a new group with a horizontal PLUURUHGV\PPHWU\ffVHH)LJXUH&fi7KHULJKWVLGHRIWKHÀgure shows the context at each layer. The process is repeated again generating the partial composition in Figure 4D and its corresponding contexts. Tests and Results We implemented a prototype to test our model. Because of the technical complexity of implementing the whole model we decided to include some constraints. In our prototype all simple-elements have the same size, colour and shape: in this work, simple elements are letters of the alphabet. Because of WKHWHFKQLFDOGLIÀFXOW\RILPSOHPHQWLQJUHODWLRQVKLSVLQWKLV prototype we only use symmetry and balance. Like the model, the prototype has two main parts: creation of knowledge structures and generation of new compositions. The prototype has an interface that allows the user to create her own compositions. She can insert, delete or modify letters in the canvas. By clicking one button she can also build new symmetrical or balanced elements, or generate random groups. The program automatically indicates all the existing groups in all layers; it also shows all the relationships that currently exist between the elements in the canvas. In the same way, the attributes of all elements are displayed as well as their rhythms. So, the user only has to create her composition on the canvas (the program includes a partial-composition button that allows the user to indicate when a partial composition is ready). In this way, the system automatically FUHDWHVWKHÀOHRISUHYLRXVGHVLJQV2QFHWKHNQRZOHGJHEDVH is ready, the user can trigger the E-R cycle to generate novel compositions. :HSURYLGHGRXUSURWRW\SHZLWKÀYHSUHYLRXVGHVLJQV)LJures 5 and 6 show two works generated by our program. In order to obtain an external feedback we decided to ask a panel of experts their opinion about our program's work. The Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q (B) Context Layer 1 Context Layer 2 (A) Context Layer 1 Context Layer 2 (C) Context Layer 1 Context Layer 2 (D) Context Layer 1 Context Layer 2 VTS: Vertical Translational Symmetry HMS: Horizontal Mirrored Symmetry B: Balanced RS: Radial Symmetry Figure 4. Partial compositions and their contexts. VTS VTS VTS HMS,B HMS,B HMS,B HMS,B HMS,B HTS HMS,B RS,B B RS,B B VTS VTS VTS VTS VTS VTS Q Q Q QQ Q Q Q QQ Q QQ Q QQ QQ QQ Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q QQQ QQ Q QQ Q Q Q QQ Q Q Q Q QQ Q QQ QQ QQQ Q Q QQ QQ Q QQ Q QQ Q Q Q Q Figure 5. A composition created by our agent. It is Composition 2 in the questionnaire. Figure 6. A second composition created by our agent. It is Composition 3 in the questionnaire. 2013 110 Experts liked composition 1 and 2. This was an interesting result because it suggested that our model was capable of generating designs with an acceptable quality. It was also clear that most experts disliked composition 3 (Figure 6); although it is fair to say that their evaluation was only one point lower than the highest evaluation. Compositions 1 and 4 (made by the human designer) had a better evaluation regarding balance and symmetry than compositions 2 and 3 (made by our program). We could have forced our program to generate symmetrical or balanced designs, but that was exactly what we wanted to avoid. Our system had the capacity of detecting such characteristics and nevertheless attempted something different. Expert's assessment on symmetry was neither clear nor unanimous. We were VXUSULVHGWRÀQGWKLVRXWVLQFHV\PPHWU\GRHVQRWGHSHQGRQ subjective judgment. Something similar occurred with balance and to some extent with rhythm. These results seemed to suggest that experts had different ways of evaluating these characteristics. Experts considered that the rhythm in Composition 2 was the best. Overall subjects preferred composition 4; compositions 1 and 2 got similar results, with a slightly preference for composition 1; composition 3 got the lowest rank. Discussion and Conclusions This project describes a computer model for visual composition. The model establishes: $FOHDUFULWHULDWRGHÀQHVLPSOHHOHPHQWVDQGJURXSV $VHWRIDWWULEXWHVIRUVLPSOHHOHPHQWVJURXSVDQGOD\HUV 5HODWLRQVKLSV EHWZHHQ HOHPHQWV DQG D PHFKDQLVP WR identify such relationships. $PHWKRGWRDQDO\VHDYLVXDOFRPSRVLWLRQEDVHGRQOD\HUV relationships and attributes. $PHFKDQLVPEDVHGRQWKH(5PRGHOWRSURGXFHQRYHO compositions. As far as we know, there is no other similar model. Although we are aware that many important features of compositions are not considered yet, we claim that our model allows a computer agent to produce novel visual designs. We tested our model implementing a computer agent. The system was capable of producing compositions. None of them are alike to any of the previous designs, although some of its characteristics resemble the set of examples. A panel of experts evaluated two compositions generated by our system and two compositions generated by a human designer. We decided to ask a small group of experts, who we believe share core concepts about design, to evaluate our prototype's compositions rather than to ask lots of people with different backgrounds. The results suggest two interesting points: 1. In most cases, the opinions of the experts were not unanipanel consisted of twelve designers: four men and eight women. All of them had studied a bachelor's degree in design and half of them got a postgraduate degree. We developed a questionnaire that included four compositions: two were created by our system (compositions 2 and 3, Figures 5 and 6) and two were created by a designer (composition 1 and 4, Figures 7 and 8). The human compositions had to follow similar constraints to those of our program's compositions: they had to be in black and white, the designer can only employ one letter to develop her work, and so on. The participants were not told that some works had been done by a computer program. Subjects were asked to assess in a range from 1 (lowest) to 5 (highest) four characteristics for each composition: a) whether they liked the composition, b) whether they considered that the composition had symmetry, c) whether the composition had balance and, d) what kind of rhythm the composition had. They were also invited to comment freely on each composition regarding balance and symmetry. In the last part of the questionnaire, participants were asked to rank the compositions from the best to the worst. Figure 9 shows the results of the questionnaire. QQQQQQQQQQ Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Figure 7: Human generated composition. Corresponds to composition 1 in the questionnaire. Figure 8: Human generated composition. Corresponds to composition 4 in the questionnaire. 5 4 3 2 1 0 Like Balance Symmetry Rhythm Preference Compositions 1 2 3 4 Figure 9: Results of the questionnaire. Experts' opinions on Compositions Assessment 2013 111 mous. That is, some experts found more interesting some of the characteristics of the computer-generated composition than those produced by humans. 2. Experts seem to have different ways of perceiving and evaluating compositions. Point 1 suggests that our model is capable of generating interesting compositions. That is, it seems that we are moving in the right direction. 3RLQWVHHPVWRFRQÀUPWKHQHFHVVLW\RIFOHDUHUPHFKDnisms to evaluate a composition. Of course, we are not suggesting that personal taste and intuition should be eliminated from design. We are only recommending the use of clearer GHÀQLWLRQV DQG PHFKDQLVPV IRU HYDOXDWLRQV :H DUH FRQvinced that they will be very useful, especially in teaching and learning graphic composition. One of the reviewers of this paper suggested comparing our work with shape grammars (Vere 1977, 1978). Our proposal is far of being a grammar; it does not include features like terminal shape elements and non-terminal shape elements. In the same way, we do not work with shapes but with relations between the elements that comprise the composition. Those relations drive the generation of new composiWLRQV:H EHOLHYHWKDW RXU DSSURDFKLV PXFK PRUH ÁH[LEOH than the grammars approach. A second reviewer suggested comparing our work with relational productions (Stiny 1972). It is true that our work also employs the "before and after" situations described by Stiny. However, we are not interested in modelling inductive (or any other type of) learning; our purpose is to record the actions that the user performs in order to progress a composition. Later, the system employs this information to develop its own composition. None of these WZRDSSURDFKHVLQFOXGHFKDUDFWHULVWLFVOLNHDÁH[LEOHJHQHUDtion process intertwined to an evaluation process, analysis by layers of the relations between the elements that comprise a composition, and other characteristics that our approach does. Thus, although some of the features that our model employs remind us of previous works, we claim that our approach introduces interesting novel features. We hope this work encourage other researches to work on visual composition generation. 2013_16 !2013 Learning how to reinterpret creative problems Kazjon Grace College of Computing and Informatics University of North Carolina at Charlotte Charlotte, NC, USA k.grace@uncc.edu John Gero Krasnow Institute for Advanced Study George Mason University Fairfax, VA, US john@johngero.com Rob Saunders Faculty of Architecture, Design and Planning Sydney University Sydney, NSW, Australia rob.saunders@sydney.edu.au Abstract This paper discusses a method, implemented in the domain of computational association, by which computational creative systems could learn from their previous experiences and apply them to influence their future behaviour, even on creative problems that differ significantly from those encountered before. The approach is based on learning ways that problems can be reinterpreted. These interpretations may then be applicable to other problems in ways that specific solutions or object knowledge may not. We demonstrate a simple proof-ofconcept of this approach in the domain of simple visual association, and discuss how and why this behaviour could be integrated into other creative systems. Introduction Learning to be creative is hard. Experience is known to be a significant influence in creative acts: cognitive studies of designers show significant differences in the ways novices and experts approach creative problems (Kavakli and Gero, 2002). Yet each creative act is potentially so different from every other act that it is complex to operationalise the experience gained and apply it to subsequent acts of creating. Systems that can, through experience, improve their own capacity to be creative are an interesting goal for computational creativity research as they are a rich avenue for improving system autonomy. While computational creativity research has coalesced over the last decade around quantified ways to evaluate creative output, there have been few attempts to imbue a system with methods of self-evaluation and processes by which it could learn to improve. This research presents one possible avenue for pursuing that goal. A distinction should be drawn between learning about the various objects and concepts to be used in particular creative acts, which serves to aid those acts specifically, and learning about how to be a better creator more broadly. Knowledge about objects influences future creative acts with those objects, but the generalisability of that knowledge is suspect. One example of where this learning challenge is particularly relevant is analogy-making, in which every mapping created between two objects is, by the definition of an analogy as a new relationship, in some way unique. Multiple analogies using the same object or objects are not guaranteed to be similar. This makes it very difficult to generalise knowledge about making analogies and apply it to any future analogy-making act. We propose to tackle this problem of learning to be (computationally) creative by learning ways to interpret problems, rather than learning solutions to problems or learning about objects used in problems. These interpretations can be learnt, evaluated, recalled and reapplied to other problems, potentially producing useful representations. This process is based on the idea that perspectives that have been adopted in the past and have led to some valuable creative output may be useful to adopt again if a compatible problem arises. While even quite similar creative problems may require very different solutions, quite different problems may be able to be reinterpreted in similar ways. We discuss this approach specifically for association and analogy-making but it may hypothetically apply to other components of computational creativity. We develop a proof-of-concept implementation in the domain of computational association, and outline some ways in which this learning of interpretations could be more useful than objector solution-learning in creative contexts. Models for how previous experiences can influence behaviour could be a valuable addition to learning in creative systems. A computational model able to learn ways to approach creative problems would behave in ways driven by its previous experiences, permitting kinds of autonomy of motivation and action currently missing from most models of computational creativity. For example, it would be possible to develop a creative system that could autonomously construct aesthetic preferences based on what it has (or has not) experienced, or to learn styles by which it can categorise the work of itself and others, such as described in (Jennings, 2010). A creative system capable using past experiences to influence its behaviour is a key step towards computationally creative systems that are embedded in the kind of rich historical and cultural contexts which are so valuable to human artists and scientists alike. Learning interpretations in computational association We have previously developed a model of computational association based on the reinterpretation of representations so as to render them able to be mapped. Our model, along with an implementation of it in the domain of ornamental 2013 113 design, is detailed in (Grace, Gero, and Saunders, 2012). We distinguish association from analogy by the absence of the transfer process which follows the construction of a new mapping: analogy is, in this view, association plus transfer. Interpretation-driven association uses a cyclical interaction of re-representation and mapping search processes to both construct compatible representations of two objects and produce a new mapping between them. An interpretation is considered to be a transformation that can be applied to the representations of the objects being associated. These transformations are constructed, evaluated and applied during the course of a search for a mapping, transforming the space of that search and influencing its trajectory while the search occurs. This differs from the theory of rerepresentation in analogy-making presented in Yan, Forbus and Gentner 2003 as in our system representations are iteratively adapted in parallel with the search for mappings, rather than only after mapping has failed. This permits interpretation to influence the search for mappings, and for mapping to influence the construction, evaluation and use of interpretations in turn. The implementation of this model preented here explores the process of Interpretation Recollection, through which interpretations that have been instrumental in creating past associations can be recalled to influence a current association problem. This process occurs in conjunction with the construction of interpretations from observations made about the current problem. In the model interpretation recollection is a step in the iterative interpretation process in which the set of past, successful interpretations is checked for any interpretations appropriate to the current situation. These past interpretations will then be considered for application to the object representations alongside other interpretations that have previously been constructed or recalled. A successful interpretation - one that has previously led to an association - can thereby by reconstructed and reapplied to a new association problem. In this paper we demonstrate that this feature of the interpretation-driven model leads to previous experiences influencing acts of association-making, and claim that this is promising groundwork for future investigations into learning in creative contexts. In the implementation described in this paper we use simplified approaches to determining the relevance of previously successful interpretations and reapplying them to the current context. The metric for determining appropriateness is straightforward: any previous interpretation which has a non-zero effect on a current object representation is determined to be capable of influencing the course of the current association problem and included. This simplifies the notion of "appropriate for future use" and leads to an obvious scalability issue, but we demonstrate that this very simple approach influences behaviour. More sophisticated methods for determining when and how known interpretations should be reapplied are an area of future investigation. Experimenting with learnt interpretations As a preliminary investigation into the potential of interpretation-based creative learning, we will demonstrate that the approach we have developed permits previous experience to influence the behaviour of an association system. To illustrate this we will prime the system to produce different results after having experienced different histories. In our system previously constructed associations can influence new association problems through interpretation learning; past associations can act to "prime" the system to produce particular results on future associations. By demonstrating that an association system's experience with one pair of objects can influence its behaviour associating different objects, we show the advantage of interpretation-based approach to learning. Comparatively an object-based approach to learning would not have permitted generalisation to an unfamiliar pair of objects. In our experiments the system is exposed to a particular stimulus (either a simple unambiguous association problem or nothing in the case of the control trial) and then attempts to solve an ambiguous association problem that is the same between all trials. Our association system produces many different mappings between any two objects, so changes in the distribution of mappings produced on the second problem is used as an indicator of priming effects. Three trials were conducted. In the first trial no priming association was performed, in the second trial a priming association between Objects 1 and 2 of Figure 1 was performed, and in the third trial a priming association Objects 1 and 3 of Figure 1 was performed. In each trial an association between Objects 4 and 5, depicted in Figure 2, followed the priming stage. Each trial was performed 100 times, with the system being re-initialised (and re-primed) between each one so that the histories are identical for every association. A distribution of the results of the association between Objects 4 and 5 was produced. All trials were conducted using three relationships: relationships of the relative orientation of shapes, such as ‘⇠45" difference in orientation'; relationships of the relative vertical separation of shapes, such as ‘⇠3 units of separation in the Y axis'; and simple binary relationships when two shapes share vertices. (c): Object 3 (a): Object 1 (b): Object 2 Figure 1: The three objects used in the priming associations. An association between either Objects 1 and 2 or Objects 1 and 3 is used to prime the interpretation system. The two associations used for priming are designed to repeatably produce a predictable association based on a predictable interpretation making them well suited to testing the impact of priming an association system with that 2013 114 interpretation. The system perceives Objects 1 and 2 and constructs a simple association based on equating the pattern of relative rotations between features in Object 1 with the pattern of shared vertices between features in Object 2. In the other trial, the system perceives Objects 1 and 3 and constructs another simple association, this time equating the pattern of relative rotations in Object 1 with the pattern of relative vertical positions in Object 3. These associations are depicted in Figure 3, with the thick dashed lines between features within the objects denoting relationships that were mapped, while solid lines between features joining the two objects denote which features were mapped to each other. (d): Object 4 (e): Object 5 Figure 2: The two objects used in the test association of all three trials, which is used to measure the effects of the priming associations. These simple associations effectively prime the system with an interpretation which will predictably bias the dependent association between Objects 4 and 5. This bias provides a proof-of-concept test of experiential influence. Future studies are needed to determine the scope of influences which historical context can exert in creative systems. The post-priming association problem used in all three trials is designed to have two dominant solutions. Over many runs the system will produce many other associations in addition to these two, but these will occur relatively often. The two associations can be seen in Figure 4, with association (a) being between the radial arrangement of shapes in Object 4 and the similar arrangement of touching shapes in Object 5, and association (b) being between the same arrangement in Object 4 and the vertically spaced shapes in Object 5. It is hypothesised that when the system is first primed with the association in Figure 3(a) the solution in Figure 4(a) will be more common (than when unprimed), and that when the system is first primed with the association in Figure 3(b) the solution in Figure 4(b) will instead be more common (than when unprimed). This outcome would demonstrate the feasibility of using interpretation-based learning to enable a creative system's experiences to influence its actions. Experimental results The distribution of associations produced in each trial can be seen in Figure 5. Each of the three bars represents one i: ~45° Δrot = shared vertex (a): Trial 2 priming association ~45° Δrot shared vertex Object 1 Object 2 Rv (O1 ) Rv (O2 ) i: ~45° Δrot = ~3.0 ΔY (b): Trial 3 priming association ~45° Δrot ~3.0 ΔY Object 1 Object 3 Rv (O1 ) Rv (O3 ) Figure 3: The solutions to the association problems used to prime the system in the second and third trials. These simple problems predictably influence the experiential component of the creative system in ways that can then be measured. trial, and each of the three different shading tones represent a different result, with the darkest tone representing the solution seen in Figure 4(a), the middle tone representing Figure 4(b), and the lightest tone representing all other solutions. The latter category included fragmented mappings (those for which the system could not find a complete mapping of all the shapes in Object 4) based on relationships such as 90! and 135! orientation differences as well as similar varieties and combinations of vertical separation and vertex sharing relationships. Although they are irrelevant to this investigation of priming effects, at present this implementation has no way of evaluating associations other than the number of features which are mapped. See (Grace, Gero, and Saunders, 2012) for a discussion of the evaluative capabilities of this model and its current implementation. It is clear from Figure 5 that priming the association system with previous problems that rely on compatible interpretations leads to a significant influence on the outcome of the association process. Trial 1, in which no priming is performed, serves as a control against which the frequency of different associations can be compared. In Trial 2 the system is primed with a problem that relies on the adoption of an interpretation equating a pattern of rotational relationships with a pattern of shared vertices. The result for Trial 2 clearly shows that the frequency of solutions relying on this interpretation (such as the one seen in Figure 4(a)) has 2013 115 Object 4 Object 5 (b): Trial 3 example dependent association i: ~45° Δrot = ~3.0 ΔY ~45° Δrot ~3.0 ΔY Object 4 Object 5 (a): Trial 2 example dependent association i: ~45° Δrot = shared vertex ~45° Δrot shared vertex shared vertex Rv (O4 ) Rv (O5 ) Rv (O4 ) Rv (O5 ) Figure 4: Two of the possible solutions to the dependent association performed in each trial. Solution (a) uses the same interpretation as used in Figure 3(a), while solution (b) uses the one found in Figure 3(b). increased significantly, from 17% in the control to 63% in Trial 2. In Trial 3 the system is primed with a problem that relies on equating the same pattern of rotational relationships with a pattern of vertical separation, shown in Figure 4(b). The result for Trial 3 shows a similarly significant increase in frequency, with 36% frequency for the primed trial compared to only 3% in the control. The difference in absolute frequency of the two associations shown in Figure 4 can be explained by the underlying graph structures and the process for searching them used in our model. The association primed for in Trial 2 is based on the "shared vertex" relationship, which is 50% more common in Object 5's graph representation than the "3.0 difference in the Y axis" relationship used in the interpretation primed for in Trial 3. For information on how our system automatically extracts these and other relationships from vector representations of the objects see (Grace, 0 10 20 30 40 50 60 70 80 90 100 Trial 3: O1 -> O3 Trial 2: O1 -> O2 Trial 1: Unprimed ~45° rot -> shared vertex ~45° rot -> ~3.0 Y Other interpretations Solutions using each interpretation (percentage) Trials Interpretation priming results Figure 5: The distribution of association results in each trial, showing the influence of the priming in Trials 2 & 3. Gero, and Saunders, 2012). The commonality of that relationship makes mappings that involve features connected by that relationship similarly more common, which makes it more likely to be utilised by both the mapping and interpretation processes. This bias makes the vertex-sharing relationships much more likely to feature in associations, but priming the system towards a less common result largely overrides it. This can be seen in the twelve-fold increase in the likelihood of the less-common association as compared to the only three-fold increase in the more common one. These results show that it is possible for the learning of interpretations to influence the behaviour of a creative system, and demonstrate our model of association's capacity for interpretation learning and experiential influence on behaviour. While the influence on behaviour produced in this implementation is limited, these experiments demonstrate that interpretation learning can influence behaviour on problems significantly different than those previously experienced. This shows the potential for more general learning than is possible by solutionor object-based methods, making this approach a valuable building block for modelling learning computational creativity. Discussion The experiments described in this paper are a demonstration that the behaviour of creative systems can be influenced by storing and reusing ways to interpret creative problems. This section discusses the impact on creativity of drawing from experience to reinterpret a problem and the ways interpretation can influence creative acts. For a more general discussion of our model and how it compares to other models see Grace et. al (2012). Re-using interpretations for creativity? There is an intuitive objection to the idea of re-using elements of a previous creative process: that process, or at least that element of the larger creative process, cannot by definition be p-creative. While the process may go on to produce por h-creative outputs, it will at least partially be based on things that have been experienced previously. The p-creativity, or lack thereof, of any element of the creative process does not imply any impact on the creativity of the final product, but the objection bears discussion: if drawing on experience will only reduce the creativity of a process, what is its value? Investigations of the diversity 2013 116 of solutions both with and without priming show that there is no significant reduction in the breadth of solutions produced, only in the order in which the system produces them. This is due to the novelty-favouring behaviour of our model, which over time discounts and eventually discards solutions to a particular problem which have repeatedly arisen. Such intrinsic motivations towards novelty are a necessary component of learning creative systems, balancing the desire to repeat the familiar against the desire to explore the new. (Suwa, Gero, and Purcell, 1999) propose a third element to Boden's categorisation of creativity (1992), ‘situational', or s-creativity, to describe when an object or process is not absolutely new to an agent, but is new within the current situation. This occurs when a familiar idea is considered for the first time in an unfamiliar context, a common outcome of analogy-making and a potent component of experiential learning. This is particularly applicable to the notion of reusing interpretations, which have the potential to transform the solution space of the current problem despite not being a novel process to the agent in question. Kinds of interpretation and their influence In the system presented here interpretations are simple transformations that are stored and re-applied verbatim. However, the notion that interpretations can influence future acts does not require that the previously useful interpretation be literally re-applied to the new context. It would be possible to develop a system in which exemplary, prototypical or generalised interpretations could be reconstructed from experience and applied to the current context. We define interpretations as a transformations applied to the objects being associated, but this need not be a direct transformation of the object representations used by the system. Other elements of the model could be transformed, such as evaluative processes, which would change not the information being used in the creative process but its value metrics. This could lead to experiential influence on aesthetic judgement, similar to the idea of autonomously derived aesthetics proposed by Colton (2011). Alternatively, representational processes of the model could be transformed, for example relaxing thresholds for categorisation or similarity. This could lead to behaviours like satisficing, a common behaviour of human designers in which requirements are changed during the creative act (Simon, 1957). Conclusions This paper proposes the notion of interpretation-learning - the storage and recollection of ways to transform problems - as a complement to more familiar models of objector solution-learning. Interpretation-learning is hypothesised as being of particular utility in creative contexts as each creative problem is unique in its solutions, but potentially not in the ways it can be perceived. These remembered interpretations can be thought of as granting a creative system more autonomy over its decision making than other means of deciding how to interpret problems such as provided heuristics or stochastic processes. We present a simple implementation of a creative system in which past experiences influence behaviour through interpretation, to serve as a proofof-concept of the notion of interpretation-learning. With this approach demonstrated as feasible and promising, future work can explore its efficiency and effectiveness. Incorporating learning is emerging as an important component of computational creativity due to growing prominence of desired behaviours like surprise (Maher, 2010), appreciation (Colton, Goodwin, and Veale, 2012) and autopoeisis (Saunders, 2012), which necessarily involve past experience. Learning about specific objects or outcomes is of limited utility in computational creativity, as creative problems are by definition unique. However, learning and recalling different perspectives through which to view objects is one process by which learning in creative contexts could be modelled. 2013_17 !2013 Computational Creativity in Naturalistic Decision-Making Magnus Jändel Swedish Defence Research Agency Stockholm, SE-16490, Sweden magnus.jaendel@foi.se Abstract Creativity can be of great importance in decision-making and applying computational creativity to decision support is comparatively feasible since novelty and value often can be evaluated by a reasonable human effort or by simulation. A prominent model for how humans make real-life decisions is reviewed and we identify and discuss six opportunities for enhancing the process with computational creativity. It is found that computational creativity can be employed for suggesting courses of action, unnoticed situation features, plan improvements, unseen anomalies, situation reassessments and information to explore. For each such enhancement opportunity tentative computational creativity methods are examined. Relevant trends in decision support research are related to the resulting framework and we speculate on how computational creativity methods such as story generation could be used for decision support. Introduction Creativity and decision-making Before the battle of Austerlitz 1805, Napoleon deceptively maneuvered to create the impression of weakness and indecision in the French forces. The opposing Russo-Austrian army took the bait, attacked and fell into a carefully prepared trap resulting in a crushing defeat. Detecting deception requires an act of creativity where the reality of the situation is discerned behind a screen of trickery. European history could have taken a different turn with more creative leadership on the Russian and Austrian side. Leaders of today are likewise challenged to be more creative. Given the progress of computational creativity in other fields, it is therefore interesting to pursue its application to decision-making. Computational creativity for decision support The key problem in computational creativity is how to automatically assess the novelty and creative value of an idea, concept or artifact that has been generated by computational means (Boden, 2009). This is a very difficult problem in art where novelty is judged by comparing to extensive traditions and evaluation would engender implementation of computational esthetic taste. Decision support is fundamentally less challenging. Novelty is often judged against a reasonably short list of options that are known to the decision makers and value is evaluated by analyzing how the idea works in the situation at hand. Computer simulations are increasingly employed for assisting decision makers and it is often quite feasible to use simulations for automatic evaluation of suggested ideas. Given the comparative straightforwardness of applying computational creativity to decision support, it appears that there are surprisingly few applications. Some of these are discussed and put in context after that we have introduced the framework that is the main result of this paper. Decision-making models Applying computational creativity to any given area of decision-making requires substantial domain knowledge and it is often difficult to see how methods generalize to other domains. Our strategy is therefore to identify generic approaches by analyzing how formal decision-making models can be extended to include computational creativity techniques. Somewhat simplified, decision-making models can be partitioned into two general classes: rational models and naturalistic models. The former prescribes how decisions ought to be made while the latter describes how people really make decisions. Many naturalistic models surpass, however, their purely descriptive origins and offer some suggestions on how intuitive decision-making can be improved. In the following two sections we analyze how computational creativity tools can extend rational and naturalistic decision-making models respectively with a strong focus on a particularly prominent naturalistic model. Rational decision-making models Rational decision-making models provide methods for how to optimally select an action from a set of alternative actions (Towler 2010). In utility-based decision-making it is for example assumed that each action leads to a set of outcomes and that the probability of each outcome is known or can be estimated. Furthermore, each outcome has a utility which is a real-valued variable and the task of the decision-maker is to select the action that is most likely to optimize the utility of the outcome. Other rational schemes extend this approach to cases with multiple objectives and multiple constraints (Triantaphyllou, 2002). The main US army Military Decision-Making Process (MDMP) is for example essentially a rational process where it is required 2013 118 that at least three different courses of action should be compared. Rational decision-making models characteristically give little guidance on how to generate the set of action alternatives although it is tacitly assumed that more alternatives make for better decisions. Since the 1980s, it has been claimed that decision makers typically don´t employ rational models (Kahneman, Slovic, and Tversky 1982) and it appears further that leaders don´t find rational models to be efficient (Yates, Veinott and Patalano 2003) and that generating more alternatives actually can be detrimental for decision quality (Johnson and Raab 2003). Computational creativity could assist rational decisionmaking by suggesting criteria, enriching the set of action alternatives, envisioning possible outcomes of actions and suggesting factors that should be considered in mental or computer simulations. The methods that could be employed for this are often quite similar to corresponding methods in naturalistic decision-making which is the focus of this paper. Extended naturalistic decision-making model Naturalistic decision-making models are inspired by research in how decisions are made in domains such as business, firefighting and in military areas. Investigations indicate that experienced and effective leaders evaluate the nature of the situation intuitively and rarely consider more than one course of action (Klein 2003). Figure 1 summarizes a leading naturalistic decisionmaking model: the recognition-primed model (RPD). For the moment, please ignore the symbols CC1, CC2, … CC6. This paragraph briefly reviews work of Klein and coworkers on RPD (Klein, Calderwood and ClintonCirocco 1986; Ross, Klein, Thunholm, Scmitt, and Baxter, 2004; Klein 2008). The experienced decision maker evaluates the state of affairs and will normally recognize a familiar type of situation. Recognition means that the relevant cues or indicators are pinpointed; expectancies on how the situation will appear and unfold are identified; what kind of goals that are reasonable to pursue are recognized and a typically short list or courses of actions are found. In the following we use course of action or action to designate the conceptual level of a top-level plan that if implemented will consist of a chain of component actions or plan elements. As a reality-check the expectancies are analyzed and compared to available information. Any anomalies found trigger an iteration of the recognition process were more information may be sought and the situation is reassessed, sometimes leading to a major shift in how the situation is perceived. Eventually, the decision maker arrives to a satisfactory anomaly-free situationrecognition and selects the most promising course of action for scrutiny. The consequences of performing the selected course of action is simulated either mentally or by computer. This may lead to that the course of action is rejected and another option is selected for a new round of simulation. Frequently it is found that the course of action is promising but that it has some unwanted consequences. Rather than rejecting the course of action, decision makers will try to repair the plan by modifying the chain of plan elements that implement the course of action. It is implicit in Figure 1 that modified courses of actions are resimulated. Eventually the decision-maker will find a satisfactory course of action which will be implemented. Note that the RPD process does not include a search for the optimal course of action, the optimal implementation or quantitative utility criteria. If the plan satisfies the recognized goals it is deemed to be ready for implementation. Figure 1. Computational creativity extensions to the recognitionprimed model. Filled circles denote computational creativity agents. Everything else in the figure is quoted from Klein (2008). How can decision makers that use RPD or some similar naturalistic decision-making model take advantage of computational creativity? In Figure 1, we mark six slots where a computational creativity agent could be plugged into the RPD process. The computational creativity agents are called CC1, CC2, … CC6 and these symbols are used in the following to highlight where the different computational creativity extensions are mentioned. For each agent we provide a mnemonic tag, discuss in which way it could improve decision-making and provide a sketch of at least 2013 119 one creativity technology that could be applicable. Finally we provide a speculative example illustrating why creative input could be of great value in the current decisionmaking phase. CC1 (proposing actions): Computational creativity could be used for suggesting a broader range of courses of action in a recognized situation. The CC1 agent would work under the assumption that the situation and the relevant goals are correctly identified and that the creative task is to find unrecognized course of action alternatives that lead towards the "plausible goals" in Figure 1. The decision maker has identified the nature of the situation which will suggest a well-defined search space of actions. Some of the actions in the search space are explicitly known to the decision maker and would hence be found in the list of actions indicated in Figure 1. The RPD process evaluates listed actions. Creative suggestions should hence point to feasible actions that are significantly different from already listed actions. The CC1 algorithm must define a similarity metric in action space and the list of actions that are known to the decision-maker should be available to the CC1 agent so that it can avoid searching too close to known actions. Ideally, the CC1 agent uses a simulation engine for confirming the approximate validity of courses of action but it might also be possible to fall back on human adjudication. Candidate courses of action that are far from known courses of action according to the metric and pass the simulation test are suggested to the decision maker and added to the known list. A government wanting to integrate an island population to the mainland society may for example consider courses of actions such as building a bridge, airport or ferry terminal. The CC1 agent, realizing that known courses of action all relate to physical connectivity, may suggest courses of actions such as investing in telepresence or locating a new university to the island. CC2 (proposing features): Simulations are never completely realistic but will always model some aspects of the situation at hand with higher fidelity than others and also ignore many other aspects. The CC2 agent could suggest features that should be included in computer or mental simulations. Such ideas might be crucial for success since the acuity of the simulation is essential for the quality of the plan that implements the selected course of action. Consider for example a decision maker trying to control flooding caused by a burst dam. The core simulation would be concerned with modeling how actions influence the flow of water. A CC2 agent searching historical records of floods could come up with the suggestion that modeling the spread of cholera might be important. The CC2 agent could for example grade the importance of candidate features by measuring how often they are mentioned in news stories on flood-related events. CC3 (reparing plans): Computational creativity could be used for suggesting how a promising but somewhat flawed plan can be repaired. Assume that simulation has exposed at least one problem with the current course of action and that the decision makers have the mental or computational means for re-planning but are out of ideas. The task of the CC3 agent is to provide an idea for how the problem can be solved. The main planning process can then use the idea for driving the next iteration of re-planning. Consider a case in which the main planning process is a planning algorithm (Ghallab, Nau, and Traverso 2004) that works by searching for a chain of plan elements that implements the course of action. Each plan element has a set of prerequisites and a set of consequences. The planner searches for chains of plan elements where all prerequisites are satisfied, the consequences at the end of the chains match the goals and the general direction of the plans is consistent with the currently considered course of action. If the planning algorithm fails to find a problem-free plan, the CC3 agent could suggest a new plan element. This creative output is validated if the planner solves or alleviates the problem by using the suggested action element in a modified plan. The task of the CC3 agent could be construed as search in the space of possible plan elements where the identified problem may be used for heuristic direction of the search. Note that the CC3 agent should not be another planner that by explicit planning guarantees that the suggested plan element solves the problem. It is sufficient that suggested plan elements have a high probability of contributing to the solution. The CC3 agent should obviously be aware of the present set of plan elements that are used by the main planner and avoid suggesting elements that are identical or very similar to currently known plan elements. A government may for example have selected reduction of the national carbon footprint as the chief course of action for environmental protection but fails to find a plan that reaches the target. A CC3 agent could then suggest enhanced weathering, where crushed rock absorbs carbon dioxide, as a new plan element. CC4 (identifying anomalies): Computational creativity could be used for identifying anomalous expectancies in the current perspective on the situation. When a decision maker has identified the situation as familiar it is often difficult to notice aspects of the situation that do not fit into the familiar context. It is crucial to find any anomalies since this might trigger a radical reassessment of the situation. The CC4 agent is best applied when the decisionmaker has exhausted the manual search for anomalous expectancies and is ready to proceed with action evaluation. A simple version of the combinatorial approach could explore the space of situation features searching for pairs of features that in combination stand out as anomalous. This could be done by investigating second-order attributes of the features and noting how combinations of attributes interact. The obscure features method (McCaffrey and Spector 2011) might be adapted for this purpose. Simulation methods that are used for evaluating actions could also be applied to examining anomaly candidates with validated anomalies escalated for human consideration. The Russian and Austrian leaders at Austerlitz would have benefited 2013 120 from a CC4 agent suggesting that Napoleon´s uncharacteristic eagerness to negotiate and seemingly panicky abandonment of important positions were anomalies deserving serious attention. CC5 (situation assessment): Supporting reassessment of the situation is the most challenging creative task. Imagine that the decision maker has noted a number of anomalies indicating that the present situation recognition is flawed but no viable alternative interpretations pop up in human minds. People are often locked into habitual trains of thought and this behavior is frequently aggravated by time pressure, fear and group-think. Computational creativity is free from such human frailties and might be able to suggest new ways of looking at the situation. A single idea might be enough for providing the Aha! experience that releases the intuitive power of the decision maker. A simple implementation of a CC5 agent could use a library of case histories enshrining human expert findings in a broad range of circumstances. A sufficiently small volume of decisionmaking experience could, as noted by M. Boden (personal communication), advantageously be codified as a check list. CC5 agents would be needed only in contexts in which the total span of assessment possibilities is large and inscrutable. A police officer leading the investigation of suspected arson in an in-door food market could for example benefit from the suggestion that spontaneous combustion of pistachio nuts might be an alternative perspective on the evidence (see Hill (2010) for further information on spontaneous combustion). The CC5 agent would in this case use encyclopedic knowledge that pistachio nuts are a kind of food and that pistachio nuts are subject to spontaneous combustion combined with records of historical cases in which suspected arson has been found to be explained by spontaneous combustion. CC6 (recommending information): Computational creativity could be used for suggesting what kind of information that could support reassessment or resolution of apparently anomalous expectancies. Decision makers often have access to vast archives and abundant streams of news and reports. Selecting what deserves attention is a difficult and sometimes creative task. Decision makers would be biased by their present understanding of the situation so the CC6 agent might be able to provide a fresh perspective. The task of the CC6 agent is quite similar to that of the CC2 agent; it must explore the space of information sources and information aspects for the purpose of identifying novel and valuable pieces of information. A doctor confronted with anomalous symptoms could for example get suggestions from a CC6 agent regarding which medical tests to apply. Discussion and conclusions In this section we will first discuss some current applications of computational creativity to decision support in relation to the framework described in the previous section and then speculate on how selected approaches to computational creativity could be applied to decision support. Examples of computational creativity in decision support Computer chess is probably the most advanced current application of computational creativity in decision support. Grand masters learn creativity in chess by studying how computers play (Bushinsky 2012). The main reason for this success is that chess is a very complex but deterministic game that readily can be simulated. The complexity of the game makes it possible for a computer to discover solutions that escape the attention of humans and simulation combined with heuristic assessment of positions enable automatic evaluation of computer generated solutions. The main components of creativity novelty and value are therefore attainable by chess programs. Referring to Fig. 1, we note that chess players use computational creativity mainly for suggesting courses of action (CC1) and repairing plans (CC3). Tan and Kwok (2009) demonstrate how Conceptual Blending Theory (Fauconnier and Turner 2002) can be used for scenario generation intended for defense against maritime terrorism. The scenarios are examples of CC5 agent output since they assist decision makers in assessing situations that may look peaceful and familiar at first sight but in which creative insights may reveal an insidious attack pattern. Deep Green is a DARPA research program that aims for a new type of decision support for military leaders (Surdu and Kittka 2008) According to the Deep Green vision commanders should sketch a course of action using an advanced graphical interface while AI assistants work out the consequences and suggests how the plan can be implemented. The AI assistants are guided by thousands of simulations that explore how the situation could evolve and what factors that are important. According to our analysis in the previous section Deep Green seems to include development of CC2 and CC3 computational creativity agents although computational creativity is not explicitly mentioned in the Deep Green program. Emerging tools Creative story generation could be turned into tools for decision support. Story generation techniques that spin a yarn connecting a well-defined initial state with a given final state (Riedl and Sugandh, 2008) could be used by CC2 agents for suggesting improvements in simulations forecasting the outcome of plans. The CC2 agent could generate stories that starts with the present situation, implements the course of action under consideration and ends with failure. Analysis of the generated story could give decision makers insights into aspects and circumstances that should be simulated carefully. CC3 agents could also use the stories for suggesting countermeasures. With a comprehensive domain-related supply of vignettes, story generation might even be used for situation assessment by CC5 agents. It is interesting to note that the techniques of vignette-based story generation are similar to those of planning algorithms and simulation engines. The difference is in purpose rather than in methodology. Story 2013 121 generators aim for novelty, planners for optimality and simulations for sufficiently realistic modeling of some relevant aspects of reality. It can be difficult for decision makers to fully understand the ramifications of goals that have been adopted by opponents or partners. This may cause errors in situation recognition and in identifying relevant expectancies in the current situation assessment. Agent-based story generation where an open-ended story evolves driven by conflicting goals (Meehan 1976) could be useful for both CC4 and CC5 decision support agents. Such stories could give a fresh perspective from a different point of view and help identifying anomalies and possibly inspire reassessment of the situation. Consider a CC3 agent that, as discussed in the previous section, is tasked with coming up with new plan elements for the purpose of repairing a failed plan. Li et al. (2012) extends conceptual blending (Fauconnier and Turner, 2002) to incorporate goals with application to algorithms for generating hypothetical gadgets engineered to fulfill the goals. This methodology could be applied to algorithms for CC3 agents in which the goals are derived from the needs of the jammed planning process and generated "gadgets" would be plan elements with prerequisites that can be fulfilled in the context of the problem-ridden plan and consequences designed to be instrumental for unjamming the planning process. Jändel (2013) describes information fusion systems extended with computational creativity agents of type CC5. The agents aid in uncovering deceit by comparing generic deception strategies to the present situation and guide the fusion process to explore alternative situation assessments. Future applications There are many research opportunities in the confluence of computational creativity and naturalistic decision-making both with respect to algorithms for the six types of agents indicated in this paper and for research into the effect and efficiency of computational creativity in various domains of decision-making. Pioneering areas of application will probably be in highstake strategic decision-making where time and resources are at hand and leaders are willing to go to great lengths in order to minimize risks and ensure decision quality. Bridgehead applications will therefore most likely be in fields such as defense strategy, major economic and environmental decisions and strategic business planning. As methods and tools evolve and the level of automation increases computational creativity will increasingly be applied also to operative and tactical decision-making. Acknowledgments This research is financed by the R&D programme of the Swedish Armed Forces. 2013_18 !2013 Nobody's A Critic: On The Evaluation Of Creative Code Generators - A Case Study In Videogame Design Michael Cook, Simon Colton and Jeremy Gow Computational Creativity Group, Imperial College, London Abstract Application domains for Computational Creativity projects range from musical composition to recipe design, but despite all of these systems having computational methods in common, we are aware of no projects to date that focus on program code as the created artefact. We present the Mechanic Miner tool for inventing new concepts for videogame interaction which works by inspecting, modifying and executing code. We describe the system in detail and report on an evaluation based on a large survey of people playing games using content it produced. We use this to raise issues regarding the assessment of code as a created artefact and to discuss future directions for Computational Creativity research. Introduction Automatic code generation is not an unusual concept in computer science. For instance, many types of machine learning work because of an ability to generate specialised programs in response to sets of data, e.g., logic programs (Muggleton and de Raedt 1994). Also, evolutionary systems can be seen to produce code either explicitly, in the case of genetic programming, or implicitly through evolutionary art software that uses programmatic representations to store and evaluate populations of artworks. Moreover, in automated theory formation approaches, systems such as HR (Colton 2002) generate logic programs to calculate mathematical concepts. These programs are purely for representation, however, rather than in pursuit of creative programming. In software engineering circles, ‘metaprogramming' is used to increase developer efficiency by expanding abstract design patterns, or to increase adaptability by reformatting code to suit certain environments. None of these instances of code generation fully embrace the act of programming for what it is - a creative undertaking. There can be no field better placed to appreciate programming in this way than Computational Creativity. Building software that can generate new software, or modify its own programming, opens up huge new areas for Computational Creativity, as well as enriching all existing lines of research by allowing us to reflect on our systems as potential artefacts of code generators or modifiers themselves. We attempt here to highlight some of these future opportunities and challenges by describing the design of a prototype system, Mechanic Miner (Cook et al 2013), which designs a particular videogame element - game mechanics - by inspecting, modifying and executing Java game code. Mechanic Miner produced game mechanics for A Puzzling Present, a platform game released in December 2012 and downloaded more than 5900 times. This game included survey and logging code to assess, among other things, the quality of the mechanics generated by Mechanic Miner in terms of perceived enjoyability and the challenge in using them. In analysing the data and evaluating the system, however, we have noticed issues with current notions of assessment within Computational Creativity research, and how they interact with the idea of evaluating a creative system whose output is program code. We explore these issues below. In this paper we make the following contributions: • We describe the development of a creative system that generates code as its output. • We report on the first large-scale experimental evaluation of interactive computationally-created artefacts. • We discuss issues involving the assessment of creative systems working in media with a high barrier to entry. The rest of this paper is organised as follows: in Mechanic Miner - Overview we describe Mechanic Miner in full, detailing how it generates and evaluates new game mechanics through code. In A Puzzling Present - Evaluation Through Play we describe A Puzzling Present, a game designed and released using mechanics invented by Mechanic Miner. We discuss the difficulties in evaluating interactive code, how a balance can be struck between presenting a survey and offering a natural experience to the user, and present some results from our survey. In the section Creativity in Code Generation, we highlight issues for the future of code generation, as well as promising opportunities for Computational Creativity. in Related Work we briefly describe previous approaches to mechanic generation and highlight why code generation is necessary to advance in this area. Finally, in Conclusions we review our achievements and reflect on where our work with game mechanics will lead next. Mechanic Miner - Overview Definitions Many conflicting definitions exist for game mechanics, as described, for instance, in (Sicart 2008), (Kelly 2010) and (Cook 2006). For our purposes here, we define a game mechanic as a piece of code that is executed whenever a button is pressed by the player that causes a change in the game's state. How a game mechanic is defined in code will vary 1 2013 123 from game to game, depending on the architecture of the game engine, the way the game has been coded within that engine, and the idiosyncrasies of the individuals who wrote the rest of the game code. For example, below is a line of code from a game written in the Flixel game engine. Executing the code causes the player character to jump, by adding a fixed value to its velocity (the player's gravity will counteract this change over time and bring the character to the ground again). player.velocity.y -= 400; Mechanic Miner generates artefacts within a subspace of game mechanics, which we have called Toggleable Game Mechanics (TGMs). A TGM is an action the player can take to change the state of a variable. That is, given a variable v and a modification function f with inverse f !1, a TGM is an action the player can take which applies f(v) when pressed the first time, and f !1(v) when pressed a second time. The action may not be perfectly reversible; if v is changed elsewhere in the code between the player taking actions f and f !1, the inverse may not set v back to the value it had when f was applied to it. For instance, if v is the player's x co-ordinate, and the player moves around after applying f, then their x co-ordinate will not return to its original value after applying f !1, as it was modified by the player moving. Generation Mechanic Miner is written in Java, and therefore able to take advantage of the language's built-in Reflection features that allow program code to inspect and explore other code1. For example, the following code retrieves a list of fields of a given class: MyClass.getClass().getFields() Such Field objects can be manipulated to yield their name, their type, or even passed to objects of the appropriate type to find the value of that field within the object. Java has similar objects to represent most other language features, such as Methods and generic types. Given the definition of a TGM above, we can see that Reflection allows software to store the location of a target field at runtime, and dynamically alter its value. Using the Reflections library, Mechanic Miner can therefore obtain a list of all classes currently loaded, and iterate through them asking for their available fields. It can use information on the type of each field to conditionally select modifiers that can be applied to the field. Java's Reflection features do not provide encapsulation for primitive operations such as mathematical operators, assignment or object equality. To solve this problem, we created custom classes to represent these operations, which enabled Mechanic Miner to select modifiers for a field that could be applied during evaluation. Thus, a TGM is composed of a java.lang.Field object, and a typespecific Modifier. For example, a mechanic that doubled the x co-ordinate of the player object would use the org.flixel.FlxSprite object's x field, and an IntegerMultiplyModifier defined as follows: 1 We further extended this core functionality by employing the Reflections library from http://code.google.com/p/reflections. Figure 1: A sample level used to evaluate mechanics. public void apply(Field f){ if(toggled_on){ f.setValue(f.getValue()*coefficient); } else{ f.setValue(f.getValue()/coefficient); } } Where coefficient is set to 2 in the case of doubling, but can be set by Mechanic Miner to an arbitrary value as it evaluates potential mechanics. Note the use of a boolean flag, toggled_on, to retain the state of the TGM so that its effect can be reversed. Modifiers were selected to give a coverage of key operations that might be performed on fields, such as inverting the value of a boolean field, adding or multiplying values for a numerical field, or setting numerical fields to exact values (such as zeroing a field, and then returning it to its original value). Future extensions we plan to the generation process will allow for the use of mathematical discovery tools such as HR (Colton 2002) that could invent calculations which transform the values of the fields. Evaluation In order to evaluate generated mechanics, we need strong criteria that describe the properties that desirable mechanics should have. In the version of Mechanic Miner described here, we focus purely on the utility of a mechanic (that is, whether it affords the player new possibilities when playing the game) rather than how fun the mechanic is to use, how easy it is to understand, or how appropriate it is for the context. Utility is not only easy to define, but can be defined in absolute terms, which provides a solid target for a system to evaluate towards. To illustrate how utility can be identified by Mechanic Miner, consider the game level shown in Figure 1. The player starts in the location marked ‘S' and must reach the location marked ‘X', and when they do, we say that the player has solved the game level. The game operates similar to a simple game such as Super Mario; the player is subject to gravity, but can move left and right as well as jumping a small distance up. Under these rules alone, the level is not solvable because the central wall is too high and impedes progress. Therefore, if we were to add a new game 2013 124 Figure 2: A level generated by Mechanic Miner for the ‘gravity inversion' mechanic. The player starts in the ‘S' position and must reach the exit, marked ‘X'. mechanic such as the inversion of gravity, and as a result the level were to become solvable, we could conclude that the new mechanic had expanded the player's abilities, and allowed them to solve a level of this type. This idea is central to Mechanic Miner's evaluation of mechanics - it uses a solver to play game levels in a breadthfirst fashion, trying legal combinations of button presses while remaining agnostic to what mechanics the buttons relate to. It will continue to search for combinations of button presses until it finds at least one solution; at this point it continues looking for combinations of that length, completing the breadth first expansion of this depth, and will then return a list of all paths that led to a solution. Hence it can try arbitrary mechanics without knowing in advance what the associated code does when executed. This enables it to firmly conclude whether the mechanic has contributed to the player's abilities by assessing which areas of the level are accessible that were not previously, which in turn enables it assess the level itself. Level Generation Mechanic Miner's ability to simulate gameplay in order to evaluate mechanics can also be applied in reverse to act as a fitness function when generating levels for specific mechanics using evolutionary techniques. Representing a level design as a 20x15 array of blocks that are either solid or empty, we can evaluate the fitness of a level with respect to a mechanic M by playing the level twice - once with only the basic controls available, and once with M added to the controls. If the level is solvable with M, but not solvable without it, then the level is given a higher fitness. Using a binary utility function as our primary evaluation criteria strengthens the system's ability to provide exact solutions to the problem - either the level is completed, or it is not. In order to have a gradient between the two so that the evolutionary level designer can progress towards good levels, we moderate the fitness based on what proportion of the level was accessible. Thus, over time, levels that are more accessible emerge until eventually the exit is reachable from the start position (and thus the level is solvable). Figure 2 shows a level generated for use with a mechanic called gravity inversion. Activating the mechanic would cause gravity to pull the player towards the ceiling instead of the floor. Activating it again would reverse the effect. Note that the level is not solvable without this mechanic, as the platforms are too high to jump onto. The simulation-driven approach to level design allowed for the resulting software to be highly parameterised. Information such as the minimum number of distinct actions required to solve a level (where each button press is considered a distinct action) or the number of times a mechanic must be used, allowed the system to generate levels with different properties. It also allows the system to remain blind to the mechanic it is designing for. This allows Mechanic Miner to exploit created mechanics without having a human intervene and describe aspects of the mechanic to it, giving it greater creative independence as it is theoretically able to discover a wholly new mechanic in a H-creative way, and generate levels for that mechanic without any assistance. We can view this within the creativity tripod framework of (Colton 2008), which advocates implementing skill, appreciation and imagination in software. In particular, we see the ability to use output from one system to inspire creative work in another without external assistance as an example of skill as well as an appreciation of what makes a game mechanic useful to the player. We also claim that simulating player behaviour is in some sense imagining how they would play. Illustrative Results Below are examples of mechanics generated by Mechanic Miner. All of the effects can be reversed by the player: • An ability to increase the player's jump height, allowing them to leap over taller obstacles. • An ability to rubberise the player, making them able to bounce off platforms and ceilings. • An ability to turn gravity upside down, sucking the player upwards. These mechanics are evident in commercially successful games, such as Cavanagh's VVVVVV which featured gravity inversion as a core mechanic. Bouncing was an unexpected result for us, as we had no idea it was in the space of possibilities, although it has been featured in some games developed in other engines, particularly Nygren's NightSky. Cavanagh has received multiple nominations in the Independent Games Festival (IGF), and NightSky was shortlisted for Excellence In Design and the Grand Prize in the 2009 IGF. Novel game mechanics are highly prized in game design circles. Many international design awards have tracks for innovative gameplay or mechanics (such as the IGF Nuovo Award2) and game design events often centre around the creation of unique methods of interaction (such as the Experimental Gameplay Workshop3). Mechanic Miner's ability to reinvent existing but niche mechanics is encouraging, given the small design space the system currently has access to. As well creating mechanics, Mechanic Miner was also able to find exploits in the supplied game code, and use 2 http://www.igf.com/ 3 http://www.experimental-gameplay.org/ 2013 125 them to create emergent gameplay - something which we had not anticipated as a capability of the system. One mechanic, which teleported the player a fixed distance left or right, was used by Mechanic Miner to design levels which at first glance had no legal solution. After inspecting the solution traces produced by the simulator, it became clear that the mechanic was being used in an innovative way to take advantage of a weakness in the code that described the player's jump. Jumping checked if the player's feet were in contact with a solid surface. By teleporting inside a wall, this check would be passed, and the player could jump upwards. Repeated applications of this technique allowed the player to jump up the side of walls - complicated exploitation of code, more commonly seen in high-end gameplay by speedrunners4, i.e., gamers who compete over finding exploits in popular videogames to help them complete the games in the shortest time possible. For example, speed runs of the popular puzzle game Portal involve the abuse of 3D level geometry to escape the level's boundaries and pass through solid walls. A Puzzling Present Evaluation Through Play To evaluate some of the mechanics and levels designed by the Mechanic Miner system, we developed a short compilation game featuring hand-selected mechanics, titled A Puzzling Present (APP). APP was released in late December 2012 on the Google Play store and desktop platforms5. The objective was to conduct a large-scale survey of players in order to gain feedback on the types of mechanic generated by the system, in addition to evaluating different metrics for level design. However, we were also conscious that interruptions to play, or overt presentation of the software as an experiment rather than a game, may deter players from completing levels or giving feedback and/or change the nature of the experiment, which is to ask their opinion on games, not surveys. In designing APP, we therefore made several tradeoffs to balance these two factors. All play sessions were logged in terms of which buttons the player presses, at what times, which can be used to fully replay a given player's attempt at a level. In addition to this, upon starting the game for the first time, the player was asked to opt-in to short surveys after each level. These took the form of two multiple-choice rating tasks on a 1-4 scale, evaluating enjoyability and difficulty. Figure 3 shows the survey screen. This presented itself to the player upon reaching the exit to a level, assuming the player had agreed to respond to surveys, although even in this case, they could continue without responding to the survey. 75614 sessions were recorded in total, over 5933 unique devices. When asked to opt-in to surveys, 60.7% of users agreed. Those who opted-in contributed 63.4% of the total session count. 92.3% of sessions played by opt-ins resulted in at least one of the two questions being answered, with 89.9% of sessions resulting in both questions being answered. Although the survey questions provided a rich source of data, by allowing us to gain qualitative evaluations 4 Such as the community at http://speeddemosarchive.com/ 5 Download from www.gamesbyangelina.org/downloads/app.html. Figure 3: Survey screen from A Puzzling Present of the levels and game mechanics, the log data (which is recorded for all players) is equally valuable, and so allowing players who did not wish to participate in the survey to continue to play the game (or those who initially agreed to change their minds later) we gained an additional 32,000 level traces which we otherwise might have lost. APP contained thirty levels, split into sets of ten that share a common mechanic. The three game mechanics are those described in the Illustrative Results section above: higher jump, bouncing and gravity inversion. Each level required the game mechanic to be used to complete it, but were generated using differing metrics for difficulty expressed through evolutionary parameters within the level designer. These were broken down as follows: two levels used a baseline setting determined through experimentation (‘Baseline'); two levels put stricter requirements on minimum reaction times needed (‘Faster Reaction'); two levels selected for longer paths from start to exit (‘Longer Path'); two levels selected for more mechanic use in the shortest solutions (‘Higher Mechanic Use'); and two levels selected for longer action chains in the solution. This provided a variety of the levels for the player to test, and allowed us to analyse feedback data to assess these metrics for future use. In order to mitigate bias or fatigue introduced as a result of experiencing certain levels or sets of levels before others, the order in which a particular player experienced the levels was randomised when the game was first started up. This was done by first randomising the order of the game mechanics, and then randomising the order of the ten levels within that set, thereby ensuring that all levels which share a mechanic are experienced together, to provide a more cohesive experience. Figure 4 shows the mean difficulty and fun ratings for the nth level played as the people progressed through the 30 levels. These mean ratings remained fairly consistent throughout the game, with the exception of the 30th level. As levels were presented randomly, we assume this is an effect of the very low number of people still playing at this point. This consistency indicates that learning or fatigue did not seem to have much effect on player experience. This may be down to the interactivity of the artefact in question, and raises the question of whether the evaluation of created artefacts is more consistent when the survey participants are interactively engaged. We discuss this later as future work. 2013 126 Figure 4: Mean fun (white circles) and difficulty (black circles) ratings for the nth level played. ● ● ● ● ● ● ● ● ● ● 1.00 1.25 1.50 1.75 2.00 1.8 1.9 2.0 2.1 2.2 2.3 Mean fun Mean difficulty world ● Invert gravity High jump Bounce Figure 5: Mean level fun and difficulty, broken down by ‘world' (a group of levels that share a mechanic). The number of players completing a given set (world) of ten levels for a certain mechanic is consistent across the three game mechanics; 2259 completed World one, 2151 completed World two and 2219 completed World three. The data show no bias towards players not completing any particular one of the three worlds, suggesting that players left due to general fatigue with the system as a whole, rather than the content generated by Mechanic Miner. This may be down to the human-designed elements of the game that were common throughout the three worlds - such as the interface, control scheme, or artwork - and therefore not attributable to the output of Mechanic Miner. Under statistical analysis of the survey scores, we found a moderate and highly significant rank correlation between mean difficulty and enjoyability (Spearman's ⇢ = 0.56, p = 0.002). The relationship between the difficulty of a Group Mean Fun Mean Difficulty High Jump 1.96 1.38 Invert Gravity 2.02 1.55 Bounce 2.03 1.42 Baseline 1.96 1.30 Faster Reaction 2.01 1.51 Longer Path 1.95 1.20 Higher Mechanic Use 2.03 1.60 Longer Solution 2.06 1.66 Figure 6: Mean level fun and difficulty, broken down by game mechanic and level design parameters. level and the perceived enjoyability of a level is an interesting one to consider. While we might expect an inverse relationship for an audience who are easily frustrated with games, we also see many examples of games in which challenge correlates to an enjoyable game. We postulate that the correlation between mean difficulty and enjoyability exists here because the levels are, on average, too easy - the average difficulty rating across all levels is just 1.45, on a scale of 1 to 4 - and so an increase in difficulty was welcomed as it made the levels more interesting. A later study, with improved difficulty metrics to give a broader spread of skill levels, would help confirm this hypothesis. The mean fun and difficulty by world mechanic and level generation metric are shown in Table 6. Variations in mean fun are very small between groups, whereas mean difficulty shows greater separation, especially between the metrics. An analysis of variance (ANOVA) showed highly significant (p < 0.001) separate main effects for fun and difficulty with respect to both factors. There was also a significant interaction between mechanic and metric, which we do not report here. Post-hoc Tukey's HSD tests suggested the following significant differences between groups: a) the mechanics Invert Gravity and Bounce are more fun than High Jump; b) the metrics Fast Reaction, High Use and High Actions are more fun than Baseline and Longer Path; c) all differences in mean difficulty between mechanics, and between metrics, are significant. Creativity In Code Generation Nobody's a Critic Many different approaches to assessing creativity in software have been proposed over the last decade of Computational Creativity research. Ritchie (2007) suggests that the creativity of a system might be established by considering what the system produces, evaluating the artefacts along such lines as novelty, typicality and quality. This leads to the proposal of ratios between sets of novel artefacts produced by a system, and sets that are of high quality, for instance. While this is helpful in establishing the performance of a given system, it presupposes both a minimum level of understanding in those assessing the system, and a direct connection between the means of interaction with the artefact, and the generated work itself. In the case of software - particularly interactive media 2013 127 whose primary purpose is entertainment - we are not guaranteed either of these. The consumers of software, such as those that evaluated A Puzzling Present, are often laypeople to the world of programming, even if they are highly experienced in interacting with software. More importantly, there is a disconnect between the presentation of code through its execution within a game environment, and the nature of the generated code itself. All software designed for use by the general public - from word processors to video games - presents a metaphorical environment in which graphics, audio and systems of rules come together to present a cohesive, interactive system with its own internal logic, symbols, language and fiction. In A Puzzling Present in particular, generated game mechanics operated on obscure variables hidden away within a complex class structure. To the interacting player, this is simply expressed as objects moving differently on-screen. This disconnect makes it hard for any user to properly evaluate the generated code itself, because they are not engaging with the underlying representation or mechanics of the software they are using. Other approaches to evaluation consider the process of creation itself as crucial to the perception of creativity. In (Colton, Charnley, and Pease 2011) the authors propose the FACE model that considers elements of the creative process such as the generation of contextual information (which the authors call framing) and the use and invention of aesthetic judgements that affect creative decision-making. This focus on the process is a promising alternative to the artefactheavy assessment methods that are more common in Computational Creativity, but problems abound here also, since in order to judge the creative process, a person must be able to comprehend that process to some degree. As noted in (Johnson 2012), the majority of the systems in Computational Creativity have focused on ‘old media' application domains, such as the visual arts, music and poetry. Although the skill ceiling for these media is undeniably high, they have very low barriers to entry. Most people have drawn pictures as children, attempted to crack new jokes, or hummed improvised ditties to themselves. While they may not exhibit even a small percentage of the virtuosity present at the top end of the medium in question, by engaging in the creation of artefacts, they can appreciate the process and are better placed to comment on it - or indeed they feel so, even if this is not the case. As a result, creative systems operating in the realm of old media often find truth in the term ‘everyone's a critic'. By contrast, programming is a skill that is only recently being taught below university level in the western world; therefore, asking the general public to assess the creativity of a code generator by commenting on its creative process is unlikely to result in a useful or fair assessment. This phenomenon - where nobody is a critic - makes it hard to apply existing thinking on the evaluation of creative systems to large-scale public surveys. Speaking In Code If neither the artefact-centric nor the process-centric approach is suitable to assess creative code generators, this begs the question of how we can proceed in assessing these systems on a large scale. We believe the key may be one Figure 7: Framing information in Stealth Bastard. particular element of the model described in FACE model of (Colton, Charnley, and Pease 2011), namely framing information that describes an artefact and the process that created it, as explored further in (Charnley, Pease, and Colton 2012). Code is not designed to be read by people. Extensive education is needed to understand the basics of programming structure and organisation, including additional time spent on learning specific languages. Even experienced programmers do not rely on these skills alone to understand program code - instead they leave plain-English comments so that others, and they themselves, will be able to understand the meaning of code long after it has been written. In interactive media, the need to explain features legibly and correctly situated within the (possibly fictional) context of the software is especially integral to the user's understanding and enjoyment of a piece of software. Video games, for instance, rely on their ability to create an immersive environment where all functionality is communicated through the fiction of the game world in question. The arcade game Space Invaders is not about co-ordinates overlapping and numbers being decremented - it is about shooting missiles at aliens and protecting your planet from attack. This all amounts to a clear need to build into creative code generation systems the ability to explain the function of code it produces. This could be done either by annotating and describing the function of the raw code itself or, in the case of presenting artefacts to a layperson for assessment or consumption, by describing the function of the code in terms of the metaphors and context dictated by the software the code is part of. In the latter case, this poses interesting problems more akin to creative natural language generation. Videogames, for example, must describe the functionality of game mechanics in terms of what they enable the player to do within the game world Figure 7 shows the Stealth Bastard game (Biddle 2012) explaining how to complete a level. Note the use of a physical verb (enter), a symbolic noun (exit) and a reference to meta-game objectives (completing a level). These are concepts unrelated to the technical specifics of game code, but crucial to the player's understanding of the thematic and ludic qualities of the game. The generation of textual descriptions of both the creative process and the generated code is crucial in enabling these systems to be assessed by the general public. It will also become more important in autonomously creative systems that generate code for use in interactive contexts, where the meaning of the code must be conveyed clearly to a user. This 2013 128 is a highly-prized feature of human-designed software6 and is crucial in autonomously creative systems where artefacts are not subject to curation prior to their use. Beyond Software Considering program code as an artefact produced by a creative system allows us to reconsider existing creative systems as potential code generators themselves. Modules within creative systems might be able to integrate criteria such as those described in (Ritchie 2007) into a process of self-exploration and modification - where new code is created for generative submodules, and evaluated according to its ability to produce content along axes such as novelty, typicality or quality. Code generation should not be thought of, therefore, as a distinct strand of Computational Creativity that runs alongside other endeavours in art, poetry and the like. Instead, it should be viewed as a new lens through which to view existing takes on Computational Creativity, and a new way to improve the novelty and ingenuity of creative systems of all kinds. If generic notions of novelty or typicality for code can be developed, then they can be applied across mediums to great effect. Comparisons of code segments have been explored within verification and software engineering approaches (Bonchi and Pous 2012; Turon et al. 2013), but for the purposes of Computational Creativity, a significantly different approach will be required, as we consider the ludic, aesthetic and semantic similarities in the output of a piece of code, rather than its raw data. If this can be done, creative software will no longer need to be considered static, instead empowered with the ability to generate new functionality within itself; creative artefacts will no longer need to be considered as finished when they leave a piece of software, but could improve and iterate upon their designs in response to use; and creative software will no longer be considered simply executing code written by humans, but instead be seen to be a collaborator in its own creation. Related Work The generation of game mechanics is closely related to the design of game rules in the more abstract sense. METAGAME (Pell 1992) is an early example of a system that attempted to generate new game rulesets. This worked by varying existing rulesets from well-known boardgames such as chess and checkers, using a simple grammar that could express the games as well as provide room for variation. Grammar-based approaches to ruleset generation are common in this area, perhaps most prominently seen in Ludi (Browne and Maire 2010) which evolved boardgame rulesets from a grammar of common operations, or work in (Togelius and Schmidhuber 2008) and (Cook and Colton 2012) which present similar work for realtime videogames. Grammar-based approaches work well because they explore spaces of games that are defined by common core concepts; but are naturally limited by the nature of the humandesigned grammar as a result. An alternative approach that can cover a broader space is to use annotated databases of 6 E.g., as promoted in Apple's Human Interface Guidelines mechanical components, and then assemble them to suit a particular design problem. Work in (Nelson and Mateas 2007) uses this approach to design games around simple noun-and-verb input, while (Treanor et al. 2012) use an annotated database approach to develop games that represent a human-defined network of concepts. Smith and Mateas (2010) present an alternative approach, describing a generator of game rulesets without an evaluative component. The system they describe uses answer set programming to define a design space through a set of logical constraints. Solutions to these constraints describe game rulesets, therefore if constraints are chosen to restrict solutions to a certain space of good games, solving them will yield high-quality games. These criteria can be narrowed down by adding further constraints to the answer set program. This can be seen as somewhat related to grammatical approaches - higher-level concepts are defined by hand (such as ‘character movement' or ‘kill all') which are then selected for use later. This has similar limitations to the grammatical approaches, in that it is dependent on external input to define its initial language, and that this restricts the novelty of the system as a result. The future work proposed in (Smith and Mateas 2010) was to focus more on programmatic modification, however, which would have further distinguished the approach. Conclusions and Further Work We have described Mechanic Miner, a code modification system for generating executable content for videogames, and A Puzzling Present, a game which we released built using content generated by Mechanic Miner. We showed that code can be used as both a source material and a target domain for Computational Creativity research, and that it can lead to greater depth than working with metalevel abstractions of target creative domains, offering surprise and novelty even on a small scale. Through evaluation of gameplay responses, we drew conclusions about the presentation of creative artefacts to large audiences for evaluation. Finally, we raised the issue of how created artefacts can be evaluated by an audience which, in general, has no experience in the domain the artefacts reside within. This work has also highlighted several areas of future work needed to expand the concepts behind Mechanic Miner to prove the worth of the approach in generating more sophisticated mechanics and games. These include work to expand the expressiveness of the code generation, so that it can include higher-level language concepts such as method invocation, expression sequences, control flow and object creation. This will lead to a large expansion of the design space, which will raise issues of efficiency and evaluation, also bearing further investigation. We will also be using our experimental results to tune both our existing metrics for level and mechanic design, and to drive further development in systems such as Mechanic Miner, to increase their autonomy and their ability to seek out novel content. We are particularly interested in how different difficulty metrics can be combined to produce a diverse set of game content. 2013 129 We will also consider the looming problem of code generation's relationship with metaphorical gameplay. Game designer and critic Anna Anthropy describes games as "an experience created by rules" (Anthropy 2012). The way in which this experience is created, however, is deeply grounded in the player's ability to connect the systems inside a game with the real world. In Super Mario, for instance, eating a mushroom makes you larger, and conveys extra speed and jumping power. In the game's code, this is simply a collision of two objects, and some state changes. Notions such as size visually indicating strength or ability, or the idea that consuming food can improve your strength, are fundamentally connected to real-world knowledge, and less evident simply by looking at code. Discovering ways that software can discover these relationships for itself will be a major hurdle in developing code generators capable of designing meaningful game content, but also a gateway to an unprecedented level of creative power for software, and an opportunity to bring art, music, narrative and mechanics together in a more meaningful way than ever before. The field of Computational Creativity was founded on the belief that computers could be used to simulate, enhance and investigate aspects of creativity, and researchers have created many complex pieces of software by hand. We believe that the time is ripe to move this a step further, and to turn the ideas we have developed on our own creations; to reconsider our artificial artists, composers and soup chefs as pieces of code that can be assessed, altered and improved at the same level of granularity that they were created. In order to do so, however, we may need to challenge some assumptions we hold about certain creative mediums and the relationship the general public has with them. Acknowledgements We would like to thank the reviewers for their input and suggestions, particularly regarding the discussion section. 2013_19 !2013 A model for evaluating interestingness in a computer-generated plot Rafael Pérez y Pérez, Otoniel Ortiz Departamento de Tecnologías de la Información Universidad Autónoma Metropolitana, Cuajimalpa Av. Constituyentes 1054, México D. F. {rperez/oortiz}@correo.cua.uam.mx Abstract This paper describes a computer model for evaluating the interestingness of a computer-generated plot. In this work we describe a set of features that represent some of the core characteristics of interestingness. Then, we describe in detail our computer model and explain how we implemented our first prototype. We assess four computer-generated narratives using our system and present the results. For comparison reasons, we asked a group of subjects to emit an opinion about the interestingness of the same four stories. The outcome suggests that we are in the right direction, although much more work is required. Introduction Evaluation is a core aspect of the creative process and if we are interested in building creative systems we need to develop mechanisms that allow them to evaluate their own outputs. The purpose of this project is to contribute in that direction. This paper describes a model for evaluating the interestingness of a computer generated plot. It is part of our research project in computer models of narrative generation. Some time ago we developed a computer model of narrative generation (Pérez y Pérez and Sharples 2001; Pérez y Pérez 2007). Our model distinguished three core characteristics: coherence, novelty and interestingness. To test our model we built an agent that generated plots. Now, we are interested in developing a model to evaluate the coherence, novelty and interestingness of a computer-generated narrative. So, our storyteller agent will be able to evaluate its own outputs. In this way, we expect to understand better how the evaluation process works and, as a consequence, how the creative process works. Due to space limitations this document only discusses the central features of our model for the evaluation of interestingness (the reader can find some published work describing the main characteristics of our model for evaluation of novelty in Pérez y Pérez et al. 2011). We are aware that human evaluation of interestingness is a very complex task and we are far from understanding how it works. Nevertheless, we believe that computer models, like the one we describe in this text, can provide some light in this challenging aspect of human creativity. Related Work There have been several discussions about how to assess computational creativity. For example, Ritchie (2007) suggests criteria for evaluating the products of a creative process (the process is not taken into consideration); in general terms such criteria evaluate how typical and how valuable the product is. Colton (2008) considers that skill, imagination and appreciation are characteristics that a computer model needs to be perceived to have. Jordanous (2012) suggests to have a set of human experts that evaluate characteristics like Spontaneity and Subconscious Processing, Value, Intention and Emotional Involvement, and so on, in a computer generated product. All these are interesting ideas, although some are too general and difficult to implement (e.g. see Pereira et al. 2005). Some work has been done in evaluation of plot generation: A computer model might be considered as representing a creative process if it generates knowledge that does not explicitly exist in the original knowledgebase of the system and which is relevant to (i.e. is an important element of) the produced output. Note that this definition involves inspection of both the output of the program and its initial data structures... we refer to this type of creativity as computerised creativity (ccreativity) (Pérez y Pérez and Sharples 2004). Peinado et al. (2010) also have worked in evaluation of stories, although they work was oriented to asses novelty. An area that some readers might consider related to this work is interactive drama and drama managers. A good example of this type of systems is the work by Weyhrauch (1997). However, rather than evaluating the plot and the creative process, Drama managers focus in evaluating the user's experience while playing the game. Some other systems might employ different techniques, e.g. case base systems (Sharma et al. 2010), but the goal is the same: to provide a pleasant experience to the user. Description of the Model This work describes a model to evaluate the interestingness of a computer generated plot. Such a plot is known as the new story or the new narrative. For the purpose of this project, we consider a narrative interesting when it is recounted in a correct manner and when it generates new knowledge. A story is recounted in a correct manner when it follows the classical Aristotelian structure of a story: introduction, development, climax and resolution (or setup, conflict and resolution). Some previous work has shown the relation between the Aristotelian structure 2013 131 and the evaluation of interestingness in computer generated plots (Pérez y Pérez and Sharples 2001). We are particularly interested in evaluating the opening and the closure of a story. We consider that a story has a correct opening when at the beginning there are no active dramatic tensions in the tale and then the tension starts to grow. We consider that a story has a correct closure if all the dramatic tensions in the story are solved when the last action is performed. An important characteristic of the recountal of a story is the introduction of unexpected obstacles. In this work an obstacle is unexpected when the story seems to finish (final part of the resolution section) and then new problems arise. Following Pérez y Pérez and Sharples (2004) we believe that the generation of new knowledge contributes to consider a narrative interesting. Some studies in motivation, curiosity and learning seem to support this claim (e.g. see Deckers 2005). In the same way, writers have pointed out how good narratives are a source of new knowledge (e.g. see Lodge 1996). In this work a new story generates new knowledge when: x It generates knowledge structures that did not exist previously in the knowledge base of the system and that can be employed to build novel narratives. x It generates a knowledge widening, i.e. when existing knowledge structures incorporate unknown information obtained from the new story. This information can be employed to build novel narratives. Our computer model of evaluation is based on expectations. So, the assessment of the new knowledge structures and the knowledge widening is performed by analysing how much the new story modifies the knowledge base; then, comparing if such modifications satisfied the given expectations. In the same way, the evaluations of unexpected obstacles and the correctness of the narrative's recountal are performed by analysing the structure of the new narrative; then, assessing if such a structure fulfils there expectations. Finally, all these partial results are considered to obtain a final evaluation of interestingness. The following lines elaborate these ideas. Generating Original Structures One of the key aspects of c-creativity is the generation of novel and relevant knowledge structures. That is, a storyteller must develop narratives that increment its knowledge base (in this work we focus on how the knowledge base of the evaluator is incremented). Thus, a storyteller must include mechanisms that allow: 1) incorporating within its knowledge base the new information generated by its outputs, i.e. it must include a feedback process; 2) comparing its knowledge base before and after feeding back a new tale (an interesting point for further discussions is to compare the processes that different systems might employ to perform these tasks). In this way, the first part of the model focuses in determining the proportion of new structures. It requires a parameter known as the Minimal Value of New Structures (Min-NS); it represents the minimum amount of new structures expected to be created by the new story. In this way, the Proportion of New Structures (PNS) is defined by the ratio between the number of new structures (NNS) created by the new narrative and the Minimal Value of New Structures (Min-NS). If the number of new structures is bigger than its minimal value, the Proportion of New Structures is set to 1. Besides calculating the number of new structures, it is necessary to determine how novel they are, i.e. to verify if they are similar to the information that already exists in the knowledge base. With this purpose we define a parameter known as the Limit of Similitude (LS) that represents the maximum percentage of alikeness allowed between two knowledge structures. So, all those new structures that are too alike to already existing structures must be eliminated. In other words, one must get rid of all new structures that are at least LS% equal to any existing structure. The number of surviving structures is known as the Original Value (OValue) and they represent new structures that are not similar to any old structures. Like in the previous case, the model requires an expected Minimum Original Value (Min-OV) to calculate the Proportion of the Original Value (POV). And, like in the previous case, this proportion never can be bigger than 1. So, POV represents in which percentage the new narrative satisfies the expected number of original new structures. The Novelty of the Knowledge Structures (NKS) is defined as the ratio between the O-Value and the number of new structures (NNS). It represents which percentage of the new structures is original. In this way, if the O-Value is identical to the number of new structures the NKS is equal to 1 (100%). That means that all new structures satisfy the requirement of novelty. A variant of the process of creation of knowledge structures is known as knowledge widening. It occurs when existing knowledge structures incorporate within its own structure unknown information obtained from the new story. This concept is inspired by Piaget's ideas about accommodation and assimilation (Piaget 1952). So, the NKS = O-Value NNS O-Value Min-OV IF O-Value Min-OV 1 IF O-Value > Min-OV POV = NNS Min-NS IF NNS Min-NS 1 IF NNS > Min-NS PNS = 2013 132 model requires knowing the number of unknown information incorporated into the knowledge base; we refer to it as the number of new elements. So, in order to calculate the Proportion of Knowledge Widening (PKW) it is necessary to know the Number of New Elements (NNE) and an expected Minimum value of New Elements (MinNE). Thus, PNS, POV, NKS and PKW provide information to evaluate how much new knowledge is generated. Analysing the Story's Structure We defined earlier that a story is recounted in a correct manner when it follows the classical Aristotelian structure: setup, conflict and resolution. The story's structure in this work is represented by the graphic of the curve of the dramatic tensions in the tale. Tensions represent conflicts between characters. When the number of conflicts grows the value of the tension rises; when the number of conflicts decreases the value of the tensions goes down; when the tension is equal to zero all conflicts have been solved. Thus, we analyse the characteristics of the graphic of tension to evaluate the presence of unexpected obstacles and how well recounted the story is. In this way, our evaluation model requires a mechanism to depict the dramatic tension in the tale. There are four basic cases of graphics of tensions that we consider in this work: one complete curve (see figure 1a); several complete curves (see figure 1-b); one incomplete curve (see figure 1-c); several incomplete curves (see figure 1-d). It is also possible to find combinations of these cases. A curve is defined as complete when its final amplitude is zero; that is, all tensions are resolved. By contrast, the final amplitude of an incomplete curve never gets the value of zero. Figure 1. Examples of graphics of tensions. The peak of a curve represents the climax of a narrative; if we have a sequence of curves we refer to the peak with the highest amplitude as the main climax. So, in a sequence, first the story reaches a situation with high levels of tensions, after that tensions start to loosen up and then they rise again; this cycle can be repeated. Each peak is a climax; each loosen up is a resolution of such a climax. We refer to the situation where a narrative has a resolution and then tensions start to rise again as reintroducing-complications. We can find variations of the basic graphics of tensions we enumerated earlier. For example, the deepness of each valley in a sequence of incomplete curves might be different for each instance; in the same way, the amplitude of the peaks of sequences of complete or incomplete curves might change between them; and so on. The difference between having a single curve and having a sequence of curves is that in the former there is only one high point in the story while in the latter we have two or more high points, i.e. new characters' obstacles are initiated reintroducing in this way complications. The difference between a sequence of complete curves and a sequence of incomplete curves is that in the former all tensions are solved before new tensions arises; in the later new tensions emerge before the current ones are worked out. An incomplete curve is very similar to a complete curve if the fall of the tensions is close to 100% with respect to its peak, i.e. if the amplitude is close to the value of zero. On the other hand, if the fall of the tensions is close to 0% with respect to its peak, i.e. if the amplitude is close to the value of its peak, we practically do not have an incomplete curve. In this work we appreciate narratives that seem to end and then reintroduce new problems for the characters. In other words, we want narratives where all tensions are solved (complete curves) or are almost solved (incomplete curves with deep valleys) and then they rise again. This formula can be observed in several examples of narratives like films, television-series and novels (nevertheless, the model allows experimenting with different values of valley's profundity). Thus, different graphics of tensions produce different characteristics in the narrative. We hypothesize that a story that includes more curves of tensions is more exciting than a story that includes fewer curves because the former reintroduces more complications. However, too many curves make the story inadequate. So, it is necessary to find a balance. In this way, our model requires to set a number that represents the perfect amount of complete curves that a story should comprise. We refer to this number as the Ideal Value of Complete Curves (Ideal-CC). So, because we can calculate the number of complete curves (Num-CC) in any new narrative and because we have defined an ideal number for them, it is possible to estimate how close the number of curves is to its ideal value. We refer to this number as the Proportion of Complete Curves (PCC): NNE Min-NE IF NNE Min-NE 1 IF NNE > Min-NE PKW = Num-CC Ideal-CC IF NumCC Ideal-CC 0 IF NumCC > Ideal-IC Â 2 PCC = Num-CC Ideal-CC 1 — a) b) c) d) IF Ideal-CC< NumCC Ideal-CC Â2 2013 133 It is important to explain how the NUM-CC is calculated. As it is going to be explained some lines ahead, a story must include at least one complete curve to be considered as properly recounted. But this curve itself does not reintroduce problems. The reintroduction of complications occurs when the current ones are sorted out and then new complications (i.e. new complete curves) emerge. In this way, NUM-CC only registers those complete curves that actually reintroduce new conflictive situations. The process to calculate the incomplete curves is a little bit different. The goal is to calculate how close the set of incomplete curves are to its ideal value. Remember that too many curves or too few curves produce inadequate results. It is necessary to know the number of incomplete curves (Num-IC) and the Ideal Value of Incomplete Curves (Ideal-IC) to calculate the Proportion of Incomplete Curves (PIC): Now, it is necessary to analyse each of the curves to see how close they are to its ideal value. One starts getting the amplitude of the first peak and the amplitude of the bottom part of its valley; the ratio between the valley and the peak indicates the percentage, with respect to its peak, that the valley needs to be expanded to reach zero. So, if the peak's amplitude is 10 and the valley's is 4, the valley needs to be expanded 40% to reach zero. The process is repeated for all incomplete curves. The summation of these results is known as the Summation of Incomplete Curves (SIC): Notice that, if the number of incomplete curves is minor to the ideal value of incomplete curves, the difference between them is added to the summation. So, the value of SIC represents how far the set of incomplete curves is from its ideal value. So, if SIC § 0 the new narrative totally satisfies the requirement for reintroducing complications (all curves have deep valleys); if SIC § Ideal-IC the valleys are so small that practically we do not have incomplete curves. Now, given an Ideal Number of Incomplete Curves (Ideal-IC), it is possible to calculate in what percentage the amplitude of all incomplete curves is similar to its ideal value. We refer to this value as the Total Amplitude of Incomplete Curves (TAI), which is defined as follows: If SIC > Ideal-IC we have too many incomplete curves whose amplitudes do not provide useful information for the evaluation. Regarding the recountal of a story, we consider that a narrative follows the classical Aristotelian structure when its graph of tension includes at least one complete curve, i.e. the tension at the beginning and at end of the story is zero, and at least once the value of the tensions between these two points is different to zero. So, in this project we analyse if the story under evaluation has an adequate opening and adequate closure in terms of tensions. A story has an adequate opening (A-Opening) when the tension in the story goes from zero at the beginning of the story to some value greater than zero at the first peak. A-Opening = Amplitude First Peak Amplitude (t=1) Amplitude First Peak In this way, because our goal is to have a continue tension growing from zero to the first peak, this formula indicates which percentage of this goal is achieved. One common mistake, particularly between inexperienced writers, is to finish a story leaving loose ends. Thus, following Pérez y Pérez and Sharples, a story "should display an overall integrity and closure, for example with a problem posed in an early part of the text being resolved by the conclusion" (Pérez y Pérez and Sharples 2004). In this way, in order to have an Adequate Closure (A-Closure) all conflicts must be worked out at the end of the story. That is, the value of the tension in the last action must be equal to zero. So, it is necessary to perform a similar process to the one employed to calculate the incomplete curves: one needs to get the amplitude of the curve's main peak, the amplitude of the bottom part of the last valley, and then calculate in what percentage the tension goes down. If the final amplitude of the curve is zero, i.e. if it goes down 100%, the Adequate Closure is set to 1; if the curve goes down 30% the Adequate Closure is set to 0.3; and so on. SIC = Amplitude-Valleyi Amplitude-Peaki Num-IC Ȉ i=1 Amplitude-Valleyi Amplitude-Peaki Num-IC Ȉ i=1 + (Ideal-IC — Num-IC) If Num-IC Ideal-IC If Num-IC > Ideal-IC A-Closure = Amplitude Last Valley Amplitude Main Peak 1 — Num-IC Ideal-IC IF NumIC Ideal-IC 0 Num-IC Ideal-IC 1 — PIC = IF Ideal-IC< NumIC Ideal-IC Â2 IF NumIC > Ideal-IC Â 2 Ideal-IC — SIC Ideal-IC IF SIC Ideal-IC 0 IF SIC > Ideal-IC TAI = 2013 134 Calculation of Interestingness Thus, our model employs the following characteristics: x Proportion of new structures (PNS) x Proportion of the Original Value (POV) x Novelty of the Knowledge Structures (NKS) x Proportion of Knowledge Widening (PKW) x Adequate Opening (A-Opening) x Adequate Closure (A-Closure) x Proportion of Complete Curves (PCC) x Proportion of Incomplete Curves (PIC) x Total Amplitude of Incomplete Curves (TAI), The first six characteristics (PNS, POV, NKS, PKW, AOpening and A-Closure) are known as the core characteristics (CoreC); the last three are known as the complementary characteristics (ComplementaryC). This distinction emerges after talking to some experts in science of human communication that pointed out to us that a story can be interesting even if there are no reintroductions of complications (that is, even if there are no extra complete or incomplete curves). The experts agreed that the reintroduction of problematic situations might add interest to the story, but they are not essential to it. So, we decided that they would complement the evaluation of the core characteristics (a kind of extra points). It is necessary to set a weight for each of the core characteristics. The sum of all weights must be equal to 1. Thus, the Evaluation of Interestingness (I) is equal to the summation of the value of each core characteristic (CoreC) multiplied by its weight (W): 6 I = Ȉ CoreCiÂWi i=1 The Complement (Com) is equal to the summation of the value of each complementary characteristic multiplied by its complementary weigh (w). The sum of all complementary weights ranges from zero to 1. 3 Com = Ȉ Complementary&LÂwi i=1 Thus, the total value of interest (TI) is giving by If we combined the values obtained from the correct recountal of a story and the reintroduction of complications, then we can calculate a parameter that we referred to as excitement (E): E = A-&ORVXUHÂ:$-2SHQQLQJÂ:3&&ÂZ3,&ÂZ 7$,ÂZ Thus, E assigns a value to the increments and decrements of tension during the story. Implementation of the Prototype We have implemented a prototype to test our model. Our prototype evaluates the interestingness of four stories generated by our storyteller. Details of our computer model for plot generation can be found in (Pérez y Pérez and Sharples 2001; Pérez y Pérez 2007). In this document we only mention two characteristics that are important to learn in order to understand how the prototype of the evaluator works: 1. Our plot generator employs a set of stories, known as the previous stories, to construct its knowledge base. Such narratives are provided by the user of the system. Any new story generated by the storyteller can be included as part of the previous stories. 2. As part of the process of developing a new story the storyteller keeps a record of the dramatic tension in the story. The following are examples of situations that trigger tensions: when the life of a character is at risk; when the health of a character is at risk; when a character is made a prisoner; and so on. Every tension has assigned a value. So, each time an action is performed the system calculates the value of all active tensions and records it. With this information the storyteller graphs the curve of tension of the story (see figure 3). Now we explain some details of the implementation of the prototype for the evaluation of interestingness. The model includes several parameters that provide flexibility. The first step is to set those parameters. We start with the expected or ideal values: Minimal Value of New Structures (Min-NS), Minimum value of New Elements (Min-NE), Minimum Original Value (Min-OV), Ideal Value of Complete Curves (Ideal-CC) and Ideal Value of Incomplete Curves (Ideal-IC). To determine the value of these parameters we employ the previous stories as a reference. (The previous stories employed in this work were made long time before this project started. They represent well-formed and interesting narratives. So, they are a good source of information). The process works as follows. We select seven previous stories. With six of them we create the knowledge base; the 7th is considered a new story (as if it had been produced by our storyteller). Then, we analyse how many new structures, new elements, new original value structures, and new complete and incomplete curves are generated by the 7th previous story and record these results. We repeat the same process for each of the previous stories. Then, after eliminating the highest and lowest values, we calculate the means of each result obtained. Following this procedure we conclude that the parameters should be set as follows: Min-NS = 7; Min-NE = 4; Min-OV = 5; IdealCC = 1; Ideal-IC = 1. That is, in average each previous story generates seven new knowledge structures, four new elements, five original structures, one complete curve and one incomplete curve. The next step is to set the weights. Based on empirical experience of experts in human communication, the weight of the generation of new knowledge is set to 50% and the weight of the correctness of the way the narrative is recounted is set to the other 50%. TI = 1 if (I + Com) > 1 I + Com if (I + Com) 1 2013 135 The characteristics that define the generation of new knowledge are: Proportion of new structures (PNS), Proportion of the Original Value (POV), Novelty of the Knowledge Structures (NKS) and Proportion of Knowledge Widening (PKW). Table 1 shows their assigned weights. We considered Novel knowledge structures more important than Knowledge Widening structures. The correctness of the way the narrative is recounted is defined by the parameters A-Opening and AClosure. Both are important and both received the same weight. Finally, the LS was set to 85%. Regarding the complementary parameters and weights, they contribute with a maximum extra value of 10% distributed as follows: 5% for the complete curves and 5% for TAI. This decision is based on our own experience. Core Characteristic Weight Proportion of new structures (PNS) 10 Proportion of the Original Value (POV) 10 Novelty of the Knowledge Structures (NKS) 15 Proportion of Knowledge Widening (PKW) 15 Adequate Opening (A-Opening) 25 Adequate Closure (A-Closure) 25 Complementary Characteristic Weight Proportion of Complete Curves (PCC) 5 Proportion of Incomplete Curves (PIC) 0 Total Amplitude of Incomplete Curves (TAI) 5 Table 1. Weights of the characteristics Finally, if the value of the correct recountal of the story (A-Closure + A-Opening) does not reach at least 50% of its highest possible value, the story is considered as unsatisfactory. In this way we avoid evaluating stories that lack enough quality (the reader must remember that in this paper we do not evaluate coherence; that is a different part of the project. However, this constraint in the prototype helps to avoid processing pointless stories). Testing the Model To test our model our storyteller generated four narratives known as short-1, short-2, long-1 and long-2 (see Figure 2). Figure 3 shows their graphics of tension. The following lines describe the main characteristics of each narrative. Short-1 lacks an introduction; it starts with a violent action. One gets the impression that everything occurs very fast. It is not clear what happens to the virgin once she escapes and has an accident. Also it is unclear the fate of the enemy. Short-2 has a brief introduction and then the conflict starts to grow (the killing of the knight). The end is tragic and all tensions are sorted out. Long-1 has a nice long introduction. The conflict between the princess and the lady grows nicely and slowly until it reaches a climax. However, at the end, we do not know the destiny of the characters. Who got the knight? So, the story has an inadequate conclusion. SHORT 1 The enemy kidnapped the virgin The virgin laugh at the enemy The enemy attacked the virgin The virgin wounded the enemy The virgin ran away The virgin had an accident The End SHORT 2 Jaguar knight was a citizen The artist prepared to sacrifice the jaguar knight The jaguar knight became free The jaguar knight fought the artist The artist killed the jaguar knight The artist committed suicide The End LONG 1 Jaguar knight was a citizen The princess was a citizen The princess was fond of jaguar knight The princess fell in love with jaguar knight The lady was in love with jaguar knight The princess got jealous of the lady The jaguar knight was in love with the princess The princess attacked the lady The lady wounded the princess The lady ran away The lady had an accident The End LONG 2 Jaguar knight was a citizen The enemy was a citizen The enemy got intensely jealous of jaguar knight The enemy attacked jaguar knight The jaguar knight fought the enemy The enemy wounded jaguar knight The enemy ran away The enemy went to Texcoco lake The enemy did not cure jaguar knight The farmer prepared to sacrifice the enemy The enemy ran away The jaguar knight died because of its injuries The End Figure 2. Four computer-generated stories Figure 3. Graphics of Tensions for the four stories Long-2 starts introducing the characters of the narrative. The tension grows fast until the story reaches a climax when the enemy wounded the knight. The tension decreases when the enemy decides to run off; however, it increases again when the enemy returns and the farmer attempts to kill him. Finally, he escapes again and the knight dies. Based on our personal taste, our favourite narrative was short-2, then long-2, long-1 and finally short-1. We evaluated these four stories with our prototype. Table 2 shows the results; figure 4 shows the normalised values for the following features: generation of new knowledge, adequate closure, excitement and the total value of interestingness. Against our prediction, the system selected Long-2 as the most interesting story. There were two main reasons that explained why Long-2 beat short-2: 1) Long-2 generated more knowledge structures than Short2; 2) Long-2's complements were slightly better evaluat0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 Actions Tension Short 1 Short 2 Long 1 Long 2 2013 136 ed than Short-2's. So, Short-2 obtained the second best result. Long-1 Long-2 Short-1 Short2 PNS 10 8.57 4.29 4.29 POV 0 10 6 6 NKS 0 15 15 15 PKW 3.8 3.75 11.25 3.75 A-Op 25 25 15 25 A-Clo 13 20.83 10 25 I 51.25 83.15 Unsatisfactory 79.04 Com 3.4 3.15 3 1.65 TI 54.6 86.30 Unsatisfactory 80.69 E 41.4 48.98 28 51.65 Table 2. Numerical values of the evaluation. In third place was Long-1; it did not produce any original structure and therefore its characteristic NKS got a value of zero. Also, its closure was poor. In last place was Short-1. The system evaluated Short-1 as an unsatisfactory story; i.e., it did not satisfy the minimum requirements of a correct recountal of a story (as we can see in table 2, the opening only got 15 points and the closing 10!). Nevertheless, we included the value of Short-1's closure and excitement in figure 4. Figure 4. Graphics of the results of the evaluation. We thought it could be interesting to compare the opinion of a group of subjects about the four stories under analysis to the results generated by our computer evaluator. Thus, we decided to make a survey by applying two questionnaires: 22 subjects answered questionnaire 1 and 22 subjects answered questionnaire 2; 25% were females and 75% were males; 13% had a PhD degree, 29% had a master degree, 27% had a bachelor's degree and 29% had other types of degree. We decided to group the narratives by their length. So, the first questionnaire included the two short narratives while the second questionnaire included the two long narratives. In both questionnaires we asked subjects to evaluate the adequateness of the closure and the interestingness of the stories. Subjects could rank each feature with a value ranging from 1 to 5, where 1 represented the lowest assessment and 5 the highest one. Figure 5 shows the results of the evaluation of interestingness. Short-2 was considered the most interesting narrative; Long-2 seemed to be in the second position followed close behind by Short-1 and Long-1. These last results were not conclusive. We were surprised that Short-1 was not clearly in the last position. We speculated that human capacity of filling gaps when reading a narrative might contribute to this result. Although our computer agent calculated a higher evaluation to Long-2 than to Short-2, both stories got a very similar score (the difference was less than 6%; c.f. with the score of Long-1). So, we felt that subjects' opinion about these two narratives was close to the results we obtained from our computer prototype. However, by contrast, our system clearly rejected Short-1 and left Long-1 in a clear third position while subjects' evaluation was unclear. Figure 5. Subjects' evaluation of interestingness. Figure 6. Subjects' evaluation of closure. Figure 6 shows the results for the evaluation of closure. Subjects ranked Short-2 as the story with the best closure, followed by Long-2, Long-1 and Short-1. There was a total coincidence between the computer agent evaluation and the human evaluation. Discussion and Conclusions This paper reports a computer model for the evaluation of interestingness. It is part of a bigger project that attempts to evaluate the interestingness, coherence and novelty of computer generated narratives. The model presented in this paper emphasises two properties: generation of new knowledge and the correctness of the recountal of a story. Regarding the generation of new knowledge, we developed a process to calculate how much new information was produced by a computer generated story. In the same way, motivated by Piaget ideas about accommodation and assimilation, we defined two different types of knowledge structures: new knowledge and widening knowledge. We went further by identifying those new knowledge structures which were very different to the existing ones. Regarding the recountal of a 2013 137 story, we worked on previous research that had illustrated the relation between the dramatic tension of a story and its interestingness. In this work we expanded this idea by analysing the opening and closure of a story, and verifying if new obstacles were introduced along the plot. Thus, we have been able to create a model that allows a computerised agent to perform a detailed evaluation of the stories it produces. The implementation of our prototype has allowed testing the ideas behind the model. We are satisfied with the results. But we are more excited about what we are expecting to achieve with this new characteristic. The capacity to evaluate its own outputs allows a storyteller to distinguish positive and negative qualities in a narrative and therefore to learn from its own creative work; it also incorporates the possibility of evaluating and learning from narratives generated by other systems. In our case, we expect that our storyteller agent will be able to determine autonomously which stories, either produced by itself, by other systems or by humans, should become part of its set of previous stories. That is our next goal. We have compared the results produced by our automatic evaluator to the results obtained from a questionnaire answered by a group of 44 human evaluators. In general terms, the results obtained from both approaches were similar. This suggests that the subjects that answered the questionnaire might consider acceptable the outputs produced by our system. Nevertheless, it is intriguing why the story Short-1 got a relative high evaluation from the subjects. We need to analyse further this result and see if we require adjusting our model. As it has been showed in this work, we consider the generation of new knowledge an important characteristic of computational creativity. So, it is not enough to evaluate the creative-product and/or the creative-process, as it has been suggested by some researchers. We believe that it is also necessary to considerate how much such products and/or processes modify the characteristics of the storyteller agent and the evaluator agent (that in our case is the same). So, any evaluation process must consider this aspect. This idea is inspired by the fact that, any creative act performed by humans will influence their future creative acts. We need to represent this feature in our computer models. The qualities that make a story interesting, coherent and novel are complex and many times overlap each other. Our work seems to illustrate part of this overlapping complexity. For example, the generation of new structures might be employed to evaluate novelty; the adequate opening and closure might be employed to evaluate coherence; however, at the same time, they are essential elements to evaluate interestingness. This seems to confirm our idea that a general model of evaluation of narratives at least must contemplate coherence, novelty and interestingness. We are currently working on producing such a general model. Hopefully this model will be useful not only for those working in plot generators but also to those researchers working in similar areas (e.g. interactive fiction). We are aware that many features not considered in this work might contribute to make a story interesting (e.g. suspense, intrigue). As mentioned earlier, human evaluation is very complex and we do not comprehend yet how it works. Nevertheless, we expect this research contributes to understand better the mechanisms behind it. 2013_2 !2013 Generating Apt Metaphor Ideas for Pictorial Advertisements Ping Xiao, Josep Blat Department of Information and Communication Technologies Pompeu Fabra University C./ Tànger, 122-140 Barcelona 08018 Spain {ping.xiao, josep.blat}@upf.edu Abstract Pictorial metaphor is a popular way of expression in creative advertising. It attributes certain desirable quality to the advertised product. We adopt a general twostage computational approach in order to generate apt metaphor ideas for pictorial advertisements. The first stage looks for concepts which have high imageability and the selling premise as one of their prototypical properties. The second stage evaluates the aptness of the candidate vehicles (found in the first stage) in regard to four metrics, including affect polarity, salience, secondary attributes and similarity with tenor. These four metrics are conceived based on the general characteristics of metaphor and its specialty in advertisements. We developed a knowledge extraction method for the first stage and utilized an affect lexicon and two semantic relatedness measures to implement the aptness metrics of the second stage. The capacity of our computer program is demonstrated in a task of reproducing the pictorial metaphor ideas used in three real advertisements. All the three original metaphors were replicated, as well as a few other vehicles recommended, which, we consider, would make effective advertisements as well. Introduction A pictorial advertisement is a short discourse about the advertised product, service or idea (all referred to as ‘product' afterwards). Its core message, namely the selling premise, is a proposition that attributes certain desirable quality to the product (Maes and Schilperoord 2008). A single proposition can be expressed virtually in an unlimited number of ways, among which some are more effective than the others. The ‘how to say' of an ad is conventionally called the ‘idea'. ‘Pictorial metaphor' is the most popular way of expression in creative advertising (Goldenberg, Mazursky and Solomon 1999). A pictorial metaphor involves two dimensions, ‘structural' and ‘conceptual' (Forceville 1996; Phillips and McQuarrie 2004; Maes and Schilperoord 2008). The structural dimension concerns how visual elements are arranged in a 2D space. The conceptual dimension deals with the semantics of the visual elements and how they together construct a coherent message. We see that the operations in the structural and conceptual dimensions are quite different issues. In any of these two dimensions, computational creativity is not a trivial issue. In this paper, we are focusing on only one dimension, the conceptual one. The conceptual dimension of pictorial metaphors is not very different from verbal metaphors (Foss 2005). A metaphor involves two concepts, namely ‘tenor' and ‘vehicle'. The best acknowledged effect of metaphor is highlighting certain aspect of the tenor or introducing some new information about the tenor. Numerous theories have been proposed to account for how metaphors work. The interaction view is the dominant view of metaphor, which we also follow. It was heralded by Richards (1936) and further developed by Black (1962). According to Black, the principal and subsidiary subjects of metaphor are regarded as two systems of "associated commonplaces" (commonsense knowledge about the tenor and vehicle). Metaphor works by applying the system of associated commonplaces of the subsidiary subject to the principal subject, "to construct a corresponding system of implications about the principal subject". Any associated commonplaces of the principal subject that conform the system of associated commonplaces of the subsidiary subject will be emphasized, and any that does not will be suppressed. In addition, our view of the subsidiary subject is also altered. Besides theories, more concrete models have been proposed, mainly the salience imbalance model (Ortony 1979), the domain interaction model (Tourangeau and Sternberg 1982), the structure mapping model (Gentner 1983; Gentner and Clement 1988), the class inclusion model (Gluckberg and Keysar 1990, 1993) and the conceptual scaffolding and sapper model (Veale and Keane 1992; Veale, O'Donoghue and Keane 1995). Furthermore, these models suggest what make good metaphors, i.e. metaphor aptness, which is defined as "the extent to which a comparison captures important features of the topic" (Chiappe and Kennedy 1999). In the rest of this paper, we first specify the problem of generating apt metaphor ideas for pictorial advertisements. Then, the relevant computational approaches in the literature are reviewed. Next, we introduce our approach to the stated problem and the details of our implementation. Subsequently, an experiment with the aim of reproducing three pictorial metaphors used in real advertisements and the 2013 8 results generated by our computer program are demonstrated. In the end, we conclude the work presented in this paper and give suggestion about future work. Problem Statement The whole range of non-literal comparison, from mereappearance to analogy (in the terms of Gentner and Markman (1997)), is featured in pictorial advertisements. But, analogies are rare. What appear most frequently are metaphors with the mapping of a few attributes or relations. This type of pictorial metaphors is the target of this paper. To generate pictorial metaphors for advertisements, our specific problem is searching for concepts (vehicles), given the product (tenor), its selling premise (the property concept) and some other constraints specified in an advertising brief. The metaphor vehicles generated have to be easy to visualize and able to establish or strengthen the connection between the product and the selling premise. There are two notes specific to advertisements that we would like to mention. One is about the tenor of metaphor. In pictorial ads, not only the product, but also "the internal components of the product and the objects that interact with it" are often used as tenors (Goldenberg, Mazursky and Solomon 1999). The other note is about the selling premise. Metaphors in advertisements are more relevant to communicating intangible, abstract qualities than talking about concrete product facts (Phillips and McQuarrie 2009). Therefore, we are primarily considering abstract selling premises in this paper. In the next section, we review the computational approaches to metaphor generation that are related to the problem just stated. Computational Approaches to Metaphor Generation Abe, Sakamoto and Nakagawa (2006) employed a threelayer feedforward neural network to transform adjectivemodified nouns, e.g. ‘young, innocent, and fine character' into ‘A like B' style metaphors, e.g. ‘the character is like a child'. The nodes of the input layer correspond to a noun and three adjectives. The nodes of the hidden layer correspond to the latent semantic classes obtained by a probabilistic latent semantic indexing method (Kameya and Sato 2005). A semantic class refers to a set of semantically related words. Activation of the input layer is transferred to the semantic classes (and the words in each class) of the hidden layer. In the output layer, the words that receive most activation (from different semantic classes) become metaphor vehicles. In effect, this method outputs concepts that are the intermediates between the semantic classes to which the input noun and three adjectives are strongly associated. If they are associated to different semantic classes, this method produces irrelevant and hard to visualize vehicles. A variation of the above model was created by Terai and Nakagawa (2009), who made use of a recurrent neural network to explicitly implement feature interaction. It differs with the previous model at the input layer, where each feature node has bidirectional edge with every other feature node. The performance of these two models was compared in an experiment of generating metaphors for two tenors. The model with feature interaction produced better results. Besides, Terai and Nakagawa (2010) proposed a method of evaluating the aptness of metaphor vehicles generated by the aforementioned two computational models. A candidate vehicle is judged based on the semantic similarity between the corresponding generated metaphor and the input expression. A candidate vehicle is more apt when the meaning of the corresponding metaphor is closer to the input expression. The semantic similarity is calculated based on the same language model used in the metaphor generation process. The proposed aptness measure was tested in an experiment of generating metaphors for one input expression, which demonstrated that it improved the aptness of generated metaphors. Veale and Hao (2007) created a system called Sardonicus which can both understand and generate propertyattribution metaphors. Sardonicus takes advantage of a knowledge base of entities (nouns) and their most salient properties (adjectives). This knowledge base is acquired from the web using linguistic patterns like ‘as ADJ as *' and ‘as * as a/an NOUN'. To generate metaphors, Sardonicus searches the knowledge base for nouns that are associated with the intended property. The aptness of the found nouns is assessed according to the category inclusion theory, i.e. "only those noun categories that can meaningfully include the tenor as a member should be considered as potential vehicles". For each found noun, a query in the format ‘vehicle-like tenor' is sent through a search engine. If there are more than zero results returned, the noun is considered an apt vehicle. Otherwise, it is considered not apt or extremely novel. The above reviewed effort of generating metaphor converges at a two-stage approach. These two stages are:  Stage 1: Search for concepts that are salient in the property to be highlighted  Stage 2: Evaluate the aptness of the found concepts as metaphor vehicles This two-stage approach of metaphor generation is adopted by us. We provide methods of searching and evaluating metaphor vehicles, which are different from the literature. In addition, special consideration is given to the aptness of metaphor in the advertising context. An Approach of Generating Apt Metaphor Ideas for Pictorial Advertisements We adopt a general two-stage computational approach of metaphor generation (as introduced in the last section) to generate apt metaphor ideas for pictorial advertisements. At the first stage, we look for concepts which have high Imageability (Paivio, Yuille and Madigan 1968; Toglia, and Battig 1978) and the selling premise as one of their prototypical properties. At the second stage, we evaluate the aptness of the candidate vehicles using four metrics, including affect polarity, salience, secondary attributes and 2013 9 similarity with tenor. Vehicles that are validated by all the four metrics are considered apt for a specific advertising task. In the following sections, we explain the rationale of our approach and its computational details. Stage 1: Search Candidate Metaphor Vehicles To find entities which have the selling premise as one of their prototypical properties, our strategy is searching for concepts that are strong semantic associations of the selling premise. One note to mention is that the concepts soughtafter do not need to be the ‘absolute' associations, because the meaning of a metaphor, i.e. which aspect of the tenor and vehicle becomes prominent, does not only depend on the vehicle, but on the interaction between the tenor and vehicle. In the past, we developed an automatic knowledge extraction system, namely VRAC (Visual Representations for Abstract Concepts), for providing concepts of physical entities to represent abstract concepts (Xiao and Blat 2011). Here we give a brief introduction of this work. We look for semantic associations in three knowledge bases, includingword association databases (Kiss, Armstrong, Milroy and Piper 1973; Nelson, McEvoy and Schreiber 1998), a commonsense knowledge base called ConceptNet (Liu and Singh 2004) and Roget's Thesaurus (Roget 1852). The reason for using three of them is that we want to take use of the sum of their capacity, in terms of both the vocabulary and the type of content. The nature of these three knowledge bases ensures that the retrieved concepts have close association with the selling premise. Vehicles of pictorial metaphors should have high imageability, in order to be easily visualized in advertisements. Imageability refers to how easy a piece of text elicits mental image of its referent. It is usually measured in psychological experiments. The available data about word imageability, at the scale of thousands, does not satisfy our need of handling arbitrary words and phrases. As imageability is highly correlated with word concreteness, we developed a method of estimating concreteness using the ontological relations in WordNet (Fellbaum 1998), as an approximation of imageability. To evaluate the capacity of VRAC, we collected thirtyeight distinct visual representations of six abstract concepts used in past successful advertisements. These abstract concepts have varied parts of speech and word usage frequency. We checked if these visual representations were included in the concepts output by VRAC, with the corresponding abstract concept as input. On average, VRAC achieved a hit rate of 57.8%. The concepts suggested by VRAC are mostly single objects. It lacks the concepts of scenes or emergent cultural symbols, which also play a role in mass visual communication. Stage 2: Evaluate the Aptness of Candidate Vehicles The aptness of the candidate vehicles generated in Stage 1 is evaluated based on four metrics, including affect polarity, salience, secondary attributes and similarity with tenor. Affect Polarity Most of the time, concepts with negative emotions are avoided in advertising (Kohli and Labahn, 1997; Amos, Holmes and Strutton 2008). Even in provocative advertisements, negative concepts are deployed with extreme caution (De Pelsmacker and Van Den Bergh 1996; Vézina and Paul 1997; Andersson, Hedelin, Nilsson and Welander 2004). In fact, negative concepts are often discarded at the first place (Kohli and Labahn 1997). Therefore, we separate candidate vehicles having negative implication from the ones having positive or neutral implication. For this purpose, affective lexicons, which provide affect polarity values of concepts, come in handy. We decided to use SentiWordNet 3.0 (Baccianella, Esuli and Sebastiani 2010), due to its big coverage (56,200 entries) and fine grained values. It provides both the positive and negative valences, which are real values ranging from 0.0 to 1.0. If a candidate vehicle is found in SemtiWordNet 3.0, its affect polarity is calculated by subtracting the negative valence from the positive valence. The candidate vehicles which are not included in SemtiWordNet 3.0 are considered being emotionally neutral. Salience Salience refers to how strongly a symbol evokes certain meaning in humans' mind. The candidate vehicles found by VRAC have varying association strength with the selling premise, from very strong to less. The vehicle of a metaphor has to be more salient in the intended property than the tenor (Ortony 1979; Glucksberg and Keysar 1990). We interpret salience as a kind of semantic relatedness (Budanitsky and Hirst 2006), which reflects how far two concepts are in the conceptual space of a society. We calculate the semantic relatedness between each candidate vehicle and the selling premise, and between the product and the selling premise. Candidate vehicles that are more remote from the selling premise than the product are discarded. We will talk more about semantic relatedness and the specific measures we used in a later section. Secondary Attributes Metaphors that capture the appropriate number of relevant features are considered especially apt (Glucksberg and Keysar 1990, 1993; Chiappe and Kennedy 1999). Phillips (1997) found that strong implicatures as well as weak implicatures were drawn from pictorial advertisements. Strong implicatures correspond to the selling premise of an ad, while we use ‘secondary attributes' for referring to the weak implicatures. We have not seen literature on the salience of the secondary attributes in metaphor vehicles. We think the candidate vehicles should, at least, not contradict the secondary attributes prescribed to a product. For this end, we use a semantic relatedness measure to filter candidate vehicles that are very distant from the secondary attributes. This is ‘soft' filtering, in contrast to the ‘hard' filtering used in the previous two metrics, i.e. affect polarity and salience, in the sense that the current criterion might need be tighten in order to ensure the aptness of generated metaphors. We compare the above approach with an alternative, which is using both the selling premise and the secondary attributes to search for candidate vehicles. This alternative 2013 10 method indeed looks for concepts that are salient in all these properties. This is possible, but rare. Most of the time, no result will be returned. On the other hand, there is a natural distinction of priority in the attributes (for a product) desired by advertisers (recall the strong and weak implicatures just mentioned). To represent this distinction, weighting of attributes is necessary. The computational model proposed by Terai and Nakagawa (2009) also uses multiple features to generate metaphors. The weights of the edges connecting the feature nodes in the input layer vary with the tenor. Specifically, the weight of an edge equals to the correlation coefficient between the two features respecting the tenor. The calculation is based on a statistic language model built on a Japanese corpus (Kameya and Sato 2005), which means the weighting of features (of a tenor) is intended to be near reality. However, this idea does not suit advertising, because the features attributed to a product are much more arbitrary. Very often, a product is not thought possessing those features before the appearance of an advertisement. Similarity with Tenor Good metaphors are those whose tenor and vehicle are not too different yet not too similar to each other (Aristotle 1924; Tourangeau and Sternberg 1981; Marschark, Kats and Paivio 1983). For this reason, we calculate the semantic relatedness between the product and each candidate vehicle. Firstly, candidate vehicles which have zero or negative semantic relatedness values are discarded, because they are considered too dissimilar to the product. Then, the candidate vehicles with positive relatedness values are sorted in the descending order of relatedness. Among this series of values, we look for values that are noticeably different from the next value, i.e. turning points. Turning points divide relatedness values into groups. We use the discrete gradient to measure the change of value, and take the value with the biggest change as the turning point. Candidate vehicles with their relatedness value bigger than or equal to the turning point are abandoned, for being too similar to the tenor. Figure 1 shows the sorted relatedness values between the candidate vehicles and the tenor ‘child' in the ad of the National Museum of Science and Technology. The turning point in this graph corresponds to the concept ‘head'. Semantic Relatedness Measures In general, semantic relatedness is measured through distance metrics in certain materialized conceptual space, such as knowledge bases and raw text. A number of semantic relatedness measures have been proposed. Each measure has its own merits and weakness. We employed two different measures in the current work, including PMI-IR (Pointwise Mutual Information and Information Retrieval) (Turney 2001) and LSA through Random Indexing (Kanerva, Kristofersson and Holst 2000). PMI-IR is used to compute salience, because we found it gives more accurate results than other available measures when dealing with concept pairs of high semantic relatedness. The relatedness between the selling premise and candidate vehicles is deemed high. Therefore, we use Similarity between Candidate Vehicles and 'Child' 0 0,02 0,04 0,06 0,08 0,1 0,12 0,14 0,16 half head car uptake Einstein loaf button headpiece mankind alien sage brainpan highbrow chess owl reader serpent cerebrum professor Candidate Vehicles Similarity with 'Child' Figure 1: Similarity between candidate vehicles and 'Child' PMI-IR to give a delicate ordering of their association strength. LSA is employed for the metrics of secondary attributes and similarity with tenor. The motivation behind this choice is to capitalize on LSA's ability of ‘indirect inference' (Landauer and Dumais 1997), i.e. discovering connection between terms which do not co-occur. Recall that candidate vehicles are assumed to have strong association with the selling premise, but not necessarily the secondary attributes. In most cases, the association between a candidate vehicle and a secondary attribute is not high. Thus, we need a measure which is sensitive to the lowrange semantic relatedness. LSA has demonstrated capacity in this respect (Waltinger, Cramer and Wandmacher 2009). For LSA, values close to 1.0 indicate very similar concepts, while values close to 0.0 and under 0.0 indicate very dissimilar concepts. In our computer program, we utilize the implementation of Random Indexing provided by the Semantic Vectors package1 . Two-hundred term vectors are acquired from the LSA process for computing semantic relatedness. In the present work, both PMI-IR and LSA are based on the Wikipedia corpus, an online encyclopedia of millions of articles. We obtained the English Wikipedia dumps, offered by the Wikimedia Foundation2 on October 10th, 2011. The compressed version of this resource is about seven gigabytes. An Example We intend to evaluate our approach of generating apt metaphor ideas for pictorial advertisements based on checking whether this approach can reproduce the pictorial metaphors used in past successful advertisements. We have been collecting a number of real ads and the information about the product, selling premise, secondary attributes, and the tenor and vehicle of metaphor in these ads. Nonetheless, it is a tedious process. 1 http://code.google.com/p/semanticvectors/ 2 http://download.wikipedia.org/ 2013 11 Rank Vehicle Rank Vehicle 1 IQ 19 reader 2 Mensa 20 child 3 brain 21 sage 4 computer 22 serpent 5 cerebrum 23 owl 6 alien 24 car 7 mankind 25 whale 8 highbrow 26 horse 9 Einstein 27 pig 10 head 38 half 11 professor 29 needle 12 dolphin 30 button 13 chess 31 table 14 lecturer 32 uptake 15 geek 33 storey 16 headpiece 34 loaf 17 newspaper 35 brainpan 18 atheist 36 latitudinarian In this paper, we use the information of three real ads to show what our computer program generates. These three ads are for the Volvo S80 car, The Economist newspaper and the National Museum of Science and Technology in Stockholm respectively. Each of them has a pictorial metaphor as its center of expression. All the three ads have the same selling premise: 'intelligence'. However, three different vehicles are used, including ‘chess', ‘brain' and ‘Einstein' respectively. The selection of these particular ads aims at testing whether our aptness metrics are able to differentiate different tenors. Table 1 summarizes the three aspects of the three ads, including product, secondary attributes and the tenor of metaphor. For both of the car and newspaper ads, the tenors of metaphor are the products. For the museum ad, the tenor is the target consumer, children. We found the secondary attributes of the Volvo S80 car in its product introduction3 . For the other two ads, the Economist newspaper and the National Museum of Science and Technology, we have not found any secondary attributes specified. Instead, their subject matter is used to distinguish them from the products of the same categories Furthermore, we think it is more accurate to use the Boolean operations ‘AND' and ‘OR' in describing the relation between multiple secondary attributes. As consequence, candidate vehicles have to be reasonably related to both attributes at the two sides of AND; at least one of the two attributes connected by OR. Product Secondary Attributes Tenor car 4 elegance AND luxury AND sophisticated car newspaper 5 international politics OR business news newspaper museum6 science OR technology child Table 1: Information about the three real ads For the concept 'intelligence', VRAC provides eightyseven candidate vehicles, including single words and phrases. We keep the single-word concepts and extract the core concept of a phrase, in order to reduce the complexity of calculating the aptness metrics at the later stage. An example of the core concept of a phrase is the word ‘owl' in the phrase ‘wise as an owl'. The core concepts are extracted automatically based on syntactic rules. This process introduces noise, i.e. concepts not related to 'intelligence', such as 'needle' of the phrase 'sharp as a needle' and 'button' of the phrase 'bright as a button'. In total, there are thirtyfour candidate vehicles of single words. All the three metaphor vehicles used in the three real ads are included. 3 http://www.volvocars.com/us/all-cars/volvo-s80/pages/5things.aspx, retrieved on April 1st, 2012. 4 http://adsoftheworld.com/media/print/volvo_s80_iq 5 http://adsoftheworld.com/media/print/the_economist_brain 6 http://adsoftheworld.com/media/print/the_national_museum_of_ science_and_technology_little_einstein As to affect polarity, the majority of the candidate vehicles, thirty out of thirty-four, are emotionally neutral. Besides, ‘highbrow' is marked as positive, while ‘geek' and ‘serpent' as negative. The ranking of candidate vehicles by its salience in the selling premise is shown in Table 2. The semantic relatedness calculated by PMI-IR correctly captured the main trend of salience. ‘IQ', ‘Mensa' and ‘brain' are ranked top, while ‘needle', ‘button' and ‘table', which are the noise introduced by the core concept extraction method, are ranked very low. The positions of the products are marked in italic. Only candidate vehicles having higher salience than a product are seen as valid. For instance, ‘horse', ranked the twenty-sixth, is not selected for the Volvo S80 car ad, since car is judged as more intelligent than horse by PMI-IR. On the other hand, all the metaphor vehicles used in the original ads, i.e. chess, brain and Einstein, have higher rankings than the corresponding tenors, which supports Ortony's salience imbalance theory. Table 2: Candidate vehicles sorted in the descending order of salience Table 3 shows how candidate vehicles are filtered by the secondary attributes of products, where candidate vehicles that are not contradictory to the secondary attributes are presented. Table 4 shows the candidate vehicles that are not too different yet not too similar with the tenors of the three ads respectively. For both results, the metaphor vehicles used in the original ads survived the filtering, which gives support to the domain interaction theory proposed by Tourangeau and Sternberg. Nevertheless, there is also flaw in the results produced by the LSA-IR measure. For instance, regarding the fourth column of Table 3, we suspect 2013 12 ‘brain' should not have nothing to do with ‘science' and consulted several other semantic relatedness measures, which confirmed our skepticism. Product car newspaper museum Secondary Attributes elegance AND luxury AND sophisticated international politics OR business news science OR technology Candidate Vehicle chess half geek IQ brain computer cerebrum mankind highbrow head professor dolphin chess lecturer geek headpiece atheist reader sage owl car whale horse half needle button table uptake storey brainpan IQ Mensa computer cerebrum alien mankind highbrow Einstein head professor chess lecturer headpiece atheist reader sage owl whale half needle button table storey loaf brainpan Table 3: Candidate vehicles NOT contradictory to the secondary attributes of the three products respectively Tenor car newspaper child (museum) Candidate Vehicle pig storey mankind uptake button half serpent whale lecturer chess latitudinarian sage professor alien horse IQ professor loaf whale table atheist geek mankind brainpan head Mensa button dolphin brain sage pig headpiece uptake storey car uptake Einstein loaf button headpiece mankind alien sage brainpan highbrow chess owl reader serpent cerebrum professor Table 4: Candidate vehicles that are not too different yet not too similar with the tenors of the three ads respectively We show in Table 5 the metaphor vehicles suggested by our computer program for each of the three ads after applying all the four aptness metrics. For all the three ads, the vehicles used in the original ads are included in the vehicles suggested by our computer program, as marked in italic. For the Volvo S80 car ad, the original metaphor vehicle is the only one recommended by our program. For the other two ads, our program also proposed other five and eight vehicles respectively. Considering that there are thirty four candidate vehicles input to the second stage, we think the four aptness metrics together did an acceptable job. Regarding the generated vehicles other than the one used in the original ad: are they equally effective? We will have a closer look at the metaphor vehicles generated for the ad of the National Museum of Science and Technology, since it has the most suggested vehicles. It is easy to spot a semantic cluster among these eight vehicles. Five out of eight are humans or human-like entities bearing high intellect, including ‘Einstein', ‘mankind', ‘alien', ‘highbrow' and ‘professor'. ‘Einstein', as the most prototypical within this cluster, fits best this specific advertising task. Besides, other vehicles in this cluster are also highly relevant to a setting like museum for people, especially children, to increase knowledge and encounter inspiration. They may be optimal for other advertising tasks with slightly different focus. The only exception is ‘mankind', which is a very general concept. As to the rest of the suggested metaphor vehicles, certain ‘headpiece' is possibly kind of symbol of intelligence; playing ‘chess' shows someone is intelligent, and ‘cerebrum' is strongly associated with intelligence. It is not difficult to imagine a picture of juxtaposing a headpiece and a child, a child playing chess or a child whose cerebrum is emphasized, all of which would be effective to associate a child with intelligence. However, strictly speaking, they are not metaphors. On the other hand, the existence of candidate vehicles other than the ones used in the original ads may suggest, firstly, our implementation of the four aptness metrics may not sufficiently reduce inapt vehicles. Secondly, more metrics, representing other factors that affect metaphor aptness, may be necessary. Ad Tenor Vehicle Volvo S80 car car chess The Economist newspaper newspaper professor mankind head dolphin brain headpiece National Museum of Science and Technology child Einstein headpiece mankind alien highbrow 2013 13 chess cerebrum professor Table 5: Metaphor vehicles considered apt for the three ads respectively Conclusions In the work presented in this paper, we adopted a general two-stage computational approach to generate apt metaphor ideas for pictorial advertisements. The first stage looks for concepts which have high imageability and the selling premise as one of their prototypical properties. The second stage evaluates the aptness of the candidate vehicles (found in the first stage) with regard to four aspects, including affect polarity, salience, secondary attributes and similarity with tenor. These four metric are conceived based on the general characteristics of metaphor and its specialty in advertising. For the first stage, we developed an automatic knowledge extraction method to find concepts of physical entities which are strongly associated with the selling premise. For the second stage, we utilized an affect lexicon and two semantic relatedness measures to implement the four aptness metrics. The capacity of our computer program is demonstrated in a task of reproducing the pictorial metaphors used in three real advertisements. All the three original metaphors were replicated, as well as a few other vehicles recommended, which, we consider, would make effective advertisements, though less optimal. In short, our approach and implementation are promising in generating diverse and apt pictorial metaphors for advertisements. On the other hand, to have a more critical view of our approach and implementation, larger scale evaluation is in need. Continuing the evaluation design introduced in this paper, more examples of pictorial metaphors used in real advertisements have to be collected and annotated. This corpus would not only contribute to building our metaphor generator, but also be an asset for the research on metaphor and creativity in general. Moreover, the results provided by our aptness metrics support both the salience imbalance theory and the domain interaction theory. Future Work We intend to compute more ways of expression appeared in pictorial advertisements. Firstly, our current implementation can be readily adapted to generate visual puns. In a pun, the product (or something associated to it) also has the meaning of the selling premise. An example is an existing ad which uses the picture of an owl to convey the message ‘zoo is a place to learn and gain wisdom'. As we all know, owl is both a member of the zoo and a symbol of wisdom. Secondly, we found some other fields of study are very relevant to computing advertising expression, such as the research and computational modeling of humor (Raskin 1985; Attardo and Raskin 1991; Ritchie 2001; Binsted, Bergen, Coulson, Nijholt, Stock, Strapparava, Ritchie, Manurung, Pain, Waller and O'Mara, 2006). Finally, we are especially interested in investigating hyperbole. Hyperbole has nearly universal presence in advertisements, but its theoretic construction and computational modeling are minimal. There exist some ad-hoc approaches: for instance, we can find the exaggeration of the selling proposition by the AlsoSee relation in WordNet; or, we should first think about a cognitive or linguistic model of hyperbole instead. 2013_20 !2013 A Model of Heteroassociative Memory: Deciphering Surprising Features and Locations Shashank Bhatia and Stephan K. Chalup School of Electrical Engineering and Computer Science The University of Newcastle Callaghan, NSW 2308 Australia shashank.bhatia@uon.edu.au, stephan.chalup@newcastle.edu.au Abstract The identification of surprising or interesting locations in an environment is an important problem in the fields of robotics (localisation, mapping and exploration), architecture (wayfinding, design), navigation (landmark identification) and computational creativity. Despite this familiarity, existing studies are known to rely either on human studies (in architecture and navigation) or complex feature intensive methods (in robotics) to evaluate surprise. In this paper, we propose a novel heteroassociative memory architecture that remembers input patterns along with features associated with them. The model mimics human memory by comparing and associating new patterns with existing patterns and features, and provides an account of surprise experienced. The application of the proposed memory architecture is demonstrated by identifying monotonous and surprising locations present in a Google Sketchup model of an environment. An inter-disciplinary approach combining the proposed memory model and isovists (from architecture) is used to perceive and remember the structure of different locations of the model environment. The experimental results reported describe the behaviour of the proposed surprise identification technique, and illustrate the universal applicability of the method. Finally, we also describe how the memory model can be modified to mimic forgetfulness. Introduction Within the context of evaluating computational creativity, measures of accounting surprise and identifying salient patterns have received great interest in the recent past. Known by different names, the problem of accounting surprise has been applied in various research areas. Specifically, the problem of identifying locations that stimulate surprise has important applications in areas such as robotics, architecture, data mining and navigation. Robotics researchers, while aiming towards robot autonomy, intend to identify locations that can potentially serve as landmarks for the localisation of a mobile robot (Cole and Harrison 2005; Siagian and Itti 2009). Architects, on the other hand, intend to design building plans that comprise sufficient salient/surprising locations in order to support way-finding by humans (Carlson et al. 2010). Lastly, navigation experts mine existing maps to identify regions/locations that can serve to better communicate a route to the users (Xia et al. 2008; Perttula, Carter, and Denoue 2009). Common to all these applications is the underlying question, the problem of identifying patterns from raw data that appeal or stimulate human attention. While the aim of these applications is same, the underlying measure of accounting surprise that each one follows has been designed to suit only the respective application. There are no domain-independent methods available that are flexible enough to be adaptable universally. Itti (2009) and Baldi (2010) rely on Bayesian statistics, and their method would require considerable domain-specific alteration, as can be seen in (Ranganathan and Dellaert 2009; Zhang, Tong, and Cottrell 2009). On one hand, designing methods that are domain-independent having capacity of comparing multi-dimensional data is a challenging task. On other hand, the use of dimensionality reduction techniques to limit or reduce dimensionality are known to cause bias. The reduction of dimensions would depend on methods employed, and different methods may assign varying weights to each dimension (Brown 2012). This makes surprise measurement, which includes comparing multi-dimensional patterns, a challenging problem. Commonly known as outlier detection, novelty detection, saliency detection etc., the question of detecting a "surprising event" has been raised in the past (Baldi and Ittii 2010). Specifically, the methods that provide a domainindependent approach for discovering inherent surprise in perceived patterns aim for information maximisation. In an information-theoretic sense, patterns that are rare are known to contain maximum information (Zhang, Tong, and Cottrell 2009). In a more formal sense, patterns that lead to an increase in entropy are deemed as unique, and are known to cause surprise (Shannon 2001). Another argument in the literature is about the frequency of occurrence of such patterns. An event/pattern that has a lower probability of occurrence/appearance, is deemed rare. Therefore, various proposals have been made that compare probabilities (Bartlett 1952; Weaver 1966) and identify the pattern with the lowest probability value. These techniques were further refined to consider the probabilities of all other patterns as well (Weaver 1966; Good 1956; Redheffer 1951). Most recent developments use Bayesian statistics to compare the probabilities of the occurrence of patterns or features extracted from them. Baldi and Ittii (2010) proposed to employ a dis 2013 139 tance metric to measure the differences between prior and posterior beliefs of a computational observer, and argued its interpretation to be that of an account of surprise. The authors proposed the use of Kullback-Leibler divergence (Kullback 1997) as the distance metric, and discussed its advantages over Shannon's entropy (Shannon 2001). They demonstrated the use of their proposed method by identifying surprising pixels from an input image. The complex mathematical constructs of modelling surprise that exist in the literature are difficult to adapt, and therefore have not found their applications across different domains. The concept of surprise can also be understood through its relationship to memory. Something that has not been observed stimulates surprise. In this setting, if a computational agent remembers the percepts presented to it, a measure of surprise can be derived. Baldi and Ittii (2010) follow this idea, but their perceptual memory is in the form of a probabilistic model. The patterns that are already observed compose the prior model, and the model obtained after adding new percepts is the posterior. As noted previously, most often the patterns/features to be evaluated are available in the form of a vector quantity (Brown 2012). Conversion of this multi-dimensional quantity into a probabilistic model not only requires specific expertise, but is also sensitive to the method employed to update the model's parameters. Even after substantial effort in design, the memory is sensitive to the parameters employed for the model. These shortcomings of the state-of-the-art methods form one part of motivation behind the current paper. Another aspect that is ignored in most contemporary methods is the associative nature of memory. Human memory has a natural tendency to relate/associate newly perceived objects/patterns with those perceived in the past. Recent research in cognitive science supports the influence perceptual inference has on previous memory (Albright 2012). A classical example is the problem of handwritten digit recognition. Multiple handwriting patterns corresponding to the same digit are labelled and associated via the same label. Since the memory is always trying to associate new patterns with previous experience, it is obvious that a strong association will lead to lower surprise and vice versa. This property of association, though well-recognised, has not been incorporated in the state-of-the-art methods of measuring surprise. This forms the second motivation of the current paper. Inspired by the discussed shortcomings of existing methods, this paper presents a computational memory framework that can memorise multi-dimensional patterns (or features derived from them) and account for inherent surprise after attempting to associate and recall a new pattern with those already stored in the memory. The uniqueness of the memory model is two-fold. Firstly, it can be employed without converting the perceived patterns into complex probabilistic models. Secondly, for the purpose of accounting surprise, the memory model not only aims to match and recall the new pattern, but also attempts to associate its characteristics/features before deeming it surprising. To illustrate these advantages and their usage, the proposed method is employed to identify monotonous and surprising structural features/locations present in an environment. Noted previously, this is an important problem in the field of robotics as well as architecture, and therefore we use a Google Sketchup (Trimble 2013) based architectural model for the demonstration. An isovist a way of representing visible space from a particular location (Benedikt 1979) is used for the purpose of perceiving a location in the form of a multi-dimensional pattern. This paper points towards the methods of extracting isovists from respective environments (section: Spatial Perception), and provides details of the neural network based memory architecture (section: Associative memory). Experimental results compare the degree to which identified monotonous locations associate with each other, and illustrate the isovist shape of those that stimulate computational surprise (section: Experiments & Results). Additionally, we describe how the proposed memory model can be modified to mimic forgetfulness, thereby forgetting patterns that have not been seen in a given length of time. To conclude, the paper provides a discussion on prospective applications of the proposed framework, and demonstrates its universality by evaluating its performance in a classification task, on various pattern classification datasets (section: Conclusions & Discussoins). Spatial Perception This work utilises multi-dimensional Isovist patterns to perceive/represent a location. Conceptually, an isovist is a geometric representation of the space visible from a point in an environment. If a human were to stand at a point and take a complete 360! rotation, all that was visible forms an isovist. In practice, however, this 3D visible space is sliced horizontally to obtain a vector that describes the surrounding structure from the point of observation, also known as the vantage point. This 2D slice is essentially a vector composed of lengths of rays projected from the vantage point, incident on the structure surrounding the point. Therefore, if a 1! resolution was utilised, an isovist would be a 360-dimensional vector, ~I = [r1, r2,...,r360] where r✓ represents the length of the ray starting from the vantage point, and incident on the first object intersected in the direction ✓. This way, an isovist records a profile of the surrounding structure (illustrated in figure 1). In an environment, multiple isovist can be generated from different vantage points. Each isovist can be represented as a 360-dimensional pattern describing the structure visible from the vantage point. In this paper, an indexed collection of isovist patterns extracted from an existing model of the environment is used. Figure 1: A hypothetical 2D plan of an environment, showing a vantage point (black dot) and the corresponding isovist generated from the vantage point. 2013 140 Isovist Extraction The method of extraction of isovists employed in this paper is derived from our previous work (Bhatia, Chalup, and Ostwald 2012), where we employed a Ruby script that executes on the Google Sketchup platform and extracts 3D isovists from a Google Sketchup model. This records the isovists while using the "walk through" tool provided in Google Sketchup. The "walk through" tool allows a user to walk through a 3D model of an architectural building plan. However, in this work, we utilise modified version of the Ruby script that extracts a 2D slice of the perceived 3D isovist. The model of a famous architectural building, Villa Savoye, is used to extract the isovist and identify the surprising locations present. The building is known for uniqueness of its structure, and therefore provides good examples for the evaluation of surprising locations. Inputs and association patterns An isovist records a spatial profile, and can be used to memorise a location by a computational memory. This is an advantage while trying to recognise/identify a location by its isovist; however, becomes a drawback when the aim is to infer surprise through association. A simple example is the case of two rectangular rooms that are similar in shape, but have different side lengths. While the isovists recorded at the central point of these rooms would have a large difference, the number of straight edges, and the angles they make, remain the same (90!). Therefore, for the purpose of associating and finding similarities between two locations, in this paper we employed a 3-dimensional feature vector derived from isovist pattern. We compute (i) Area of the isovist, (ii) Eccentricity value, and (iii) Circularity value to form the elements of the 3-dimensional associated feature pattern. This feature pattern is used to associate two isovist patterns. The perceived isovist pattern, therefore, comprises a 360-dimensional vector, and the derived associated pattern is a 3-dimensional feature vector. The isovist of a location and the feature vector are presented as a pair to the memory model proposed in this paper. The memory model remembers essential patterns and computes surprise after associating new patterns and comparing existing ones. Due to the association task that the memory performs, such memories are known as Associative Memories (Palm 2013). Associative Memory Associative memories are computational techniques, capable of storing a limited set of input and associated pattern pairs (x1, y1),(x2, y2),...,(xm, ym). Depending on the size of the input vector xi, its associated pattern yi, and methods of association, various types of such memories are proposed. Kosko (1988) was the first to introduce Bidirectional Associative Memories (BAM), which provides a two way association search mechanism termed Heteroassociation. A BAM can take either the input or associated pattern as its input and has the capacity to recall the respective association. Despite the utility BAM can offer, its usage has been limited due to many existing challenges, such as limited capacity and conditions of instability. Importantly, existing variations of BAM can only memorise binary patterns. Many other variations of BAM have been offered, however, and the present note is provided only as a basis for the following discussion and is by no means an exhaustive account of the developments on this topic. A detailed review can instead be found in (Palm 2013). The proposed memory model offers similar functionality without requirement for input patterns to be binary in nature. Overview of the architecture The architecture of the proposed memory model consists of two memory blocks, and can be divided into three parts. (a) Input Memory Block (IMB): block that stores input patterns, (b) Associated Memory Block (AMB): block that stores associated feature vectors/patterns, (c) Association Weights: a matrix that maintains a mapping between the two memory blocks. Complete architecture of the memory is represented in figure 2. Figure 2: Memory Architecture: Comprise two memory blocks and association weights, all linked through one or more data/processing units presented in white and grey colour respectively. The memory blocks are the storage units responsible for memorising input and associated patterns. This memory model in concept works similar to traditional BAMs except that it provides additional many-to-many mapping functionality on real-valued vectors. Input patterns (which in the case of this application are isovist vectors) when presented to the memory model are compared in two respects: (a) similarity of shape, and (b) similarity of the features derived from them. The detailed construction and working of each block and the overall memory model is provided in the following subsections. Memory Blocks The smallest unit of storage in this memory model is a Radial Basis activation unit, also known as a Radial Basis Function (RBF). Typically, a RBF is a real valued function with its response monotonically decreasing/increasing with distance from a central point. The parameters that describe a RBF include the central point c, distance metric k · kd and 2013 141 the shape of the radial function. A Gaussian RBF with Euclidean distance metric and centre ci is defined in equation 1. The parameters ci and radius !i decide the activation level of the RBF unit. Any input x lying inside the circle centred at ci having a radius less than or equal to !i will result in a positive activation, with the level of activation decreasing monotonically as the distance between the input and the centre increases. "i(x) = exp ✓ !(x ! ci)2 !2 i ◆ (1) The realisation of a memory element in our approach is done by saving the input as the centre ci, and adjusting the value of the radius !i to incorporate values that lie close to each other. Mathematically, this memory element will have "i(x) > 0 activation for all values of x that fall in a !i neighbourhood of the point ci defined in equation 2. Further, limx!ci "i(x)=1. This condition ensures that the activation unit with the centre ci closest to the current input x activates the most. B(ci; !i) = {x 2 X | d(x, ci) < !i} (2) In a collection of multiple RBF units, with each having a different centre ci and radius !i, multiple values can be remembered. If an input x is presented to this collection, the unit with highest activation will be the one that has the best matching centre ci. Or in other words, for the presented input value, the memory block can be said to recall the nearest possible value ci. For one input pattern, there will be one corresponding recall value. This setting of multiple RBF units can thus work as a memory unit. The Memory Blocks described previously comprise multiple RBF units. As an example, a memory block comprising n RBF units can be represented with figure 3. Figure 3: RBF Memory Block: Each RBF unit stores one data value in the form of centre ci; the range of values for which the unit has positive activation are defined by the values of !i according to equation 2. cmax is the value that the memory recalls as the best match to the input, and "max represents the confidence in the match. So far we have described the use of the RBF unit as a memory block having a scalar valued centre ci. In order to memorise a multi-dimensional pattern (in this application an isovist pattern, comprising 360 ray-lenghts), we modify the traditional RBFs to handle a multi-dimensional input isovist vector ~x by replacing its scalar valued centre with a 360dimensional vector c~i. While Euclidean distance and dot product of two multi-dimensional vectors are also scalar and do not disrupt the working of standard RBFs, their capacity to capture the difference in shape between two isovist patterns is minimal. Therefore, in order to account for difference in shape, we replace the Euclidean distance metric by Procrustes Distance (Kendall 1989). The procrustes distance is a statistical measure of shape similarity that accounts for dissimilarity between two shapes while ignoring factors of scaling and transformation. For two isovist vectors ~xm and ~xn, the procrustes distance h~xm, ~xnip first identifies the optimum translation, rotation, reflection and scaling required to align the two shapes, and finally provides a minimised scaled value of the dissimilarity between them. An example of two similar and non-similar isovists with their procrustesaligned isovists is shown in figure 4. Utilising procrustes distance with the multidimensional centre c~i, we term this Multidimensional Procrustes RBF, which is defined as: "i(~x) = exp !h~x, c~ii 2 p !2 i ! (3) Procrustes distance provides a dissimilarity measure ranging between 0 and 1. A zero procrustes distance therefore leads to maximum activation and vice versa. A multidimensional procrustes RBF has the capacity to store a multi-dimensional vector in the form of its centre. It is important to note that for the application described in this paper, the difference between two multi-dimensional vectors, viz. the isovists, was recorded using procrustes distance. However, in general the memory model can be adapted for any suitable distance metric, or used with the simple Euclidean distance. The use of procrustes distance as a distance metric was adapted specifically for the purpose of the application of identifying surprising locations in an environment. Figure 4: Two isovist pairs (illustrated in red and blue) and corresponding aligned isovists (black dashed), one with a high procrustes distance (left) and other with a low procrustes distance (right). IMB and AMB IMB and AMB are in principle collections of one or more multidimensional-procrustes RBF and multidimensional RBF units respectively, grouped together as a block (such as the one represented in figure 3). Each 2013 142 block is initialised with a single unit that stores the first input vector (for IMB) and derived features (for AMB). The feature vector employed to associate two input patterns (in this application isovists) comprise (i) area, (ii) circularity, (iii) eccentricity, together making up a 3-dimensional vector. Initially, each block is created with a single memory unit having a default radius 0.1. Thereafter, the memory block adapts one of the two behaviours. For new patterns that lie far from the centre, the memory block grows by incorporating a new RBF unit having its centre same as the presented pattern. On the other hand, for patterns that lie close to existing patterns, the radii of the RBF units are adjusted in order to obtain positive activation. Adjustment of the radii is analogous to adjustments of weights performed during the training of a Neural Network. The procedure followed to expand or adjust the radii can be understood by following algorithms 1 & 2. Consider a memory block comprising k neural units, with their centres c~1, c~2,..., c~k and radii "1, "2,..., "k and the distance metric h·id. Let the model be presented with a new input vector ~x. The algorithm 1 first computes h·id distance (procrustes distance in the case of an isovist block) between each central vector and the presented pattern, and compares the distance with prespecified best and average match threshold values ⇥best and ⇥average. If the distance value is found as d  ⇥best, the corresponding central vector is returned as this signifies that a similar pattern already exists in memory. However, in the case where ⇥avg  d < ⇥best , the radius of the corresponding best match unit is updated. This updating ensures that the memory responds with a positive activation when next presented with a similar pattern. Algorithm 1 Memory Block Updation Require: ~x, [c1, c2,...,ck], ⇥best, ⇥avg , ⌃ 1: for all center vectors ci do 2: di(~x) ( h~x, c~iid 3: end for 4: bestScore ( min i (di) 5: bestIndex ( argmin (di) 6: blockU pdated ( false 7: if (⇥best  bestScore) then 8: ~r ( ~cbestIndex 9: blockU pdated ( true 10: else if (⇥avg  bestScore < ⇥best) then 11: if ("bestIndex < ⌃) then 12: [~cbestIndex, "bestIndex] ( computeCenter() 13: blockU pdated ( true 14: end if 15: end if 16: if (blockU pdated == false) then 17: add new neural unit center with 18: ~ck+1 = ~x 19: "k+1 = 0.1 20: end if The network expands on the presentation of patterns that cannot be incorporated by adjusting the weights/radii of the RBF units. This feature provides three advantages over the Algorithm 2 Center vector and radius calculation Require: ~cbestIndex, ⇥best, ~x ~cold ( ~cbestIndex ~cbestIndex ( (~cbestIndex + ~x) /2 dnew ( (h~x,cbestIndex ~ id) 2 #2·log(⇥best) dold ( (h~cold,cbestIndex ~ id) 2 #2·log(⇥best) "bestIndex ( max (dnew, dold) traditional BAMs. The first is that there is no a-priori training required by the memory block. The memory is updated as new patterns are presented, and the training is online. Secondly, adjustment of weights ensures that similar patterns are remembered through a common central vector, thereby reducing the number of neural units required to remember multiple patterns. Despite the averaging process, a high level of recall accuracy is guaranteed by maintaining all radii "i  ⌃. The values of ⇥best, ⇥avg and ⌃ are application specific parameters that require adjustment. However, for the purpose of associating and remembering isovists, in our application we determined these using equations 4, 5, 6. Here, Dij is a n⇥n matrix containing h·ip distances between all central vectors; std(Dij ) stands for standard deviation. Dij = hc~i, c~j ip Sd = Xn i6=j Dij ⇥best = percentile(Sd, 95) Sd (4) ⇥avg = percentile(Sd, 50) Sd (5) ⌃ = min(std(Dij )) max(std(Dij )) (6) Association Weights Association weights act as a separate layer of the network architecture, and play the role of mapping the input patterns with their associated features. For a case of m isovist patterns and n associated feature vectors stored in IMB and AMB respectively, the association weights would comprise a (m ⇥ (n + 1)) matrix. The first column of the matrix contains the indices of each central vector c~i and the remaining columns contain mapping weights. On initialisation, the mapping weights are set to zero. Once each memory block is updated, the corresponding best match index obtained as an output of the memory block is used to configure the values of the matrix. Let q be the index returned from IMB, and r be the index obtained from AMB. The weight updation simply increments the value at the qth row and r + 1th column of the weight matrix. If such a row or column does not exist (signifying a new addition to the memory block), a new row/column is added. During the use of the memory model to recall the associated vector from the presented input vector, assuming an index p was returned, the pth row is selected, and the index of the column containing the highest score is obtained. 2013 143 Let this index be k. If the highest score in kth column this implies that for AMB, the centre of the kth activation unit is most strongly associated with the current input. This kind of mapping look-up can be performed vice versa as well and provides an efficient bi-directional many-to-many mapping functionality, which is hard to implement in traditional memory models. Surprise Calculation The Kullback-Leibler (KL) divergence (Kullback 1997) is a measure of difference between two probabilistic models of current observations. To estimate KL divergence, an application specific probabilistic model of the current data is required, and in most cases the design of such a model requires specific expertise. In our approach, each memory model computes the surprise without having the need to train/estimate or design any probabilistic model. This is achieved by using activation scores that each memory unit outputs on presentation of a pattern. These scores are obtained through RBF activation units. Each score in principle is therefore a probabilistic estimate of the similarity between the input vector and the centre of the corresponding memory unit. Exploiting this property, we measure the KL divergence on activation scores. On presentation of a new input vector ~x to a memory block, the activation scores are first computed. Since these scores are calculated before the block updates (using algorithm 1 & 2), they are termed a-priors, A = [a1, a2,...,an]. Post the execution of algorithm 1, the memory block would either remain the same (in the case of best match), or change one of its radius values (for average match), or lastly may have an additional neural unit (no match). Accordingly, the activation scores obtained after the updating might be different from the apriors. Scores obtained after the updating of memory are termed posteriors, P = [p1, p2,...,pm]. If n value1 where greater than is the operator. A prediction is modeled as: Feature → Operator. For example, a prediction can be feature1 > where it is expected that feature1 will increase after the action is performed. Comparisons can detect the presence or absence of a feature, and the change in the size of a feature (<, ≤, =, ≥, >). If an observed feature does not match its predicted value, then the system recognizes surprise. This model does not make any explicit reference to time and uses surprise as a flag to update the rule base. Maher and Fisher (2012) have used clustering algorithms to compare a new design to existing designs, to identify when a design is novel, valuable, and surprising. The clustering model uses distance (e.g., Euclidean distance) to assess novelty and value of product designs (e.g., laptops) that are represented by vectors of attributes (e.g., display area, amount of memory, cpu speed). In this approach, a design is considered surprising when it is so different from existing designs that it forms its own new cluster. This typically happens when the new design makes explicit an attribute that was not previously explicit, because all previous designs had the same value for that attribute. Maher and Fisher use the example of the Bloom laptop, which has a detachable keyboard (i.e., detachable keyboard = TRUE), where all previous laptop designs had value FALSE along what was a previously implicit, unrecognized attribute. Thus, like one of Rissland's black swans, the Bloom transformed the design space. In Maher and Fisher, the established clusters of design are effectively representing the expectation that the next new design will be associated with one of the clusters of existing designs, and when a new design forms its own cluster it is surprising and changes our expectations for the next generation of new designs. Maher and Fisher (2012) focused on evaluation of creativity on the part of an observer, not an active designer. Brown (2012) investigates many aspects of surprise in creative design, such as who gets surprised: the designer or the person experiencing or evaluating the design. Brown (2012) also presents a framework for understanding surprise in creative design by characterizing different types of expectations, active, active knowledge, and not active knowledge, as alternative situations in which expectations can be violated in exploratory and transformative design. To varying extents, many of the computational approaches above model surprise as a deviation from expectation, where the expectation is an expected value that is estimated from data distributions or a prediction made by simulating a rule-based model. In these, however, there is no explicit representation of time as a continuum, nor explicit concern with projecting into the future. Recognizing Surprising Designs Our approach to projecting designs into the future assumes that each product design is represented by a vector of ordinal attributes (aka variables). For each attribute, a mathematical function of time can be fit to the attribute values of existing (past) designs, showing how the attribute's values have varied with time in the past. This best fitting function, obtained through a process of regression, can be used to predict how the attribute's values will change in the future as well. Our approach to projecting into the future is inspired by earlier work by Frey and Fisher (1999) that was concerned with projecting machine learning performance curves into the future (thereby allowing cost benefit analyses of collecting more data for purposes of improving prediction accuracy), and it was not concerned with creativity and surprise assessment per se. While Frey and Fish 2013 148 er used a variety of functional forms, most notably power functions, as well as linear, logarithmic, and exponential, we have thus far only used linear functions (i.e., univariate linear regression) for projecting designs into the future for purposes of surprise assessment. In this paper we focus on regression models for recognizing a surprising design: a regression analysis of the attributes of existing designs against a temporal dimension is used to predict the "next" value of the attributes. The distance from the observed value to the predicted value identifies a surprising attribute-value pair. We illustrate our use of regression models for identifying surprising designs in an automobile design dataset, which is composed of 572 cars that were produced between 1878 and 2009 (Dowlen, 2012). Each car is described by manufacturer, model, type, year, and nine numerically-valued attributes related to the mechanical design of the car. In this dataset only 190 entries contain values for all nine attributes. These complete entries all occur after 1934 and are concentrated between 1966 and 1994. A summary of the number of designs and the number of attributes in our dataset is shown in Table 1. Table 1: List of the mechanical design attributes and the number of automobile design records with an entry for each of the nine attributes in our dataset. A variety of linear regression models are considered. The first model uses linear regression over the entire time period of the design data and fits a line to each attribute as a function of time. The results for one attribute, maximum speed, are shown in Figure 1. This analysis identifies the outliers, and therefore potentially surprising designs. For example, the Ferrari 250LM had a surprising maximum speed in 1964, and the Bugati Type 41 Royale has a surprising engine size (another attribute, and another regression analysis) in 1995. This first model works well for identifying outliers across a time period but does not identify trendsetters (or ‘black swans' as Rissland might call them) since data points that occurred later in the timeline were included in the regression analysis when evaluating the surprise of a design. A trendsetter is a surprising design that changes the expectations for designs in the future, and is not simply an outlier for all time. In other words, using the entire time line to identify surprising automobile designs does not help us identify those designs that influenced future designers. A design that is an outlier in its own time, but inspires future generations of designers to do something similar can only be found if we don't use designs which came out after the model being measured in the training data. Figure 1. Regression analysis for maximum speed over the entire time period of car design data. Thus, we considered a second strategy that performs a linear regression only on previously created designs and measures surprise of a new design as the distance from that design's attribute value to the projection of the line at the year of the design in question. This second regression strategy, where the time period used to fit the line for a single attribute was limited to the time before each design was released (see Figure 2), found roughly the same surprising designs as the first model (over the entire time period) for most attributes, but there were two exceptions: torque displacement and maximum speed. In these exceptions, outliers earlier in time were sufficiently extreme so as to significantly move the entire regression line from before the early outliers to after, whereas in other cases the rough form of the regression lines created over time did not change much. Figure 2: Using strategy 2, linear models are constructed using all previous-year designs. The circles show the predicted (or projected) values for EACH year from the individual regression lines; the dots show actual values. We show three sample regression lines, each ending at the year (circle) it is intended to predict, but there is actually one regression line for each year. Attribute Number of Designs Engine Displacement 438 Bore Diameter 407 Stroke Length 407 Torque Force 236 Torque Displacement 235 Weight 356 Frontal Area 337 Maximum Speed 345 Acceleration 290 2013 149 When training this second model, designs from every previous year were weighted equally for predicting future designs. Thus, outliers in the beginning of the dataset perpetually shifted the model and skewed the surprise measurements for all subsequent designs. And why shouldn't they - these early designs correspond roughly to what Rissland called black swans, which understandably diminish the surprise value of subsequent ‘grey cygnets'. However, it is also the case that when using model 2, taking into account all past history, that a large mass of ‘bland' designs earlier can exaggerate the perceived surprise of a design, even when that design is in the midst of a spurt of like designs. These observations inspired a third linear regression strategy that makes predictions (or sets expectations) by only including designs within a specified time range before the designs being measured. We use a sliding window, rather than disjoint bins. In either case though, limited time intervals can mimic perceptions of surprise when the observer has a limited memory, only remembering up to a myopic horizon into the past. The window (aka interval) size used for the cars dataset was ten years. This number was chosen because histograms of the data revealed that all ten-year periods after 1934 contained at least one design with all nine attributes while smaller periods were very sparsely populated in the 1950s. Larger window sizes converged to the second regression model as window size increased. In general, the size of windows has a large influence on the results. Though we won't delve into the results of this final strategy here, its sensitivity has appeal. In fact, relative to our longer-term goal of modeling human surprise, this sensitivity to window size may map nicely on to different perceptions by people with different experiences. An older adult may have a very different surprise reaction than a young person, depending on past experience. In general, the selection of an appropriate range of years for the third regression model can be correlated with typical periods of time over which a person can remember. That is, if we want to compare our computational model of surprise with human expectations, we should use time intervals that are meaningful to people rather than based on the distribution of data. People will be surprised when expectations based on a time period relevant to their personal knowledge and experience of a series of designs is not met, rather than on the entire time period for all designs. Directions for Further Research This paper presents an approach to evaluating whether a design is surprising, and therefore creative, by including a temporal analysis of the conceptual space of existing designs and using regression analysis projected into the future to identify surprising designs. There are a number of directions we plan to follow. 1. We want to further develop the regression models, and in particular move beyond linear regression, to include other functional forms such as polynomial, power, and logarithmic. After all, a design might be regarded as surprising if we used linear regression to project into the future, but not at all surprising if we used a higher-order polynomial regression into the future! Identifying means of distinguishing when one functional form over another is most appropriate for regression will be a key challenge. 2. We want to move beyond our current univariate assessments of surprise through univariate regression, to holistic, multivariate model assessments of surprise through multivariate regression. We can apply multivariate regression methods to designs as a function of time, or combine our earlier work on clustering approaches (Maher and Fisher, 2012) with our regression approaches, perhaps by performing multivariate regression over multivariate summaries of design clusters (e.g., centroids). 3. We have thus far been investigating novelty and value (Maher and Fisher, 2012) and surprise as decoupled characteristics of creativity, but an important next step is to consider how measures of these three characteristics can be integrated into a single holistic measure of creativity, probably parameterized to account for individual differences among observers. 4. Assessments of creativity are conditioned on individual experiences; such individual differences in measures of surprise, novelty, and value are critical - surprise to one person is hardly so to another. We made a barest beginning of this study in Maher and Fisher (2012), where we viewed clustering as the means by which an agent organized its knowledge base, and against which creativity would be judged. The methods for regression that we have presented in this paper will allow us to build in an "imagining" capacity to an agent, adding expectations for designs that do not yet exist to the knowledge base of agents responsible for assessing creativity. 5. In all the variants that we plan to explore, we want to match the results of our models in identifying surprising designs to human judgments of surprise, and of course to assessments of creativity (novelty, value, surprise) of the designs, generally. 6. Finally, our work to date assumes that designs are represented as attribute-value vectors; these propositional representations are clustered in Maher and Fisher (2012), or time-based regression is used in this paper. We want to move to relational models, however, perhaps first-order representations and richer representations still. Relational representations would likely be required in Rissland's legal domain, if in fact that domain were formalized. A domain that we find very attractive for exploring relational representations is the domain of computer programs, which follow a formal representation and for which a number of well established tools exist for evaluating novelty, value, and surprise. For example, consider that tools for identifying plagiarism in computer programs measure "deep" similarity between programs, and can be adapted as novelty detectors), and for assessing surprise as well. An ability to measure creativity of "generic" computer programs will allow us to move into virtually any (computable) domain that we want. For example, consider mathematical reasoning in students. In an elementary 2013 150 course, we can imagine seeing a large number of programs that are designed to compute the variance of data values, as composed of two sequential loops - the first to compute the mean of the data, and the subsequent loop to compute the variance given the mean. These programs will be very similar at a deep level. Imagine then seeing a program that computes the variance (and mean) with ONE loop, relying on a mathematical "simplification." These are the kinds of assessments of creativity that we can expect in more sophisticated relational domains, all enabled by capabilities to assess computer programs. Acknowledgements: We thank our anonymous reviewers for helpful comments, which guided our revision. 2013_22 !2013 Less Rhyme, More Reason: Knowledge-based Poetry Generation with Feeling, Insight and Wit Tony Veale Web Science & Technology Division, KAIST / School of Computer Science and Informatics, UCD Korean Advanced Institute of Science & Technology, South Korea / University College Dublin, Ireland. Tony.Veale@gmail.com Abstract Linguistic creativity is a marriage of form and content in which each works together to convey our meanings with concision, resonance and wit. Though form clearly influences and shapes our content, the most deft formal trickery cannot compensate for a lack of real insight. Before computers can be truly creative with language, we must first imbue them with the ability to formulate meanings that are worthy of creative expression. This is especially true of computer-generated poetry. If readers are to recognize a poetic turn-of-phrase as more than a superficial manipulation of words, they must perceive and connect with the meanings and the intent behind the words. So it is not enough for a computer to merely generate poem-shaped texts; poems must be driven by conceits that build an affective worldview. This paper describes a conceit-driven approach to computational poetry, in which metaphors and blends are generated for a given topic and affective slant. Subtle inferences drawn from these metaphors and blends can then drive the process of poetry generation. In the same vein, we consider the problem of generating witty insights from the banal truisms of common-sense knowledge bases. Ode to a Keatsian Turn Poetic licence is much more than a licence to frill. Indeed, it is not so much a licence as a contract, one that allows a speaker to subvert the norms of both language and nature in exchange for communicating real insights about some relevant state of affairs. Of course, poetry has norms and conventions of its own, and these lend poems a range of recognizably "poetic" formal characteristics. When used effectively, formal devices such as alliteration, rhyme and cadence can mold our meanings into resonant and incisive forms. However, even the most poetic devices are just empty frills when used only to disguise the absence of real insight. Computer models of poem generation must model more than the frills of poetry, and must instead make these formal devices serve the larger goal of meaning creation. Nonetheless, is often said that we "eat with our eyes", so that the stylish presentation of food can subtly influence our sense of taste. So it is with poetry: a pleasing form can do more than enhance our recall and comprehension of a meaning - it can also suggest a lasting and profound truth. Experiments by McGlone & Tofighbakhsh (1999, 2000) lend empirical support to this so-called Keats heuristic, the intuitive belief - named for Keats' memorable line "Beauty is truth, truth beauty" - that a meaning which is rendered in an aesthetically-pleasing form is much more likely to be perceived as truthful than if it is rendered in a less poetic form. McGlone & Tofighbakhsh demonstrated this effect by searching a book of proverbs for uncommon aphorisms with internal rhyme - such as "woes unite foes" - and by using synonym substitution to generate non-rhyming (and thus less poetic) variants such as "troubles unite enemies". While no significant differences were observed in subjects' ease of comprehension for rhyming/non-rhyming forms, subjects did show a marked tendency to view the rhyming variants as more truthful expressions of the human condition than the corresponding non-rhyming forms. So a well-polished poetic form can lend even a modestly interesting observation the lustre of a profound insight. An automated approach to poetry generation can exploit this symbiosis of form and content in a number of useful ways. It might harvest interesting perspectives on a given topic from a text corpus, or it might search its stores of commonsense knowledge for modest insights to render in immodest poetic forms. We describe here a system that combines both of these approaches for meaningful poetry generation. As shown in the sections to follow, this system - named Stereotrope - uses corpus analysis to generate affective metaphors for a topic on which it is asked to wax poetic. Stereotrope can be asked to view a topic from a particular affective stance (e.g., view love negatively) or to elaborate on a familiar metaphor (e.g. love is a prison). In doing so, Stereotrope takes account of the feelings that different metaphors are likely to engender in an audience. These metaphors are further integrated to yield tight conceptual blends, which may in turn highlight emergent nuances of a viewpoint that are worthy of poetic expression (see Lakoff and Turner, 1989). Stereotrope uses a knowledge-base of conceptual norms to anchor its understanding of these metaphors and blends. While these norms are the stuff of banal clichés and stereotypes, such as that dogs chase cats and cops eat donuts. we also show how Stereotrope finds and exploits corpus evidence to recast these banalities as witty, incisive and poetic insights. 2013 152 Mutual Knowledge: Norms and Stereotypes Samuel Johnson opined that "Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it." Traditional approaches to the modelling of metaphor and other figurative devices have typically sought to imbue computers with the former (Fass, 1997). More recently, however, the latter kind has gained traction, with the use of the Web and text corpora to source large amounts of shallow knowledge as it is needed (e.g., Veale & Hao 2007a,b; Shutova 2010; Veale & Li, 2011). But the kind of knowledge demanded by knowledgehungry phenomena such as metaphor and blending is very different to the specialist "book" knowledge so beloved of Johnson. These demand knowledge of the quotidian world that we all tacitly share but rarely articulate in words, not even in the thoughtful definitions of Johnson's dictionary. Similes open a rare window onto our shared expectations of the world. Thus, the as-as-similes "as hot as an oven", "as dry as sand" and "as tough as leather" illuminate the expected properties of these objects, while the like-similes "crying like a baby", "singing like an angel" and "swearing like a sailor" reflect intuitons of how these familiar entities are tacitly expected to behave. Veale & Hao (2007a,b) thus harvest large numbers of as-as-similes from the Web to build a rich stereotypical model of familiar ideas and their salient properties, while Özbal & Stock (2012) apply a similar approach on a smaller scale using Google's query completion service. Fishelov (1992) argues convincingly that poetic and non-poetic similes are crafted from the same words and ideas. Poetic conceits use familiar ideas in non-obvious combinations, often with the aim of creating semantic tension. The simile-based model used here thus harvests almost 10,000 familiar stereotypes (drawing on a range of ~8,000 features) from both as-as and like-similes. Poems construct affective conceits, but as shown in Veale (2012b), the features of a stereotype can be affectively partitioned as needed into distinct pleasant and unpleasant perspectives. We are thus confident that a stereotype-based model of common-sense knowledge is equal to the task of generating and elaborating affective conceits for a poem. A stereotype-based model of common-sense knowledge requires both features and relations, with the latter showing how stereotypes relate to each other. It is not enough then to know that cops are tough and gritty, or that donuts are sweet and soft; our stereotypes of each should include the cliché that cops eat donuts, just as dogs chew bones and cats cough up furballs. Following Veale & Li (2011), we acquire inter-stereotype relationships from the Web, not by mining similes but by mining questions. As in Özbal & Stock (2012), we target query completions from a popular search service (Google), which offers a smaller, public proxy for a larger, zealously-guarded search query log. We harvest questions of the form "Why do Xs Ys", and assume that since each relationship is presupposed by the question (so "why do bikers wear leathers" presupposes that everyone knows that bikers wear leathers), the triple of subject/relation/object captures a widely-held norm. In this way we harvest over 40,000 such norms from the Web. Generating Metaphors, N-Gram Style! The Google n-grams (Brants & Franz, 2006) is a rich source of popular metaphors of the form Target is Source, such as "politicians are crooks", "Apple is a cult", "racism is a disease" and "Steve Jobs is a god". Let src(T) denote the set of stereotypes that are commonly used to describe a topic T, where commonality is defined as the presence of the corresponding metaphor in the Google n-grams. To find metaphors for proper-named entities, we also analyse n-grams of the form stereotype First [Middle] Last, such as "tyrant Adolf Hitler" and "boss Bill Gates". Thus, e.g.: src(racism) = {problem, disease, joke, sin, poison, crime, ideology, weapon} src(Hitler) = {monster, criminal, tyrant, idiot, madman, vegetarian, racist, …} Let typical(T) denote the set of properties and behaviors harvested for T from Web similes (see previous section), and let srcTypical(T) denote the aggregate set of properties and behaviors ascribable to T via the metaphors in src(T): (1) srcTypical (T) = M∈src(T) typical(M) We can generate conceits for a topic T by considering not just obvious metaphors for T, but metaphors of metaphors: (2) conceits(T) = src(T) ∪ M∈src(T) src(M) The features evoked by the conceit T as M are given by: (3) salient (T,M) = [srcTypical(T) ∪ typical(T)] ∩ [srcTypical(M) ∪ typical(M)] The degree to which a conceit M is apt for T is given by: (4) aptness(T, M) = |salient(T, M) ∩ typical(M)| |typical(M)| We should focus on apt conceits M ∈ conceits(T) where: (5) apt(T, M) = |salient(T,S) ∩ typical(M)| > 0 and rank the set of apt conceits by aptness, as given in (4). The set salient (T,M) identifies the properties / behaviours that are evoked and projected onto T when T is viewed through the metaphoric lens of M. For affective conceits, this set can be partitioned on demand to highlight only the unpleasant aspects of the conceit ("you are such a baby!") or only the pleasant aspects ("you are my baby!"). Veale & Li (2011) futher show how n-gram evidence can be used to selectively project the salient norms of M onto T. ∪ ∪ 2013 153 Once More With Feeling Veale (2012b) shows that it is a simple matter to filter a set of stereotypes by affect, to reliably identify the metaphors that impart a mostly positive or negative "spin". But poems are emotion-stirring texts that exploit much more than a crude two-tone polarity. A system like Stereotrope should also model the emotions that a metaphorical conceit will stir in a reader. Yet before Stereotrope can appreciate the emotions stirred by the properties of a poetic conceit, it must model how properties reinforce and imply each other. A stereotype is a simplified but coherent representation of a complex real-world phenomenon. So we cannot model stereotypes as simple sets of discrete properties - we must also model how these properties cohere with each other. For example, the property lush suggests the properties green and fertile, while green suggests new and fresh. Let cohere(p) denote the set of properties that suggest and reinforce p-ness in a stereotye-based description. Thus e.g. cohere(lush) = {green, fertile, humid, dense, …} while cohere(hot) = {humid, spicy, sultry, arid, sweaty, …}. The set of properties that coherently reinforce another property is easily acquired through corpus analysis - we need only look for similes where multiple properties are ascribed to a single topic, as in e.g. "as hot and humid as a jungle". To this end, an automatic harvester trawls the Web for instances of the pattern "as X and Y as", and assumes for each X and Y pair that Y ∈ cohere(X) and X ∈ cohere(Y). Many properties have an emotional resonance, though some evoke more obvious feelings than others. The linguistic mapping from properties to feelings is also more transparent for some property / feeling pairs than others. Consider the property appalling, which is stereotypical of tyrants: the common linguistic usage "feel appalled by" suggests that an entity with this property is quite likely to make us "feel appalled". Corpus analysis allows a system to learn a mapping from properties to feelings for these obvious cases, by mining instances of the n-gram pattern "feel P+ed by" where P can be mapped to the property of a stereotype via a simple morphology rule. Let feeling(p) denote the set of feelings that is learnt in this way for the property p. Thus, feeling(disgusting) = {feel_disgusted_by} while feeling(humid} = {}. Indeed, because this approach can only find obvious mappings, feeling(p) = {} for most p. However, cohere(p) can be used to interpolate a range of feelings for almost any property p. Let evoke(p) denote the set of feelings that are likely to be stirred by a property p. We can now interpolate evoke(p) as follows: (6) evoke(p) = feeling(p) ∪ c ∈ cohere(p) feeling(c) So a property p also evokes a feeling f if p suggests another property c that evokes f. We can predict the range of emotional responses to a stereotype S in the same way: (7) evoke(S) = p ∈ typical(S) evoke(p) If M is chosen from conceits(T) to metaphorically describe T, the metaphor M is likely to evoke these feelings for T: (8) evoke(T, M) = p ∈ salient(T, M) evoke(p) For purposes of gradation, evoke(p) and evoke(S) denote a bag of feelings rather than a set of feelings. Thus, the more properties of S that evoke f, the more times that evoke(S) will contain f, and the more likely it is that the use of S as a conceit will stir the feeling f in the reader. Stereotrope can thus predict that both feel disgusted by and feel thrilled by are two possible emotional responses to the property bloody (or to the stereotype war), and also know that the former is by far the more likely response of the two. The set evoke(T, M) for the metaphorical conceit T is M can serve the goal of poetry generation in different ways. Most obviously, it is a rich source of feelings that can be explicitly mentioned in a poem about T (as viewed thru M). Alternately, these feelings can be used in a meta-text to motivate and explain the viewpoint of the poem. The act of crafting an explanatory text to showcase a poetry system's creative intent is dubbed framing in Colton et al. (2012). The current system puts the contents of evoke(T, M) to both of these uses: in the poem itself, it expresses feelings to show its reaction to certain metaphorical properties of T; and in an accompanying framing text, it cites these feelings as a rationale for choosing the conceit T is M. For example, in a poem based on the conceit marriage is a prison, the set evoke(marriage, prison) contains the feelings bored_by, confined_in, oppressed_by, chilled_by and intimidated_by. The meta-text that frames the resulting poem expresses the following feelings (using simple NL generation schema): "Gruesome marriage and its depressing divorces appall me. I often feel disturbed and shocked by marriage and its twisted rings. Does marriage revolt you?" Atoms, Compounds and Conceptual Blends If linguistic creativity is chemistry with words and ideas, then stereotypes and their typical properties constitute the periodic table of elements that novel reactions are made of. These are the descriptive atoms that poems combine into metaphorical mixtures, as modeled in (1) … (8) above. But poems can also fuse these atoms into nuanced compounds that may subtly suggest more than the sum of their parts. Consider the poetry-friendly concept moon, for which Web similes provide the following descriptive atoms: typical(moon) = {lambent, white, round, pockmarked, shimmering, airless, silver, bulging, cratered, waning, waxing, spooky, eerie, pale, pallid, deserted, glowing, pretty, shining, expressionless, rising} Corpus analysis reveals that authors combine atoms such as these in a wide range of resonant compounds. Thus, the Google 2-grams contain such compounds as "pallid glow", ∪ ∪ ∪ 2013 154 "lambent beauty", "silver shine" and "eerie brightness", all of which can be used to good effect in a poem about the moon. Each compound denotes a compound property, and each exhibits the same linguistic structure. So to harvest a very large number of compound properties, we simply scan the Google 2-grams for phrases of the form "ADJ NOUN", where ADJ and NOUN must each denote a property of the same stereotype. While ADJ maps directly to a property, a combination of morphological analysis and dictionary search is needed to map NOUN to its property (e.g. beauty ! beautiful). What results is a large poetic lexicon, one that captures the diverse and sometimes unexpected ways in which the atomic properties of a stereotype can be fused into nuanced carriers of meaning. Compound descriptions denote compound properties, and those that are shared by different stereotypes reflect the poetic ways in which those concepts are alike. For example, shining beauty is shared by over 20 stereotypes in our poetic lexicon, describing such entries as moon, star, pearl, smile, goddess and sky. A stereotype suggests behaviors as well as properties, and a fusion of both perspective can yield a more nuanced view. The patterns "VERB ADV" and "ADV VERB" are used to harvest all 2-grams where a property expressed as an adverb qualifies a related property expressed as a verb. For example, the Google 2-gram "glow palely" unites the properties glowing and pale of moon, which allows moon to be recognized as similar to candle and ghost because they too can be described by the compound glow palely. A ghost, in turn, can noiselessly glide, as can a butterfly, which may sparkle radiantly like a candle or a star or a sunbeam. Not every pairing of descriptive atoms will yield a meaningful compound, and it takes common-sense - or a poetic imagination - to sense which pairings will work in a poem. Though an automatic poet is endowed with neither, it can still harvest and re-use the many valid combinations that humans have added to the language trove of the Web. Poetic allusions anchor a phrase in a vivid stereotype while shrouding its meaning in constructive ambiguity. Why talk of the pale glow of the moon when you can allude to its ghostly glow instead? The latter does more than evoke the moon's paleness - it attributes this paleness to a supernatural root, and suggests a halo of other qualities such as haunting, spooky, chilling and sinister. Stereotypes are dense descriptors, and the use of one to convey a single property like pale will subtly suggest other readings and resonances. The phrase "ghostly glow" may thus allude to any corpus-attested compound property that can be forged from the property glowing and any other element of the set typical(ghost). Many stereotype nouns have adjectival forms - such as ghostly for ghost, freakish for freak, inky for ink - and these may be used in corpora to qualify the nominal form of a property of that very stereotype, such as gloom for gloomy, silence for silent, or pallor for pale. The 2-gram "inky gloom" can thus be understood as an allusion either to the blackness or wetness of ink, so any stereotype that combines the properties dark and wet (e.g. oil, swamp, winter) or dark and black (e.g. crypt, cave, midnight) can be poetically described as exhibiting an inky gloom. Let compounds(…) denote a function that maps a set of atomic properties such as shining and beautiful to the set of compound descriptors - such as the compound property shining beauty or the compound allusion ghostly glow - that can be harvested from the Google 2-grams. It follows that compounds(typical(S)) denotes the set of corpusattested compounds that can describe a stereotype S, while compounds(salient(T, M)) denotes the set of compound descriptors that might be used in a poem about T to suggest the poetic conceit T is M. Since these compounds will fuse atomic elements from the stereotypical representations of both T and M, compounds(salient(T, M)) can be viewed as a blend of T and M. As described in Fauconnier & Turner (2002), and computationally modeled in various ways in Veale & O'Donoghue (2000), Pereira (2007) and Veale & Li (2011), a "blend" is a tight conceptual integration of two or more mental spaces. This integration yields more than a mixture of representational atoms: a conceptual blend often creates emergent elements - new molecules of meaning - that are present in neither of the input representations but which only arise from the fusion of these representations. How might the representations discussed here give rise to emergent elements? We cannot expect new descriptive atoms to be created by a poetic blend, but we can expect new compounds to emerge from the re-combination of descriptive atoms in the compound descriptors of T and M. Just as we can expect compounds(typical(T) ∪ typical(M)) to suggest a wider range of descriptive possibilities than compounds(typical(T)) ∪ compounds(typical(M)), we say: (9) emergent(T, M) = {p ∈ compounds(salient(T, M)) | p ∉ compounds(typical(T)) ∧ p ∉ compounds(typical(M))} In other words, the compound descriptions that emerge from the blend of T and M are those that could not have emerged from the properties of T alone, or from M alone, but can only emerge from the fusion of T and M together. Consider the poetic conceit love is the grave. The resulting blend - as captured by compounds(salient(T, M)) - contains a wide variety of compound descriptors. Some of these compounds emerge solely from the concept grave, such as sacred gloom, dreary chill and blessed stillness. Many others emerge only from a fusion of love and grave, such as romantic stillness, sweet silence, tender darkness, cold embrace, quiet passion and consecrated devotion. So a poem that uses these phrases to construct an emotional worldview will not only demonstrate an understanding of its topic and its conceit, but will also demonstrate some measure of insight into how one can complement and resonate with the other (e.g., that darkness can be tender, passion can be quiet and silence can be sweet). While the system builds on second-hand insights, insofar as these are ultimately derived from Web corpora, such insights are fragmentary and low-level. It still falls to the system to stitch these into its own emotionally coherent patchwork of poetry. What use is poetry if we or our machines cannot learn from it the wild possibilities of language and life? 2013 155 Generating Witty Insights from Banal Facts Insight requires depth. To derive original insights about the topic of a poem, say, of a kind an unbiased audience might consider witty or clever, a system needs more than shallow corpus data; it needs deep knowledge of the real world. It is perhaps ironic then that the last place one is likely to find real insight is in the riches of a structured knowledge base. Common-sense knowledge-bases are especially lacking in insight, since these are designed to contain knowledge that is common to all and questioned by none. Even domainspecific knowledge-bases, rich in specialist knowledge, are designed as repositories of axiomatic truths that will appear self-evident to their intended audience of experts. Insight is both a process and a product. While insight undoubtedly requires knowledge, it also takes work to craft surprising insights from the unsurprising generalizations that make up the bulk of our conventional knowledge. Though mathematicians occasionally derive surprising theorems from the application of deductive techniques to self-evident axioms, sound reasoning over unsurprising facts will rarely yield surprising conclusions. Yet witty insights are not typically the product of an entirely sound reasoning process. Rather, such insights amuse and provoke via a combination of over-statement, selective use of facts, a mixing of distinct knowledge types, and a clever packaging that makes maximal use of the Keats heuristic. Indeed, as has long been understood by humor theorists, the logic of humorous insight is deeply bound up with the act of framing. The logical mechanism of a joke - a kind of pseudological syllogism for producing humorous effects - is responsible for framing a situation in such a way that it gives rise to an unexpected but meaningful incongruity (Attardo & Raskin, 1992; Attardo et al., 2002). To craft witty insights from innocuous generalities, a system must draw on an arsenal of such logical mechanisms to frame its observations of the world in appeallingly discordant ways. Attardo and Raskin view the role of a logical mechanism (LM) as the engine of a joke: each LM provides a different way of bringing together two overlapping scripts that are mutually opposed in some pivotal way. A joke narrative is fully compatible with one of these scripts and only partly compatible with the other, yet it is the partial match that we, as listeners, jump to first to understand the narrative. In a well-structured joke, we only recognize the inadequacy of this partially-apt script when we reach the punchline, at which point we switch our focus to its unlikely alternative. The realization that we can easily duped by appearances, combined with the sense of relief and understanding that this realization can bring, results in the AHA! feeling of insight that often accompanies the HA-HA of a good joke. LMs suited to narrative jokes tend to engineer oppositions between narrative scripts, but for purposes of crafting witty insights in one-line poetic forms, we will view a script as a stereotypical representation of an entity or event. Armed with an arsenal of stereotype "scripts", Stereotrope will seek to highlight the tacit opposition between different stereotypes as they typically relate to each other, while also engineering credible oppositions based on corpus evidence. A sound logical system cannot not brook contradictions. Nonetheless, uncontroversial views can be cleverly framed in such a way that they appear sound and contradictory, as when the columnist David Brooks described the Olympics as a "peaceful celebration of our warlike nature". His form has symmetry and cadence, and pithily exploits the Keats heuristic to reconcile two polar opposites, war and peace. Poetic insights do not aim to create real contradictions, but aim to reveal (and reconcile) the unspoken tensions in familiar ideas and relationships. We have discussed two kinds of stereotypical knowledge in this paper: the property view of a stereotype S, as captured in typical(S), and the relational view, as captured by a set of question-derived generalizations of the form Xs Ys. A blend of both these sources of knowledge can yield emergent oppositions that are not apparent in either source alone. Consider the normative relation bows fire arrows. Bows are stereotypically curved, while arrows are stereotypically straight, so lurking beneath the surface of this innocuous norm is a semantic opposition that can be foregrounded to poetic effect. The Keats heuristic can be used to package this opposition in a pithy and thought-provoking form: thus compare "curved bows fire straight arrows" (so what?) with "straight arrows do curved bows fire" (more poetic) and "the most curved bows fire the straightest arrows" (most poetic). While this last form is an overly strong claim that is not strictly supported by the stereotype model, it has the sweeping form of a penetrating insight that grabs one's attention. Its pragmatic effect - a key function of poetic insight - is to reconcile two opposites by suggesting that they fill complementary roles. In schematic terms, such insights can be derived from any single norm of the form Xs Ys where X and Y denote stereotypes with salient properties - such as soft and tough, long and short - that can be framed in striking opposition. For instance, the combination of the norm cops eat donuts with the clichéd views of cops as tough and donuts as soft yields the insight "the toughest cops eat the softest donuts". As the property tough is undermined by the property soft, this may be viewed as a playful subversion of the tough cop stereotype. The property toughness is can be further subverted, with an added suggestion of hypocrisy, by expressing the generalization as a rhetorical question: "Why do the toughest cops eat the softest donuts?" A single norm represents a highly simplified script, so a framing of two norms together often allows for opposition via a conflict of overlapping scripts. Activists, for example, typically engage in tense struggles to achieve their goals. But activists are also known for the slogans they coin and the chants they sing. Most slogans, whether designed to change the law or sell detergent, are catchy and uplifting. These properties and norms can now be framed in poetic opposition: "The activists that chant the most uplifting slogans suffer through the most depressing struggles". While the number of insights derivable from single norms is a linear function of the size of the knowledge base, a combinatorial opportunity exists to craft insights from pairs of norms. Thus, "angels who fight the foulest demons 2013 156 play the sweetest harps", "surgeons who wield the most hardened blades wear the softest gloves", and "celebrities who promote the most reputable charities suffer the sleaziest scandals" all achieve conflict through norm juxtaposition. Moreover, the order of a juxtaposition - positive before negative or vice versa - can also sway the reader toward a cynical or an optimistic interpretation. Wit portrays opposition as an inherent part of reality, yet often creates the oppositions that it appears to reconcile. It does so by elevating specifics into generalities, to suggest that opposition is the norm rather than the exception. So rather than rely wholly on stereotypes and their expected properties, Stereotrope uses corpus evidence as a proxy imagination to concocts new classes of individuals with interesting and opposable qualities. Consider the Google 2-gram "short celebrities", whose frequency and plurality suggests that shortness is a noteworthy (though not typical) property of a significant class of celebrities. Stereotrope already possesses the norm that "celebrities ride in limousines", as well as a stereotypical expectation that limousines are long. This juxtaposition of conventions allows it to frame a provocatively sweeping generalization: "Why do the shortest celebrities ride in the longest limousines?" While Stereotrope has no evidence for this speculative claim, and no real insight into the statusanxiety of the rich but vertically-challenged, such an understanding may follow in time, as deeper and subtler knowledge-bases become available for poetry generation. Poetic insight often takes the form of sweeping claims that elevate vivid cases into powerful exemplars. Consider how Stereotrope uses a mix of n-gram evidence and norms to generate these maxims: "The most curious scientists achieve the most notable breakthroughs" and "The most impartial scientists use the most accurate instruments". The causal seeds of these insights are mined from the Google n-grams in coordinations such as "hardest and sharpest" and "most curious and most notable". These ngram relationships are then be projected onto banal norms - such as scientists achieve breakthroughs and scientists use instruments - for whose participants these properties are stereotypical (e.g. scientists are curious and impartial, instruments are accurate, breakthroughs are notable, etc.). Such claims can be taken literally, or viewed as vivid allusions to important causal relationships. Indeed, when framed as explicit analogies, the juxtaposition of two such insights can yield unexpected resonances. For example, "the most trusted celebrities ride in the longest limousines" and "the most trusted preachers give the longest sermons" are both inspired by the 4-gram "most trusted and longest." This common allusion suggests an analogy: "Just as the most trusted celebrities ride in the longest limousines, the most trusted preachers give the longest sermons". Though such analogies are driven by superficial similarity, they can still evoke deep resonances for an audience. Perhaps a sermon is a vehicle for a preacher's ego, just as a limousine is an obvious vehicle for a celebrity's ego? Reversing the order of the analogy significantly alters its larger import, suggesting that ostentatious wealth bears a lesson for us all. Tying it all together in Stereotrope Having created the individual pieces of form and meaning from which a poem might be crafted, it now falls to us to put the pieces together in some coherent form. To recap, we have shown how affective metaphors may be generated for a given topic, by building on popular metaphors for that topic in the Google n-grams; shown how a tight conceptual blend, with emergent compound properties of its own, can be crafted from each of these metaphors; shown how the feelings evoked by these properties may be anticipated by a system; and shown how novel insights can be crafted from a fusion of stereotypical norms and corpus evidence. We view a poem as a summarization and visualization device that samples the set of properties and feelings that are evoked when a topic T is viewed as M. Given T, an M is chosen randomly from conceits(T). Each line of the text renders one or more properties in poetic form, using tropes such as simile and hyperbolae. So if salient(T, M) contains hot and compounds(salient(T, M)) contains burn brightly - for T=love and M=fire, say - this mix of elements may be rendered as "No fire is hotter or burns more brightly". It can also be rendered as an imperative, "Burn brightly with your hot love", or a request, "Let your hot love burn brightly". The range of tropes is best conveyed with examples, such as this poetic view of marriage as a prison: The legalized regime of this marriage My marriage is an emotional prison Barred visitors do marriages allow The most unitary collective scarcely organizes so much Intimidate me with the official regulation of your prison Let your sexual degradation charm me Did ever an offender go to a more oppressive prison? You confine me as securely as any locked prison cell Does any prison punish more harshly than this marriage? You punish me with your harsh security The most isolated prisons inflict the most difficult hardships O Marriage, you disgust me with your undesirable security Each poem obeys a semantic grammar, which minimally indicates the trope that should be used for each line. Since the second-line of the grammar asks for an apt , Stereotrope constructs one by comparing marriage to a collective; as the second-last line asks for an apt , one is duly constructed around the Google 4-gram "most isolated and most difficult". The grammar may also dictate whether a line is rendered as an assertion, an imperative, a request or a question, and whether it is framed positively or negatively. This grammar need not be a limiting factor, as one can choose randomly from a pool of grammars, or even evolve a new grammar by soliciting user feedback. The key point is the pivotal role of a grammar of tropes in mapping from the properties and feelings of a metaphorical blend to a sequence of poetic renderings of these elements. Consider this poem, from the metaphor China is a rival: 2013 157 No Rival Is More Bitterly Determined Inspire me with your determined battle The most dogged defender scarcely struggles so much Stir me with your spirited challenge Let your competitive threat reward me Was ever a treaty negotiated by a more competitive rival? You compete with me like a competitively determined athlete Does any rival test more competitively than this China? You oppose me with your bitter battle Can a bitter rival suffer from such sweet jealousies? O China, you oppress me with your hated fighting Stereotypes are most eye-catching when subverted, as in the second-last line above. The Google 2-gram "sweet jealousies" catches Stereotrope's eye (and ours) because it up-ends the belief that jealousy is a bitter emotion. This subversion nicely complements the sterotype that rivals are bitter, allowing Stereotrope to impose a thought-provoking opposition onto the banal norm rivals suffer from jealousy. Stereotype emphasises meaning and intent over sound and form, and does not (yet) choose lines for their rhyme or metre. However, given a choice of renderings, it does choose the form that makes best use of the Keats heuristic, by favoring lines with alliteration and internal symmetry Evaluation Stereotrope is a knowledge-based approach to poetry, one that crucially relies on three sources of inspiration: a large roster of stereotypes, which maps a slew of familiar ideas to their most salient properties; a large body of normative relationships which relate these stereotypes to each other; and the Google n-grams, a vast body of language snippets. The first two are derived from attested language use on the web, while the third is a reduced view of the linguistic web itself. Stereotrope represents approx. 10,000 stereotypes in terms of approx. 75,000 stereotype-to-property mappings, where each of these is supported by a real web simile that attests to the accepted salience of a given property. In addition, Stereotrope represents over 50,000 norms, each derived from a presupposition-laden question on the web. The reliability of Stereotrope's knowledge has been demonstrated in recent studies. Veale (2012a) shows that Stereotrope's simile-derived representations are balanced and unbiased, as the positive/negative affect of a stereotype T can be reliably estimated as a function of the affect of the contents of typical(T). Veale (2012b) further shows that typical(T) can be reliably partitioned into sets of positive or negative properties as needed, to reflect an affective "spin" imposed by any given metaphor M. Moreover, Veale (ibid) shows that copula metaphors of the form T is an M in the Google n-grams - the source of srcTypical(T) - are also broadly consistent with the properties and affective profile of each stereotype T. So in 87% of cases, one can correctly assign the label positive or negative to a topic T using only the contents of srcTypical(T), provided it is not empty. Stereotrope derives its appreciation of feelings from its understanding of how one property presupposes another. The intuition that two properties X and Y that are found in the pattern "as X and Y as" evoke similar feelings is supported by the strong correlation (0.7) observed between the positivity of X and of Y over the many X/Y pairs that are harvested from the web using this acquisition pattern. The "fact" that bats lay eggs can be found over 40,000 times on the web via Google. On closer examination, most matches form part of a larger question, "do bats lay eggs?" The question "why do bats lay eggs?" has zero matches. So "Why do" questions provide an effective superstructure for acquiring normative facts from the web: they identify facts that are commonly presupposed, and thus stereotypical, and clearly mark the start and end of each presupposition. Such questions also yield useful facts: Veale & Li (2011) shows that when these facts are treated as features of the stereotypes for which they are presupposed, they provide an excellent basis for classifying different stereotypes into the same ontological categories, as would be predicted by an ontology such as WordNet (Fellbaum, 1998). Moreover, these features can be reliably distributed to close semantic neighbors to overcome the problem of knowledge sparsity. Veale & Li demonstrate that the likelihood that a feature of stereotype A can also be assumed of stereotype B is a clear function of the WordNet similarity of A and B. While this is an intuitive finding, it would not hold at all if not for the fact that these features are truly meaningful for A (and B). The problem posed by "bats lay eggs" is one faced by any system that does not perceive the whole context of an utterance. As such, it is a problem that plagues the use of n-gram models of web content, such as Google's n-grams. Stereotrope uses n-grams to suggest insightful connections between two properties or ideas, but if these n-grams are mere noise, not even the Keats heuristic can disguise them as meaningful signals. Our focus is on relational n-grams, of a kind that suggests deep tacit relationships between two concepts. These n-grams obey the pattern "X Y", where X and Y are adjectives or nouns and is a linking phrase, such as a verb, a preposition, a coordinator, etc. To determine the quality of these n-grams, and to assess the likelihood of extracting genuine relational insights from them, we use this large subset of the Google n-grams as a corpus for estimating the relational similarity of the 353 word pairs in the Finklestein et al. (2002) WordSim-353 data set. We estimate the relatedness of two words X and Y as the PMI (pointwise mutual information score) of X and Y, using the relational n-grams as a corpus for occurrence and co-occurrence frequencies of X and Y. A correlation of 0.61 is observed between these PMI scores and the human ratings reported by Finklestein et al. (2002). Though this is not the highest score achieved for this task, it is considerably higher than any than has been reported for approaches that use WordNet alone. The point here is that this relational subset of the Google n-grams offers a reasonably faithful mirror of human intuitions for purposes of recognizing the relatedness of different ideas. We thus believe these n-grams to be a valid source of real insights. 2013 158 The final arbiters of Stereotrope's poetic insights are the humans who use the system. We offer the various services of Stereotrope as a public web service, via this URL: http://boundinanutshell.com/metaphor-magnet We hope these services will also allow other researchers to reuse and extend Stereotrope's approaches to metaphor, blending and poetry. Thus, for instance, poetry generators such as that described in Colton et al. (2012) - which creates topical poems from fragments of newspapers and tweets - can use Stereotrope's rich inventories of similes, poetic compounds, feelings and allusions in its poetry. Summary and Conclusions Poets use the Keats heuristic to distil an amorphous space of feelings and ideas into a concise and memorable form. Poetry thus serves as an ideal tool for summarizing and visualizing the large space of possibilities that is explored whenever we view a familiar topic from a new perspective. In this paper we have modelled poetry as both a product and an expressive tool, one that harnesses the processes of knowledge acquisition (via web similes and questions), ideation (via metaphor and insight generation), emotion (via a mapping of properties to feelings), integration (via conceptual blending) and rendering (via tropes that map properties and feelings to poetic forms). Each of these processes has been made publicly available as part of a comprehensive web service called Metaphor Magnet. We want our automated poets to be able to formulate real meanings that are worthy of poetic expression, but we also want them to evoke much more than they actually say. The pragmatic import of a creative formulation will always be larger than the system's ability to model it accurately. Yet the human reader has always been an essential part of the poetic process, one that should not be downplayed or overlooked in our desire to produce computational poets that fully understand their own outputs. So for now, though there is much scope, and indeed need, for improvement, it is enough to know that an automated poem is anchored in real meanings and intentional metaphors, and to leave certain aspects of creative interpretation to the audience. Acknowledgements This research was supported by the WCU (World Class University) program under the National Research Foundation of Korea (Ministry of Education, Science and Technology of Korea, Project no. R31-30007). 2013_23 !2013 Harnessing Constraint Programming for Poetry Composition Jukka M. Toivanen and Matti Jarvisalo ¨ and Hannu Toivonen HIIT and Department of Computer Science University of Helsinki Finland Abstract Constraints are a major factor shaping the conceptual space of many areas of creativity. We propose to use constraint programming techniques and off-the-shelf constraint solvers in the creative task of poetry writing. We show how many aspects essential in different poetical forms, and partially even in the level of language syntax and semantics can be represented as interacting constraints. The proposed architecture has two main components. One takes input or inspiration from the user or the environment, and based on it generates a specification of the space and aesthetic of a poem as a set of declarative constraints. The other component explores the specified space using a constraint solver. We provide an elementary set of constraints for composition of poetry, we illustrate their use, and we provide examples of poems generated with different sets of constraints. Introduction Rules and constraints can be seen as an essential ingredient of creativity. First, there typically are strong constraints on the creative artefacts. For instance, consider traditional western music. In order for a composition to be recognized as (western) music in the first place, it must meet a number of requirements concerning, e.g., timbre, scale, melody, harmony, and rhythm. For any specific genre of western music, the constraints usually become much tighter. Similarly, the composition of many types of poetry is governed by numerous rules specifying such things as strict stress and syllable patterns, rhyming and alliteration structures, and selection of words with certain associations — in addition to the basic constraints of syntax and semantics that are needed to make the expressions understandable and meaningful. However, constraints are not just a nuisance that creative agents need to cope with in order to produce plausible results. On the contrary, constraints are often considered to be an essential source of creativity for humans. For instance, composer Igor Stravinsky associated constraints with creating freedom, not containment: "The more constraints one imposes, the more one frees one's self of the chains that shackle the spirit." (Stravinsky 1947) Constraints can also be used as computational tools for studies of creativity or creative artefacts. Artificial intelligence researcher Marvin Minsky suggested that a good way to learn about how music "worked" was to represent musical compositions as interacting constraints, then modify these constraints and study their effects on the musical structures (Roads 1980). This essential idea has been explored extensively in the field of computer music research afterwards. Our domain of interest in this paper is composition of poetry. We envision a computational environment where formally expressed constraints and constraint programming methods are used to (1) specify a conceptual search space, (2) define an aesthetic of concepts in the space, (3) explore the space to find the most aesthetic concepts in it. Any given set of (hard) constraints on poems specifies a space of possible poems. For instance, the number of lines and the number of syllables per line could be such constaints, contributing to the style of poetry. Soft constraints, in turn, can be used to indicate (aesthetical) preferences over poems and to rank poems that match the hard constraints. For instance, rhyme could be a soft constaint, giving preference to poems that follow a given rhyme structure but not absolutely requiring it. In this paper we study and illustrate the power of constraint programming for creating poems. In our current setup, the creative system consists of two subcomponents. One takes input from user or from some other source of inspiration, and based on it specifies the space and poetical aesthetic (as a set of constraints). The other subcomponent explores the specified space using the aesthetic, i.e., produces optimally aesthetic poems in the space (using a constraint solver). We show how poems can be generated by applying different kinds of constraints and constraint combinations using an off-the-shelf constraint programming tool. The elegance of this approach is that it is not based on specifying a step sequence to produce a certain kind of a poem, but rather on declaring the properties of a solution to be found using mathematical constraints. An empirical evaluation of the obtained poetry is left for future work. We next briefly review some related work on constraint programming in creative applications, and on poetry generation. Then we provide a description of a constraint model for composing poems, illustrating the ideas with examples. We 2013 160 discuss the results and conclude by outlining future work. Related Work Constraint-based methods have been applied in various fields such as configuration and verification, planning, and evolution of language, to name a few. In the area of computational creativity, constraints have been used mostly to describe the composition of various aspects of music. For example, Boenn et al. (2011) have developed an extensive music composition system called Anton which uses Answer Set Programming to represent the musical knowledge and the rules of the system. Anton describes a model of musical composition as a collection of interacting constraints. The system can be used to compose short pieces of music as well as to assist the composer by making suggestions, completions, and verifications to aid in the music composition process. On the other hand, composition of poetry with constraint programming techniques has received little if any attention. Several different approaches have been used (Manurung, Ritchie, and Thompson 2000; Gervas 2001; Manurung ´ 2003; Diaz-Agudo, Gervas, and Gonz ´ alez-Calero 2002; ´ Wong and Chun 2008; Netzer et al. 2009; Colton, Goodwin, and Veale 2012; Toivanen et al. 2012), many involving constraints in one form or another, but we are not aware of any other work systematically based on constraints and implemented using a constraint solver. The system developed by Manurung et al. (2003) uses a grammar-driven formulation to generate metrically constrained poetry out of a given topic. This approach performs stochastic hillclimbing search within an explicit state-space, moving from one solution to another. The explicit representation is based on a hand-crafted transition system. In contrast, we employ constraint-programming methodology based on searching for optimal solutions over an implicit representation of the conceptual space. Our approach should scale better to large numbers of constraints and a large input vocabulary than explicit state-space search. The ASPERA poetry composition system (Gervas 2001), ´ on the other hand, uses a case-based reasoning approach. This system generates poetry out of a given input text via composition of poetic fragments retrieved from a case-base of existing poetry. These fragments are then combined together by using additional metrical rules. The Full-FACE poetry generation system (Colton, Goodwin, and Veale 2012) uses a corpus-based approach to generate poetry according to given constraints on, for instance, meter and stress. The system is also argued to invent its own aesthetics and framings of its work. In contrast to our system, this approach uses constraints to shape only some aspects of the poetry composition procedure whereas our approach is fully based on expressing various aspects of poetry as mutually interacting constraints and using a constraintsolver to efficiently search for solutions. The approach of this paper extends and complements our previous work (Toivanen et al. 2012). We proposed a method where a template is extracted randomly from a given corpus, and words in the template are substituted by words related to a given topic. Here we show how such basic functionality can be expressed with constraints, and more interestingly, how constraint programming can be used to add control for rhyme, meter, and other effects. Simpler poetry generation methods have been proposed, as well. In particular, Markov chains have been widely used to compose poetry. They provide a clear and simple way to model some syntactic and semantic characteristics of language (Langkilde and Knight 1998). However, the resulting poetry tends to have rather poor sentence and poem structures due to only local syntax and semantics. Overview The proposed poetry composition system has two subcomponents: a conceptual space specifier and a conceptual space explorer. The former one determines what poems can be like and what kind of poems are preferred, while the latter one assumes the task of producing such poems. The modularity and the explicit specification of the conceptual search space have great potential benefits. Modularity allows one to (partially) separate the content and form of poetry from the computation needed to produce matching poetry. An explicit, declarative specification, in turn, gives the creative system a potential to introspect and modify its own goals and intermediate results (a topic to which we will return in the conclusion). A high-level view to the internal structure of the poetry composition system considered in this work is shown in Figure 1. In this paper, our focus is on the explorer component and on the interface between the components. Our specifier component is built on the principles of Toivanen et al. (2012), but ideas from many other poetry generation systems (Gervas 2001; Manurung 2003; Colton, Goodwin, ´ and Veale 2012) could be used in the specifier component as well. The assumption in the model presented here is that the specifier can generate a large number of mutually dependent choices of words for different positions in the poem, as well as dependencies between them. The specifier uses input from the user and potentially other sources as its inspiration and parameters and automatically generates the input for the explorer component, shielding user from the details of constraint programming. The automatically generated "data" or "facts" are conveyed to the explorer component that consists of a constraint solver and a static library of constraints. The library is provided by the system designers, i.e., by us, and any constraints that the specifier component wishes to use are triggered by the data it generates. The user of the system does not need to interact directly with the constraint library (but the specifier component may offer the user options for choosing which constraints to use). Our focus in this paper is on the explorer component, and in the constraint specifications that it receives from the specifier component or from the static library: • The number of lines, and the number of words on each line (we call this the skeleton of the poem). 2013 161 Figure 1: Overview of the poetry composition workflow. The user provides some inspiration and parameters, based on which the space specifier component generates a set of constraints, used as "data" by the constraint solver in the explorer component. The explorer component additionally contains a static library of constraints that are dynamically triggered by the data. Explorer component then outputs a poem that best fulfills wishes of the user. • For each word position in the skeleton, a list of words that potentially can be used in the position (collectively called the candidates). • Possible additional requirements on the desired form of the poem (e.g., rhyming structure). • Possible additional requirements on the syntax and contents of the poem (e.g., interdependencies between words to make valid expressions). We will next describe these in more detail. Poetry Composition via Answer Set Programming The explorer component takes as input specifications dynamically generated by the specifier, affecting both the search space and the aesthetic. In addition, it uses a static constraint library. Together, the dynamic specifications and the constraint library form a constraint satisfaction problem (or, by extension, an optimization problem; see end of the section). The constraint satisfaction problem is built so that the solutions to the problem are in one-to-one correspondence with the poems that satisfy the requirements imposed by the specifier component of the system (as potentially instructed by the user). Highly optimized off-the-shelf constraint satisfaction solvers can then be used to find the solutions, i.e., to produce poems. In this work, we employ answer set programming (ASP) (Gelfond and Lifschitz 1988; Niemela 1999; Simons, ¨ Niemela, and Soininen 2002) as the constraint programming ¨ paradigm, since ASP allows for expressing the poem construction task in an intuitively appealing way. At the same time, state-of-the-art ASP solvers, such as Clasp (Gebser, Kaufmann, and Schaub 2012), provide an efficient way of finding solutions to the poem construction task. Furthermore, ASP offers in-built support for constraint optimization, which allows for searching for a poem of high quality with respect to different imposed quality measures. We will not provide formal details on answer set programming and its underlying semantics; the interested reader is referred to other sources (Gelfond and Lifschitz 1988; Niemela 1999; Simons, Niemel ¨ a, and Soininen 2002) for a ¨ detailed account. Instead, we will in the following provide a step-by-step intuitive explanation on how the task of poetry generation can be expressed in the language of ASP. For more hands-on examples on how to express different computational problems in ASP, we refer the interested reader to Gebser et al. (2008). Answer set programming can be viewed as a data-centric constraint satisfaction paradigm, in which the input data, represented via predicates, expresses the problem instance. In our case, this dynamically generated data will express, for example, basic information on the poem skeleton (such as length of lines), and the candidate words within the input vocabulary that can be used in different positions within the poem. The actual computational problem (in our case poetry generation) is expressed via rule-based constraints which are used for inferring additional knowledge based on the input data, as well as for imposing constraints over the solutions of interest. The rule-based constraints constitute the static constraint library: once written, they can be reused in any instances of poem generators just by generating data that activates the constraints. Elementary constraints are an integral part of the system — comparable to program code. More rule-based constraints can be added by the specifier component if needed. The end-user does not need to write any constraints. The Basic Model We next describe a constraint library, starting with elementary constraints. We also illustrate dynamically generated specifications. While these are already sufficient to generated poetry comparable to that of Toivanen et al. (2012), we remind the reader that these constraints are examples illustrating the flexibility of constraint programming in compu 2013 162 Table 1: The predicates used in the basic ASP model Predicate Interpretation rows(X) the poem has X rows positions(X,Y) the poem contains Y words on row X candidate(W,I,J,S) the word W, containing S syllables, is a candidate for the Jth word of the Ith line word(W,I,J,S) the word W, containing S syllables, is at position J on row I in the generated poem % Generator part { word(W,I,J,S) } :candidate(W,I,J,S). (G1) % Testing part: the constraints :not 1 { word(W,I,J,S) } 1, rows(X), I = 1..X, positions(I,Y), J=1..Y. (T1) Figure 2: Answer set program for generating poetry: the basic model tational poetry composition, and different sets of constraints can be used for different effects. We will first give a two-line basic model of the constraint library that takes the skeleton and candidates as input. This model simply states that exactly one of the given candidate words must be selected for each word position of the poem. Predicates The predicates used in the basic answer set program are listed in Table 1, together with their intuitive interpretations. The input predicates rows/1 and positions/2 characterize the number of rows and the number of words allowed on the individual rows of the generated poems. The input predicate candidate/4 represents the input vocabulary, i.e., the words that may be considered as candidates for words at specific positions. The output predicate word/4 represents the solutions to the answer set program, i.e., the individual words and their positions in the generated poem. Example. The following is an example of the basic structure of a data file representing a possible input to the basic ASP model rows(6). positions(1,6). positions(2,8). positions(3,8). positions(4,5). positions(5,6). positions(6,6). candidate("I",1,1,1). candidate("melt",1,2,1). candidate("weed",1,2,1). candidate("teem",1,2,1). candidate("kidnap",1,2,2). candidate("perspire",1,2,2). candidate("shut",1,2,1). candidate("eclipse",1,2,1). candidate("sea",1,2,1). candidate("plan",1,2,1). candidate("hang",1,2,1). candidate("police",1,2,2). candidate("revamp",1,2,2). candidate("flip",1,2,1). candidate("wring",1,2,1). candidate("sting",2,2,2). ... Rules The answer set program that serves as our basic model for generating poetry is shown in Figure 2. The program can be viewed in two parts: the generator part (Rule G1) and the testing part (Rule T1). The test part consists of rule-based constraints that filter out poems that do not satisfy the conditions for acceptable poems characterized by the program. In the generator part, Rule G1 states that each candidate word for a specific position of the poem may be considered to be chosen as the word at that position in the generated poem (expressed using the so-called choice construct { word(W,I,J,S) }). In the testing part, Rule T1 imposes the most fundamental constraint that exactly one candidate word should be chosen for each word position in the poem: the empty left-handside of the rule is interpreted as falsum, a contradiction. The rule then states that, for each row and each position on the row, it is a contradiction if it is not the case that exactly one word is chosen for that position (expressed as the cardinality construct 1 { word(W,I,J,S) } 1). Example. Given the data presented above these basic rules are now grounded as follows. There are six lines in the poem as described by the rows predicate and each of these lines has a certain number of positions to be filled with words as described by the positions predicate. The candidate predicates specify which words are suitable choices for these positions. During grounding the solver tries to find a suitable candidate for each position, which is trivial in the basic model that lacks any constraints between the words. We consider more interesting models next. Controlling the Form of Poems We will now describe examples of how the form of the poems being generated can be further controlled in a modular fashion by introducing additional predicates and rules over these predicates to the basic ASP model. The additional predicates introduced for these examples are listed in 2013 163 Table 2: Predicates used in extending the basic ASP model Predicate Interpretation must rhyme(I,J,K,L) the word at position J on row I and the word at position L on row K are required to rhyme rhymes(X,Y) the words X and Y rhyme nof syllables(I,C) the Ith row of the poem is required to contain C syllables min occ(W,L) L is the lower bound on the number of occurrence of the word W max occ(W,U) U is the upper bound on the number of occurrence of the word W % Generator part { word(W,I,J,S) } :candidate(W,I,J,S). (G1) rhymes(Y,X) :rhymes(X,Y). (G2) syllables(W,S) :candidate(W,_,_,S). (G3) % Testing part: the constraints :not 1 { word(W,I,J,S) } 1, rows(X), I = 1..X, positions(I,Y), J=1..Y. (T1) :word(W,I,J,S), word(V,K,L,Q), must_rhyme(I,J,K,L), not rhymes(W,V). (T2) :Sum = #sum [ word(W,I,J,S) = S ], Sum != C, nof_syllables(I,C), (T3) I = 1..X, rows(X). :not L { word(W,_,_,_) } U, min_occ(W,L), max_occ(W,U). (T4) Figure 3: Answer set program for generating poetry: extending the basic model Table 2. Using these predicates, rules that refine the basic model are shown in Figure 3 (Rules G2, G3, and T2- T4). Rhyming The predicate must rhyme/4 is used for providing pairwise word positions that should rhyme. Knowledge on the pairwise relations of the candidate words, namely, which pairs of candidate words rhyme, is provided via the rhymes/2 predicate. Rule G2 enforces that rhyming of two words is a symmetry relation. In the testing part Rule T2 imposes the constraint that, in case two words chosen for specific positions in a poem must rhyme, but the chosen two words do not rhyme, a contradiction is reached. Numbers of Syllables The basic model can also be extended to generate poetical structures with more specific constraints. As an example, one can consider forms of poetry that have strict constraints on the numbers of syllables in every line, such as haikus, tankas, and sonnets. We use the additional predicate nof syllables/2 for providing as input the required number of syllables on the individual rows. At the same time, Rule G3 projects the information on the number of syllables of each candidate word to the syllables/2 predicate. Rule T3 can then be used to ensure that the number of syllables on each row (line) of the poem (computed and stored in the Sum variables using the counting construct Sum = #sum [ word(W,I,J,S) = S ]) matches the number of syllables specified for the row by the nof syllables/2 predicate. Word Occurrences The simple model above does not control possible repetitions of words at all. Such control can be easily added by introducing input predicates min occ(W,L) and max occ(W,U), which are then used to state for each word W the minimum L (respectively, maximum U) number of occurrences allowed for the word. Using these additional predicates, Rule T4 then constrains the number of occurrences to be within these lower and upper bounds (expressed by the cardinality constraint L { word(W, , , ) } U). Further Possibilities for Controlling Form The possibilities of controlling poetical forms are not of course limited to simple requirements for fulfilment of certain syllable structures or rules for rhyming and alliteration. Besides strict constraints on numbers of syllables on verse, classical forms of poetry usually obey a specific stress pattern, as well. Stress can be handled with constraints similar to the ones governing syllables. Metric feet like iamb, anapest, and trochee can be used by specifying constraints that describe positions where the syllable stress must lie in every line of verse. Controlling poetical form also provides interesting possibilities for using constraint optimization techniques (to be described below). As an example, consider different forms of lipograms i.e. poems that avoid a particular letter like e or univocal poems where the set of possible vowels in the poem is restricted to only one vowel. Similarly, more complex optimisations of the speech sound structure can be handled depending on whether the wished poetry is required to have soft or brutal sound, or to have characteristics of a tonguetwister. Controlling the Contents and Syntax of Poems While the example constraints presented above focus on controlling the form of poems, linguistic knowledge of phonology, morphology, and syntax (as examples) can similarly be controlled by introducing additional constraints in a modular fashion. This includes rules of syntax that specify 2013 164 failed_rhyme(I,J,K,L) :word(W,I,J,S), word(V,K,L,Q), must_rhyme(I,J,K,L), not rhymes(W,V). (T2') failed_syllable_count(I) :Sum = #sum [ word(W,I,J,S) = S ], Sum != C, nof_syllables(I,C), I = 1..X, rows(X). (T3') failed_occount(W) :not L { word(W,_,_,_) } U, min_occ(W,L), max_occ(W,U). (T4') #minimize [ failed_rhyme(I,J,K,L) @3 ]. (O2) #minimize [ failed_syllable_count(I) : I=1..X : rows(X) @2 ]. (O3) #minimize [ failed_occount(W) @1 ]. (O4) Figure 4: Handling inconsistencies by relaxing the constraints and introducing optimization criteria how linguistic elements are sequenced to form valid statements and rules of semantics which specify how valid references are made to concepts. Consider, for example, transitive and intransitive verbs, i.e., verbs that either require or do not require an object to be present in the same sentence. Here one can impose additional constraints for declaring which words can or cannot be used in the same sentence where a transitive verb requiring certain preposition and an object has been used. Similarly other constraints not directly related to the poetical forms but rather to linguistic structures like idioms, where several words are always bundled together, can be effectively declared as constraints. The same holds for syntactic aspects such as rules governing the constituent structure of sentences (Lierler and Schuller 2012). ¨ As a simple, more concrete example, consider the following. In order to declare that the poems of interest start with the word "I", the fact word("I",1,1,1). can be added to the constraint model. In order to ensure that all verbs associated with the first person should be in past tense, the additional predicate in past tense/1 can be introduced, and specified for each past-tense verb in the data. Combining the above, one can as an example declare that the word following any "I" is in a past tense, using the following two rules. :word("I",I,J,1), word(W,I,J+1, ), not in past tense(W). :word("I",I,J,1), positions(I,J), word(W,I+1,1, ), not in past tense(W). Here the first rule handles the case that the occurrence of "I" is not the last word on a row. The second rule handles the case that "I" is the last word on a row, in which case the first word on the following row should be in past tense. More generally, one can pose constraints that ensure that two (or more) words within a poem are compatible (in some specified sense), even if the words are not next to each other. For an example, consider the additional predicated pronoun/1 and verb/1 that hold for words that are pronouns and verbs, respectively, and the predicate person/2 that specifies the grammatical person, expressed as an integer value, of a given word: person(W,P) is true if and only if the word W has person P. Using these predicates, one can enforce that, for the first verb following any pronoun (not necessarily immediately after the pronoun), the pronoun and the verb have to have the same person. For instance, after the pronoun "she" the first following verb has to be in the third person singular form. This can be expressed as the following rule: :word(W,I,J, ), pronoun(W), person(W,P), 0{ word(U,I,L, ) : verb(U) : L>J : LJ, P!=Q. Similarly, by specifying the additional predicate verb/1 for each verb in the input data, one can require that the whole poem should be in past tense: :word(W, , , ), verb(W), not in past tense(W). Specifying an Aesthetic via Optimization Up to now, we have only considered hard constraints, and did not address how to assess the aesthetics of generated poems, or how to generate poems that are maximally aesthetic by some measures. In the constraint programming framework, an aesthetic can be specified using soft constraints. The constraint solver then attempts to look for poems which maximally satisfy the soft constraints. In ASP, this is achieved by using optimization statements offered by the language. As concrete examples, we will now explain how Rules T2- T4 can be turned into soft constraints. The soft variants, Rules T2'- T4', are shown in Fig. 4, together with the associated optimization statements O2- O4. Taking Rule T3 as an example, the idea is to introduce a new predicate failed syllable count/1 with the following interpretation: Predicate failed syllable count(I) is true for row I if and only if the number of syllables on the row was not made to match the required number. In contrast to Rule T3, which rules out all solutions of the model immediately in such a case, Rule T3' simply results in assigning failed syllable count(I) to true. Thus the predicate failed syllable count/1 acts as an indicator of failing to have the required number of syllables on a specific row. The optimization statement associated with Rule T3' is Rule O3. This minimize statement declares that the number of rows I for which failed syllable count(I) is assigned to true should be minimized, or equivalently, that the numbers of syllables should conform to the required numbers of syllables for as many rows as possible. The optimization variants T2' and T4' and the associated optimization statements follow a similar scheme. 2013 165 When multiple such optimization statements are introduced to the model, the relative importance of the statements is declared using the @i attached to each of the optimization statement. In the example of Figure 4, the primary objective is to minimize the number of rhyming failures (specified using @3). The secondary objective is then to find, among the set of poems that minimize this primary objective, a poem that has a minimal number of lines with a wrong number of syllables, (using @2), and so forth. Examples We will now illustrate the results and effects of some combinations of constraints. In the data generation phase (the specifier component) we use the methodology by Toivanen et al. (2012), including the Stanford POS-tagger and morpha & morphg inflectional morphological analysis and generation tools (Toutanova et al. 2003; Minnen, Carroll, and Pearce 2001). The poem templates are extracted automatically from a corpus of humanwritten poetry. The only input by the user is a topic for the poem, and some other parameters as described below. As a test case for our current system we study how the approach manages to produce different types of quatrains. It is a unit of four lines of poetry; it may either stand alone or be used as a single stanza within a larger poem. The quatrain is the most common type of stanza found in traditional English poetry, and as such is fertile ground on which to test theories of the rules governing poetry patterns. The specifier component randomly picks a quatrain from a given corpus of existing poetry. It then automatically analyses its structure, to generate a skeleton for a new poem. The following poem skeleton is marked with the required part-of-speech for every word position (PR = pronoun, VB = verb, PR PS = possessive pronoun, ADJ = adjective, N SG = singular noun, N PL = plural noun, C = conjunction, ADV = adverb, DT = determiner, PRE = preposition): N SG VB, N SG VB, N SG VB! PR PS ADJ N PL ADJ PRE PR PS N SG: - C ADV, ADV ADV DT N SG PR VB! DT N SG PRE DT N PL PRE N SG! The specifier component then generates a list of candidate words for each position. If we give "music" as the topic of the poem, the specifier specifically uses words related to music as candidates, where possible (Toivanen et al. 2012). A large number of poems are possible, in the absense of other constraints, and the constraint solver in the explorer component outputs this one (or any number of alternative ones, if required): Music swells, accent practises, traditionalism marches! Her devote narrations bent in her improvisation: - And then, vivaciously directly a universe she ventilates! An anthem in the seasons of radio! This example does not yet have any specific requirements for the prosodical form. Traditional poetry often has its prosodic structure advertised by one or more of several poetic devices, with rhyming and alliteration being best-known of these. Let the specifier component hence generate the additional constraints that the first and the third line must rhyme, as well as the second and fourth line. As a result of this more constrained specification we now get a very similar poem, but with some words changed to rhyme. Music swells, accent practises, traditionalism hears! Her devote narrations bent in her chord: - And then, vivaciously directly a universe she disappears! An anthem in the seasons of record! Addition of this simple constraint adds rhyme to the poem, which in turn draws attention to the prosodic structure of the poem. Use of prosodic techniques to advertise the poetical nature of a given text can also enhance coherence of the poetry as the elements are linked together more tightly. For example, a rhyme scheme of ABAB would give the listener a strong sense that the first and third as well as the second and fourth lines belong together as a group, heightening the saliency of the alternating structure that may be present in the content, as well. The constraint on rhyming reflects the intuition that rhyme works by creating expectation and satisfaction of that expectation. Upon hearing one line of verse, the listener expects to hear another line that rhymes with it. Once the second rhyme is heard, the expectation is fulfilled, and a sense of closure is achieved. Similarly, adding constraints that specify a more sophisticated prosodic structure or content related aspects may lead to improved quality of the generated poetry. Let us conclude this section with an example of an aesthetic, an optimization task concerning the prosodic structure of poetry. Consider composition of lipograms, i.e., poems avoiding a particular letter. (Also univocalism or more complex optimizations of the occurrence of certain speech sounds can be composed in a similar fashion.) The following poem is an example of a lipogram that avoids the letter o. As a result of this all words that contained that letter in the previous example are changed to match the strengthened constraints: Music swells, accent practises, theatre hears! Her delighted epiphanies bent in her universe: - And then, singing directly a universe she disappears! An anthem in the judgements after verse! Empirical results of Toivanen et al. (2012) indicate that in Finnish, already the basic mechanism produces poems of surprisingly high quality. The sequence of poems above illustrates how their quality can be substantially improved by relatively simple addition of new, declarative constraints. Discussion and Conclusions We have proposed harnessing constraint programming for composing poetry automatically and flexibly in different styles and forms. We believe constraint programming has high potential in describing also other creative phenomena. 2013 166 A key benefit is the declarativity of this approach: the conceptual space is explicitly specified, and so is the aesthetic, and both are decoupled from the algorithm for exploring the search space (an off-the-shelf constraint solver). Due to its modular nature, the presented approach can be an effective building block of more sophisticated poetry generation systems. An interesting next step for this work is to build an interactive poetry composition system which makes use of constraint programming in an iterative way. In this approach the constraint model is refined and re-solved based on user feedback. This can be seen as an iterative abstract-refinement process, in which the first abstraction specifies a very large search-space that is iteratively pruned by refining the constraint model with more intricate rules that focus search to the most interesting parts of the conceptual space. Another promising research direction is to consider a selfreflective creative system. Since the search space and aesthetic are expressed in an explicit manner as constraints, they can also be observed and manipulated. We can envision a creative system that controls its own constraints. For instance, after observing that a large amount of good results is obtained with the current constraints, it may decide to add new constraints to manipulate its own internal objectives. Modification of the set of constraints may lead to different conceptual spaces and eventually to transformational creativity (Boden 1992). Development of metaheuristics and learning mechanisms that enable such self-supported behavior is a great challenge indeed. Acknowledgements This work has been supported by the Academy of Finland under grants 118653 (JT,HT), and 132812 and 251170 (MJ). 2013_24 !2013 Slant: A Blackboard System to Generate Plot, Figuration, and Narrative Discourse Aspects of Stories Nick Montfort The Trope Tank, MIT 77 Mass Ave, 14N-233 Cambridge, MA 02139 USA nickm@nickm.com Rafael Pérez y Pérez División de Ciencias de la Comunicación y Diseño Universidad Autónoma Metropolitana, Cuajimalpa, México D. F. rperez@correo.cua.uam.mx D. Fox Harrell Imagination, Computation, & Expression Laboratory, MIT 77 Mass Ave, 14N-207 Cambridge, MA 02139 USA fox.harrell@mit.edu Andrew Campana Department of East Asian Languages & Civilizations Harvard University Cambridge, MA 02138 USA campana@fas.harvard.edu Abstract We introduce Slant, a system that integrates more than a decade of research into computational creativity, and specifically story generation, by connecting subsystems that deal with plot, figuration, and the narrative discourse using a blackboard. The process of integrating these systems highlights differences in the representation of story and has led to a better understanding of how story can be usefully abstracted. The plot generator MEXICA and a component of Curveship are used with little modification in Slant, while the figuration subsystem Fig-S and the template generator GRIOT-Gen, inspired by GRIOT, are also components. The development of the new subsystem Verso, which deals with genre, shows how different genres can be computationally modeled and applied to in-development stories to generate results that are surprising in terms of their connections and valuable in terms of their relationship to cultural questions. Example stories are discussed, as is the potential of the system to allow for broader collaboration, the empirical testing of how subsystems interrelate, and possible contributions in literary and artistic contexts. Introduction Slant is a system for creative story generation that integrates different types of expertise and creativity; the framework it provides also means that other systems, implementing other approaches to story generation, can be integrated into it in the future. The development of Slant has involved formalizing, reworking, and testing ideas about creative storytelling and what is important to writing stories—specifically, the poetics of figuration, the poetics of plot development, and the poetics of narrating. The system incorporates a new perspective on genre and integrates components from three existing systems: D. Fox Harrell's GRIOT, Rafael Pérez y Pérez's MEXICA, and Nick Montfort's Curveship. Story generation systems have not yet used an architecture of this sort to encapsulate different expertise and different aspects of creativity; nor have they incorporated major components that are based on existing, proven systems by different researchers. Slant is a blackboard system in which different subsystems, each of them informed by and modeling humanistic theories, collaborate together, working incrementally to fully specify a story. An alternative, simpler process involves making decisions in a "pipeline," in which one system offers, for instance, a plot and another system determines how the narrative discourse will be arranged. Although this system seems to be a poor model of human creativity, it is a reasonable first step toward a "blackboard" system. Two of the Slant collaborators previously developed such a pipelined system with two stages (Montfort and Pérez y Pérez 2008). The current project involves five major subsystems rather than two and uses a blackboard architecture, allowing any of the subsystems that work during the main phase of generation to augment the story representation at any point. The generation of stories in Slant begins with minimal, partial proposals from a simple unit, the Seeder. In turn, the subsystems MEXICA, Verso, and Fig-S read and add to this set of proposals, each according to its focus. When the proposals are complete, the finished story specification is sent to GRIOT-Gen so conceptual blending can be applied to the relevant templates and then to the three-stage pipelined text generation component of Curveship. Curveship-Gen realizes a finished story in the form of a text file that can be read and considered by human readers. This paper introduces the architecture of the system and describes the subsystems that build and realize stories together. It includes a discussion of what was learned by inte 2013 168 grating three different lines of research on story generation. Reflections are also offered on the first set of stories produced by the system, and some discussion of the potential of the system is included as well. Slant will undergo more refinement and development, but the work that has been done so far is of relevance to those working to implement largescale computational creativity systems that integrate heterogeneous subsystems, to those developing representations of story and other creative representations, and to those working specifically in story generation. Creativity and the Architecture of Slant Boden holds that creativity involves the production of new, surprising, and valuable results (Boden 2004). In the case of story generation and other literary endeavors, being new involves not repeating what has been done before (by the system or in the wider culture); surprise often manifests itself in unusual juxtapositions that are effective, though one would not have guessed it; and value, rather than indicating that the story is of didactic or economic value, means that a story accomplishes some imaginative or poetic purpose—it connects in some way to cultural or psychological issues or questions and allows the reader to think about them in new ways. Stories that surprise readers by bringing unusual elements together and which provide for this sort of reflection, but which do so in the same way as existing stories, are not new. Stories that are innovative and could allow for reflection, but which do not involve unusual juxtapositions or connections, are not surprising. Stories that are fresh and involve unusual combinations of elements, but do not ultimately seem to have a point of any sort, are not of value. Taking value to indicate relevance within culture means that the value of a story is similar to what has been called, with regard to conversational stories of the sort that are uttered all the time by people, its "point" (Polanyi 1989). While the point of a story is understood in the context of a specific conversation, the ability of a story to have a point at all can be understood within the context of culture. Valuable stories are those that have a point to at least some readers when they encounter them in some context. Beyond Boden's three components of creativity, we also consider a higher level of creativity. Namely, the various cognitive processes for conceptualization that enable people to recognize and generate new, surprising, and valuable cultural content are forms of everyday creativity. Cognitive scientist Gilles Fauconnier has referred to these process of meaning construction as "backstage cognition" and asserts that backstage cognition includes specific phenomena such as "viewpoints and reference points, figure-ground/profilebases/landmark-trajector organization, metaphorical, analogical, and other mappings, idealized models, framing, construal, mental spaces, counterpart connections, roles, prototypes, metonymy, polysemy, conceptual blending, fictive motion, [and] force dynamics" (Fauconnier 1999). These cognitive processes are especially important to note here because the notion of creativity informing Fig-S and GRIOT-Gen is based on a model of the creative backstage cognition phenomenon of metaphorical mapping, most prominently, but also mental spaces, counterpart connections, metaphor, analogy, and metonymy in the case of the GRIOT system that inspired them. To succeed repeatedly and reliably at creativity, a storytelling system must have mechanisms relevant to each of these aspects of creativity. It must have some model of what has happened before to prompt novelty, somehow provide for stories that join aspects together in unusual and effective ways, and somehow provide for stories that relate to culture and have a point. The means of accomplishing these aspects of creativity do not have to be abstracted into separate components of a system, but they do need to be somehow realized by a creative system. A simple way that systems can connect and to some extent collaborate involves organizing them in a pipeline. This can model a regimented assembly-line process or "waterfall" model in which each subsystem participates in one phase and interfaces only with the systems before and after it. For certain processes, this may be adequate, but for the nuanced process of creativity, which involves making interesting connections, the components of a system probably need to interact in a less constrained and unidirectional manner. This was the rationale for the blackboard architecture used in Slant. The Blackboard and Subsystems In Slant, the three major story-building subsystems can write to and read from a blackboard representation of the story in progress. Currently, the systems function in practice much as a pipeline does, with each of the three subsystems augmenting the story representation once. The systems can influence each other "backwards" only via Verso examining the current plot and proposing a new action (not just a specification of narrative discourse, which is always proposed.) MEXICA can then incorporate that expanded plot into its next ER cycle that it uses to elaborate the plot. Although the interactions between subsystems are not intricate at this Figure 1: The architecture of Slant. 2013 169 point, the framework is in place for more elaborate blackboard interaction in future versions of Slant. Currently, MEXICA contributes an initial, partial plot - a minimal, random one will eventually be provided at the first step by the Seeder. Then, Verso assigns a genre and a specification of the narrative discourse, and MEXCIA further elaborates the plot until it is complete. Verso may specify constraints on how the story is to be developed. For instance, it may specify that a particular character, who has been designated as the narrator of the story, should not die. MEXICA will respect these in elaborating the story. Finally, Fig-S determines what figuration will be used. Eventually, another system, the Harvester, will check to see if all aspects of the story are complete, allowing the subsystems to augment the story in several different orders. After the story representation is complete, it is realized. GRIOT-Gen determines how to realize figurative representations and Curveship-Gen does content selection, microplanning, and surface realization to produce the final text. The MEXICA subsystem has the most explicit model of an aspect of creativity; it explicitly evaluates the novelty and interestingness of the component of story that it develops, the plot. Verso and Fig-S both aim to add surprise by combining conventional genres and metaphors in unusual ways. They do not currently measure how surprising their results are, but they embody techniques for choosing appropriate combinations that may be seen as creative by readers. Foundational Systems MEXICA. This system generates plots or frameworks for short stories about the Mexicas, the old inhabitants of what today is México city, also known as the Aztecs. MEXICA's process is based on the engagement/reflection cycle, a cognitive account of writing by Mike Sharples (Pérez y Pérez and Sharples 1999, 2001, 2004). During engagement the system focuses on generating sequences of actions driven by content and rhetorical constraints and avoids the use of explicit goals or predefined story-structures. During reflection MEXICA evaluates the novelty and interestingness of the material produced so far and verifies the coherence of the story (see also Pérez y Pérez et al. 2011). The design of the system is based on structures known as Linguistic Representations of Actions (LIRAs), which are sets of actions that any character can perform in the story and whose consequences produce some change in the storyworld context. There are two types of possible pre-conditions and postconditions in MEXICA: emotional links between characters and dramatic tensions in the story. MEXICA is incorporated as the generator of plot. It generates plot in stages, allowing other systems to interact with the story representation as it does so. In the current system, it can be influenced by actions added to the story by Verso. GRIOT. This is a system that is the basis for interactive and generative text and multimedia works using Harrell's Alloy algorithm for conceptual blending. These works include poetic, animated, and documentary systems that themselves produce different output each time they are run. While GRIOT allows authors to implement narrative and poetic structures (e.g., plots), a major contribution of GRIOT is its orientation toward the dynamic generation of content resulting from modeling aspects of figurative thought that can be described formally. That is, GRIOT allows authors to fix elements such as narrative structure while varying output in terms of theme, metaphor, emotional tone, and related types of what is here called "figuration" (results of figurative thought). Rather than being based on a single knowledge base or ontology, as is the case with many classic AI systems, GRIOT creates blends between different ontologies (Harrell 2006, 2007). Indeed, a key feature of GRIOT is the ability of authors to construct subjective ontologies based in specific authorial worldviews, elements of which are then blended in a manner that maintains coherence based on several formal optimality principles inspired by a subset of those proposed by Gilles Fauconnier and Mark Turner (1999). This approach allows for novel, surprising, and valuable content to be generated that retains conceptual coherence. GRIOT, like MEXICA, has also been used to implement cultural forms of narrative that are not often privileged in computer science, in this case oral traditions of narrative from the African diaspora (Harrell 2007a). This is important because some forms of oral narrative have more in common with narratives in virtual worlds than the graphocentric (text-biased) forms of narrative privileged in most research in the field of narratology in literary studies. The implemented GRIOT system, and experience with it, have informed the development of Fig-S, a component of Slant that proposes what types of figuration, mainly metaphor, will be used in telling the story. GRIOT also inspires GRIOT-Gen, the component that generates natural language representations for figuratively enriched versions of particular actions after the story representation is completely developed (see also Goguen and Harrell 2008). Curveship. This is an interactive fiction system that provides a world model (of characters, objects, locations, and things that happen) while also modeling the narrative discourse, so that the narration and description of the simulated world can change (Montfort 2009, 2011). Curveship can tell events out of order, using flashback and other techniques, and can tell the story from the standpoint of particular characters and their perceptions and understandings. It is based on Genette's theories (Genette 1983) and incorporates other ideas from narratology. The architecture of Curveship draws on well-established techniques for simulating an IF world, separating these from the subsystem for narrating, which in 2013 170 cludes a standard three-stage natural language generation pipeline. To make use of the system, either for interactive fiction authoring or story generation, one specifies highlevel narrative aspects; the system does appropriate content selection, works out grammatical specifics, and realizes the text with, for instance, proper verb formation. Some world simulation abilities and the narrative text generation capabilities of Curveship are used directly in Slant in Curveship-Gen, the component that outputs the finished, realized story. The Slantstory XML Format Connecting different systems so that they can work together means establishing shared representations. For Slant, that representation is an XML format called Slantstory. It contain all of the information that is needed in the final steps to represent each action and realize the story, meaning that it must contain sufficiently granular information about the plot, the narrative discourse, and the types of conceptual blending that are to be done. This information is not only needed at the last stage, where the generation of text is done. It can also be read by the different subsystems during story generation, when the story is not yet complete, and can influence the next stage of story augmentation. Because of this, Slantstory is a format not only for representing entire, complete stories but also for representing partial stories, the composition of which is in progress. In the current implementation, subsystems can augment a story and declare it complete, but cannot revise or remove what has already been contributed. To declare a common representation for (both partial and complete) stories, an agreement had to be reached between different perspectives on what the elements of a story are, what is to be represented about each, and how granular the representation of each element is. The Slantstory DTD specifies five elements that occur within the root: A story cannot be complete without all five of these present, but only existents and actions are required at every stage of story development. The existents are of three types: locations, characters, and things. Actions each have a verb (which might be a phrase such as "try to flee") and may have any or all of agent, direct object, and indirect object specified. The "instantaneous" tension level, or change in the tension associated with an action, is also represented there. The actions also have a unique ID number which indicates their chronological order in the story world, as in: One challenge in developing and using this blackboard representation involves the different models of existents and actions that the three foundational systems use. Characters and locations, but nothing like props or "things," are represented in MEXICA, while Curveship represents all three sorts of existents to provide the type of simulation that is typical in interactive fiction, where objects can typically be acquired, given to other characters, placed on surfaces and in containers, and so on. MEXICA was modified for use in Slant to produce appropriate representations of whatever things were mentioned in actions. The representation of action was also not consistent between the foundational systems. Curveship has a typology of four actions: Configure (move some existent into, onto, out of, off, or to a different location), Modify (change the state of some existent), Sense (gain information about the world from sensing), and Behave (any other action, not resulting in any change of state in the world). Although they may be quite different, all actions are meant to correspond to a sentence with a single verb phrase when realized. MEXICA's actions, on the other hand, are not categorized in this way and include many different sorts of representations. There are, for instance, complex actions such as FAKED_STAB_INSTEAD_HURT_HIMSELF, indications that an action was not taken such as NOT_CURE, and indications that a state is to be described at a certain point such as WAS_BROTHER_OF. The first of these issues, the granularity of action, was handled by developing a mapping between MEXICA actions and Slantstory actions. A limitation of this approach is that actions cannot be inserted into the middle of a series of Slantstory actions that correspond to a single MEXICA action; this is enforced by giving the actions consecutive IDs, so that there is no room to add further actions. Ideally, however, other subsystems would be able to modify the Slantstory representation of actions in any way. The second of these issues bring up the interesting issue of disnarration (Prince 1988), that it is possible in a story to not only tell what has happened but to also tell what what did not happen, and that doing so can have an interesting effect on the reader. Disnarration is not the representation of action, however, so it cannot be represented in a straightforward way in a list of actions, and should be handled elsewhere—in the spin element, for instance. Resolving the final issue related to stative information also requires further work, since the system should both represent facts about the story world (probably in the existents element) and when to mention them (probably in the spin element). GRIOT transforms, for instance, the "agent" and "direct" attributes of an action into conceptual categories. While Slantstory uses a grammatical-sounding model of actions, with direct and indirect objects, Curveship can in fact realize sentences out of these where the agent is the direct object and the "direct" existent is the subject—when it realizes a sentence in the passive, for instance. So, both GRIOT and Curveship treat the seemingly grammatical attributes of ac 2013 171 tion in slightly different ways. Furthermore, the templates that are used to represent sentences in Curveship, which is designed for narrative variation, are not well-designed for the generation of figurative text. Curveship's templates are set up to allow a slot for an agent, for example, which might eventually be filled with "the jaguar knight" "I" "he" or "you" depending upon how narrator and narratee are set and whether the noun phrase is pronominalized. Fig-S, however, may determine that the adjective "enflamed" should be used with this noun phrase because it will participate in the conventional metaphor LOVE IS FIRE. In this case, Curveship-Gen should generate either "the enflamed jaguar knight" "I, enflamed," "he, enflamed," or "you, enflamed." All the possibilities for combinations of figuration (not just the use of an adjective) and all the existing ways that Curveship can generate noun phrases need to be implemented in the next version of Slant. Verso: Augmenting a Story Based on Genre Verso, like MEXICA and Fig-S, reads a Slantstory XML file from the blackboard and outputs an updated one. While MEXICA is focused on plot and Fig-S selects an appropriate domain for blending particular representations of action, Verso's operation is based on a model genre. This subsystem operates by: 1. Detecting particular aspects of the in-progress story (typically actions with particular verbs, although possibly series of actions or sets of characters) that indicate the story's suitability to a particular genre, for all known genres. 2. Selecting the genre that is most appropriate. 3. Updating the story using rules specific to that genre. The narrative discourse is always updated by specifying attributes of and elements within "spin." This determines elements such as the focalizer, narrator, time of narrating, rhetorical style, and beginning and/or ending phrases to frame the story. The update can also contribute new actions to the story, which can influence the way that MEXICA continues to develop the plot. This procedure brings a model of genre awareness into Slant, but it is an unusual process from the standpoint of conventional human creativity. More often than not, an author chooses a genre and then writes or tells something within it, rather than begin with a partial story and finding a genre that suits it. The overall effect, however, is to introduce sensitivity to an important aspect of human creativity. Verso's model does not seem completely aligned with the direction of genre studies in recent decades. This field has moved from a formalist definitional framework of genre to one that is semiotic, focusing in particular on the "rhetorical study of the generic actions of everyday readers and writers" (Devitt 2008). Recently, genre studies has deemphasized and argued against the idea of genres as distinct categories with characteristic elements that identify them. Scholars now dispute the idea that characteristics can be identified and summed up to indicate the likelihood that a text is part of a certain genre. They note that few genres have true fundamental elements. Particularly in the case of literary genres (e.g. detective fiction, science fiction, horror, fantasy), even when there seem to be some core characteristics that all works within a category share, almost any "defining" characteristic could be countered by an example work which lacks that element but is still undeniably of that genre. Furthermore, a fundamental dilemma arises in the act of classification itself, the problem of "whether these units exist independently of the taxonomical scheme, or arise as a result of the attempt to classify" (Ryan 1981). However, these recent concerns pertain most directly to scholarly and critical work; they do not bear upon the way genre is used in literary creativity. Sharp definitions of genre that are developed through writing practice have served many authors well, including Raymond Queneau, who used 99 different genres, modes, or styles to retell the same simple story in Exercises in Style. The problem of whether classification compels texts into categories is a problem for analysis, but it is a productive idea for literary creativity. Additionally, as Steve Neale has pointed out, "genres are instances of repetition and difference;" it is precisely through the differentiation from the established norms of a genre that a work can become part of it (Neale 1984). Verso, while making use of those "instances of repetition," also aims to effectively model the production of this necessary difference. The genres that have been implemented so far are not literary, either in the sense of broad differentiations such as "prose" and "poetry," or in the sense of categories such as "romance," "cyberpunk," "noir," and so on. Instead, Verso uses a broader definition of what constitutes genre, one which includes categories that may very well be alternatively thought of as styles, modes, or even distinct media, and which relate to both fiction and non-fiction as well as to oral and written communication. In the introduction to Writing Genres, Devitt provides many examples of the influence of genre in our daily lives, including such wide-ranging categories as the joke, lecture, mystery novel, travel brochure, small talk, sales letter, and, most appropriately, the research paper (Devitt 2008). It is this broader conception of genre, rather than a strictly literary one, that Verso aims to model. The genres implemented in Verso tend towards the stylistic rather than the thematic. In part due to the pre-existing capabilities of Curveship, and in part because of the domain in which MEXICA operates, the genres used are those that can be identified and produced through changes in the narrative discourse (focalization, time of narrating, order of events in the telling, etc.) rather than the story world domain (which could incorporate dragons, spaceships, magic, etc.). A concrete example is provided by the "confession" 2013 172 genre, which casts a story so that it sounds like it is being told to a priest at confession. To determine if this genre is applicable, the system checks to see if one or more actions are likely "sins" (robbing, killing, etc.) based on a list of these. Each "sin" raises the suitability of this genre. If "confession" is selected as the genre to use, the Slantstory XML representation is updated. A "sinner" is located—the agent of the last sinful action. This sinner is specified as the narrator (the "I" of the story). There is no narratee (or "you"), since we presume that the priest was not part of the events that were being told. The time of narrating is set to "after," which results in past-tense narration, and the "hesitant" style is used, injecting "um" and "er" into the story as if the speaker were nervous and reticent. Finally, a conventional opening ("Forgive me, Father, for I have sinned. It has been a month since my last confession.") and a conventional conclusion ("Ten Hail Marys? Thank you, Father.") are added. The "confession" genre produces plausible and amusing results. Some of this has to do with the formulaic nature of the genre. As one reads additional confessions, the rigid, repetitive opening and conclusion can be amusing, because they model the ritualized interaction of confession. Read in this light, it is only more amusing that ten Hail Marys are always given for penance, whether the penitent tried to swipe something or committed a murder. Finally, because Spanish conquerors came to the Americas and imposed Catholicism on the natives, MEXICA-generated plots that are told in this genre can be read as a comment upon, or at least a provocation about, the colonial history of Mexico.Importantly, these two subsystems did not invent this juxtaposition of the MEXICA and Catholic ritual; rather, humans decided many years go to develop a story generator about the Mexica and decided recently to develop a "confession" genre template. However, the subsystems' collaboration as part of Slant involves automatically finding occasions when the juxtaposition of these two is particularly effective. Verso's work and MEXICA's work combine in Slant to provide more cultural resonance, to be more surprising and also to be more valuable by virtue of being thought-provoking. In the current system 10 genres have been implemented: confession, diary, dream, fragments, hangover, joke, letter, memento, memoir, play-by-play, prophecy, and the default "standard" story. These take advantage of only a limited range of Curveship's narrative variation capabilities. For instance, the focalization of a story can be varied, but we have not yet implemented genres that focalize stories based on particular characters; similarly, Curveship is already capable of narrating with flashbacks and making other more elaborate changes in order. There are now only two prose styles that are used, "excited" for play-by-play and "hesitant" for confession. It would also be straightforward to elaborate the Slantstory representation and to modify Curveship-Gen to allow for expression that better relates to a wider variety of genres. In discussions so far we have already listed more than 100 genres, most of which we believe will be to some extent recognizable and applicable to the short stories produced by Slant. Fig-S and GRIOT-Gen for Figuration Fig-S reads a Slantstory XML file from the blackboard and updates it to include metaphorical content. Metaphor here can be understood as an asymmetrical conceptual blend in which all content from one domain called the "target space" is integrated with a subset of content from another called the "source space" (Grady, Oakley, and Coulson 1999). Fig-S currently implements ontologies representing several domains empirically identified as important in poetry such as "death" and "love" (Lakoff and Turner 1989) that can be used to generate metaphors such as REJECTION IS DEATH or ADMIRATION IS LOVE. Fig-S begins by processing each of the actions from the Slantstory XML file to assess whether they will be replaced by metaphorical versions of the same action. Currently, there are two modes in which this processing can be done. If ONE-METAPHOR is set to true, then the Slantstory is analyzed to find which single source domain is appropriate to map onto the greatest number of actions in order to produce metaphors. Otherwise, each action will be analyzed individually in order to find an appropriate source domain to map onto it. The first mode typically results in more coherent output, the second mode typically results in a greater degree and variety of metaphorical output. As an example of an action that has been mapped onto by the source domain LOVE in order to produce a metaphorical action, the Slantstory action: could be processed by Fig-S and added to the Slantstory as: While Fig-S currently has implemented simple, metaphorical form of blending as a first step, it could be extended to use a more robust blending algorithm such as Alloy, or even to extend Alloy to result in even more novel, surprising, and/or culturally valued blends using an extended set of optimality principles. GRIOT-Gen is used to produce specific output template from metaphorical actions in a Curveship-Gen format. For example, the metaphorical action above could be realized in a number of ways. The default produced by GRIOT-Gen, for a story in which neither virgin nor princess are narrator or narratee, would be structured as: 2013 173 '61': 'the burning virgin [become/v] jealous-of the incendiary princess', however, it can alternatively be structured as: '61': '[@virgin/s] like burning [get/v] jealous of the incendiary [princess/o]', if there is a preference for a simile-oriented style for the subject. It is also possible to use a "source-element/target-element" structure as in: '61': 'the burning/virgin [get/v] jealous of and [burn/v] for the incendiary/princess' to be very explicit about every element that has been integrated. GRIOT-Gen currently has multiple such exposition forms implemented and is easily extensible. Slant's First Stories In the current system some spin (narrative discourse specification) is necessary, although it may simply involve the default settings, while figurative action representations are optional. To begin with, this amusing but flawed story was generated without figuration, but with contributions from MEXICA and Verso: Forgive me, Father, for I have sinned. It has been a month since my last confession. An enemy slid. The enemy fell. The enemy injured himself. I located a curative plant. I cured the enemy with the curative plant. The tlatoani kidnapped me. The enemy sought the tlatoani. The enemy travelled. The enemy, um, looked. The enemy found the tlatoani. The enemy observed, uh, the tlatoani. The enemy drew a weapon. The enemy attacked the tlatoani. The enemy killed the tlatoani with a dagger. The enemy rescued me. The enemy entranced, uh, me. I became jealous of the enemy. I killed the enemy with the dagger. I killed myself, uh, with the dagger. Ten Hail Marys? Thank you, Father. The "sinner" who narrates the story dies, a problem which can also crop up when the "diary" genre issued. Since Verso can assign the genre of the story before the plot is complete, there was initially no way that Verso be sure that the character it selects as narrator will not die. This requires an interaction between the genre-selecting system, Verso, and the plot-generating system, MEXIA. We implemented an additional set of constraints on how the plotting could be done which either require or prohibit that a certain tension, as defined in MEXICA, arise. One of these tensions is "actor dead," letting Verso prohibit a narrator's death. A story with figuration follows. This one is generated without the constraint for a single conventional metaphor to be used (ONE-METAPHOR is false), so there is a colorful diversity of less consistent metaphors. The genre chosen is "play-by-play," based on sports commentary, which may be a suitable one for the range of metaphor that is used: This is Ehecatl, live from the scene. The cold-wind eagle knight is despising the icy jaguar knight! The cold-wind jaguar knight is despising the chilling eagle knight! Yes, an eagle knight is fighting a jaguar knight! Look at this, the eagle knight is drawing a weapon! Look at this, the eagle knight is closing on the jaguar knight! The gardener eagle knight is wounding the weed jaguar knight! And now, the jaguar knight is bleeding! Yes, the consumed eagle-knight is panicking! And, eagle knight is hiding! Holy -the snowflake slave is despising the chilling jaguar knight! The freezing-wind jaguar knight is despising the cold slave! And, yes, the cold-wind slave is detesting the chilling jaguar knight! A slave is curing the jaguar knight! And, the slave is returning to the city! And, the jaguar knight is suffering! The frozen jaguar knight is dying! Back to you! MEXICA's stative descriptions of characters could probably be mentioned more rapidly, or perhaps not at all, to keep the action going. This could be done with an existing facility in Slantstory for omitting actions when narrating. This story would also benefit from pronominalization, which Curveship-Gen is capable of but which would need to be either turned on for all stories or specified at an earlier stage. Slant's Research Potential We plan to further develop the system we have initiated to explore new ways that computational creativity researchers can collaborate, new models of storytelling that abstract different sorts of expertise and emphasis, and new ways to compare the importance of and interaction between different aspects of story. We intend that the system will be used for empirical studies of how people receive generated stories and will also be brought into literary and artistic contexts. Using the Slantstory XML blackboard, many different subsystems can be developed for Slant, which will allow Slant to be run with any subset of them. For instance, if Verso is turned off so that the specification of the narrative discourse is not done by that subsystem, either a default narrative discourse specification could be used (as would be the case now, since Verso is the only subsystem that updates this aspect) or that specification can be built up by one or more other subsystems. This allows the effect of each subsystem, in the context of Slant overall, to be carefully examined. Readers of stories generated under different conditions could be asked not only to rank the outputs in terms of quality, but also to comment on what they thought about particular elements (such as characters) and high-level qualities (whether the story was funny, for instance, or whether it seemed plausible). The project can also facilitate a broader collaboration between researchers of story generation. As long as researchers find the Slantstory XML representation adequate for their purpose, they can develop new subsystems that help to build stories based on other theories or concerns. For instance, a researcher interested in how creativity occurs in social contexts could model the process in a unit that reads from and writes to the blackboard and models social influ 2013 174 ence and awareness. As just discussed, this new system could be tried in many combinations with existing systems and the outputs could be compared. This would help to show not only the importance of social creativity as modeled in this particular subsystem, but also how creativity of this sort interacts with plot generation using the engagement-reflection cycle, figuration based on conventional metaphors, and awareness of genre. We also anticipate that Slant will supply stories for exhibition and publication in arts contexts, and the functional system itself could be part of a digital media, electronic literature, or e-poetry exhibit. In this way, Slant can contribute to creative practice, and reactions and discussion in this context can help us further develop a system that relates to contemporary literary concerns. Acknowledgements Thanks to Clara Fernandez-Vara and Ayse Gursoy for their discussions of genre and of early ideas about Slant. 2013_25 !2013 Using Theory Formation Techniques for the Invention of Fictional Concepts Flaminia Cavallo, Alison Pease, Jeremy Gow, Simon Colton Computational Creativity Group Department of Computing Imperial College, London ccg.doc.ic.ac.uk Abstract We introduce a novel method for the formation of fictional concepts based on the non-existence conjectures made by the HR automated theory formation system. We further introduce the notion of the typicality of an example with respect to a concept into HR, which leads to methods for ordering fictional concepts with respect to novelty, vagueness and stimulation. To test whether these measures are correlated with the way in which people similarly assess the value of fictional concepts, we ran an experiment to produce thousands of definitions of fictional animals. We then compared the software's evaluations of the non-fictional concepts with those obtained through a survey consulting sixty people. The results show that two of the three measures have a correlation with human notions. We report on the experiment, and we compare our system with the well established method of conceptual blending, which leads to a discussion of automated ideation in future Computational Creativity projects. Introduction Research in Artificial Intelligence has always been largely focused on reasoning about data and concepts which have a basis in reality. As a consequence, concepts and conjectures are generated and evaluated primarily in terms of their truth with respect to a given a knowledge base. For instance, in machine learning, learned concepts are tested for predictive accuracy against a test set of real world examples. In Computational Creativity research, much progress has been made towards the automated generation of artefacts (painting, poems, stories, music and so on). When this task is performed by people, it might start with the conception of an idea, upon which the artefact is then based. Often these ideas consist of concepts which have no evidence in reality. For example, a novelist could write a book centered on the question ‘What if horses could fly?' (e.g., Pegasus), or a singer could write a song starting from the question ‘What if there were no countries?' (e.g., John Lennon's Imagine). However, in Computational Creativity, the automated generation and evaluation of such fictional concepts for a creativity purposes is still largely unexplored. The importance of evaluating concepts independently of their truth value has been highlighted by some cognitive science research. Some of the notions that often appear in the cognitive science and psychology literature are those of novelty, actionability, unexpectedness and vagueness. Novelty is used to calculate the distance between a concept and a knowledge base. In (Saunders 2002), interestingness is evaluated through the use of the Wundt Curve (Berlyne 1960), a function that plots hedonistic values with respect to novelty. The maximum value of the Wundt curve is located in a region close to the y-axis, meaning, as Saunders points out, that the most interesting concepts are those that are "similaryet-different" to the ones that have already been explored (Saunders 2002). The notions of actionability and unexpectedness were first introduced in (Silberschatz and Tuzhilin 1996) as measurements of subjective interestingness. Actionability evaluates the number of actions or thoughts that an agent could undertake as a consequence of a discovery. Unexpectedness is a measurement inversely proportional to the predictability of a result or event. Finally, vagueness is referred to as the difficulty of making a precise decision. Several measurements have been proposed in the literature for the calculation of this value, particularly using fuzzy sets (Klir 1987). The importance of generating concepts which describe contexts outside of reality was underlined by Boden when she proposed her classification of creative activity. In particular, Boden identifies ‘three ways of creativity' (Boden 2003): combinational creativity, exploratory creativity and transformational creativity. Transformational creativity involves the modification of a search space by breaking its boundaries. One reading of this could therefore be the creation of concepts that are not supported by a given knowledge base; we refer to these as fictional concepts herein. Conceptual blending (Fauconnier and Turner 2002) offers clear methods for generating fictional concepts, and we return to this later, specifically with reference to the Divago system which implemented aspects of conceptual blending theory (Pereira 2007). We propose a new approach to the formation and evaluation of fictional concepts. Our method is based on the use of the HR automated theory formation system (Colton 2002b) (reviewed below), and on cognitive science notions of concept representation. In particular, we explore how the notion of typicality can improve and extend HR's concept formation techniques. In the field of cognitive psychology, typicality is thought of as one of the key notions behind concept representation. Its importance was one of the main factors that led to the first criticisms of the classical view (Rosch 2013 176 1973), which argues that concepts can be represented by a set of necessary and sufficient conditions. Current cognitive theories therefore take into account the fact that exemplars can belong to a concept with a different degree of membership, and the typicality of an exemplar with respect to a concept can be assessed. In the following sections, we discuss the methods and results obtained by introducing typicality values into HR. We argue that such typicality measures can be used to evaluate and understand fictional concepts. In particular, we propose calculations for three measures which might sensibly be linked to the level of novelty, vagueness and stimulation associated with a fictional concept. We generated definitions of fictional animals by applying our method to a knowledge base of animals and we report the results. We then compare the software's estimate of novelty, vagueness and stimulation with data obtained through a questionnaire asking sixty people to evaluate some concepts with the same measures in mind. The results were then used to test whether there is a correlation between our measurements and the usual (human) understanding of the terms novelty, vagueness and stimulation. We then compare our approach and the well established methods of conceptual blending. Finally, we draw some conclusions and discuss some future work. Automated Theory Formation Automated theory formation concerns the formation of interesting theories, starting with some initial knowledge then enriching it by performing inventive, inductive and deductive reasoning. For our purposes, we have employed the HR theory formation system, which has had some success inventing and investigating novel mathematical concepts, as described in (Colton and Muggleton 2006). HR performs concept formation and conjecture making by applying a concise set of production rules and empirical pattern matching techniques respectively. The production rules take as input the definition of one or two concepts and manipulates them in order to output the definition of the new concept. For example, the compose production rule can be used to merge the clauses of the definitions of two concepts into a new definition. It could, therefore, be given the concept of the number of divisors of an integer and the concept of even numbers and be used to invent the concept of integers with an even number of divisors. The success set - the collection of all the tuples of objects which satisfy the definition - of the new defined concept is then calculated. Once this is obtained, it is compared with all the previously generated success sets and used to formulate conjectures about the new concept. These conjectures take the form of equivalence conjectures (when two success sets match), implication conjectures (when one success set is a subset of another), or non-existence conjectures (when a success set is empty). In domains where the user can supply some axioms, HR appeals to third party theorem provers and model generators to check whether a conjecture follows from the axioms or not. HR follows a best-first non-goal-oriented search, dictated by an ordered agenda and a set of heuristic rules used to evaluate the interestingness of each concept. Each item in the agenda represents a theory formation step, which is an instruction about what production rule to apply to which existing concept(s) and with which parameters. The agenda is ordered with respect to the interestingness of the concepts in the theory, and the most interesting concepts are developed first. Overall interestingness is calculated as a weighted sum (where the weights are provided by the user) of a set of measurements, described in (Colton 2002b) and (Colton, Bundy, and Walsh 2000). These were developed to evaluate non-fictional concepts, but some of them could be modified to evaluate fictional concepts for our system, and we plan to do this in future work. HR was developed to work in mathematical domains, but different projects have demonstrated the suitability of this system to work in other domains such as games (Baumgarten et al. 2009), puzzles (Colton 2002a), HR's own theories (Colton 2001) and visual art (Colton 2008). Using HR to Generate Fictional Concepts We are interested in the generation and evaluation of concepts for which it is not possible to find an exemplar in the knowledge base that completely meets the concept's definition. Throughout this paper we use the term fictional concepts to refer to this kind of concept. We use the HR system for the generation of such fictional concepts. To do so, after it has formed a theory of concepts and conjectures in a domain, we look at all the non-existence conjectures that it has generated. These are based on the concepts that HR constructs which have an empty success set. Hence, the concepts that lie at the base of these conjectures are fictional with respect to the knowledge base given to HR as background information. For example, from the non-existence conjecture: @(x)(Reptile(x) & HasW ings(x)) we extract the fictional concept: C0(x) = Reptile(x) & HasW ings(x) To see whether typicality values can be used for the evaluation of these fictional concepts, we have introduced this notion into HR. Typicality values are obtained by calculating the degree of membership of each user-given constant (i.e., animals in the above example) with respect to every fictional concept which specialises the concept of the type of object under investigation (which is the concept of being an animal in this case). This is done by looking at the proportion of predicates in a concept definition that are satisfied by each constant. Hence, for each constant aj and for each fictional concept Ci in the theory, we will have T ypicality(aj , Ci) = t, where 0  t < 1. For example, for the concept definition: C1(x) = M ammal(x) & HasW ings(x) & LivesIn(x, W ater) the typicality values for the constants in the set {Lizard, Dog, Dolphin, Bat} are as follows: 2013 177 T ypicality(Lizard, C1) = 0; T ypicality(Dog, C1)=0.3; T ypicality(Dolphin, C1)=0.6; T ypicality(Bat, C1)=0.6; We see that the constant ‘Dolphin' has typicality of 0.6 with respect to C1 because a dolphin is a mammal which lives in water but which doesn't have wings - hence it satisfies two of the three predicates (⇡ 66.6%) in the definition of C1. It is important to note that for each fictional concept C there are at least n constants a1, ..., an such that 8j, 0 < T ypicality(aj , C) < 1, where n is the number of predicates in the concept definition. We refer to these as the atypical exemplars of fictional concept C, and we denote this set of constants as atyp(C). The atypical exemplars of C have typicality bigger than zero because they partly belong to C, and less than one because the concept is fictional, and hence by definition it doesn't have any real life examples. The number of atypical exemplars of a fictional concept is always more than or equal to the number of predicates in the concept definition because fictional concepts originate from the manipulation of non-fictional concepts, and hence, - given a well formed knowledge base - each predicate in a fictional concept definition will correspond to a non-fictional concept with at least one element in its success set. Evaluating Concepts Based on Typicality We explain here how typicality can be used to evaluate fictional concepts along three axes which we claim can be sensibly used to estimate how people will assess such concepts in terms of vagueness, novelty and stimulation respectively. This claim is tested experimentally in the next section. To define the measures for a fictional concept C produced as above, we use E to represent the set of constants (examples) in the theory, e.g., animals, and we use NF to denote the set of non-fictional concepts produced alongside the fictional ones. We use |C| to denote the number of conjunct predicates in the clausal definition of concept C. We further re-use atyp(C) to denote the set of atypical exemplars of C and the T ypicality measure we introduced above. It should be noted that the proposed methods of evaluation of fictional concepts have not been included into the HR program to guide concept formation.It is, however, our ambition to turn these measurements into measures of interest for ordering HR's agenda. Using Atypical Exemplars Our first measure, MV , of fictional concept C, is suggested as an estimate of the vagueness of C. It calculates the proportion of constants which are atypical exemplars of C, factored by the size of the clausal definition of C, as follows: MV (C) = |atyp(C)| |E| ⇤ |C| As previously discussed, vagueness is a measurement that has been widely studied in the context of fuzzy sets. Klir (1987) emphasises the difference between this measurement and the one of ambiguity, and underlines how vagueness should be used to refer to the difficulty of making a precise decision. While several more sophisticated measurements have been proposed in the literature, as explained in (Klir 1987), we chose the above straightforward counting method, as this is consistent with the requirement that if concept Ca is intuitively perceived as more vague than concept Cb, then MV (Ca) > MV (Cb). To see this, suppose we have the following two concepts: C1(x) = Animal(x) & has(x, W ings) C2(x) = Reptile(x) & has(x, W ings) In this case, we can intuitively say that an animal with wings is more vague than a reptile with wings, because for the first concept, we have a larger choice of animals than for the second. In terms of typicality, this can be interpreted as the fact that C1 has a larger number of atypical exemplars than C2, and it follows that MV (C1) > MV (C2). Using Average Typicality Our second measure, MN , of fictional concept C, is suggested as an estimate of the novelty of C. It calculates the complement of the average typicality of the atypical exemplars of C, as follows: MN (C)=1 $ 1 |atyp(C)| X a2E T ypicality(a, C) ! Novelty is a term largely discussed in the literature, and can be attached to several meanings and perspectives. In our case, we interpret novelty as a measurement of distance to the real world, as inferred in previous work in computational creativity research, such as (Saunders 2002). As an example of this measure, given the concepts: C1(x) = Bear(x) & F urniture(x) & Has(x, W ings) C2(x) = Bear(x) & F urniture(x) & Brown(x) then, in a domain where all the constants are either exclusively bears or furniture (but not both), and assuming that all the bears and all the furniture are brown, we calculate: MN (C1)=0.6 MN (C2)=0.3 This is because for C1, all exemplars will satisfy just one of the three clauses ( 1 3 ) in the definition, hence this will be their average typicality, and C1 will score 1 $ 1 3 = 0.6 for MN . In contrast, all exemplars will satisfy two out of the three clauses in C2, and hence it scores 0.3 for MN . Hence we can say that C1 is more distant from reality, and hence more novel, than C2. Consistent with the literature, and in particular with the Wundt Curve (which compares novelty with the hedonic value), we assume that the most interesting concepts have an average typicality close to 0.5. Note that this implies that fictional concepts whose definition contains two conjuncts are always moderately interesting in terms of novelty, as their average typicality is always equal to 0.5. 2013 178 Using Non-Fictional Concepts Our final measure, MS, of fictional concept C is suggested as an estimate of the stimulation that C might elicit when audiences are exposed to it (i.e., the amount of thought it provokes). It is calculated as the weighted sum of all the nonfictional concepts, r, in NF that HR formulates for which their success set, denoted ss(r), has a non-empty intersection with atyp(C). The weights are calculated as the sum of the typicalities over atyp(C) with respect to C. MS(C) is calculated as follows: MS(C) = X r2NF 0 @ X a2atyp(C)\ss(r) T ypicality(a, C) 1 A This calculation is motivated by Ward's path-of-leastresistance model (Ward 2004). This states that when people approach the task of developing a new idea for a particular domain, they tend to retrieve basic level exemplars from that domain and select one or more of those retrieved instances as a starting point for their own creation. Having done so, they project most of the stored properties of those retrieved instances onto the novel ideas they are developing. As an example, the fictional concept: C1(x) = Horse(x) & Has(x, W ings) could lead to the following questions: Is it a mammal? Can humans ride it? Does it live in a farm? Does it fly? Does it lay eggs? Each of these questions can be derived from the corresponding HR generated concepts which have in their success set a large number of the atypical exemplars of C1. Experimental Results To evaluate our approach, we started with a knowledge base of animals, based on similar inputs to those used for the conceptual blending system Divago (Pereira 2007), which is described in the next section. The concept map for a horse was taken from (Pereira and Cardoso 2003) and reapplied to each animal from a list of 69 animals reported in the National Geographic Kids website1. The relations were maintained when relevant, and extended when necessary according to the Generalized Upper Model hierarchy, as instructed in (Pereira 2007). Figure 1 illustrates a small part of the information we provided as background knowledge for HR to form a theory with. To generate fictional concepts with HR, we used a random-search setup and ran the system for 100,000 steps, which took several hours. We limited the HR system to use only the compose, exists and split production rules, as described in (Colton 2002b). Extracting them from nonexistence conjectures, the system produced 4623 fictional concepts, which were then automatically ranked in terms of their MV , MN and MS values, as described above. From each of the ranked lists, a sub-list of 14 fictional concepts was created. The fictional concepts were taken at regular intervals so that they were evenly distributed numerically over the sub-lists, from highest scoring to lowest scoring. For 1 kids.nationalgeographic.co.uk/kids/animals/creaturefeature Figure 1: Details from the knowledge base for animals. the MN sub-list, all the fictional concepts with two clauses in the definition were first filtered out. For the MV and MS sub-lists, all the fictional concepts with more than two clauses in the definition were filtered out instead. The resulting sub-lists are given in tables 2, 3 and 4 of the appendix respectively. We performed a survey of sixty people who were shown these lists and asked to rank them from 1 to 14 with respect to their own interpretations of the fictional concepts and their values. The aim of the survey was to verify how measurements MV , MN and MS described above correlate with respect to common (human) understanding of vagueness, novelty and stimulation respectively. The survey was composed of four parts. The first three parts asked people to rank the three sets of 14 concepts in terms of vagueness, novelty and stimulation. We didn't include an explanation of our interpretation of these words in the questions, to encourage participants to use their own understanding of the three terms. The fourth part of the survey asked for a qualitative written definition of each of the three criteria of evaluation: vagueness, novelty and stimulation. Tables 2, 3 and 4 in the appendix report the three sub-lists of fictional concepts and the ranking (1 to 14) that our software assigned to them, along with the rankings obtained from the survey. In order to establish whether our ranking and the survey rankings are correlated, we calculated Pearson's correlation, r, between the system's ranking and an aggregated ranking. The aggregated ranking was calculated by ordering the fictional concepts 1 to 14, according to the mean rank from the participants. We then calculated the respective 95% Confidence Intervals (CI) and p-values, using the alternative hypothesis that the correlations are greater than zero. We obtained the following results (quoted to 3 decimal places): MV /vagueness: r = 0.552, p = 0.020, 95% CI = [0.124, 1] MN /novelty: r = 0.697, p = 0.003, 95% CI = [0.350, 1] MS/stimulation: r = -0.029, p = 0.059, 95% CI = [-0.481, 1] We can therefore conclude that there is strong and highly statistically significant correlation between the software rankings given by MN and the survey rankings for novelty. We have similarly found a significant and moderate correlation 2013 179 f fi f ff f fi ff ff f ff f ff fi f ff fi f ff fffi fi fffi f fi fi fffi fffiff ff ff ff ff fffi fffi ff ff fi fi fi fifi fi ff fi fi Figure 2: Word clouds: vagueness, novelty and stimulation. with the survey rankings for MV . Hence it appears that the novelty and vagueness measurements we suggested offer sensible calculations for the general understanding of these two terms for fictional concepts. We found no correlation between the survey rankings for the stimulation value and the software measure MS. This could be due to two reasons. Firstly, looking at the general descriptions of the word ‘stimulating' given by people in the last section of the survey, they present a broader range of meanings than the word ‘novel' or ‘vague'. Moreover, these meanings are often very distant from the interpretation of the term ‘stimulation' that we used in deriving the MS measure. In figure 2, we present word clouds obtained from the definitions that people in the survey gave of the words vagueness, novelty and stimulation respectively. We can see that the the word cloud for vagueness includes words such as ‘description', ‘unclear' and ‘difficult' as might be expected, and the word cloud for novelty includes words such as ‘different', ‘unusual' and ‘original', also as expected. However, the word-cloud for ‘stimulation' includes words such as ‘emotion', ‘exciting' and ‘imagination'. This suggests a second reason that could explain the lack of correlation: our measure MS lacks factors to estimate emotions and surprisingness elements, which will be studied in future work. To explore the question of stimulation further, we looked at another measure of fictional concepts which might give us a handle on this property. Table 1 portrays the non-fiction concepts found (during the experimental session with HR described above) to have examples overlapping with the atypical exemplars of this fictional concept: Cp(A) = isa(A, equine), pw(A, wings) [noting that pw(A, X) means that animal A has a body (p)art (w)ith aspect X]. These non-fiction concepts comprised the subset of NF that was used to calculate MS(Cp). The non-fiction concepts overlapping with Cp are given along with a calculation which was intended to capture an essence of Cp as the likelihood of additional features being true of the fictional animals described by Cp. The calculation takes the sum of the typicalities of the atypical exemplars of the fictional concept which are also true of the non-fiction concept. We see that it is more likely for the winged horse to have feathers than to have claws, as pw(A,feathers) scores 10, while pw(A,claws) scores just 1. In future, we plan to use these likelihood scores at the heart of new measures. For instance, we can hypothesise that the inverse of average likelihood over all the associated non-fiction concepts might give an inCONCEPT: isanimal(A,horse), pw(A,wing) Non-fictional concept Likelihood isa(A,bird) 6.5 isa(A,bug) 3.0 isa(A,mammal) 1.0 pw(A,lung) 8.5 pw(A,mane) 0.5 pw(A,tail) 7.0 pw(A,claws) 1.0 pw(A,teeth) 1.0 pw(A,eye) 10.5 pw(A,legs) 10.5 pw(A,fur) 1.0 pw(A,feathers) 10.0 pw(A,beak) 10.0 pw(A,hoof) 0.5 pw(A,claw) 5.5 existence(A,mountain) 2.5 isa(A,bug) 3.0 isa(A,bird) 6.5 isa(A,mammal) 1.0 hasAbility(A,carry) 1.0 hasAbility(A,hunt) 1.5 hasAbility(A,flying) 8.0 Table 1: Non-fiction concepts with success sets overlapping with atypical exemplars of the given concept, along with their actionability. dication of how thinking about Cp could lead to less likely, more imaginative and possibly more stimulating real world concepts. A Comparison with Conceptual Blending We compare our system to the well-established conceptual blending technique, as this technique performs fictional concept formation and evaluation, as defined above. We therefore present a comparison of our system with Divago (Pereira 2007), which is a conceptual blending system implemented on the basis of the theory presented in (Fauconnier and Turner 2002). It applies the notions suggested by this theory in order to combine two concepts into a stable solution called a blend. Blends are novel concepts that derive from the knowledge introduced via the inputs, but which also acquire an emerging structure of their own (Pereira 2007). Divago has been successfully tested in both visual and linguistic domains (Pereira 2007). It is comprised of six different modules: the knowledge base, the mapper, the blender, the factory, the constraints module and the elaboration module. The knowledge base contains the following elements: concept maps that are used to define concepts through a net of relations; rules that are used to explain inherent causalities; frames that provide a language for abstract or composite concepts; integrity constraints that are used to assess the consistency of a concept; and instances that are optional sets 2013 180 of examples of the concepts. The mapper takes two random or user selected concepts and builds a structural alignment between the two respective concepts maps. It then passes the resulting mapping to the blender, which produces a set of projections. Each element is projected either to itself, to nothing, to its counterpart (the elements it was aligned with by the mapper), or to a compound of itself and its counterpart. The blender therefore implicitly defines all possible blends that constitute the search space for the factory. The factory consists of a genetic algorithm used to search for the blend that is evaluated as the most satisfactory by the constraints module. The algorithm uses three reproduction rules: asexual-reproduction, where the blend is copied; crossover, where two blends exchange part of their lists of projections; and mutation, where a random change in one of the projections in a blend is applied. The factory interacts both with the elaboration module and the constraints module. The elaboration module is used to complete each blend by applying context-dependent knowledge provided by the rules in the knowledge base. The constraints module is used for the evaluation of each blend. It does this by measuring its compatibility with the frames, integrity constraints, and a user-specified goal (Pereira 2007). The first high-level difference between Divago and our system derives from the motivations behind their implementations. Divago was constructed to test the cognitive plausibility of a computational theory of conceptual blending, and hence their aims were to construct complete and stable concepts, i.e., the blends. Details of the system's reasoning process, used for the formation and elaboration of such concepts, are therefore presented in the final output. Our system was instead constructed to generate fictional ideas of value. These are concise concepts which are purposely left in a simple and ambiguous form. The aim is in fact to find the concepts that stimulate the highest amount of thought and interest in an audience. The system's reasoning process is hence hidden from the outputs, and used only for evaluation purposes. In the following paragraphs, we describe the parallels between Divago's modules and the different components of our system. In doing so, we identify the consequences of using each methodology. The first comparison that can be made is between the structures of the user-provided knowledge bases. In HR, the knowledge base is used only to define a set of concepts. It is hence equivalent in functionality to Divago's concept maps. The rules, frames and integrity constraints that need to be user-specified in Divago, are instead automatically learned in HR. They take the form of conjectures, non-fictional concepts and function specifications respectively. On one hand, this implies that HR has a greater degree of autonomy. On the other hand, HR is more prone to errors, as the constructed conjectures, non-fictional concepts and functions may not be relevant for the construction of fictional concepts. For example, given an appropriate knowledge base, HR could construct the concept of an animal being amphibious, which is defined as an animal that lives in water and lives on earth. The same frame can be manually defined and used in Divago. However, HR will simultaneously construct other similar concepts. For example, the concept of animals that live in water and are red; or the concept of animals that live on earth and have four legs. If we assume that these concepts could be used for the evaluation of fictional concepts (as we plan to do in the future), then there is currently no way to differentiate between them in terms of the relevance they might have on the definition of a fictional concept (i.e., the system couldn't itself determine that an amphibian is more relevant than a water-living red animal). Moreover, HR is not capable of constructing all the rules, frames and constraints that Divago uses, but we believe that a similar functionality could be achieved through the use of typicality-based exemplar membership, and we plan to explore this possibility. Despite the evident differences between their internal mechanisms, we can make a comparison between the blends produced by Divago's mapper and blender modules, and HR's non-existence conjectures. The first observation regards the range of the potential outputs. For HR, we only consider the concepts that are empirically known to be fictional. Divago's blends could instead be fictional, nonfictional, or exact copies of the two initial inputs. Moreover, Divago focuses only on one of the possible bijections between the elements in the concept maps. Pereira recognises that this restriction narrows the creative potential of the system (Pereira 2007, p. 117). HR is instead able to consider all possible structural alignments. Furthermore, Divago works on the blend of two randomly selected or user specified concepts, while HR can consider multiple concepts at once. A component to develop and elaborate on HR's fictional concepts is still missing from our system, which we are planning to implement soon. In order to do so, we will take inspiration from Divago's factory and elaboration modules, while also taking into consideration the typicality values discussed above. However, as explained before, in our case this reasoning module will be used to calculate the potential reasoning that can originate from a fictional concept. In Divago, the factory and elaboration modules are instead used for the completion of a blend. Finally, Divago's constraints module can be compared with measures MV , MN and MS introduced above. Divago's constraints module aims to evaluate a completed blend, while our system rates fictional concepts. Nevertheless, a correspondence between the evaluation methods can be noted. For example, the topology constraint used in Divago measures the novelty of a blend, like the MN measure for fictional concepts investigated above, and the integration constraint used in Divago measures how well-defined a blend is, which is similar to the MV measurement we have found is correlated with vagueness. Conclusions and Further Work We have proposed a method for generating and evaluating fictional concepts, using the HR theory formation system enhanced with typicality values. With the experiments above, we have shown that it is possible to create fictional concepts by using this process and that it is possible to meaningfully order the fictional concepts in terms of interestingnessoriented measurements. We have compared the automatically achieved evaluations with a ranking obtained through the analysis of a survey consulting sixty people. This 2013 181 showed that our MV and MN measures are correlated positively with common understandings of vagueness and novelty respectively. Finally, we compared our approach to the one based on conceptual blending in the Divago system, which placed our work in context and highlighted comparisons which will inform future implementations. Our system is still at the developmental stage. The experiment above, however, indicates that it is capable of creating fictional concepts that could be of interest to an audience. Moreover, this ideation process could be used at the heart of more sophisticated artefact generation systems, e.g., for poems or stories. As previously discussed, the methods used to rank such fictional concepts have been shown to be useful, but also present some issues. Our next steps will therefore be to refine our current approach and implement new measures to estimate the interestingness of fictional concepts. To start this process, we will take inspiration from the notions analysed in (Colton, Bundy, and Walsh 2000) and used in the HR system, and modify them as appropriate. We will also look at other measurements suggested and used in Computational Creativity literature, such as Ritchie's criteria (Ritchie 2007). These, for example, could be used to assess the novelty of a fictional concept with respect to other fictional concepts. We will then refine our measurement of typicality. To do so we hope to take inspiration from the theories proposed in cognitive science on the evaluation of the prototype theory and the weighting of category features. Each feature will be given a value called salience, used to indicate how important it is for the concept's definition. The salience values will then be used to calculate the typicality values with more accuracy. Ultimately, we aim to introduce the notion of the distortion of reality. This measurement will serve to calculate how many real world constraints a fictional concept breaks. We will start by studying two methods for the calculation of values related to this. The first method is inspired from (Pease 2007) and will be based on the number of conjectures that each atypical exemplar of a fictional concept breaks. The second method is based on the scale of the distortion that an ontology would be subject to in order to include a fictional concept. We will also implement further methods for reasoning with fictional concepts. These methods will be used to estimate actionability; for the elaboration of fictional concepts; and for potential renderings of ideas in cultural artefacts such as poems and stories. We also plan to study how the different methods of measurement could be related to a rendering choice and vice versa. For example, non-vague concepts could be suitable for paintings, while actionable concepts might be more suitable for storytelling. We hope that such studies will help usher in a new era of idea-centric approaches in Computational Creativity as we hand over the creative responsibility for ideation to our software and address high level issues such as imagination in software. Acknowledgments We would like to thank the anonymous reviewers for the comments and suggestions we received. This research was funded by EPSRC grant EP/J004049. 2013_26 !2013 e-Motion: a system for the development of creative animatics Santiago Negrete-Yankelevich and Nora Morales-Zaragoza División de Ciencias de la Comunicación y Diseño Universidad Autónoma Metropolitana (Cuajimalpa) Av. Constituyentes 1054, México D.F. 11950, México {snegrete/nmorales}@correo.cua.uam.mx Abstract This paper introduces e-Motion, a software system for the creation of animatics1 , which are important tools within the process of creation of animated graphics for TV. This type of animation, generated by the system from plots in plain text, allows production teams to envision how a final motion graphics piece can be developed. We argue that our system plays a creative role within the generative process. Specifically, our work is linked to a real production team, involved in the creation of animated shorts, called Imaginantes, for Mexican television. Introduction Computer systems intervene more and more in creative practices. They play different roles in teams of people working on projects that produce innovative work and whose overall process can be deemed creative by a suitably selected group of human experts. Just as in the case of creative teams formed strictly by human members, the blame for creativity can be distributed amongst the team members including computer systems (Maher,M.L. 2012). In this paper we describe e-Motion, a computer system that builds animatics for a pre-production process to create motion graphics. In order to test the system we embed it in the process of Imaginantes, a TV production of a series of animated shorts (one minute long) based on texts of different authors aimed at encouraging viewers to get involved in Music, Literature, Fine Arts and Film. The first season (12 shorts), launched in October 2006, captivated young audiences who shared and published them through different social media and the web; some people even created their own shorts. Since then, Imaginantes has won numerous awards, and it's on the 4th season with a total of 46 shorts produced and delivered in several media (Imaginantes, 2006). 1 An animatic is a visualization tool used in the pre-production process of an animation that informs about movement, narrative structure, framing aspects and visual effects to the production team before the animation is actually done. e-Motion is part of a research project on computational creativity where we take a proven, creative process that produces a recognized, valuable product and use it as an environment to test our systems. We are interested in studying how the overall creativity is affected if computer systems take over different roles within the pre-production stage. The Imaginantes team and process are well defined as well as the work products that must be produced. We hypothesize that all stages and work products contribute to the overall creativity but we test our system's creativity by inserting it in the human process, to see how it affects the outcome and ask human members of the team to assess the system's performance. There are two main advantages in this approach: 1. The system is assessed within a recognized real-world creative process, so we can avoid the toy-world generalization problem. 2. Our system plays a role within a human process, so it is easier for the human members of the team to assess the system's performance. They are experts in the area, they know very well what to expect. The following are the main motivations for our work: • Study computational creativity within real-world creative practices • Understand the creative process of multi-sensorial content, sound and movement of visual narratives. • Develop a computational system that works collaboratively in the creative production process of visual narratives. • Develop sound user criteria to evaluate the system as a valuable tool. • Experiment with automatically-created motion graphics to study how different frame, color and graphic element combinations transmit emotional content to an audience. We want to maximize narrative appeal while preserving the logical structure suggested by the original plot. (Malamed, 2009). Animatics constitute an intermediate step towards a full motion graphics piece and, although they depict simple 2013 184 representations, they have a complex structure and most of the high level architecture and elements of the final product. They are built from storyboards,2 which in turn, are assembled from scripts and constitute an important tool within the process to develop motion graphics. They convey decisions about editing, camera framing and special effects. A production team can discuss several of these options using various animatics before embarking on the production stage, saving resources on this costly process. Hart actually describes animatics as the "future of motion control" to stress their importance (Hart, J. 2008). As a starting point of our research into the nature of the creation of animated stories, we use the output of Mexica (Pérez y Pérez and Sharples, 2001), a computer system that generates story-plots about characters, places and themes of pre-Hispanic folklore; in particular, that of the Mexicas (most commonly known as Aztecs). These stories were originally represented in codices: pictographic documents where cultures from Mesoamerica used to write their history and other important aspects of their lives (Galarza, J.1997). Mexica plots are useful for our purpose because they have very well defined and simple syntactic and narrative structures, yet they have an immense potential for expression. In fact most of the themes of classical literature can be represented by Mexica plots: betrayal, sacrifice, courtly love, deceit, loyalty conflict, etc. The basic visual elements to assemble the animatic are provided by another system: Visual Narrator (VN) (Pérez y Pérez at al. 2012). This program illustrates story-plots from Mexica by producing sequences of still images composed of characters and scenes that literally represent the input plot by following a set of rules used in some pre-hispanic codices in a pictographic fashion. The rules specify how characters are presented according to their rank in society, activity, gender, and tension (emotional links represented by facial expression). They also tell how locations must be represented as well as action conventions. For instance, the rules describe how to represent a person that has a high social rank, who is talking to the people and who is angry. All characters used by e-Motion are built by VN, within the context of the process to produce a full motion graphics piece, the sequence produced by VN can be considered as a rough storyboard. e-Motion generates animatics that follow a set of conventions for the representation of characters and locations, but also depict the dynamics of the action and emotion found in the original plot. 2 Sequential drawings adapted from the script, depicted as concept drawings that illuminate and augment the script narrative. (Hart, J. 2008) This paper is divided into four sections besides this introduction. In the first section we describe the Imaginantes project and why we think the use of computer systems can improve it. In the next section we explain how the system works. Then we propose a set of criteria to evaluate the system. Finally we present some conclusions and the current state of our project. Building Motion Graphics for Imaginantes Motion graphics are already present in everyday life. They are used in a variety of media: TV identities, film titles and credits, DVD's, videogames, smartphones interfaces, advertising displays and multiple media. The creation of motion graphics is considered a special skill, usually handled by artists or graphic designers focused on the combination of design and television broadcast or film (Frantz, M. 2003). The term is an abbreviation of "Motion Graphic Design". Kook refers to it as the use of graphics, video footage and animation technology to create the illusion of motion or rotation, usually combined with audio (Kook, E. 2011). The Imaginante's team consists of 8 to 10 people including: an executive producer, an art director, a design and animation coordinator, animators, illustrators, a musician or audio designer. The total time spent on the creation of a short ranges from 10 to 12 weeks. The team starts with an original script that provides the general structure of the story. In some cases, this script has some extra indications describing shots, special effects, sound, etc. Concept creation. The team collects all kinds of reference material related to the theme and author. It's a collaborative and exploratory work. Pre-visualization. At this stage, the team develops two main tools: first, the storyborad, whose purpose is to show the key moments of the story in a sequence, suggest framing of the scenes and inform other specifics, like lighting, camera movements and special effects. It gives the entire pre-production team, a visual sequential breakdown of the main scenes in the narrative. The other tool is the animatic, which brings the storyboard alive with motion, visual effects and a visual style for the animation. It is very effective tool to pace the narrative and timing and later add music and dialog (Hart, J. 2008). Production. After the animatic is developed, illustrations are created, digitalized, and rendered; sound and music are also added to produce the final piece. In a process like the one described above, a system like eMotion, that suggests a variety of animatics with some camera-direction decisions based on the dramatic content the director wants to pursue, would be of great value for the production team. In the regular process there is a lim 2013 185 ited feedback the team receives from just one animatic per motion graphics project. It would also open new communication channels between the team members by expanding the discussion to new options and save time and work resources. e-Motion Plots, in Mexica, are built by selecting characters and structure from a repository of previous plots, combining them in a way that makes sense, story-wise, and trying to preserve well-known, successful emotional tensions. Emotional tensions are collected during the process in an emotional-tension profile for the story. This can be viewed as a chart where overall emotion varies against time. Emotional tension preservation in Mexica is a key factor in the guidance towards the selection and combination of elements for a successful plot. A plot is a sequence of events in the order they occur in the story. It is the skeleton, the structure that tells the main events that occur in the story in a sequence of short action descriptions. Before the story is complete and ready for a final reader, it would have to be further developed to include all aspects that fulfill a creative piece of literary work. Yet, for our purposes it constitutes a good starting point because in the Imaginantes process the starting point is a plot from a text script (similar to a plot) with a few very structured actions or events in sequential order. An example story plot can be seen below. Emotional tensions are inserted between brackets as they occur (Lc = love conflict); (Lr = life at risk), (Hr = health at risk), (Ad = actor dead): Jaguar Knight was in Texcoco Lake Enemy was in Texcoco Lake Enemy got intensely jealous of Jaguar Knight (Lc) Enemy Attacked Jaguar Knight (Lr) Jaguar Knight fought Enemy Enemy wounded Jaguar Knight (Hr) Enemy ran away Enemy went back to Texcoco Lake Enemy did not cure Jaguar Knight (Lr) Farmer prepared to sacrifice enemy Enemy ran away Jaguar Knight died by injuries (Ad) Figure 1. A plot from Mexica and its tensions. In e-Motion, a story plot with its emotional profile is taken as input as well as a set of characters generated by VN. An example character from VN can be seen in (Figure 2). It depicts the 'enemy' character from the story being angry as an emotional response to the fight it held with jaguar knight (see plot above in Figure 1). Each line in the plot is an event and these, in turn, are incorporated into scenes by e-Motion. A scene has a set of performers. A performer is a character in action. That is, a character associated to an action to be performed. Characters in the animation include anything that appears on the screen and can be animated: humans, locations, emotional tokens, etc. They are all images and can be modified by 'moods'. Figure 2. Enemy in angry mood A character can have several moods depending on the representational variations available to it. A human can be looking right or left (there are only two dimensions, so far); he/she can be normal, angry or sad, etc. A location can have rain, sunshine, etc. Actions encode the movements of the characters on screen, they have a name and are a combination of the following basic animation operations: translation, scale and rotation. Performers can realize an action from the plot, like 'fight' or enact the manifestation of an emotional-tension like 'got intensely jealous of'. Emotions in animation may be expressed in different ways: character moods, textures flying as clouds across the scene, icons depicting specific feelings —similar to the ones presented in codices—, scene elements or characters appearing as text, etc. In the latter case, the font, size and color of the text are used to manifest different emotions too. We call these: emotional tokens. e-Motion builds scenes by following cinematic rules of composition: transition, character distribution, motion trajectories, framing and color. There are several options for each and the system builds the scenes of the animatic by selecting combinations of them that reflect the emotional profile of the original plot. Emotional tensions may be of different kinds: love, hate, danger, anger, etc. As a story progresses, each event may bring new emotional tensions into consideration. Some of them may reinforce others previously introduced or may counteract them. Each new tension introduced in a plot manifest itself in the composition of a scene by affecting, in a certain amount, the dramatic quality of the scene. For every character there is an emotional profile consisting of three parameters (Table 1): affect (the level of acceptance /rejection the character feels), health (a level of well-being) and excitement (a measurement of the degree of arousal of the character). Each occurrence of a tension in the original plot contributes, by a certain, predefined, amount to some of the emotional profile parameters just mentioned. They are ranked from -3 to 3. Hence, whenever a character is to be integrated into a scene, its emotional profile defines 2013 186 how it appears in it: a character's affect, health and excitement values determine its performance parameters in the scene: mood, speed, and emotional tokens (Table 1). There is also a set of global rules that determine framing transition and trajectory for the characters and scenes. Framing refers to decisions based on how close to frame an action of the story and how far to pull it back so the audience can see where the action is taking place. eMotion choses its framing from a range of 4 types, based on camera angles of photography and film (McCloud, S. 2006): First plane, Middle-shot, Middle close-up and Extreme close-up. A story will always start with a first plane view (a), making a zoom in camera movement, followed by a middle-shot (b). As the story continues with the events carrying on, the characters change their affect and health levels. e-Motion will always look for the highest tension value to change the framing. E.g. (+2 or -2) or (+3 or -3) will change to frame (c) or (d) respectively. Every time we have a middle close-up the program sets back to a middle shot view until the levels of tension of the character are increased again to +3 or -3 values. The system will always end the story with a first plane view unless the overall level of excitement shows -3 or +3. In that case the system selects and extreme close-up and tilts the object in the frame. Trajectory refers to the path that a character follows to enter and exit a scene. Each character has an entrance to the scene and a position to move to in the plane. It can also continue it towards the edges of the plane. Emotional tokens have two main trajectory paths: clouds follow a curve; stains and pictograms stay in their initial position of appearance while varying their scale according to the character's level of excitement. Transition refers to the sequence of movements from one key scene of the story to another, thus establishing the flow of the story (McCloud, S. 1993). The example plot shown in Figure 1 produces the animatic presented in Figure 3 as a sequence of images that show some of its scenes. The third frame shows character Enemy in angry mood because he is jealous of Jaguar Knight. There is a green cloud traveling across the frame from left to right showing that feeling in the scene. The eagle and the cactus compose the name of "Texcoco Lake", the place where the scene takes place. The angry mood is due to a low level of affect (-1) and excitement (1). The cloud is one of many possible manifestations of jealousy; this one in particular was selected randomly. The framing of that scene is solved in a middle shot angle. In the last scene Jaguar Knight is dead. The tension refers to Actor death, health (-3), excitement (0) and affect (0). There is an emotional token (red stain) which depicts the "being hurt"action and an up-scaling black cloud covers the body, dissolving into the final scene. Assesing Animatics To evaluate an animatic we have designed a questionnaire to measure its efficiency as a valuable tool and how new and surprising are the results to the team. (Boden, M. 1992). The questions were designed following interviews with actual members of the team where they set the parameters for effectiveness. The questionnaire is aimed at members of the Imaginantes team and will be rated higher according to their experience. Each questions offers 5 levels of agreement (1 means "totally disagree", while 5 Figure 3. Six still frames extracted from an animatic as a means of illustration. Figure 3. Six still frames extracted from an animatic as a means of illustration. Dimension -3 -2 -1 0 1 2 3 affection hate contempt envy neutral sympathy affection love health death illness hurt neutral sane welfare happiness excitement horror dread cautious neutral surprise joy bliss Table 1. Dimensions of emotional profile and their discrete values. 2013 187 means "totally agree"): 1. Does the animatic show a logical selection of key moments from the text script? 2. Does the animatic give you general information on how many characters/objects/locations need to be drawn. Does it help you visualize a particular graphic style? 3. Does the animatic give you a general direction on time, special effects, movements and transitions you need to consider for the animation? 4. Does the animatic show an appropriate selection of camera angles according to the dramatic content of the script? 5. Is it likely that anyone on your team would have come up with a similar solution? We end the questionnaire with a question that asks the participant to locate in a chart the balance of intensity/clarity and creativity of the animatic they have just seen (McCloud, S. 2006). We expect that e-Motion should ideally be ranked within the upper-right quadrant. (Figure 4). All team members have access to all work products, so they can relate them to the initial plot. Clarity Transmit the logical structure narrative of the text script Intensity Effects and techniques to attract an excite readers according to the dramatic content Very Creative Suggests relevant ideas, some that I have never thought before Not so creative Suggests a number of ideas but not so good Figure 4. Efficiency and creativity chart Conclusions and Future Work e-Motion is a system that contributes to the development of motion graphics by creating animatics. By doing so, it plays an important role in the creative process since it determines a great deal of the structure and action of the final motion graphics piece. We consider our system important because it allows us to experiment with computational creativity in a proven creative process in the real world. This setting is particularly appropriate, we find, to evaluate the system's performance since the human part of the team can do so, with well defined parameters. As far as the authors of this paper are aware, there is no other work involving computational creativity and animation. At the time of writing e-Motion is in β -test for its first version. We have run it with a few plots but the evaluation process, although it has already been designed, it still hasn't been applied. It will as soon as the system is ready. In the first stage of the project we use Mexica plots as a starting point. This allows us to use a standard for plots that also contain information about their emotional tensions. In the Imaginantes project, the starting point is a script derived from an art piece. In subsequent versions we will use the experience with Mexica plots to standardize scripts taken from other sources and provide them with emotional descriptions. We are also planning to develop other systems that take over other aspects of the process and study their effects. The system currently works with a set of tension values; these contribute cumulatively to character's emotional profiles, which determine how characters are animated. The general rules about framing, transition and trajectory follow basic cinematic standards but we would also like them to be determined by emotional parameters, an aspect that needs to be further investigated 2013_27 !2013 An Emerging Computational Model of Flow Spaces to Support Social Creativity Shiona Webster, Konstantinos Zachos & Neil Maiden Centre for Creativity in Professional Practice City University London Northampton Square, London, EC1V0HB, UK {shiona.webster.2, k.zachos, N.A.M.Maiden}@city.ac.uk Abstract This position paper reports an emerging computational model of flow spaces in social creativity and learning that can be applied to guide human-centered creative cognition in social groups. In particular we are planning for the model to be applied to inform creative goal setting, creativity technique selection and adaptation, and guided social interaction during creative problem solving and learning. Social Creativity and Learning Social creativity and learning are increasingly important and related phenomena. Indeed, fostering creativity in learning is seen as a key direction with which to transform promising ideas into new processes, products or services (Retalis and Sloep, 2010). The explosion of information made available through the advancement of Web 2.0 has resulted in publicly available content that is continuously (re)created over the social media universe at an everincreasing speed (Kaplan and Haenlein, 2010). Such rich content resources can provide a wealth of useful information that can support creativity and learning in both informal and formal social groups. Technologies are available to support such social creativity and learning and which support many different techniques that can be applied to solve problems creatively. However, one outstanding challenge is which techniques to use to support different forms of social creativity and learning. The techniques can be categorized by the creative outcome that each can deliver when applied effectively, for example, the distinction between transformational, exploratory and combinatorial creativity (Boden, 1990), yet these categories offer few insights into effective processes that lead to social creativity and learning. We argue that the success of social creative processes can depend on the extent to which people in the process are able to collect and relate information as well as create ideas collaboratively (Shneiderman, 2002), and whether these people experience flow and can create and learn, as opposed to becoming bored or anxious during it (Csikszentmihalyi, 1974). For example, consider the following three different creativity techniques that could be deployed in a social creative process: (i) creativity triggers for business services, an exploratory creativity technique which directs the problem solver to solutions associated with creative ideas with qualities such as convenience and trust; (ii) constraint removal, a transformational creativity technique that removes or reduces perceived constraints to increase the possible search space e.g. (Onarheim, 2012), and; (iii) analogical problem solving, an exploratory creativity technique that transfers a network of interrelated facts from a mapped source domain to the target domain e.g. (Gick and Holyoak, 1983). Each of the techniques has different strengths and weaknesses. Analogical reasoning from a source domain necessitates information about the domain to be collected and related before ideas can be generated. Analogical knowledge transfer can then trigger the problem solver to generate multiple and more radical new ideas and concepts, but is cognitively difficult to do (Gick and Holyoak, 1983), and can lead to anxiety rather than flow and learning through the formation of new problem schemata. Constraint removal also necessitates information to be collected beforehand, and can lead to the generating of more ideas than with analogical problem solving (Jones et al., 2008). We argue that criteria and mechanisms for selecting the most effective creativity technique at the right time in a social creative process are currently lacking. Whilst some experienced human consultants demonstrate an ability to select and adapt techniques to changing situations in social processes, such work is best categorized as craft, with little externalization of the knowledge and mechanisms applied. Moreover, if we are to embed such knowledge and mechanisms in computational environments that will guide and support people in the use of Web 2.0 creativity support tools during such processes, then new research is needed to discover and describe this knowledge and mechanisms - new research that we are undertaking in the COLLAGE consortium. COLLAGE is a EU-funded Integrated Project, to inform and enable the design of effective Web 2.0 social creativity and learning technologies and services. The focus is to design, develop and validate an innovative cloud-enabled social creativity service-set that will support the 2013 189 interlinking of learning processes and systems with (i) social computational services for inspiring learners, (ii) social affinity spaces for leveraging expression and exploration, and (iii) social game mechanics for supporting social evaluation and appreciation of creative behaviour. The new computational environment that we are developing to invoke different services in this set will need new capabilities to select between and recommend services, then adapt guidance to the social group during the social process. To deliver these capabilities, the approach adopted in COLLAGE is to develop a descriptive model of the desirable creative processes that is derived from existing theories and models of creativity and learning. In this paper we report a first version of the model that describes how creativity and learning might be associated within a social process. The focus of the model is on descriptions of conceptual spaces in which flow, creativity and learning can be achieved. This model will, we anticipate, enable the design of effective social creativity and learning technologies and computational services with which to inform the selection and use of different creativity techniques and support tools. Initial version of the COLLAGE Social Creativity and Learning Model The COLLAGE Social Creativity and Learning (SCL) model is being developed to inform the principled selection and use of different techniques and computational services that support creative idea generation based on inspiration and recommendation engines, game mechanics and affinity spaces. To develop the model, we have drawn on Shneiderman's GENEX framework and Boden's concept of conceptual space to support social creativity and collaborative learning in workplaces. The use of each is reported in turn. GENEX Framework The SCL Model is based on the GENEX framework (Shneiderman, 2002) - an established situationalist model of social creative processes. The GENEX framework identifies four key processes during social creativity: (i) collecting information from public domain and available digital sources; (ii) relating, interacting, consulting and collaborating with colleagues and teams; (iii) creating, exploring, composing, and evaluating solutions; and (iv) disseminating and communicating solutions in a team and storing them in digital sources. These phases may occur in any order and may repeat and cycle iteratively. Boden's Theory of Search Spaces In COLLAGE we use Boden's model of creativity (Boden, 1990) to support the creative work by exposing novel information spaces to problem solvers and in turn, recommend creativity techniques that can be used to discover novel ideas for problem solving. Creativity is seen as a search of solution possibilities in a space based on measures of dissimilarity between possibilities as proxies for solution novelty (Ritchie, 2007). The search task is to find a complete solution among a set of partial and complete solutions that make up the search space. Hence, we assert that the problem at hand can be mapped to a problem of searching a space of solution possibilities. The SCL model extends both the GENEX framework and Boden's concept of conceptual space to incorporate three capabilities that are critical to support social creativity and learning: (i) to reason about a new solution in order to discover the spaces in which novel and useful ideas are most possible; (ii) to guide the use of creativity techniques to search these spaces in order to discover novel and useful ideas; (iii) to engage the problem solver in such a way that he is fully immersed, feeling involved and successful in exploring the space of possible ideas. To deliver these capabilities the SCL model includes: (a) a theory of goal-driven creative search spaces that computes novel search spaces and recommends creativity techniques to discover novel ideas; (b) a collaborative learning model for creativity that exploits a problem solver's real learning capacity in a collaborative and creative setting. The next section describes our use of the theory of goal driven creative search and new collaborative learning model that combines Csíkszentmihályi's notion of ‘flow' with Vygotsky's Zone of Proximal Development. Theory of Goal-driven Creative Search Spaces Since search spaces have an implicit modularity in their structure (Johnson, 2005) and are often too large to search in a single search activity, the SCL model supports the discovery and exploitation of modular building blocks in the space. In COLLAGE we see the SCL model as a search-based creative process, i.e. a process of breaking down an initial, bigger problem into sub-problems, working out how those sub-problems fit together, and then tackling those sub-problems. Figure 1. The overall search space divided into sub-spaces Figure 1 shows a representation of two types of search space that we are seeking to describe and enable the search of, and discovery of ideas within. The first one is the larger overall search space that includes all of the ideas in the space. Since the space is too large to search in a single creative search activity, the space is searched through a series of creative search activities, each of which searches 2013 190 the local part of the space expressed by the current goal, related to the ideas already discovered in the space. We can express a creative search activity in terms of a current subspace in a wider design space, and apply search-based techniques and theories to it. One characteristic of creative search processes is that the criteria for evaluation of where to make the moves in the search space are not easy to capture in rule-bound form. Therefore, in COLLAGE we will employ game mechanics as a means to set intermediate goals in the overall search space that will both guide and engage problem solvers in further creative activities. Just as a game has levels that one tries to achieve, so should each creative search activity be informed by specific goals; game mechanics are used to provide these goals, which can be in the form of awards, credits and acknowledgements, in order to motivate and engage learners further in the creative problem solving process. Each subspace reveals a new goal that compels the problem solver to continue their creative search activity. Collaborative Learning Model The fundamental idea of how a subspace is traversed can be illustrated through an approach that combines Csíkszentmihályi's notion of ‘flow' (Csíkszentmihályi, 1996) with Vygotsky's notion of the Zone of Proximal Development (Vygotsky, 1978). By combining both ideas, we introduce the concept of the collaborative learning model. Csíkszentmihályi suggests that a person (or group) can experience ‘flow' when fully immersed in an activity, feeling full involvement, an energized focus and success. Creativity is more likely to result from flow states (Csikszentmihalyi, 1996). Csíkszentmihályi identified three things that must be present to enter a state of flow: • Goals - Goals add motivation and structure to the task; therefore, the person must be working towards a goal to experience flow. • Balance - There must be a good balance between a person's perceived skill and the perceived challenge of the task. If one weighs more heavily than the other, flow probably won't occur. • Feedback - A person must have clear, immediate feedback, so that he can make changes and improve his performance. This can be feedback from other people, or the awareness that progress is being made. Vygotsky's conceptualisation of the zone of proximal development (ZPD) is designed to capture that continuum between the things that a learner can do without help, and the things that a learner can do when given guidance, or in collaboration with more knowledgeable others. According to Vygotsky, learning occurs in this zone. Therefore, for learning to occur, people in a creative social process must be presented with tasks that are just out of reach of our present abilities. Tasks that are in the ZPD are tasks we can almost do ourselves, but need help from others to accomplish. After receiving help from others we will eventually be able to do the tasks on our own, thus shifting them out of our ZPD, in other words we have learned something. In COLLAGE we combine flow and the zone of proximal development in the collaborative learning model depicted graphically in figure 2. The concentric circles represent the subspaces and goals that make up the larger overall search space. The horizontal axis represents a problem solver's domain-specific knowledge of the task at hand and the vertical axis represents the level of the task challenge. As the problem solver's acquisition of knowledge advances in response to the challenges, an ideal path in the flow region would progress from the origin towards the upper right. The transition from starting point (A) to destination point (B) indicates the increase of knowledge and challenge that naturally traverses the ZPD, but under control and with the expectation that the problem solver will return to the flow zone again. We can see how a problem solver can move from bored (when their domainspecific knowledge exceeds their challenges) into the flow zone (where everything is in balance), but can easily move into a space where he needs some help. Most importantly, if we move upwards and out of the ZPD by increasing the challenge too soon, we reach the point where a problem solver starts to realize that he is well beyond his comfort zone. In COLLAGE, we seek to characterize each path connecting a knowledge/challenge space by the goal, balance and feedback needed to encourage flow: • Game mechanics can provide achievable goals; • Balance between a problem solver's domain-specific knowledge and skills and the perceived challenge of the task will be sought; • Specific COLLAGE creativity-supported feedback services will provide clear and immediate feedback. The next section describes how we are developing computational guidance for social creative processes. Providing Guidance for Creative Processes Our vision in COLLAGE is to utilize the emerging model with its concepts of information search for idea discovery, individual and social flow, and zones of proximal Figure 2. The collaborative learning model 2013 191 development to recommend and adapt the use of different computational services and affinity spaces during a social creative process. The ambition is deliberately ambitious, with the aim to develop a computational environment to propose and adapt different services and spaces to maximize search, and achieve flow and learning. Indeed, according to Amabile, one of the single most important factors that induces creativity is a sense of making progress on a meaningful task (Amabile and Kramer, 2011), therefore the guidance will provide catalysts that induce progress, for example by setting achievable goals, providing resources, offering help and enabling users to learn from knowledge gained during previous creative activities. The guidance is being developed to direct users along paths that connect a knowledge/challenge starting point (A) with destination point (B) in the collaborative learning model depicted in figure 2. We see the role of the creative process guidance to direct the problem solvers to effectively use the different creativity techniques, dependent on the situation, to bring balance to the knowledge/challenge. The creativity-supported feedback component incorporates all four processes from the GENEX framework. The first version of the model identifies at least the following characteristics of social creativity and learning: 1. Defining and searching conceptual spaces of possible ideas 2. The setting of goals that render effective periods of individual and group flow achievable, within risking boredom and/or anxiety; 3. The maintenance of group flow in groups of distributed individuals who are often collaborating asynchronously; 4. Guiding individual learners into zones proximal development to encourage then support learning about creativity techniques and/or the problem domain as part of the flow process. COLLAGE creativity services and affinity spaces need to support people to undertake creativity and learning activities with these characteristics. Moreover, we argue that each of these characteristics indicates one or more affordances of creativity services and affinity spaces for these characteristics of social creativity and learning. Consider each of the characteristics in turn. Defining and Searching Conceptual Spaces of Possible Ideas Any creativity service and affinity space should afford: x One or members of the social group to undertake explicit information search and idea discovery in a conceptual space of possible ideas; x These members to explicitly implement creativity services and affinity spaces that support different forms of transformational, exploratory and combinational creativity in a conceptual space. An example of an established creativity service that affords exploratory information search and idea discovery is a creativity trigger. A creativity trigger is a generic desirable quality of a future solution that the social group is directed to discover new ideas to deliver - in software-based solutions, these qualities can include convenience, choice and trust. For example, use of the creativity trigger convenience guides one or members of the social group to undertake explicit information search and idea discovery in a space of ideas that can deliver the convenience of quality - and the search can be supported through the retrieval of information related to the quality of convenience. Setting of Goals that Render Effective Periods of Individual and Group Flow Achievable Any creativity service and affinity space should have assigned to it: x A rating of the prototypical distance between the current set of ideas and the set goal that can be achieved through effective application of the creativity service or affinity space - the creative potential of the service or space; x A rating of the prototypical distance between the content of the current set of ideas and the set goal content that can be achieved through effective application of the creativity service or affinity space - the creative potential of the service's or space's content; x A difficulty rating indicating the potential level of difficulty that one person or a social group might encounter when learning and/or applying the service or space. An example of a creativity service that demonstrates goal setting for individual and group flow is analogical reasoning. Analogical reasoning is the systematic transfer of a network of related information from a source domain to a target domain in order to generate new ideas in the target domain based on the transferred information (Gentner, 1983). Analogical reasoning has considerable potential to reconceptualise problem and solution spaces, hence the service's creative potential is high. Key to its success is the selection of source domain(s) from which to transfer knowledge for idea generation. Source domains semantically close to the target domain are easier for people to map to, but can lead to less new idea generation, and can risk boredom. In contrast, source domains semantically further from the target domain can lead to greater idea generation, are more difficult for people to map to and risk anxiety. Moreover, empirical evidence has revealed that people find analogical reasoning difficult (Gick & Holyoak 1983), hence they are likely to encounter difficulties during its use compared with creativity services that are easier to use such as creativity triggers. 2013 192 The maintenance of group flow in groups of distributed individuals Any creativity service and affinity space should afford: x Collaborative creativity and learning by the members of the social group; x The externalization of new ideas and knowledge that can be shared effectively with the members of the social group as part of a creative process; x Explicit support for turn taking by members of the social group during the collaborative creative process. An example of an affinity space that can afford the maintenance of group flow is design storyboarding. A storyboard is a graphic organizer in the form of illustrations or images displayed in sequence for the purpose of pre-visualizing a motion picture, animation, motion graphic, interactive media sequence or, for COLLAGE, a business or service design. Developing a storyboard from a set of existing concepts and ideas can afford collaborative creativity and learning by members of a social group through focused work on individual storyboard frames - the new ideas and knowledge generated from this creative work are shared with other members of the social group through the emerging storyboard, which acts as common ground in the collaborative creative process. Moreover, the development of discrete storyboard frames by individual members of the social group can afford turn taking based on game mechanics. Guiding Individual Learners into Zones of Proximal Development Any creativity service and affinity space should afford: x The acquisition and learning of new knowledge in order to achieve flow as part of the individual and collaborative creative processes; x The adaptation of any creativity service and affinity space in real-time to guide one or members of the social group into the zone of proximal development to support learning during creative flow. An example of a creativity service that guides learners into zones of proximal development to encourage learning is the constraint removal service reported earlier. During the create activity, one or more members of the social group are required to envision a future version of the domain in which a constraint no longer applies or has been significantly relaxed. For example, during the exploration of new, more environmentally friendly operational concepts for an airport management system, one constraint that was removed was the variability of the weather. To generate new ideas, each member of the social group was required to envision an alternative reality of the domain in which weather was predictable. This required learning by the social group. Future Work Clearly we have only reported preliminary research in this paper, and much work remains to be done to develop, implement and validate the concepts proposed. The next stages of the research are to complete a first description of the model and build a first computational model of creative search spaces that the model will be applied to. We have a set of available computational creativity services that can be applied to search the space, as a basis for prototypical development of first versions of the computational model. We will look forward to reporting these advances in the near future. Acknowledgements The research reported in this paper is supported by the EUfunded COLLAGE integrated project 318536, 2012-15. 2013_28 !2013 Idea in a bottle - A new method for creativity in Open Innovation Matthias R. Guertler, Christopher Muenzberg, Udo Lindemann Institute of Product Development Technische Universität München Boltzmannstr. 15, 85748 Garching guertler@pe.mw.tum.de Abstract This paper presents an approach to increase the creativity of ideas/solutions in an idea contest. Analog to a letter in a bottle tasks are distributed in a randomized way to potential problem solvers. The idea contest is a method from Open Innovation which opens a company's innovation process to its environment (e.g. customers, suppliers). By using idea contests the creative potential of a large crowd of people can be used for developing innovative solutions for a specific task. Nevertheless, based on experience from industry projects we found that creativity often is limited. This paper presents an approach for increasing the creative potential of participants. The new integrated method combines idea contest with lead user's methods and aspects from synectics and communication. Introduction Open Innovation integrates a company's environment into its innovation process, e.g. in terms of customers or suppliers, and enables new innovations (Chesbrough et al. 2006). A popular Open Innovation method is the, usually webbased, idea contest which allows companies to publish a specific issue/task to a large crowd of people. These develop and post potential solutions for the issue. The idea behind is using the diversity of the crowd to generate creative and innovative solutions (Keinz et al. 2012). By giving participants/users the possibility to review other posts they can evaluate them as well as advance them. However, in industry projects we found that submitted solutions often are relatively homogeneous, of small number and of low degree of creativity. In order to improve participants' creativity and the quality of posts, we developed the approach "Idea in a Bottle" based on the creativity method Synectics and Shannon's model of communication. The idea is to break up entrenched processes within an idea contest where users/problem solvers choose tasks to contribute. This is done by allocating the four phases of synectics (see next chapter) to different persons or groups and instrumentalize the primarily negative "noise source" of Shannon's model in a positive manner. We propose, by randomly allocating issues from idea-seekers to other users, their creativity is stimulated. The confrontation with an unexpected, nonself-chosen task helps overcoming our assumption that users usually choose issues they are familiar with. To direct the randomized process into efficient channels the Pyramiding method from the lead user concept is utilized. Thus, the first recipients of the issue do not solve it but act as agents and forward it to users they consider to be suitable and experienced on the specific field. These users submit suggested solutions to the idea seeker who evaluates the usefulness. The proposed approach is applicable for issues/tasks of low and medium complexity. This means the improvement or new development of everyday products or the solution of medium complex problems. All issues should be processable without the need of highly specialized expertise or know-how. The paper starts with a rough overview of the state of the art of Open Innovation, different user integration concepts, synectics, and Shannon's communication model. Based on this we present our I aB approach. We close the paper with a discussion about the planned evaluation of our approach by integrating I aB into a web-based idea contest platform. State of the art This chapter shortly explains the underlying concepts of the proposed approach "Idea in a Bottle". The basic elements are Open Innovation, synectics, Shannon's communication model, analysis-of-stimulating-word and pyramiding. Open Innovation and Crowdsourcing Open Innovation opens a company's innovation process to its environment (Chesbrough et al. 2006). The interaction with the environment enables innovations inside and outside the company. A concept focusing on the innovative potential of a large group of people is Crowdsourcing (Sloane 2011). The crowd can help elaborating and solving 2013 194 specific issues and tasks by using the diversity of persons with their individual backgrounds, mindsets, abilities and knowledge (Keinz et al. 2012). A popular Crowdsourcing method is the usually web-based idea contest. Companies or individuals can publish issues on a web-platform. Users of the platform look at the issues and post ideas for solutions. Other users review these posts, advance them or get inspiration for new ideas. The goal is obtaining a large number of advanced ideas. Lead User According to von Hippel et al. (2006) lead users are characterized by (1) their capability for innovation as they are ahead of the market, and (2) their motivation for contribution. Several methods were developed to identify these innovative users. One method based on the snowball effect is the method Pyramiding. It is based on the assumption that people who are interested in a topic know other people who are more expert than themselves. Thus, Pyramiding starts with an initial group of people who name other people they consider to be more expert. These persons again name persons considered to be more expert. After some iterations potential lead users are gained (von Hippel et al. 2006). Synectics Synectic is a creativity technique based on brainstorming and was developed by W.J.J. Gordon in 1960 (Daenzer and Huber 2002). By postulating analogies from different fields, e.g. literature, nature, or symbols, users of this method are supported to find new solutions spaces for a stated problem. Synectic is a group technique with a proposed maximum of 10 participants who are instructed by a skilled moderator (Daenzer and Huber 2002). Synectics is structured into four phases which are passed through sequentially. The four phases are: 1. In the Analysis phase the group exposes the problem and states a problem definition. Also first solutions will be gathered and documented. Finally the problem should be restated. 2. The second phase, Incubation, is characterized by taking one step back with the help of building analogies. For example the group tries to build personal analogies by thinking how the object of interest feels. The outcomes of this phase are abstract solutions of the problem. 3. In the third step the stated analogies get analyzed and it is tried to transfer the solutions on the original problem. This can also be done with the help of force fit, i.e. oppressive reforming of the analogies. The results of the Illumination are new solutions approaches. 4. In the Verification phase the proposed approaches are used to elaborate solution concepts. Presentation of Communication by Shannon The communication process within an idea contest or synectics, e.g. the problem description formulated by the ideaseeker and interpreted by the problem solver, is one of the success factors for developing appropriate solutions. In 1963 Shannon and Weaver proposed a schematic diagram of a communication system (Shannon 1998). The proposed diagram consists of five essentially parts. These are information source, transmitter, channel, receiver, and the destination depict in Figure 1. Information source Transmitter Receiver Destination Noise source Received signal Signal Information Information Figure 1: Schematic diagram of Communication by Shannon and Weaver (Shannon 1998) The information source produces messages or sequences of messages which should be communicated. These messages can be of various kinds, e.g. letters or functions (Shannon 1998). The operator produces a suitable signal for transportation. The channel is the medium which transmits the signal to the receiver. The receiver reconstructions the signal and transports it to the destination, i.e. the person for whom the message is intended. An important factor in Shannon and Weaver's diagram is the noise source introduced in the channel. This source leads to impacts on the communication. These impacts can change the original message by new interpretations, extension, reduction, or adaption (Lindemann 2009). Analysis of stimulus words This is a creativity method for developing new ideas by confronting participants with words not related to the actual topic. Participants analysis these words spontaneously by relevant criteria and build links to the original topic (Lindemann 2009). A new method for creativity in idea contests: Idea in a bottle (I²aB) In order to increase the creativity and quality of ideas developed during an idea contest, we suggest redesigning the present communication process on an idea contest platform. So far, in analogy to Figure 1, an idea-seeker describes his issue (information source) by a problem description/task (transmitter) and publishes it on the platform. Here other users (receivers) can select this task, read it and derive their understanding of the task (destination). The following posting of solution ideas proceeds in an analogous way. 2013 195 Our approach splits up the four steps of synectics and distributes each to another group in order to increase efficiency and creativity. The analysis (1) is performed by the idea-seeker as "owner" of the problem. His analysis and statement of the issue affect the entire following I aB process. The incubation (2) is located by users of the platform who read and interpret the problem statement. Based on their understatement they link the issue to other users they consider suitable for the issue. The illumination (3) is conducted by the recommended users. They develop solution ideas for the given task based on their own interpretation of the issue and their personal background. The final verification (4) of the created solution ideas is performed by the idea-seeker himself again. Due to the incubation and illumination stage are not executed by the idea-seeker but by other users we term them "external". The I aB approach instrumentalizes the "noise source" in terms of a randomized distribution of tasks to users. Instead of selecting familiar issues users get new tasks. Receiving unfamiliar topics shall support out-of-the-box thinking by providing an external perspective on a topic. To prevent demotivating users by receiving to many unfamiliar topics the distribution and solving step are separated by an intermediary Pyramiding step. The primary receivers of a task forward it to other users they consider to contribute a value gain to solving the problem. The process of the idea-seeker putting his issue into the platform without knowing who is receiving the issue is comparable to a letter in a bottle thrown into the sea. Hence, the approach was named "Idea in a Bottle" in analogy. Figure 2 illustrates the concept of Idea in a Bottle (I aB). It consists of four stages analog the synectics approach, as mentioned previously: In Stage 1 "analysis" idea-seekers phrase their problem/issue in a written task statement. It can also be enhanced by a picture or sketch. However, it is the intension to gain a compact description of the issue which focuses on relevant aspects. This increases the comprehensibility and thereby the user's motivation to deal with the issue. Thus, the number of words will be limited to abstracts' length with ca. 250 words in the beginning. Adding characterizing keywords supports the later forwarding process by the socalled agents. All issues are stored on the web-based idea contest platform. In Stage 2 "external incubation" the Idea in a Bottle (I aB) system distributes the issue in a randomized way to three registered users on the platform. These users act as agents: they examine and, due to its shortness, interpret the issue. They are allowed to reply a potential solution idea. However, primarily their function is forwarding the issue to another user they consider able to contribute an add value for solving the issue, e.g. due to their experience/behavior in other idea contests on the platform. This forwarding process is based on the pyramiding method of the lead user concept. The optimal number of agents and problem solvers needs to be evaluated in practical tests. The randomized distribution and interpretation of the issue by the agents equate the noise source of Shannon's model. Summarized, the randomization stimulates the creativity of problem solvers in terms of analysis-of-stimulus-words (Lindemann 2009). By receiving forwarded issues, we asd) Verification a) Analysis c) External illumination Idea contest agent problem solver idea seeker System pyramiding task distribution task distribution suggestion of solutions/ inspiration for idea seeker act of attribution randomized allocation problem, task task distribution b) External incubation Figure 2: Model of Idea in a Bottle 2013 196 sume an increased motivation of problem solvers due to the honor of being recommended by other users. Stage 3 is called "external illumination" due to the interpretation of the issue by other users. As described before, the potential problem solvers receive a random issue with the request for solving it. Since the problem solver does not know the real problem, only the problem statement, he builds new analogies of the given problem by interpreting the issue. These new analogies combined with the randomized distribution should lead to creative solutions which were not considered by the idea-seeker. Similar to Stage 1 also the solution ideas can be consist of text, photos or sketches. The size is limited, too. The problem solver is considered to contribute with solution ideas. Otherwise it is also possible to submit advices/hints which might indirectly draw the idea-seekers attention toward alternative potential sources and directions for a solution. Both the solution ideas and the hints are submitted electronically via the I aB system. In Stage 4, "verification", the idea-seeker receives potential solution ideas and evaluates them regarding their applicability to his problem. In comparison to "classical" idea contest with a high effort in evaluating the gained ideas (Kain et al. 2012), we assume the verification effort for ideas created by I aB being lower since the solutions were elaborated by qualified system user. In the case of no appropriate idea the idea-seeker can submit his issue for a second loop. Conclusion and next steps The presented approach supports increasing the creativity and quality of solution ideas posted in an idea contest. This is realized by a combination of crowdsourcing, synectics, creativity techniques and pyramiding. Issues/tasks published by idea-seekers cannot be chosen by other users as in "classical" idea contests but are distributed in a randomized way to users who forward it to potential problem solvers. This randomized distribution combined with both the interpretation by the agent and the potential problem solver supports "out-of-the-box" ideas which might lead to innovative solutions. At this, the confrontation with unfamiliar topics acts as an analysis-by-stimulating-words and affects the problem solver's creativity. Additionally by being considered as a kind of expert by other users the motivation should tend to be high to contribute a solution. I aB is enhancing, transferring and implementing classical creativity methods for new media and distributed product development activities. However, synectics was developed in the 1960s and is a classical creativity method which can be used in teams. We try to adapt this method for today's multi-media society. To evaluate and proof these advantages we plan to implement I aB in a web-based way. The basis will be an idea contest platform at the institute which is being implemented at the moment and is specifically designed for testing new methods in the field of Open Innovation. This platform allows Open Innovation contest with students as well as industry as evaluation partners. Here, we have the possibility to assess I aB in direct comparison to a "classical" idea contest. At this, the user pool of the platform can be used, as a sufficient community is seen as crucial success factor. Besides others, the following questions need to be addressed: 1. Does the satisfaction and motivation of problem solvers increase? 2. Are differences regarding the number of replies to an issue; the quality and usefulness of ideas; the creativeness and the evaluation effort by the ideaseeker? 3. Is the choice of limitations of the issue description useful? 4. Are there any specific patterns within the forwarding process with frequently involved users? Summarized, the expected key contributions of I aB are (1) a higher creativity, (2) a higher motivation of problem solvers and (3) a higher resulting quality of solution ideas. 2013_29 !2013 2013 198 Multilevel Computational Creativity Ricardo Sosa Singapore University of Technology and Design Singapore 138682 ricardo_sosa@sutd.edu.sg John S. Gero Krasnow Institute for Advanced Study George Mason University Fairfax, VA, 22030 john@johngero.com Abstract Creativity can hardly be understood in isolation from a context where values such as novelty and usefulness are ascribed. This paper presents a multi-level perspective for the study of creativity and formulates a framework for computational creativity that consists of 1) Culture; 2) Society; 3) Groups; 4) Products; 5) Personality; 6) Cognition, 7) Neural processes; and 8) CC processes. This model enables the definition of functional relationships among these levels. As an initial step to illustrate its usefulness, an analysis is made of the ICCC'12 proceedings in view of this model. Introduction The assessment of creativity is increasingly being recognized as an important direction in the research program of computational creativity (Jordanus 2011; Indurkhya 2012; Maher 2010, 2012). One of the main arguments is that creativity is in fact defined via the evaluation or ascription of values such as novelty and utility by third-parties beyond the creator(s). In other words, a creative product, person or process can hardly be understood in isolation from a context where such values are ascribed. Rather than a binary property, we consider that the composite value of creativeness is easier to define as a relative value ascribed by weak-to-strong levels of agreement or consensus to a range of products, persons or processes ranging from noncreative or routine to transformative or disruptive creativity (Gero 1990; Kaufman and Beghetto 2009). Because creativity is defined through the ascription of values (novelty, utility, expectation) in a system where creators and evaluators interact, this paper regards creativity as an eminently psycho-socio-cultural phenomenon; its aim is to frame computational creativity from such perspective. Computational creativity (CC) has inherited an emphasis on individual processes, performance and products from the mainstream Artificial Intelligence worldview. In that paradigm, the agent architecture consists of autonomous individuals interacting with an external environment (Russell and Norvig 2005). CC has assumed that understanding individual behavior is a sufficient way of modeling creativity. A social-psychology approach to creativity began to illustrate the interaction between individual and external factors (Hennessey 2003). More recently, culturalpsychology creativity seeks to extend that work by shifting the architecture from a view of individual behavior "conditioned" by social factors and towards a more integrated view where interdependent relationships co-constitute a complex creative system (Glăveanu 2010) This paper presents a multi-level perspective for the study of creativity and formulates a framework for computational creativity (CC). The aims of this work include: to enable new ways of thinking about CC from different disciplines, to support communication between research traditions, and to start mapping the units of analysis, variables and interactions between levels. The paper is organized as follows: Section 2 introduces key concepts and draws from the theoretical bases of this approach; Section 3 presents our framework and explains structural and functional aspects of our model. Section 4 evaluates this model using the 34 papers presented at the previous International Conference of Computational Creativity (ICCC'12). Section 5 closes the paper presenting modeling strategies and guidelines as well as discussing potential approaches to CC. Background Integrating scientific disciplines goes back to Comte's hierarchy of sciences according to the scale and complexity of theoretical tools (Mayer and Lang 2011). The role of cultural mediation in the development of cognitive functions has its origins in the tradition of cultural psychology since Vygotsky (Moran and John-Steiner 2003). Ecological models of creative problem solving integrate cognitive, personality, and situational factors (Isaksen et al 1993). Views of creativity as a social construct have been formulated elsewhere (Sawyer 2010; Westmeyer 2009). Multilevel models that capture the interactions between psychological, social and cultural factors enable two complementary research directions. On the one hand, holistic explanations are possible by going up in the hierarchy drawing upon higher levels that moderate lower effects. On the other hand, reductionistic explanations go down in the hierarchy to inspect lower-level factors that account for high-level phenomena (Koestler and Smythies 1969). For example, accounting for cultural constructs can be essential to understand individual attitudes to altruism (Sheldon et 2013 199 al). Likewise, the characterization of individual cognitive styles helps explain and manage group conflict (Kim et al 2012). Despite the disciplinary divides between psychology, anthropology and sociology, a phenomenon such as creativity may require a cross-disciplinary perspective that includes the interplay between levels of causality (Sternberg and Grigorenko 2001). Computational creativity has the potential to embark on cross-disciplinary modeling. Contemporary personality research is a relevant example as it provides empirical support for the irreducibility postulate: i.e., "no scientific discipline is likely to subsume the others, all are needed" (Sheldon 2004). In the field of personality and well-being, multilevel approaches show the complex interactions and effects among factors located within and between levels of organization -from cultural to social, personality, cognition and neural processes (West et al 2010). Such integrated and interdisciplinary models account for moderator relationships between levels of organization. The Multilevel Personality in Context (MPIC) (Sheldon et al 2011) and the Cognitive-Affect Personality System (CAPS) (Mischel and Shoda 1995) are two examples of how multiple levels of analysis can be integrated for a more reliable and complete understanding of complex human behavior - such as creativity. The MPIC model specifies the following levels: Culture, Social relations, and four levels of Personality: Self-Narratives, Goals/Motives, Traits/Dispositions, and Needs/Universals (Sheldon et al 2011). Reviewers of the MPIC model further suggest the addition of Situations to account for contextual factors beyond the bio-psychosocial (Mayer and Lang 2011). In computational creativity, Indurkhya (2012) identifies the interplay between system levels by framing the following dilemma: when non-conscious or unintentional processes generate artifacts deemed as creative by an audience (i.e., works of art by a schizophrenic but also the ubiquitous cases of unexpected successful products), "where is the creativity?". A similar point can be made when considering the attribution of creativity to designs by Nature (McGrew 2012). Understanding the interplay between generative and evaluative processes of creativity has the potential to transcend such apparent paradox where at a given level it may seem like "there is nothing distinctive […] that we can label as creative" (Indurkhya 2012). Maher (2012) frames the need for evaluation criteria that are independent of the generative process. Jordanus (2011) suggests a standardized approach to evaluation where key components are identified, clear metrics are defined and tests are implemented. The work presented in this paper is aligned to these aims and puts forward a structural and functional framework for an integrated cross-disciplinary study of computational creativity. Multi-level Computational Creativity The Multi-level Computational Creativity (MLCC) model builds upon the Ideas-Agent-Society (IAS) framework which maps three dimensions of creative systems: epistemological, individual and social dynamics (Sosa et al 2009). That structural framework synthesizes constructs from five influential theories related to creativity and innovation, i.e.: exemplars, proponents, and communities (Kuhn); innovations, entrepreneurs and markets (Schumpeter); noosphere, strong spirit and culture (Morin); domain, individual and field (Csikszentmihalyi); and logic, genius and zeitgeist (Simonton). MLCC specifies eight separate levels of analysis: 1) Culture; 2) Society; 3) Groups; 4) Products; 5) Personality; 6) Cognition, 7) Neural processes; 8) CC processes. In addition, MLCC goes beyond the mapping of systemic dimensions and enables the definition of functional relationships among these levels. These relationships can be defined as independent or interdependent, i.e., the former represent processes that occur only within a single level in isolation, whilst the latter represent processes that are connected between levels. Namely, a range of cognitive functions can be studied in a CC system, some of which can be assumed to emerge from explicit lower-level neural processes, others that are defined only within the cognitive level, and a third type that lead to higher-level personality or group processes. Table 1. The eight levels of our multi-level model of computational creativity (MLCC) and exemplary creativity models MLCC level 1, Culture, refers to processes that either aim to model or draw from knowledge bases and corpora, cultural evolution, cultural dimensions, organizational culture, language and semiotics, economic impacts, taste and traditions, public policy, mass media, intellectual property, creative environments, planned obsolescence, aggregate search trends, market trends and anomalies. MLCC level Sample models 1: Culture Cultural dimensions in creativity (Lubart 2010); Peer-reviewed epositories (Duflou and Verhaegen 2011); IP law (Lessig 2008); Built environment (McCoy and Evans 2002). 2: Society Gatekeeping (Sosa and Gero 2005a); Creative class (Florida); Migration (Hansen and Niedomysl 2009); Social capital (Fischer et al 2004). 3: Groups Group conformity (Kaplan et al 2009); Team diversity (Bassett!Jones 2005); Group brainstorming (Sosa and Gero 2012). 4: Products Rogers (1995) five factors (relative advantage, compatibility, complexity, trialability, observability). 5: Personality Extraversion and dominance (Anderson and Kilduff 2009); Openness (Dollinger 2004). 6: Cognition Creative cognition (Finke et al 1996); Bilingüalism (Adesope et al 2010). 7: Neural processes Neuroanatomy (Jung et al 2010); NN models (Iyer et al 2009). 8: CC processes Machine creativity (Cohen 1999; Maher et al 2012); Computational models of innovation (Young 2009; Sosa and Gero 2005b); tools and support systems (Liu et al 2004). 2013 200 MLCC level 2, Society, captures processes that account for the influence of - or seek to grow effects on- demographics, networks, migration, social influence and authority, roles and occupations, class structure, social capital, crowdsourcing, market segmentation, reputation and popularity, ethnic diversity, gender and aging, diffusion of innovations, crowd behavior. MLCC level 3, Groups, refers to team dynamics, communities of practice, family and peer support, co-creation, artist collectives, art commission, brainstorming, change management and leadership, deliberation, collaboration/competition strategies, workplace, groupthink, game theory, adopter categories. MLCC level 4, Product, captures intrinsic properties of creative artifacts largely determined by domain characteristics, techniques and processes, but also by technological or functional features, life-cycle, etc. MLCC level 5, Personality, personality types, motivation, curiosity, extroversion, mental health, addictions, emotions, risk aversion, well-being, lifestyle, charisma, habit, expertise. MLCC level 6, Cognition includes all processes related to creative cognition (intuition, insight, incubation, problem framing and solving, memory, concept formation, representation, fixation, association, analogy, divergent thinking, abductive reasoning, visual and spatial reasoning), perception, cognitive and attribution biases, heuristics. MLCC level 7, Neural processes related to creativity including neuroanatomy (brain asymmetry), neuromodulation (risk, arousal, novelty), brain stimulation, neural network models of creative reasoning. The final MLCC level refers to CC methods and techniques aimed at solving problems or generating creative solutions with no direct claims to model or being inspired by the other levels. The MLCC model accounts for multiple levels of studying creativity, none of these levels is strictly new - Table 1 in fact includes references to multiple existing research programs that address creativity from each of the disciplinary traditions that specialize in such scales and units of analysis. The MLCC model brings them together and enables CC researchers to explore top-down and bottom-up connections between these levels. Directionality of cross-level interactions in the MLCC model opens up a double opportunity in CC: on the one hand, it allows the study of generative processes between levels, i.e., how individuals create in isolation or in teams, how societal and cultural norms provide the bases for change cycles, what neural and cognitive processes help explain creative behavior, etc. On the other hand, it supports the less-explored study of evaluative processes between levels, i.e., how individuals, teams and society attributes creativeness to an artifact or a process, how cultures or subcultures accommodate for new additions or transformations, what neural or cognitive processes help explain the assessment of novel stimuli, etc. Figure 1 provides a graphical depiction of an MLCC- inspired system showing a conventional organization of levels, i.e.: culture provides a general epistemological background where creators (individuals and teams) generate new artifacts targeted to specific audiences, a process mediated by distributors or promoters of artifacts - which are distinguished from the creators (for example, producers, market and art agents as separate stakeholders from designers and artists). Figure 1. A system architecture to study individual creators and social evaluators interacting in a shared culture However, the MLCC model supports a wide range of alternative modeling approaches, for example to study the ‘maker' culture (Anderson 2012) or to focus on the cognitive processes of target audiences - for example how people are primed to rate and comment on the novelty and originality of artifacts in online forums (Sosa and Dong 2013). This flexibility of the MLCC model accommodates various research traditions, including minimal models where interactions between macro cultural and micro neural processes are explored - for example in cellular automata architectures (Sosa and Gero 2004). CC presents clear advantages as a tool to advance theory building and for the systematic examination of assumptions and extraction of principles in multilevel systems (Fontaine 2006). Nonetheless, associated risks include: loss of clarity in the definition of interactions and causal relationships between levels; misalignment between disciplinary divides (research methods, units of analysis, linguistic traditions); and limited cross-level understanding between specialists. Is the MLCC a creative artifact? It's not an entirely novel model - clear precedents were discussed going back as far as the XVII century. However, it does carry some novelty to the CC community. Its usefulness will be defined by its suitability as a modeling framework as determined firstly by the reviewers of the ICCC'13 and ultimately by the entire ICCC community. As an initial step to evaluate its relevance, the following section presents an analysis of the ICCC'12 proceedings using the MLCC model. The aim is to demonstrate its role in the analysis and discovery of trends in the current CC approaches, and identify gaps and connections between recent models of creativity. Mapping ICCC'12 contributions The 34 full papers published in the ICCC'12 proceedings were selected for this exercise (Maher et al 2012). They were classified in one or more of the MLCC levels according to their research aims and claims as stated by the au 2013 201 thor(s), as well as the target research agendas mentioned as part of future work. In addition to the eight MLCC levels, a ninth category was added during the review of these papers, which we named "Tools" and refers to work aimed at developing computational tools to support or enable human creativity (Gatti et al 2012, Hoover et al 2012). Table 2 presents the 34 papers (rows) and their relation to the MLCC levels (columns). Entries related to generative processes in existing CC systems are marked by , while entries related to evaluative processes in existing CC systems are marked by . Examples of generative processes include a memetic algorithm "capable of open-ended and spontaneous creation of analogous cases from the ground up" (Baydin et al 2012); an evolutionary art system that generates artwork that "has been accepted and exhibited at six major galleries and museums" (Gabora and DiPaola 2012); and a system "able to generate pleasing melodies that fit well with the text of the lyrics, often doing so at a level similar to that of human ability" (Monteith et al 2012). A paper may have multiple entries in different MLCC levels, for instance Morris et al (2012) present a "recipe engine" that draws from a corpus of online recipes published online (MLCC level 1), applies CC processes to generate new recipes (MLCC level 8), and these are subsequently analyzed by their typicality to a "recipe genre" (MLCC level 4). Examples of evaluative processes include plans to include "feedback from journalists, critics, peers and audiences" (Burnett et al 2012); models of the cultural tastes and preferences of audiences (Indurkhya 2012); and plans to study "the cognitive processes of the viewers as they look at […] pictures" (Ogawa et al 2012). A distinction is made when an entry refers to a future research approach that the authors identify as a valuable way forward - rather than an existing CC system. In such cases a plus sign qualifies the entry, respectively + and + . Table 2 refers to the first author only due to space limitations. Some papers are rather comprehensive, such as Indurkhya (2012) and Maher (2012) which span across five MLCC levels each, but the overall average is 2.18 indicating a reasonable distinction among types of CC models. Although these results systematic validation, they suggest a focus on generative processes in ICCC'12 (60 entries, including 43 existing and 17 target processes). Evaluation processes constitute a minority (14 total entries, half of them referring to target processes). These results are consistent with the preceding finding that "only a third of systems presented as creative were actually evaluated on how creative they are" (Jordanous 2011). MLCC level 8 is the most prevalent: 40% of all papers discuss existing CC processes, and an additional 11% discuss target CC processes. Level 8 refers to methods and techniques aimed at solving problems or generating creative solutions with no direct claims to model or being inspired by the other MLCC levels. Examples include association-based computational creative systems (Grace et al 2012); small-scale "creative text generators" (Montfort and Fedorova 2012); and a music generator "inspired by nonmusical audio signals" (Smith et al 2012). MLCC level 4 is present in 30% of the papers; these present -or discuss approaches to generateconcrete artifacts identified as creative. They include Visual Narrator which constructs short visual narratives (Pérez y Pérez et al 2012); machine-composed music (Eigenfeldt et al 2012); and PIERRE which produces new crockpot recipes (Morris et al 2012). AUTHORS 1 2 3 4 5 6 7 8 Tools Agustini( ● ● ● Baydin( ● ● ● Burnett( ○+ ○ ● Charnley ●+ ●+ ●+ Colton( ● ● ● Eigenfeldt( ○ ● ● Gabora( ● ● ●+ ● Gatti( ● Grace ● ● Hoover( ● Indurkhya ○+ ○+ ●+ ○+ ●+ Jennings ● ● Johnson ●+ ●+ ●+ Jordanus( ●+ Jursic( ● Keller( ● Li( ● Linson( ○+ Maher ●+ ●+ ●+ ●+ ●+ Monteith( ● ● Montfort( ● Morris( ● ● ● Noy( ● O'Donoghue( ● ● Ogawa( ●+ ○+ Pérez y(Pérez ○ ● ● Rank( ● Ritchie ●+ ●+ Smith ● ● Sosa( ● Toivanen( ● ○ ● Veale ○ ● Wiggins ● ● Zhu ○ Table 2. Classification of the ICCC'12 papers in MLCC levels More than 30% of all papers address MLCC level 6, cognition. Most of these refer to the cognitive processes involved in the generation of creative artifacts, but a few do suggest the study of cognitive processes related to the evaluation of creativity (Ogawa et al 2012; Linson et al 2012; Indurkhya 2012). MLCC level 1 is captured in 35% of all papers. In most, culture is used as a source in the creation of creative artifacts (as corpora or as evolutionary models at the cultural level). The remaining entries deal with culture as part of the evaluation of creativity. These include the application 2013 202 of "literary criticism and communication theory […] to develop evaluation methods" (Zhu 2012) and "conceptual mash-ups" evaluated against "semantic structures seeking to replicate the semantic categories" (Veale 2012). Notably, MLCC level 7 - neural models of creativityis not represented in ICCC'12, although progress is being made elsewhere (Iyer et al 2009). Evaluation processes are scarce and gravitate mainly around MLCC levels 1 and 3 (Culture and Groups). 11% report assessment by small groups (audiences, experts) and the same number use culture as a metric for validating the results of a CC system (by comparison against or recreation of concrete cultural achievements). Only a couple of papers present potential ways of using societal factors or cognitive studies to understand how an artifact is ascribed creative value. From an evaluation viewpoint, the ICCC'12 papers do not address the following MLCC levels: products (level 4), personality (level 5), neural processes (level 7) and CC processes (level 8). In this way, the MLCC model helps suggest future research approaches including: • Models that incorporate explicit CC processes of evaluation of creativity, for example "automated critics" or "automated audiences" capable of replicating the assessment patterns of human judges (different scales and levels of domain expertise), as well as ultimately predicting the creativeness of computer-generated artifacts (Maher and Fischer 2012). Sample research question: "How may a computational system identify a masterpiece from mediocre artworks?" • Models of neuro-mechanisms behind the creation as well as the evaluation of creativity. Systems that capture the connections between neural and cognitive processes. Sample research question: "How do basic functions such as short term memory or cognitive load moderate the evaluation of creative artifacts?" • Models of the role of personality and motivation in the creation as well as the evaluation of creativity, for example systems that create or evaluate artifacts based on emotional predispositions, gender distinctions, and other personality dimensions. Models where creative behavior is moderated by environmental cues. Sample research question: "How do extraversion traits such as assertiveness moderate the assessment of creativity?" • Models of intrinsic artifact properties identified in the evaluation of creativity according to intra and crossdomain characteristics. Sample research question: "What common assessment criteria do people apply when ascribing creativity in music, literature and architectural works?" Beyond these "missing" levels (or ICCC gaps), this analysis leads to interesting new possibilities and distinctions in CC research: • Culture can be approached in several ways in both generative and evaluative models: as the source of knowledge and generative techniques; as the standards against which new artifacts are evaluated by the creator and by the evaluators; as the status-quo that prevent or constrain acceptance of new artifacts; as factors exogenous to the domain from which creators can draw from and introduce novelty into their creative process; as rules and regulations that incentivize/inhibit creative processes; as market or cultural outlets and vehicles of promotion of creative value; etc. • Societal and group levels can equally be considered in several ways: as large collectives or small groups (teams) collaborating in creative endeavors; as opinion leaders that influence both creators and evaluators; as cliques that provide support but may also polarize types of creators; as aggregate structures of behavior that lead to segmentation, migration, institutionalization; as temporal and spatial trends; etc. As noted before, cognitive modeling may apply both to the generation and the evaluation of creativity. Likewise, although current computational tools are conceived for the creation of creative artifacts, computational tools could also support the individual and collective evaluation of artificial and human-produced artifacts - for example through the automated extraction of evaluation functions provided customer needs and requirements, which can then be used to guide either a computational system or human designers. Discussion How do works such as the Mona Lisa by Leonardo become icons of creativity? Elements to consider range from its intrinsic aesthetic and artistic qualities all the way to its distinctive history - including its theft from the Louvre in 1911 and the ensuing two-year international media notoriety (Scotty 2010). This illustrative case exemplifies the "entangled art!market complex" (Joy and Sherry 2003). Two CC scenarios are compared here where MLCC modeling is demonstrated: 1) "The Next Mona Lisa" CC model: a computational generative system is pursued that captures MLCC levels 6, 7 and/or 8 implementing symbolic or neural techniques (inspired or not by human capabilities) which aims to create a work of art comparable to the Mona Lisa, i.e., that receives the kind of appreciation and recognition gaining the status of a global cultural icon. The problem is that not only this approach seems rather implausible based on the current state of CC, it would also require a vast number of exogenous factors outside the reach of the system's authors - and would probably require very long time periods, considering that even La Gioconda path to prominence took more than four centuries (Scotti 2010). 2) "The Mona Lisa System" CC model: a multilevel computational system is based on the MLCC levels of choice (two or more from 1 to 8), which aims to capture the creation of a large number of artifacts, some of which (most) fall into complete oblivion, some of which (very few) 2013 203 make it to the equivalent of mediocre galleries, local museums and living rooms of elite audiences, and some of which (an absolute minority) are preserved, disseminated and capture broad attention and consensus. Some works in this last category may gradually become part of the cultural heritage, may be used as exemplars in specialized domain training and in general education, may fetch high prices in auctions or be considered invaluable in monetary terms, and may ultimately play an influential role in shaping public taste as well as future artifacts within and beyond the domain of origin. The latter approach opens interesting intellectual paths: What types of processes are capable of generating such diversity of artifacts? What commissioning, distribution and exchange mechanisms are sufficient to account for the observed skewed distributions of evaluation? What connections are possible, in principle, between intrinsic characteristics of artifacts and contextual conditions? What cross-level dynamics apply to creative systems from different domains and times? Such an MLCC model can include a large number of elements, possibly derived from published studies - for example of art-market dynamics in this case (Debenedetti 2006; Joy and Sherry 2003). The output in such models may not be (only or necessarily) the creative artifact itself, but a deeper understanding of the principles that underlie creative generation and evaluation. This may include two or more MLCC levels, and over time, historical trajectories that are likely to be context and time-dependent. Thus the high relevance of CC approaches for the study of systems based on stochastic processes which can be re-run over sets of initial conditions in order to inspect causal relationships and long-term effects. Lastly, the following guidelines are provided when building MLCC models, somehow extending the evaluation guidelines proposed by Jordanus (2011). 1) Identify levels to be modeled a) Define primary and complementary levels: realistically, empirical validation or data may be relevant only for one or two levels, whilst computational explorations can target other levels of interest. b) Identify level variables (experimental and dependent) that represent target factors and observable behaviors or patterns of interest. c) Define inputs and outputs at target levels, establishing the bootstrapping strategies of the model. 2) Define relationships of interest between levels a) Establish explicit connections above/below primary levels in the model b) Define irreducible factors, causal links and whether the model is being used for holistic or reductionistic purposes. c) Identify internal/exogenous factors to the system. 3) Depending on modeling aims, define outputs a) Define type and range of outputs, identifying extreme points such as non-creative to creative artifacts b) Capture and analyze aggregate data, model tuning and refinement 4) Evaluation of a MLCC system a) Validity may be achievable in some models where relevant empirical data exists at the primary level(s) of interest, but this may be inaccessible and even undesirable for exploratory models. Acknowledgements This work was supported in part by the US National Science Foundation under grant number SBE-0915482. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 2013_3 !2013 Once More, With Feeling! Using Creative Affective Metaphors to Express Information Needs Tony Veale Web Science & Technology Division, KAIST / School of Computer Science and Informatics, UCD Korean Advanced Institute of Science & Technology, South Korea / University College Dublin, Ireland. Tony.Veale@gmail.com Abstract Creative metaphors abound in language because they facilitate communication that is memorable, effective and elastic. Such metaphors allow a speaker to be maximally suggestive while being minimally committed to any single interpretation, so they can both supply and elicit information in a conversation. Yet, though metaphors are often used to articulate affective viewpoints and information needs in everyday language, they are rarely used in information retrieval (IR) queries. IR fails to distinguish between creative and uncreative uses of words, since it typically treats words as literal mentions rather than suggestive allusions. We show here how a computational model of affective comprehension and generation allows IR users to express their information needs with creative metaphors that concisely allude to a dense body of assertions. The key to this approach is a lexicon of stereotypical concepts and their affective properties. We show how such a lexicon is harvested from the open web and from local web n-grams. Creative Truths Picasso famously claimed that "art is a lie that tells the truth." Fittingly, this artful contradiction suggests a compelling reason for why speakers are so wont to use artfully suggestive forms of creative language - such as metaphor and irony - when less ambiguous and more direct forms are available. While literal language commits a speaker to a tightly fixed meaning, and offers little scope to the listener to contribute to the joint construction of meaning, creative language suggests a looser but potentially richer meaning that is amenable to collaborative elaboration by each participant in a conversation. A metaphor X is Y establishes a conceptual pact between speaker and listener (Brennan & Clark, 1996), one that says ‘let us agree to speak of X using the language and norms of Y' (Hanks, 2006). Suppose a speaker asserts that "X is a snake". Here, the stereotype "snake" conveys the speaker's negative stance toward X, and suggests a range of talking points for X, such as that X is charming and clever but also dangerous, and is not to be trusted (Veale & Hao, 2008). A listener may now respond by elaborating the metaphor, even when disagreeing with the basic conceit, as in "I agree that X can be charming, but I see no reason to distrust him". Successive elaboration thus allows a speaker and listener to arrive at a mutually acceptable construal of a metaphorical "snake" in the context of X. Metaphors achieve a balance of suggestiveness and concision through the use of dense descriptors, familiar terms like "snake" that evoke a rich variety of stereotypical properties and behaviors (Fishelov, 1992). Though every concept has the potential to be used creatively, casual metaphors tend to draw their dense descriptors from a large pool of familiar stereotypes shared by all speakers of a language (Taylor, 1954). A richer, more conceptual model of the lexicon is needed to allow any creative uses of stereotypes to be inferred as needed in context. We will show here how a large lexicon of stereotypes is mined from the web, and how stereotypical representations can be used selectively and creatively, to highlight relevant aspects of a given target concept in a specific metaphor. Because so many familiar stereotypes have polarizing qualities - think of the endearing and not-so-endearing qualities of babies, for instance - metaphors are ideal vehicles for conveying an affective stance toward a topic. Even stereotypes that are not used figuratively, as in the claim "Steve Jobs was a great leader", are likely to elicit metaphors in response, such as "yes, a true pioneer" or "what an artist!", or even "but he could be such a tyrant!". Proper-names can also be used as evocative stereotypes, as when Steve Jobs is compared to the fictional inventor Tony Stark, or Apple is compared to Scientology, or Google to Microsoft. We use stereotypes effortlessly, and their exploitations are common currency in everyday language. Information retrieval, however, is a language-driven application where the currency of metaphor has little or no exchange value, not least because IR fails to discriminate literal from non-literal language (Veale 2004, 2011, 2012). Speakers use metaphor to provide and elicit information in 2013 16 casual conversation, but IR reduces any metaphoric query to literal keywords and key-phrases, which are matched near-identically in texts (Salton, 1968; Van Rijsbergen 1979). Yet everyday language shows that metaphor is an ideal form for expressing our information needs. A query like "Steve Jobs as a good leader" can be viewed by an IR system as a request to consider all the ways in which leaders are stereotypically good, and to then consider all the metaphors that are typically used to convey these viewpoints. The IR staple of query expansion (Vernimb, 1977; Vorhees, 1994,1998; Navigli & Velardi, 2003; Xu & Croft, 1996) can be made both affect-driven and metaphoraware. In this paper we show how an affective stereotypebased lexicon can both comprehend and generate affective metaphors that capture or shape a user's feelings, and show how this capability can lead to more creative forms of IR. Related Work and Ideas Metaphor has been studied within computer science for four decades, yet it remains at the periphery of NLP research. The reasons for this marginalization are, for the most part, pragmatic ones, since metaphors can be as varied and challenging as human creativity will allow. The greatest success has been achieved by focusing on conventional metaphors (e.g., Martin, 1990; Mason, 2004), or on very specific domains of usage, such as figurative descriptions of mental states (e.g., Barden, 2006). From the earliest computational forays, it has been recognized that metaphor is essentially a problem of knowledge representation. Semantic representations are typically designed for well-behaved mappings of words to meanings - what Hanks (2006) calls norms - but metaphor requires a system of soft preferences rather than hard (and brittle) constraints. Wilks (1978) thus proposed his preference semantics model, which Fass (1991,1997) extended into a collative semantics. In contrast, Way (1990) argues that metaphor requires a dynamic concept hierarchy that can stretch to meet the norm-bending demands of figurative ideation, though her approach lacks computational substance. More recently, some success has been obtained with statistical approaches that side-step the problems of knowledge representation, by working instead with implied or latent representations that are derived from word distributions. Turney and Littman (2005) show how a statistical model of relational similarity can be constructed from web texts for retrieving the correct answer to proportional analogies, of the kind used in SAT tests. No hand-coded knowledge is employed, yet Turney and Littman's system achieves an average human grade on a set of 376 real SAT analogies. Shutova (2010) annotates verbal metaphors in corpora (such as "to stir excitement", where "stir" is used metaphorically) with the corresponding conceptual metaphors identified by Lakoff and Johnson (1980). Statistical clustering techniques are then used to generalize from the annotated exemplars, allowing the system to recognize and retrieve other metaphors in the same vein (e.g. "he swallowed his anger"). These clusters can also be analyzed to identify literal paraphrases for a metaphor (such as "to provoke excitement" or "suppress anger"). Shutova's approach is noteworthy for operating with Lakoff & Johnson's inventory of conceptual metaphors without using an explicit knowledge representation. Hanks (2006) argues that metaphors exploit distributional norms: to understand a metaphor, one must first recognize the norm that is exploited. Common norms in language are the preferred semantic arguments of verbs, as well as idioms, clichés and other multi-word expressions. Veale and Hao (2007a) suggest that stereotypes are conceptual norms that are found in many figurative expressions, and note that stereotypes and similes enjoy a symbiotic relationship that has some obvious computational advantages. Similes use stereotypes to illustrate the qualities ascribed to a topic, while stereotypes are often promulgated via proverbial similes (Taylor, 1954). Veale and Hao (2007a) show how stereotypical knowledge can be acquired by harvesting "Hearst" patterns of the form "as P as C" (e.g. "as smooth as silk") from the web (Hearst, 1992). They show in (2007b) how this body of stereotypes can be used in a webbased model of metaphor generation and comprehension. Veale (2011) employs stereotypes as the basis of a new creative information retrieval paradigm, by introducing a variety of non-literal wildcards in the vein of Mihalcea (2002). In this system, @Noun matches any adjective that denotes a stereotypical property of Noun (so e.g. @knife matches sharp, cold, etc.) while @Adj matches any noun for which Adj is stereotypical (e.g. @sharp matches sword, laser, razor, etc.). In addition, ?Adj matches any property or behavior that co-occurs with, and reinforces, the property denoted by Adj; thus, ?hot matches humid, sultry and spicy. Likewise, ?Noun matches any noun that denotes a pragmatic neighbor of Noun, where two words are neighbors if they are seen to be clustered in the same adhoc set (Hanks, 2005), such as "lawyers and doctors" or "pirates and thieves". The knowledge needed for @ is obtained by mining text from the open web, while that for ? is obtained by mining ad-hoc sets from Google n-grams. There are a number of shortcomings to this approach. For one, Veale (2011) does not adequately model the affective profile of either stereotypes or their properties. For another, the stereotype lexicon is static, and focuses primarily on adjectival properties (like sharp and hot). It thus lacks knowledge of everyday verbal behaviors like cutting, crying, swaggering, etc. So we build here on the work of Veale (2011) in several important ways. First, we enrich and enlarge the stereotype lexicon, to include more stereotypes and behaviors. We determine an affective polarity for each property or behavior and for 2013 17 each stereotype, and show how polarized +/viewpoints on a topic can be calculated on the fly. We show how proxy representations for ad-hoc proper-named stereotypes (like Microsoft) can be constructed on demand. Finally, we show how metaphors are mined from the Google n-grams, to allow the system to understand novel metaphors (like Google is another Microsoft or Apple is a cult) as well as to generate plausible metaphors for users' affective information needs (e.g., Steve Jobs was a great leader, Google is too powerful, etc.). Once more, with feeling! If a property or behavior P is stereotypical of a concept C, we should expect to frequently observe P in instances of C. In linguistic terms, we can expect to see collocations of "P" and "C" in a resource like the Google n-grams (Brants and Franz, 2006). Consider these 3-grams for "cowboy" (numbers in parentheses are Google database frequencies). a lonesome cowboy 432 a mounted cowboy 122 a grizzled cowboy 74 a swaggering cowboy 68 N-gram patterns of the above form allow us to find frequent ascriptions of a quality to a noun-concept, but frequently observed qualities are not always noteworthy qualities (e.g., see Almuhareb and Poesio, 2004,2005). However, if we also observe these qualities in similes - such as "swaggering like a cowboy" or "as grizzled as a cowboy" - this suggests that speakers see these as typical enough to anchor a figurative comparison. So for each hypothesis P is stereotypical of C that we derive from the Google n-grams, we generate the corresponding simile form: we use the "like" form for verbal behaviors such as "swaggering", and the "as-as" form for adjectival properties such as "lonesome". We then dispatch each simile as a phrasal query to Google: a hypothesis is validated if the corresponding simile is found on the web. This mining process gives us over 200,000 validated hypotheses for our stereotype lexicon. We now filter these hypotheses manually, to ensure that the contents of the lexicon are of the highest quality (investing just weeks of labor produces a very reliable resource; see Veale 2012 for more detail). We obtain rich descriptions for commonplace ideas, such as the dense descriptor Baby, whose 163 highly salient qualities - a set denoted typical(Baby) - includes crying, drooling and guileless. After this manual phase, the stereotype lexicon maps 9,479 stereotypes to a set of 7,898 properties / behaviors, to yield more than 75,000 pairings. Determining Nuanced Affect To understand the affective uses of a property or behavior, we employ the intuition that those which reinforce each other in a single description (e.g. "as lush and green as a jungle" or "as hot and humid as a sauna") are more likely to have the same affect than those which do not. To construct a support graph of mutually reinforcing properties, we gather all Google 3-grams in which a pair of stereotypical properties or behaviors X and Y are linked via coordination, as in "hot and spicy" or "kicking and screaming". A bidirectional link between X and Y is added to the graph if one or more stereotypes in the lexicon contain both X and Y. If this is not so, we consider whether both descriptors ever reinforce each other in web similes, by posing the web query "as X and Y as". If this query has a non-zero hit set, we still add a link between X and Y. Next, we build a reference set -R of typically negative words, and a disjoint set +R of typically positive words. Given a few seed members for -R (such as sad, evil, monster, etc.) and a few seed members for +R (such as happy, wonderful, hero, etc.), we use the ? operator of Veale (2011) to successively expand this set by suggesting neighboring words of the same affect (e.g., "sad and pathetic", "happy and healthy"). After three iterations in this fashion, we populate +R and -R with approx. 2000 words each. If we can anchor enough nodes in the graph with + or - labels, we can interpolate a nuanced positive / negative score for all nodes in the graph. Let N(p) denote the set of neighboring terms to a property or behavior p in the support graph. Now, we define: (1) N+(p) = N(p) ∩ +R (2) N-(p) = N(p) ∩ -R We assign positive / negative affect scores to p as follows: (3) pos(p) = |N+(p)| |N+(p) ∪ N-(p)| (4) neg(p) = 1 pos(p) Thus, pos(p) estimates the probability that p is used in a positive context, while neg(p) estimates the probability that p is used in a negative context. The X and Y 3-grams approximate these contexts for us. Now, if a term S denotes a stereotypical idea that is described in the lexicon with the set of typical properties and behaviors denoted typical(S), then: (5) pos(S) = Σp∈typical(S) pos(p) |typical(S)| (6) neg(S) = 1 pos(S) So we simply calculate the mean affect of the properties and behaviors of s, as represented in the lexicon via typical(s). Note that (5) and (6) are simply gross defaults. 2013 18 One can always use (3) and (4) to separate the elements of typical(s) into those which are more negative than positive (a negative spin on s) and those which are more positive than negative (a positive spin on s). Thus, we define: (7) posTypical(S) = {p ∈ typical(S) | pos(p) > neg(p)} (8) negTypical(S) = {p ∈ typical(S) | neg(p) > pos(p)} For instance, the positive stereotype of Baby contains the qualities such as smiling, adorable and cute, while the negative stereotype contains qualities such as crying, wailing and sniveling. As we'll see next, this ability to affectively "spin" a stereotype is key to automatically generating affective metaphors on demand. Generating Affective Metaphors, N-gram style The Google n-grams is also a rich source of copula metaphors of the form Target is Source, such as "politicians are crooks", "Apple is a cult", "racism is a disease" and "Steve Jobs is a god". Let src(T) denote the set of stereotypes that are commonly used to describe T, where commonality is defined as the presence of the corresponding copula metaphor in the Google n-grams. To also find metaphors for proper-named entities like "Bill Gates", we analyse n-grams of the form stereotype First [Middle] Last, such as "tyrant Adolf Hitler". For example: src(racism) = {problem, disease, joke, sin, poison, crime, ideology, weapon} src(Hitler) = {monster, criminal, tyrant, idiot, madman, vegetarian, racist, …} We do not try to discriminate literal from non-literal assertions, nor indeed do we try to define literality at all. Rather, we assume each putative metaphor offers a potentially useful perspective on a topic T. Let srcTypical(T) denote the aggregation of all properties ascribable to T via metaphors in src(T): (9) srcTypical (T) = M∈src(T) typical(M) We can also use the posTypical and negTypical variants of (7) and (8) to focus only on metaphors that place a positive or negatve spin on a topic T. In effect, (9) provides a feature representation for topic T as viewed through the creative lens of metaphor. This is useful when the source S in the metaphor T is S is not a stereotype in the lexicon, as happens when one describes Rasputin as Karl Rove, or Apple as Scientology. When the set typical(S) is empty, srcTypical(S) may not be, so srcTypical(S) can act as a proxy representation for S in these cases. The properties and behaviors that are salient to the interpretation of T is S are given by: (10) salient (T,S) = [srcTypical(T) ∪ typical(T)] ∩ [srcTypical(S) ∪ typical(S)] In the context of T is S, the metaphorical stereotype M ∈ src(S)∪src(T)∪{S} is an apt vehicle for T if: (11) apt(M, T,S) = |salient(T,S) ∩ typical(M)| > 0 and the degree to which M is apt for T is given by: (12) aptness(M,T,S) = |salient(T, S) ∩ typical(M)| |typical(M)| We can now construct an interpretation for T is S by considering the stereotypes in src(T) that are apt for T in the context of T is S, and by also considering the stereotypes that are commonly used to describe S that are also potentially apt for T: (13) interpretation(T, S) = {M ∈ src(S)∪src(T)∪{S} | apt(M, T, S)} In effect, the interpretation of the creative metaphor T is S is itself a set of more conventional metaphors that are apt for T and which expand upon S. The elements {Mi } of interpretation(T, S) can be sorted by aptness(Mi T,S) to produce a ranked list of interpretations (M1 … Mn). For a given interpretation M, the salient features of M are thus: (14) salient(M, T,S) = typical(M) ∩ salient (T,S) So if T is S is a creative IR query - to find documents in which T is viewed as S - then interpretation(T, S) is an expansion of T is S that includes the common metaphors that are consistent with T viewed as S. In turn, for any viewpoint Mi in interpretation(T, S), then salient(Mi , T, S) is an expansion of Mi that includes all of the qualities that T is likely to exhibit when it behaves like Mi . A Worked Example: Metaphor Generation for IR Consider the creative query "Google is Microsoft", which expresses a user's need to find documents in which Google exhibits qualities typically associated with Microsoft. Now, both Google and Microsoft are complex concepts, so there are many ways in which they can be considered similar or dissimilar, whether in a good or a bad light. However, we can expect the most salient aspects of Microsoft to be those that underpin our common metaphors for Microsoft, i.e., the stereotypes in src(Microsoft). These metaphors will provide the talking points for an interpretation. The Google n-grams yield up the following metaphors, 57 for Microsoft and 50 for Google: ∪ 2013 19 src(Microsoft) = {king, master, threat, bully, giant, leader, monopoly, dinosaur …} src(Google) = {king, engine, threat, brand, giant, leader, celebrity, religion …} So the following qualities are aggregrated for each: srcTypical(Microsoft) = {trusted, menacing, ruling, threatening, overbearing, admired, commanding, …} srcTypical(Google) = {trusted, admired, reigning, lurking, crowned, shining, ruling, determined, …} Now, the salient qualities highlighted by the metaphor, namely salient(Google, Microsoft), are: {celebrated, menacing, trusted, challenging, established, threatening, admired, respected, …} Finally, interpretation(Google,Microsoft) contains: {king, criminal, master, leader, bully, threatening, giant, threat, monopoly, pioneer, dinosaur, …} Let's focus on the expansion "Google is king", since according to (12), aptness(king, Google, Microsoft) = 0.48 and this is the highest ranked element of the interpretation. Now, salient(king, Google, Microsoft) contains: {celebrated, revered, admired, respected, ruling, arrogant, commanding, overbearing, reigning, …} Note that these properties / behaviours are already implicit in our consensus perception of Google, insofar as they are highly salient aspects of the stereotypical concepts to which Google is frequently compared on the web. These properties / behaviours can now be used to perform query expansion for the query term "Google", to find documents where the system believes Google is acting like Microsoft. The metaphor "Google is Microsoft" is diffuse and lacks an affective stance. So let's consider instead the metaphor "Google is -Microsoft", where is used to impart a negative spin (and where + can likewise impart a positive spin). In this case, negTypical is used in place of typical in (9) and (10), so that: srcTypical(-Microsoft) = {menacing, threatening, twisted, raging, feared, sinister, lurking, domineering, overbearing, …} and salient(Google, -Microsoft) = {menacing, bullying, roaring, dreaded…} Now, interpretation(Google, -Microsoft) becomes: {criminal, giant, threat, bully, evil, victim, devil, …} In contrast, interpretation(Google, +Microsoft) is: {king, master, leader, pioneer, classic, partner, …} More focus is achieved with this query in the form of a simile: "Google is as -powerful as Microsoft". For explicit similes, we need to focus on just a sub-set of salient properties, as in this varient of (10): {p ∈ salient(Google, Microsoft) | p ∈ N-(powerful)} In this case, the final interpretation becomes: {bully, threat, giant, devil, monopoly, dinosaur, …} A few simple concepts can thus yield a wide range of options for the creative IR user who is willing to build queries around affective metaphors and similes. Empirical Evaluation The affective stereotype lexicon is the cornerstone of the current approach, and must reliably assign meaningful polarity scores both to properties and to the stereotypes that exemplify them. Our affect model is simple in that it relies principally on +/affect, but as demonstrated above, users can articulate their own expressive moods to suit their needs: for Stereotypical example, one can express disdain for too much power with the term -powerful, or express admiration for guile with +cunning and +devious. The Effect of Affect: Stereotypes and Properties Note that the polarity scores assigned to a property p in (3) and (4) do not rely on any prior classification of p, such as whether p is in +R or -R. That is, +R and -R are not used as training data, and (3) and (4) receive no error feedback. Of course, we expect that for p ∈ +R that pos(p) > neg(p), and for p ∈ -R that neg(p) > pos(p), but (3) and (4) do not iterate until this is so. Measuring the extent to which these simple intuitions are validated thus offers a good evaluation of our graph-based affect mechanism. Just five properties in +R (approx. 0.4% of the 1,314 properties in +R) are given a positivity of less than 0.5 using (3), leading those words to be misclassified as more negative than positive. The misclassified property words are: evanescent, giggling, licking, devotional and fraternal. Just twenty-six properties in -R (approx. 1.9% of the 1,385 properties in -R) are assigned a negativity of less than 0.5 via (4), leading these to be misclassified as more positive than negative. The misclassified words are: cocky, dense, demanding, urgent, acute, unavoidable, critical, startling, gaudy, decadent, biting, controversial, peculiar, disinterested, strict, visceral, feared, opinionated, humbling, subdued, impetuous, shooting, acerbic, heartrending, ineluctable and groveling. 2013 20 Because +R and -R have been populated with words that have been chosen for their perceived +/slants, this result is hardly surprising. Nonetheless, it does validate the key intuitions that underpin (3) and (4) - that the affective polarity of a property p can be reliably estimated as a simple function of the affect of the co-descriptors with which it is most commonly used in descriptive contexts. The sets +R and -R are populated with adjectives, verbal behaviors and nouns. +R contains 478 nouns denoting positive stereotypes (such as saint and hero) while -R contains 677 nouns denoting negative stereotypes (such as tyrant and monster). When these reference stereotypes are used to test the effectiveness of (5) and (6) - and thus, indirectly, of (3) and (4) and of the stereotype lexicon itself - 96.7% of the positive stereotype exemplars are correctly assigned a mean positivity of more than 0.5 (so, pos(S) > neg(S)) and 96.2% of the negative exemplars are correctly assigned a mean negativity of more than 0.5 (so, neg(S) > pos(S)). Though it may seem crude to assess the affect of a stereotype as the mean of the affect of its properties, this does appear to be a reliable measure of polar affect. The Representational Adequacy of Metaphors We have argued that metaphors can provide a collective representation of a concept that has no other representation in a system. But how good a proxy is src(S) or srcTypical(S) for an S like Karl Rove or Microsoft? Can we reliably estimate the +/polarity of S as a function of src(S)? We can estimate these from metaphors as follows: (15) pos(S) = ΣM∈src(S) pos(M) |src(S)| (16) neg(S) = ΣM∈src(S) neg(M) |src(S)| Testing this estimator on the exemplar stereotypes in +R and -R, the correct polarity (+ or -) is estimated 87.2% of the time. Metaphors in the Google n-grams are thus broadly consistent with our perceptions of whether a topic is positively or negatively slanted. When we consider all stereotypes S for which |src(S)| > 0 (there are 6,904 in the lexicon), srcTypical(S) covers, on average, just 65.7% of the typical properties of S (that is, of typical(S)). Nonetheless, this shortfall is precisely why we use novel metaphors. Consider this variant of (9) which captures the longer reach of these novel metaphors: (17) srcTypical2 (T) = S ∈ src(T) srcTypical(S) Thus, srcTypical2 (T) denotes the set of qualities that are ascribable to T via the expansive interpretation of all metaphors T is S in the Google n-grams, since S can now project onto T any element of srcTypical(S). Using macroaveraging over all 6,904 cases where |src(S)| > 0, we find that srcTypical2 (S) covers 99.2% of typical(S) on average. A well-chosen metaphor enables us to emphasize almost any quality of a topic T we might wish to highlight. Affective Text Retrieval with Creative Metaphors Suppose we have a database of texts {D1 … Dn} in which each document Di offers a creative perspective on a topic T. We might have texts that view politicians as crooks, popes as kings, or hackers as heroes. So given a query +T, can we retrieve only those texts that view T positively, and given -T can we retrieve only the negative texts about T? We first construct a database of artificial figurative texts. For each stereotype S in the lexicon, and for each M ∈ src(S)∩(+R∪-R), we construct a text DSM in which S is viewed as M. The title of document DSM is "S is M", while the body of DSM contains all the words in src(M). DSM uses the typical language of M to talk about S. For each DSM, we know whether DSM conveys a positive or negative viewpoint on S, since M sits in either in +R or -R. The affect lexicon contains 5,704 stereotypes S for which src(S)∩(+R∪-R) is non-empty. On average, each of these stereotypes is described in terms of 14 other stereotypes (5.8 are negative and 8.2 are positive, according to +R and -R) and we construct a representative document for each of these viewpoints. We construct a set of 79,856 artificial documents in total, to convey figurative perspectives on 5,704 different stereotypical topics: Table 1. Macro-Average P/R/F1 scores for affective retrieval of + and viewpoints for 5,704 topics. Macro Average (5704 topics) Positive viewpoints Negative viewpoints Precision .86 .93 Recall .95 .78 F-Score .90 .85 For each document retrieved for T, we estimate its polarity as the mean of the polarity of the words it contains. Table 1 presents the results of this experiment, in which we attempt to retrieve only the positive viewpoints for T with a query +T, and only the negative viewpoints for T using -T. The results are sufficiently encouraging to support the further development of a creative text retrieval engine that is capable of ranking documents by the affective figurative perspective that they offer on a topic. ∪ 2013 21 Concluding Thoughts: The Creative Web Metaphor is a creative knowledge multiplier that allows us to expand our knowledge of a topic T by using knowledge of other ideas as a magnifying lens. We have presented here a robust, stereotype-driven approach that embodies this practical philosophy. Knowledge multiplication is achieved using an expansionary approach, in which an affective query is expanded to include all of the metaphors that are commonly used to convey this affective viewpoint. These viewpoints are expanded in turn to include all the qualities that are typically implied by each. Such an approach is ideally suited to a creative re-imagining of IR. An implementation of these ideas is available for use on the web. Named Metaphor Magnet, the system allows users to enter queries of the form shown here (such as Google is - Microsoft, Steve Jobs as Tony Stark, Rasputin as Karl Rove, etc.). Each query is expanded into a set of apt metaphors mined from the Google n-grams, and each metaphor is expanded into a set of contextually apt qualities. In turn, each quality is expanded into an IR query that is used to retrieve relevant hits from Google. In effect, the system - still an early prototype - allows users to interface with a search engine like Google using metaphor and other affective language forms. The system can currently be accessed at this URL: http://boundinanutshell.com/metaphor-magnet Metaphor Magnet is just one possible application of the ideas presented here, which constitute not so much a philosophical or linguistic theory of metaphor, but an engineering-oriented toolkit of reusable concepts for imbuing a wide range of text applications with a robust competence in linguistic creativity. Human speakers do not view metaphor as a problem but as a solution. It is time our computational systems took a similarly constructive view of this remarkably creative cognitive tool. In this vein, Metaphor Magnet continues to evolve as a creative web service. In addition to providing metaphors on demand, the service now also provides a poetic framing facility, whereby the space of possible interpretations for a given metaphor is crystallized into a single poetic form. More generally, poetry can be viewed as a means of reducing information overload, by summarizing a complex metaphor - or the set of texts retrieved using that metaphor via creative IR - whose interpretation entails a rich space of affective possibilities. A poem can thus be seen in functional terms as both an information summarization tool and as a visualization device. Metaphor Magnet adopts a simple, meaning-driven approach to poetry generation: given a topic T, a set of candidate metaphors with the desired affective slant is generated. One metaphor is chosen at random, and the elements of its interpretation are sampled to produce different lines of the resulting poem. Each element, and the sentiment it best evokes, is rendered in natural language using one of a variety of poetic tropes. For example, Metaphor Magnet produces the following as a distillation of the space of feelings and associations that arise from the interpretation of Marriage is a Prison: The legalized regime of this marriage My marriage is a tight prison The most unitary federation scarcely organizes so much Intimidate me with the official regulation of your prison Let your close confines excite me O Marriage, you disgust me with your undesirable security Each time we dip into the space of possible interpretations, a new poem is produced. One can use Metaphor Magnet to sample the space at will, hopping from one interpretation to the next, or from one poem to another. Here is an alternate rendition of the same metaphor in poetic form: The official slavery of this marriage My marriage is a legitimate prison No collective is more unitary, or organizes so much Intimidate me with the official regulation of your prison Let your sexual degradation charm me O Marriage, you depress me with your dreary consecration In the context of our earlier worked example, which generated a space of metaphors to negatively describe Microsoft's perceived misuse of power, consider the following, which distills the assertion Microsoft is a Monopoly into an aggressive ode: No Monopoly Is More Ruthless Intimidate me with your imposing hegemony No crime family is more badly organized, or controls more ruthlessly Haunt me with your centralized organization Let your privileged security support me O Microsoft, you oppress me with your corrupt reign Poetry generation in Metaphor Magnet is a recent addition to the service, and its workings are beyond the scope of the current paper (though they may be observed in practice by visiting the aforementioned URL). For details of a related approach to poetry generation - one that also uses the stereotype-bearing similes described in Veale (2012) - the reader is invited to read Colton, Goodwin & Veale (2012). Metaphor Magnet forms a key element in our vision of a Creative Web, in which web services conveniently provide creativity on tap to any third-party software application that requests it. These services include ideation (e.g. via metaphor generation & knowledge discovery), composition (e.g. via analogy, bisocation & conceptual blending) and framing (via poetry generation, joke & story generation, etc.). Since CC does not distinguish itself through distinct algorithms or representations, but through its unique goals 2013 22 and philosophy, such a pooling of services will not only help the field achieve a much-needed critical mass, it will facilitate a greater penetration of CC ideas and approaches into the commercial software industry. Acknowledgements This research was supported by the WCU (World Class University) program under the National Research Foundation of Korea (Ministry of Education, Science and Technology of Korea, Project no. R31-30007). 2013_30 !2013 Evaluating Human-Robot Interaction with Embodied Creative Systems Rob Saunders Design Lab University of Sydney NSW 2006 Australia rob.saunders@sydney.edu.au Emma Chee Small Multiples Surry Hills, Sydney NSW 2010 Australia emma@small.mu Petra Gemeinboeck College of Fine Art University of NSW NSW 2021 Australia petra@unsw.edu.au Abstract As we develop interactive systems involving computational models of creativity, issues around our interaction with these systems will become increasingly important. In particular, the interaction between human and computational creators presents an unusual and ambiguous power relation for those familiar with typical humancomputer interaction. These issues may be particularly pronounced with embodied artificial creative systems, e.g., involving groups of mobile robots, where humans and computational creators share the same physical environment and enter into social and cultural exchanges. This paper presents a first attempt to examine these issues of human-robot interaction through a series of controlled experiments with a small group of mobile robots capable of composing, performing and listening to simple songs produced either by other robots or by humans. Introduction Creativity is often defined as the generation of novel and valuable ideas, whether expressed as concepts, theories, literature, music, dance, sculpture, painting or any other medium of expression (Boden 2010). But creativity, whether or not it is computational, doesn't occur in a vacuum, it is a situated activity that is connected with cultural, social, personal and physical contexts that determine the nature of novelty and value against which creativity is assessed. The world offers opportunities, as well as presenting constraints: human creativity has evolved to exploit the former and overcome the latter, and in doing both, the structure of creative processes emerge (Pickering 2005). There are three major motivations underlying the research of developing computational creativity: (1) to construct artificial entities capable of human-level creativity; (2) to better understand and formulate an understanding of creativity; and, (3) to develop tools to support human creative acts (Pease and Colton 2011). The development of artificial creative systems is driven by a desire to understand creativity as interacting systems of individuals, social groups and cultures (Saunders and Gero 2002). The implementation of artificial creative systems using autonomous robots imposes constraints upon the hardware and software used. These constraints focus the development process on the most important aspects of the computational model to support an embodied and situated form of creativity. At the same time, embodiment provides opportunities for agents to experience the emergence of effects beyond the computational limits that they must work within. Following an embodied cognition stance, the environment may be used to offload internal representation (Clark 1996) and allow agents to take advantage of properties of the physical environment that would be difficult or impossible to simulate computationally, thereby expanding the behavioural range of the agents (Brooks 1990). Interactions between human and artificial creators within a shared context places constraints on the design of the human-robot interaction but provides opportunities for the transfer of cultural knowledge through the sharing of artefacts. Embodiment allows computational agents to be creative in environments that humans can intuitively understand. As Penny (1997) describes, embodied cultural agents, whose function is self reflexive, engage the public in a consideration of the nature of agency itself. In the context of the study of computational creativity, this provides an opportunity for engaging a broad audience in the questions raised by models of artificial creative systems. The ‘Curious Whispers' project (Saunders et al. 2010), investigates the interaction between human and artificial agents within creative systems. This paper focuses on the challenge of designing one-to-one and one-to-many interactions within a creative system consisting of humans and robots and provides a suitable method for examining these interactions. In particular, the research presented in this paper explores how humans interacting with an artificial creative system construe the agency of the robots and how the embodiment of simple creative agents may prolong the production of potentially interesting artefacts through the interaction of human and artificial agents. The research adopts methods from interaction design to study the interactions between participants and the robots in open-ended sessions. Background Gordon Pask's early experiments with electromechanical cybernetic systems provide an interesting historical precedent for the development of computational creativity (Haque 2007). Through the development of "conversational machines" Pask explored the emergence of unique interaction protocols between the machine and musicians. MusiColour, 2013 205 seen in Figure 1, was constructed by Gordon Pask and Robin McKinnon-Wood in 1953. It was a performance system comprising of coloured lights that illuminated in conjunction with audio input from a human performer. But MusiColour did more than transcode sound into light, it manipulated its coloured light outputs such that it became a co-performer with the musician, creating a unique (though non-random) output with every iteration (Glanville 1996). The sequence of the outputs not only depended on the frequencies and rhythms but also repetition: if a rhythm became too predictable then MusiColour would enter a state of ‘boredom' and seek more stimulating rhythms by producing and stimulating improvisation. As such, it has been argued that MusiColour acted more like a jazz co-performer might when ‘jamming' with other band members (Haque 2007). The area of musical improvisation has since provided a number of examples of creative systems that model social interactions within creative activies, e.g., GenJam (Biles 1994), MahaDeviBot (Kapur et al. 2009). The recent development of Shimon (Hoffman and Weinberg 2010) provides a nice example of the importance of modelling social interactions alongside the musical performance. Figure 1: MusiColour: light display (left) and processing unit (right) (Glanville 1996). ‘Performative Ecologies: Dancers' by Ruairi Glynn is a conversational environment, involving human and robotic agents in a dialogue using simple gestural forms (Glynn 2008). The Dancers in the installation are robots suspended in space by threads and capable of performing ‘gestures' through twisting movements. The fitness of gestures is evaluated as a function of audience attention, independently determined by each robot through face tracking. Audience members can directly participate in the evolution by manipulating the robots, twisting them to record a new gesture. Successful gestures, i.e., those observed to attract an audience, are shared between the robots over a wireless network. The robotic installation ‘Zwischenraume' employs em¨ bodied curious agents that transform their environment through playful exploration and intervention (Gemeinboeck and Saunders 2011). A small group of robots is embedded in the walls of a gallery space, they investigate their wall habitat and, motivated to learn, use their motorised hammer to inFigure 2: Performative Ecologies: Dancers (Glynn 2008) troduce changes to the wall and thus novel elements to study. As the wall is increasingly fragmented and broken down, the embodied agents discover, study and respond to human audiences in the gallery space. Unlike the social models embodied in MusiColour and Performative Ecologies, the social interactions in Zwischenraume focus on those between ¨ the robots. Audience members still play a significant role in their exploration of the world but in Zwischenraume visitors ¨ are considered complex elements of the environment. In ‘The New Artist', Straschnoy (2008) explored issues of what robots making art for robots could be like. In a series of interviews, the engineers involved in the development of The New Artist expressed different interpretations of the meaning and purpose of such a system. Some questioned the validity of the enterprise, arguing that there is no reason to constructs robots to make art for other robots. While others considered it to be part of a natural progression in creative development "We started out with human art for humans, then we can think about machine art for humans, or human art for machines. But will we reach a point where there's machine art for machines, and humans don't even understand what they are doing or why they even like it." — Interview with Jeff Schneider, Associate Research Professor, Robotics Institute, Carnegie Mellon (Straschnoy 2008) The following section describes the current implementation of the ‘Curious Whispers', an embodied artificial creative system. The implemented system is much simpler than those described above, i.e., the robots employ a very simple generative system to produce short note sequences, but it provides a useful platform for the exploration of interaction design issues that arise with the development of autonomous creative systems involving multiple artificial agents. Implementation The current implementation of Curious Whispers (version 2.0) uses a small group of mobile robots equipped with speakers, microphones and a movable plastic hood, see Figure 3. Each robot is capable of generating simple songs, evaluating the novelty and value of a song, and performing those songs that they determine to be ‘interesting' to 2013 206 other members of the society - including human participants. Each robot listens to the performances of others and if it values a song attempts to compose a variation. Closing their plastic hood, allows a robot to rehearse songs using the same hardware and software that they use to analyse the songs of other robots, removing the need for simulation. Figure 3: The implemented mobile robots and 3-button synthesiser. A simple 3-button synthesiser allows participants to play songs that the robots can recognise and if a robot considers a participant's songs to be interesting it will adopt them. Using this simple interface, humans are free to introduce domain knowledge, e.g., fragments of well-known songs, into the collective memory of the robot society. For more information on the technical details of the implementation see Chee (2011). Methodology To investigate the interactions between robots and human participants we adopted a methodology from interaction design and employed a ‘technology probe'. Technology probes combine methods for collecting qualitative information about user interaction, the field-testing of technology, and exploring design requirements. A well-designed technology probe should balance these different disciplinary influences (Hutchinson et al. 2003). A probe should be technically simple and flexible with respect to possible use: it is not a prototype but a tool to explore design possibilities and, as such, should be open-ended and explicitly co-adaptive (Mackay 1990). The probe used in this research involved three observational studies exploring different aspects of the human-robot interaction with the embodied creative system. The observational studies were conducted with different arrangements of robots and human participants, allowing us to observe how interaction patterns and user assessments of the system changed in each configuration. Each session was video recorded and at the end of each session the participants were interviewed using a series of open-ended questions. The interview was based on a similar one developed by Bernsen and Dybkjær (2005) in their study of conversational agents. Employing a ‘post-think-aloud' method at the end of each session the participants were first asked to describe their experiences interacting with the robot. A similar method was used in the evaluation of the Sonic City project (Gaye, Maze, and Holmquist 2003). The video recordings ´ were transcribed and interaction events noted on a timeline. The ‘post-think-aloud' reports were correlated with events in the video recordings where possible. Six participants were observed in the studies. The participants came from a variety of backgrounds and included 2 interaction designers, 2 engineers, 1 linguist, and 1 animator. All participants were involved in the 1:1 (1 human, 1 robot) observation study. Two participants (Participant 5 and 8) went on to be part of the 1:3 (1 human, 3 robots) observation study, the other four (Participant 6, 7, 9 and 10) were involved in the 2:3 (2 humans, 3 robots) observation study. 1:1 Interaction Observation Study The purpose of the first study was to observe the participants behaviour whilst interacting with a single robot. Each participant was given a 3-button synthesiser to communicate with the robot and allowed to interact for as long as they wished, i.e., no time limit was given. 1:3 Interaction Observation Study The second observational study involved each participant interacting with the group of 3 robots to examine how participants interacted with multiple creative agents at the same time and how the participants were influenced by the interactions between robots. This study involved 2 participants, both participants had previously completed the first observation study. 2:3 Interaction Observation Study The third observational study involved pairs of participants interacting with the system of 3 working robots. This study allowed for the participants to not only interact and observe the working system but to also interact with each other to share their experiences. This study involved 4 participants working in two groups of two. The 4 participants were chosen from those who completed the 1:1 study but were not involved in the 1:3 observation study. Results This section presents a brief summary of the observational studies, a more detail account can be found in Chee (2011). 1:1 Interaction The 1:1 interaction task allowed the participants to form individual theories on how single robots reacted to them, most learned that the robots did not respond to individual notes but sequences of them. Participants spent between 2 and 4 minutes interacting with the robot, much of that time was spent experimenting to determine how the robot reacted to different inputs: "[I] first tried to see how it would react, pressed a single button and then tried a sequence of notes" (Participant 6). Several of the participants learned to adopt a turn-taking behaviour with the robots, e.g., "when it started to play I stopped to watch, I only tried to play when it stopped" (Participant 5). Some of the participants interpreted the opening and closing of the hood as a cue for when they could play a song for the robot to learn, as Participant 9 commented: "I played a noise and it took that song and closed up and was like ‘alright I'm gonna think of 2013 207 something better'. It sounded like it was repeating what I did but like a bit different. Like it was working out what I'd done." Most of the participants assumed the role of teacher and attempted to get the robot to repeat a simple song. But in the case of Participant 8 the roles were reversed as the participant began copying the songs played by the robot. 1:3 Interaction For the 1:3 interaction studies the group of robots were placed on a table in a quiet location, as shown in Figure 4. The participants interacted with the group of robots for approximately 5 minutes. Both participants already knew the robots were responsive to them from the 1:1 study, but they found it difficult to determine which robot they were interacting with: "you knew you could interact but you were not really aware of the reaction as a group" (Participant 5). The participants noticed that the robots were different: "the green robots song was slightly different to blue and purple" (Participant 5); and, that they exhibited social behaviour amongst themselves: "Noticed they didn't rely just on the [synthesiser], the 3 of them were communicating. I thought they sang in a certain order as one started and the others would reply" (Participant 8). Both participants came to realise that system would continue to evolve new songs without their input and spent time towards the end of their sessions observing the group behaviour. Figure 4: An example of the interaction in the 1:3 study. 2:3 Interaction Working together the participants in the third study quickly arrived at the conclusion that they needed to take turns in order to interact with the robots. Participant 6 saw that the robots moved towards Participant 7 and asked to be given one of the robots, Participant 7 replied "No, they have to go to you on their own", suggesting that Participant 7 recognised that the robots could not be commanded. Later, the participants became competitive in their attempts to attract the robots away from each other. As the participants shared observations about the system, they explored the transference of songs. By observing the interactions between Participant 7 and the robots, Participant 6 was able to determine that the robots responded to songs of exactly 8 notes and that the robots would repeat the song 3 times while it learned. At one point Participant 9 commented: "...when I pressed it like this ‘beep beep beep beep' it went ‘beep beep boop beep' so it was like changing what I played". These observations suggest that over time the participants were able to build relatively accurate ‘mental' models of the processes of the robotic agents. Figure 5: An example of the interaction in the 2:3 study. Discussion Unlike traditional interactive systems that react to human participants (Dezeuze 2010), the individual agents within artificial creative systems are continuously engaged in social interactions: the robots in our study would continue to interact and share songs without the intervention of the participants. While initially confusing, participants discovered through extended observation and interaction that they could inject songs into the society by teaching them to a single robot. Participants sometimes also assumed the role of learner and copied the songs of the robots and consequently adopted an interaction strategy more like that of a peer. The autonomous nature of the embodied creative system runs counter to typical expectations of human-robot interactions; making interacting with a group of robots is significantly more difficult than interacting with one. The preliminary results presented here suggest that simple social policies in artificial creative systems, e.g., the turn-taking behaviour, coupled with cues that indicate state, e.g., closing the hood while practicing and composing songs, allow for conversational interactions to emerge over time. Conclusion The development of embodied creative system offers significant opportunities and challenges for researchers in computational creativity. This paper has presented a possible approach for the study of interaction design issues surrounding the development of artificial creative systems. The Curious Whispers project explores the possibility of developing artificial creative systems that are open to these types of peer-to-peer interactions through the construction of a ‘common ground' based on the expression and perception of artefacts. The research presented has shown that even a simple robotic platform can be designed to exploit its physical embodiment as well as its social situation, using easily obtained components. The implemented system, while simple in terms of the computational ability of the agents, has provided a useful 2013 208 platform for studying interactions between humans and artificial creative systems. The technical limitations of the robotic platform place an emphasis on the important role that communication plays in the evolution of creative systems, even with the restricted notion of what constitutes a ‘song' in this initial exploration. Above all, the technology probe methodology used in our observational studies have illustrated the usefulness of implementing simple policies in artificial creative systems to allow human participants to adapt to the unusual interaction model. Acknowledgements The research reported in this paper was supported as part of the Bachelor of Design Computing Honours programme in the Faculty of Architecture, Design and Planning at the University of Sydney. 2013_31 !2013 The role of motion dynamics in abstract painting Alexander Schubert and Katja Mombaur Interdisciplinary Center for Scientific Computing University of Heidelberg {alexander.schubert, katja.mombaur}@iwr.uni-heidelberg.de Abstract We investigate the role of dynamic motions performed by artists during the creative process of art generation. We are especially interested modern artworks inspired by the Action Painting style of Jackson Pollock. Our aim is to evaluate and model the role of these motions in the process of art creation. We are using mathematical approaches from optimization and optimal control to capture the essence (cost functions of an optimal control problem) of these movements, study it and transfer it to feasible motions for a robot arm. Additionally, we performed studies of human responses to paintings assisted by an image analysis framework, which computes several image characteristics. We asked people to sort and cluster different action-painting images and performed PCA and Cluster Analysis in order to determine image traits that cause certain aesthetic experiences in contemplators. By combining these approaches, we can develop a model that allows our robotic platform to monitor its painting process using a camera system and - based on an evaluation of its current status - to change its movement to create human-like paintings. This way, we enable the robot to paint in a human-like way without any further control from an operator. Introduction The cognitive processes of generating and perceiving abstract art are - in contrast to figurative art - widely unknown. When processing representational art works, the effect of meaning is highly dominant. In abstract art, with the lack of this factor, the processes of perception are much more ambiguous, relying on a variety of more subtle qualities. In this work, we focus on the role of dynamic motions performed during the creation of an art work as one specific trait that influences our perception and aesthetic experience. Action Paintings Modern art works created by dynamic motions The term "action painting" was first used in the essay "The American Action Painters" (Rosenberg 1952). While the term "action painting" is commonly used in public, art historians sometimes also use the term "Gestural Abstraction". Both terms emphasize the process of creating art, rather than the resulting art work, which reflects the key innovation that Figure 1: An action painting in the style of Jackson Pollock, painted by "JacksonBot" arose with this new form of painting in the 1940s to the 1960s. The style of painting includes dripping, dabbing and splashing paint on a canvas rather than being applied carefully and in a controlled way. Art encyclopedias describe these techniques as "depending on broad actions directed by the artist's sense of control interacting with chance or random occurrences." The artists often consider the physical act of painting itself as the essential aspect of the finished work. Regarding the contemplators, action paintings intend to connect to them on a subconscious level. In 1950, Pollock said "The unconscious is a very important side of modern art and I think the unconscious drives do mean a lot in looking at paintings"(Ross 1990) and later, he stated "We're all of us influenced by Freud, I guess I've been a Jungian for a long time"(Rodman 1961). Clearly, artists like Pollock do not think actively about dynamic motions performed by their bodies the way as mathematicians from the area of modeling and optimal control do. But for us, it is very exciting, that one of the main changes they applied to their painting style in order to achieve their aim of addressing the subconscious mind has been a shift in the manner they carry out their motions during the creational process. 2013 210 Understanding the perception and generation of action paintings Since a human possesses much more degrees of freedom than needed to move, human motions can often be seen as a superposition of goal directed motions and implicit, unconscious motions. The assumption, that elements of human motions can be described in this manner has been widely applied and verified, particularly in walking and running motions (Felis and Mombaur 2012),(Schultz and Mombaur 2010), but also (very recently) regarding emotional body language during human walking (Felis, Mombaur, and Berthoz 2012). If we transfer this approach to an artist, the goal-directed motions are those carried out to direct his hand (or rather a brush or tool) to the desired position, the implicit, unconscious motions are the result of an implicit solved optimal control problem with a certain cost function like maximizing stability or minimizing energy costs. When looking at action paintings, we note, that this form of art generation is a very extreme form of this superposition model with a widely negligible goal-directed part. Therefore, it is a perfect basis to study the role of (unconscious) motion dynamics on a resulting art work. Jackson Pollock himself expressed similar thoughts when he said "The modern artist... is working and expressing an inner world - in other words - expressing the energy, the motion, and other inner forces" or "When you're working out of your unconscious, figures are bound to emerge... Painting is a state of being" (Rodman 1961). However, the role of motion dynamics in the embodied expression of artists has been poorly described so far, supposedly due to the lack of an adequate method for the acquisition of quantitative data. The goal of our project is to use state-of-the-art tools from scientific computing to analyze the impact of motion dynamics both on the creational and perceptual side of action-painting art works. Therefore, we perform perception studies with contemplators and experimental studies concerning motion generation, which are linked by a robotic platform as a tool that can precisely reproduce different motion dynamics. Using this approach, we want to determine key motion types influencing a painting's perception. Models of art perception The perception of art, especially abstract art, is still an area of ongoing investigations. Therefore, no generally accepted theory including all facets of art perception exists. There are, however, different theories that can explain different aspects of art perception. One example of a theory of art perception is the one presented in (Leder et al. 2004) (see figure 2). In the past, resulting from an increasing interest in embodied cognition and embodied perception, there has been a stronger focus on the nature of human motion and its dynamics regarding neuroscience or rather neuroaesthetics as well as psychology and history of art. There are several results, showing that we perceive motion and actions with a strong involvement of those brain regions that are responsible for motion and action generation (Buccino et al. 2001). The mirror neurons located in these brain regions fire both, Figure 2: Overview of the aesthetic judgment model by (Leder et al. 2004) when an action is actively performed and when the same action is being observed. These findings support the theory, that the neural representations for action perception and action production are identical (Buxbaum, Kyle, and Menon 2005). The relation between perception and embodied action simulation also exists for static scenes (Urgesi et al. 2006) and ranges even to the degree, where the motion is implied only by a static result of this very motion. For example, (Knoblich et al. 2002) showed, that the observation of a static graph sign evokes in the brain a motor simulation of the gesture, which is required to produce this graph sign. Finally, in (Freedberg and Gallese 2007), it was proposed that this effect of reconstructing motions by embodied simulation mechanisms will also be found when looking at "art works that are characterized by the particular gestural traces of the artist, as in Fontana and Pollock". Mathematical background To perform mathematical computations on motion dynamics, we first need to create models of a human and the robot arm. Both can be considered as systems of rigid bodies, which are connected by different types of joints (prismatic or revolute). By "model", we mean a mathematical description in terms of differential equations of the physical characteristics of the human arm an the robot accordingly. Depending on the number of bodies and joints, we end up with an certain number of degrees of freedom. For each body, we get a set of generalized variables q (coordinates), q˙ (velocities), q¨ (accelerations), and ⌧ (joint torques). Given such a model, we can fully describe its dynamics by means of M(q)¨q + N(q, q˙) = ⌧ (1) where M(q) is the joint space inertia matrix and N(q, q˙) contains the generalized non-linear effects. Once we have such a model, we can formulate our optimal control problem using x = [q, q˙] T as states and u = ⌧ as controls. The OCP 2013 211 Figure 3: Interface for web-based similarity ratings can be written in its general form as: min x,u,T Z T 0 L(t, x(t), u(t), p)dt + !M(T,x(T)) (2) subject to: x˙ = f(t, x(t), u(t), p) g(x(t), p)=0 h(t, x(t), u(t), p) ! 0 Note, that all the dynamic computation from our model is included in the RHS of the differential equation x˙ = f(t, x(t), u(t), p). The first part of our objective function, R T 0 L(t, x(t), u(t), p)dt is called the Lagrange term, !M(T,x(T)) is called the Mayer term. The former is used to address objectives that have to be evaluated over the whole time horizon (such as minimizing jerk), the latter is used to address objectives that only need to be evaluated at the end of the time horizon (such as overall time). In our case, we will often only use the Lagrange term. To solve such a problem numerically, we apply a direct multiple shooting method which is implemented in the software package MUSCOD-II. For a more detailed description of the algorithm, see (Bock and J. 1984; Leineweber et al. 2003). Experimental Data Perception experiments We performed two pre-studies to find out, whether human contemplators can distinguish robot paintings from humanmade paintings and how they evaluate robot paintings created by different mathematical objective functions. In the first study, we showed nine paintings to 29 participants, most of whom were laymen in arts and only vaguely familiar with Jackson Pollock. Seven paintings were original art works by Jackson Pollock and two paintings were generated by the robot platform JacksonBot. We asked the participants to judge, which of the paintings were original paintings by Pollock and which were not, but we intentionally did not inform them about the robotic background of the "fake" paintings. As might be expected, the original works by Pollock had a higher acceptance rate, but, Figure 4: Interface for web-based sorting studies very surprisingly, the difference between Pollock's and JacksonBot's paintings was not very high (2.74 + / " 0.09 vs. 2.85 + / " 0.76, on a scale of 1 5). In the second study, the participants were shown 10 paintings created solely by the robot platform, but with two opposite objective functions (maximum and minimum overall angular velocity in the robot arm) in the optimal control problem. The participants easily distinguished the two different painting styles. Since the pre-studies were only conducted to get a rather rough idea on this aspect, we developed a more sophisticated web-based platform for further, more detailed investigations on this subject. The data obtained from this tool can be used to enhance the robot's ability to monitor its painting process. The set of stimuli used for our studies consists of original action-art paintings by Pollock and other artists and images that were painted by our robot platform. In the first task, contemplators are presented three randomly chosen paintings1 and asked to arrange them on the screen according to their similarity (see figure 3). If they want, they are free to add a commentary to indicate their thoughts while arranging the paintings. As a result, we obtain for every set of two paintings a measure for their similarity in comparison with any other set of two paintings2. Using standard procedures from statistics like cluster analysis, we can determine which paintings are overall rated more "similar" than others. In the second task, people are asked to perform a standard sorting study, i.e. they are asked to combine similar paintings in groups and to give some information on why they formed specific groups. The results of this task are used to validate the information obtained by the previous one and, additionally, they are used to gain more information about the attributes and traits, people seem to use while grouping. Therefore, the set of possible tags for the formed groups is limited and chosen by us. Is includes very basic image characteristics like colour as well as more interesting character1 more precisely, the paintings are not chosen purely random but there is a slight correction to the probability of each painting to be presented in order to get many different correlations even when participants only complete few repetitions 2 Note that we do not use the absolute values of "similarity" but quotients of these in order to avoid offset problems 2013 212 Figure 5: recorded acceleration data for a 3sec motion istics like associated emotions. Motion capture experiments In order to study the way real human artists move during action-painting, we chose to do motion-capture studies with our collaborating artist. As a first approach, we used three inertia sensors to record dynamic data Dcapture. For each of the three segments of the artist's arm (hand, lower arm, upper arm), we recorded accelerations, angular velocities and the rotation matrix3 using three Xsens MTw inertial motion trackers. The sensors were placed directly above the calculated center of mass of each arm segment. Figure 5 shows an example of the raw data output obtained from the sensors. We asked the artist to create different paintings and to describe her creative ideas as well as her thoughts and emotions during the process with her own words. That way, we can correlate identified objective functions with specific emotions or creative ideas. Robot painting experiments For first experiments, we created paintings with our robot platform. In order to compute the robot joint trajectories necessary to move along a desired end effector path, we use an optimal control based approach to solve the inverse kinematics problem. Using our first robotic platform, we created several paintings using different cost functions in the optimal control problem. Two of them - maximizing and minimizing the angular velocities in the robot joints - resulted in significantly different paintings. These paintings were used in the pre-study mentioned earlier. Data Analysis Motion reconstruction To fit the record dynamic data Dcapture to our 9 DOF model of a human arm that is based on data from (De Leva 1996), we formulated an optimal control problem which generates the motion x(t)=[q(t), q˙(t)]T and the controls u(t) = ⌧ (t) that best fit the captured data with respect to the model dynamics f. min x,u 1 2 ||Dcapture(t) ! DSimulated(t)||2 2 (3) subject to: x˙(t) = f(t, x(t), u(t), p) g(x, p)=0 h(x, p) " 0 3 recording the euler angles is not sufficient due to potential singularities in the reconstruction process Figure 6: Computed trajectories for joint angles (left) and comparison of computed (lines) and measured (dots) accelerations (right). The constraints in this case are given by the limited angles of the human arm joints and torque limitations of the arm muscles. The computed states and the fit quality of the acceleration data can bee seen in figure 6. Note that the angle approach to the joint limitations is plausible for this type of motion. In the next step, we will use the motion capture data obtained from experiments with our collaborating artist not only reconstruct the motion, but use an inverse optimal control approach (like successfully used in a similar case in (Mombaur, Truong, and Laumond 2010)) to retrieve the underlying objective functions of these motions. To do so, we will use an approach developed by K.Hatz in (Hatz, Schloder, and Bock 2012). This process is illustrated in fig¨ ure 7. Conclusion and Outlook We introduced a new way to analyze the creative process of action painting by investigating the dynamic motions of artists. We developed a mathematical model, which we used to succesfully reconstructed an artists' action-paintingmotions from inertia measurements. We used state-of-theart optimal control techniques to create new action-paintingmotions for a robotic platform and evaluated the resulting painting. Even with "artificial" objective functions, we were able to create action paintings that are indistinguishable from human-made action paintings for a human contemplator. In the next step, we will use an inverse optimal control approach to go one step further from reconstructing an artist's motions to identifying the underlying objective functions of motion dynamics. That way, we will be able to generate specific painting motions corresponding to specific intentions as formulated by the artist. Since several studies, e.g. (Haak et al. 2008), have shown that aesthetic experiences and judgments can - up to a certain degree - be explained by analyzing low-level image features, we chose to develop an image analysis software tool based on OpenCV that uses a variety of different filters and image processing tools that are related to aesthetic experience. Amongst other features, our tool analyzes the paintings considering its power spectrum, different symmetries, color and fractal analysis (Taylor, Micolich, and Jonas 1999). We will include the information obtained from our online perception studies in this tool and use it as feedback 2013 213 Figure 7: Transfer of human motion objectives to a robot platform (schematic overview) for the robot platform. That way, we will enable it to paint autonomously with feedback only from an integrated camera monitoring the process. The presented approach of capturing the essence of dynamic motions using inverse optimal control theory is not limited to the investigation of action paintings but can be used to analyze human motions in other art forms like dance or even in daily life by analyzing human gestures or fullbody motions. 2013_32 !2013 Creative Machine Performance: Computational Creativity and Robotic Art Petra Gemeinboeck College of Fine Art University of NSW NSW 2021 Australia petra@unsw.edu.au Rob Saunders Design Lab University of Sydney NSW 2006 Australia rob.saunders@sydney.edu.au Abstract The invention of machine performers has a long tradition as a method of philosophically probing the nature of creativity. Robotic art practices in the 20th Century have continued in this tradition, playfully engaging the public in questions of autonomy and agency. In this position paper, we explore the potential synergies between robotic art practice and computational creativity research through the development of robotic performances. This interdisciplinary approach permits the development of significantly new modes of interaction for robotic artworks, and potentially opens up computational models of creativity to rich social and cultural environments through interaction with audiences. We present our exploration of this potential with the development of Zwischenraume (In-between Spaces), an ¨ artwork that embeds curious robots into the walls of a gallery. The installation extends the traditional relationship between the audience and artwork such that visitors to the space become performers for the machine. Introduction This paper looks at potential synergies between the practice of robotic art and the study of computational creativity. Starting from the position that creativity and embodiment are critically linked, we argue that robotic art provides a rich experimental ground for applying models of creative agency within a public forum. From the robotic art perspective, a computational creativity approach expands the performative capacity of a robotic artwork by enhancing its potential to interact with its ‘Umwelt' (Von Uexkull 1957). ¨ In the 18th century, the Industrial Age brought with it a fascination with mechanical performers: Jacques de Vaucanson's Flute Player automaton and Baron Wolfgang von Kempelen's infamous chess playing Mechanical Turk clearly demonstrate a desire to create apparently creative automata. Through their work, both Vaucanson and von Kempelen engaged the public in philosophical questions about the nature of creativity, the possibilities of automation and, crucially, perfection. Moving from mechanical to robotic machine performers, artists have deployed robotics to create apparently living and behaving creatures for over 40 years. The two dominant motivations for this creative practice have been to question "our premises in conceiving, building, and employing these electronic creatures" (Kac 2001), and to develop enhanced forms of interactions between machine actors and humans "via open, non-determined modes" (Reichle 2009). The pioneering cybernetic work Senster by Edward Ihnatowicz, for example, exhibited life-like movements and was programmed to ‘shy away' from loud noises. In contrast to the aforementioned automata, Ihnatowicz did not aim to conceal the Senster's inner workings, and yet "the public's response was to treat it as if it were a wild animal" (Rieser 2002). Norman White's Helpless Robot (1987- 96) was a public sculpture, which asked for help to be moved, and when assisted, continued to make demands and increasingly abused its helpers (Kac 1997). Petit Mal by Simon Penny resembled a strange kind of bicycle and reacted to and pursued gallery visitors. With this work Penny aimed to explore the aesthetics of machines and their interactive behaviour in real world settings; Petit Mal was, in Penny's words, "an actor in social space" (Penny 2000). Ken Rinaldo's Autopoesis consisted of 15 robotic sculptures and evolved collective behavior based on their capability to sense each other's and the audience's presence (Huhtamo 2004). The installation Fish-Bird by Mari Velonaki comprised two robotic actors in the form of wheelchairs whose movements and written notes created a sense of persona. The relationship between the robot characters and the audience evolved based on autonomous movement, coordinated by a central controller, and what appeared to be personal, "handwritten" messages, printed by the robots (Rye et al. 2005). Our fascination with producing artefacts that appear to be creative has created a rich history for researchers of computational creativity to draw upon. What we learn from these interdisciplinary artistic approaches is that, as performers, the artificial agents are embodied and situated in ways that can be socially accessed, shared and experienced by audiences. Likewise, embodied artificial agents gain access to shared social spaces with other creative agents, e.g., audience members. The ability of robotic performers to interact with the audience not only relies on the robot's behaviours and responsiveness but also the embodiment and enactment of these behaviours. It can be argued that the performer is most successful if both embodiment and enactment reflect its perception of the world, that is, if it is capable of expressing and communicating its disposition. Looking at robotic art 2013 215 works that explore notions of autonomy and artificial creativity may thus offer starting points for thinking about social settings that involve humans interacting and collaborating with creative agents. Our exploration revolves around the authors' collaboration to develop the robotic artwork Zwischenraume (In¨ between Spaces), a machine-augmented environment, for which we developed a practice embedding embodied curious agents into the walls of a gallery, turning them into a playground for open-ended exploration and transformation. Zwischenraume ¨ The installation Zwischenraume embeds autonomous robots ¨ into the architectural fabric of a gallery. The machine agents are encapsulated in the wall, sandwiched between the existing wall and a temporary wall that resembles it. At the beginning of an exhibition, the gallery space appears empty, presenting an apparently untouched familiar space. From the start, however, the robots' movements and persistent knockings suggest comprehensive machinery at work inside the wall. Over the course of the exhibition, the wall increasingly breaks open, and configurations of cracks and hole patterns mark the robots' ongoing sculpting activity (Figure 1). Figure 1: Zwischenraume: curious robots transform our fa¨ miliar environment. The work uses robotics as a medium for intervention: it is not the spectacle of the robots that we are interested in, but rather the spectacle of the transformation of their environment. The starting point for this interdisciplinary collaboration was our common interest in the open-ended potential of creative machines to autonomously act within the human environment. From the computational creativity researcher's point of view, the embodied nature of the agents allowed for situating and studying the creative process within a complex material context. For the artist, this collaboration opened up the affective potential to materially intervene into our familiar environment, bringing about a strange force, seemingly with an agenda and beyond our control. Each machine agent is equipped with a motorised hammer, chisel or punch, and a camera to interact and network with the other machines by re-sculpting its environment (Figure 2). The embodied agents are programmed to be curious, and as such intrinsically motivated to explore the environment. Once they have created large openings in the wall the robots may study the audience members as part of their environment. In the first version of this work, the robots used their hammer to both punch holes and for communicating amongst the collective. In a later version, we experimented with a more formal sculptural approach that used heuristic compositions of graffiti glyphs to perforate walls. Using the more stealthy movements of a chisel, the work responded to the specific urban setting of the gallery by adapting graffiti that covered the exterior of the building to become an inscription, pierced into the pristine interior walls of the gallery space (Figure 3). The final version of Zwischenraume used a punch to combine the force of the ¨ hammer and the precision of the chisel. Figure 2: Robot gantries are attached to walls. Similar to Jean Tinguely's kinetic sculptures (Hulten´ 1975), Zwischenraume's performance and what it produces ¨ may easily evoke a sense of dysfunctionality. As the machines' adaptive capability is driven by seemingly nonrational intentions rather than optimisation, the work, in some sense, subverts standard objectives for machine intelligence and notions of machine agency. Rather, it opens up the potential for imagining a machine that is ‘free', a machine that is creative, see (Hulten 1987). ´ Machine Creativity This section focuses on the development of the first version of Zwischenraume as depicted in Figures 1 and 2. Each ¨ robotic unit consisted of a carriage, mounted on a vertical gantry, equipped with a camera mounted on an articulated arm, a motorised hammer, and a contact microphone. The control system for the robots combined machine vision to detect features from the camera with audio processing to detect the knocking of other robots and computational models of intrinsic motivation based on unsupervised and reinforce 2013 216 Figure 3: Inscription of adapted graffiti glyphs. ment machine learning to produce an adaptive, autonomous and self-directed agency. The robot's vision system was developed to construct multiple models of the scene in front of the camera; using colour histograms to differentiate contexts, blob detection to detect individual shapes, and frame differencing to detect motion. Motion detection was only used to direct the attention of the vision system towards areas of possible interest within the field of view. Face detection is also used to recognise the presence of people to direct the attention of the robots towards visitors. While limited, these perceptual abilities provide sufficient richness for the learning algorithms to build models of the environment to determine what is different enough to be interesting. Movements, shapes, sounds and colours are processed, learned and memorised, allowing each robotic agent to develop expectations of events in their surrounds. The machine learning techniques used in Zwischenraume combine un¨ supervised and reinforcement learning techniques (Russell and Norvig 2003): a self-organizing map (Kohonen 1984) is used to determine the similarity between images captured by the camera; Q-learning (Watkins 1989) is used to allow the robots to discover strategies for moving about the wall, using the hammer and positioning the camera. Separate models are constructed for colours and shapes in images. To determine the novelty of a context, sparse histograms are constructed from captured images based on only 32 colour bins with a high threshold, so only the most significant colours are represented and compared using a selforganising map. Blob detection in low-resolution (32x32 pixel) images, relative to a typical model image of the wall, is used to discover novel shapes and encoded in a selforganising map as a binary vector. In both cases, the difference between known prototypes in the self-organising map provide a measure of novelty (Saunders 2001). Reinforcement learning is used to learn the consequences of movements within the visual field of the camera. Error in prediction between learned models of consequences and observed results is used as a measure of surprise. As a result system that is able to learn a small repertoire of skills and appreciate the novelty of their results, e.g., knocking on wood does not produce dents. This ability is limited to immediate consequences of actions and does not current extend to sequences of actions. The goal of the learning system is to maximise an internally generated reward for capturing ‘interesting' images and to develop a policy for generating rewards through action. Interest is calculated based on a computational model that captures intuitive notions of novelty and surprise (Saunders 2001): ‘novelty' is defined as a difference between an image and all previous images taken by the robot, e.g., the discovery of significant new colours or shapes; and, ‘surprise' is defined as the unexpectedness of an image within a known situation, e.g., relative to a learned landmark or after having taken an action within an expected outcome (Berlyne 1960). Learning plays a critical role in both the assessment of novelty and surprise. In novelty, the robots have to learn suitably general prototypes for the different types of images that they encounter. In surprise, the ‘situation' against which images are judged includes a learned model of the consequences of actions (Clancey 1997). Consequently, intrinsic motivation to learn directs both the robot's gaze and its actions, resulting in a feedback process that increases the complexity of the environment - through the robot's knocking - relative to the perceptual abilities of the agent. Sequences of knocking actions are developed, such that the robots develop a repertoire of actions that produce significant perceived changes in terms of colour, shapes and motion. In this way, the robots explore their creative potential in re-sculpting their environment. Figure 4 presents a collage of images taken by a single robot when it discovered something ‘interesting', illustrating how the evaluation of ‘interesting' evolved for this robot; it shows how the agent's interest is affected by: (a) positioning of the camera, e.g., the discovery of lettering on the plasterboard wall; (b) use of the hammer, e.g., the production of dents and holes; and, (c) interaction of visitors. Figure 4: Robot captures, showing the evolution of interesting changes in the environment. 2013 217 Discussion The robots' creative process turns the wall into a playful environment for learning, similar to a sandpit; while from the audiences' point of view, the wall is turned into a performance stage. This opens up a scenario of encounter for studying the potential of computational creativity and the role of embodiment. Following Pickering (2005), we argue that creativity cannot be properly understood, or modelled, without an account of how it emerges from the encounter between the world and intrinsically active, exploratory and productively playful agents. Embodiment and Creativity The agents' embodiment provides opportunities to expand their behavioural range by taking advantage of properties of the physical environment that would be difficult or impossible to simulate computationally (Brooks 1990). In Zwischenraume the machines' creative agency is not predeter¨ mined but evolves based on what happens in the environment they examine and manipulate. As the agents' embodiment evolves based on its interaction with the environment, the robots' creative agency affects processes out of which it itself is emergent. This resonates with Barad's argument that ‘agency is a matter of intra-acting: it is an enactment, not something that someone or something has' (Barad 2007). It also evokes Maturana and Varela's notion of enaction, where the act of bringing about a world occurs through the ‘structural coupling' between the dynamical environment and the autonomous agents (Maturana and Varela 1987). While the machines perturb and eventually threaten the wall's structural integrity, they adapt to their changing environment, the destruction of the wall and how it changes their perception of the world outside. The connection to creativity is two-fold: Firstly, the robots' intrinsic motivation to explore, discover and constantly produce novel changes to their environment demonstrates a simplistic level of a creative process itself, akin to the act of doodling, where the motivation is a reflective exploration of possibilities rather than purposeful communication with others. Secondly, the audiences interpret the machines' interactions based on their own context, producing a number of possible meaningful relations and associations. The agents' embodiment and situatedness becomes a portal for entering the human world, creating meaning. The agents' enacted perception also provides a window on the agents' viewpoint, thus possibly changing the perspective of the audience. Furthermore, an enactive approach (Barad 2003; Clark 1998; Thompson 2005) opens up alternative ways of thinking about creative human-machine collaborations. It makes possible a re-thinking of human-machine creativity beyond the polarisation of human and non-human, one that promotes shared or distributed agency within the creative act. Audience Participation Autonomous, creative machine performances challenge the most common interaction paradigm of primarily reacting to what is sensed, often according to a pre-mapped narrative. Zwischenraume's curious agents proactively seek interac¨ tion, rather than purely responding to changes in the surrounds. Once the robots have opened up the wall, the appearance and behaviours of audience members are perceived by the system as changes in their environment and become an integral part of the agents' intrinsic motivation system. The agents' behaviours adapt based on their perception and evaluation of their environment, including the audience, as either interesting or boring. A curious machine performer whose behaviors are motivated by what it perceives and expects can be thought of as an audience to the audiences performance. Thus, in Zwischenraume it is not only the robots ¨ that perform, but also the audience that provokes, entertains and rewards the machines' curiosity. This notion of audience participation expands common interaction paradigms in interactive art and media environments (Paul 2003). The robots don't only respond or adapt to the audience's presence and behaviours, but also have the capacity to perceive the audience with a curious disposition. By turning around the traditional relationship between audiences and machinic performers, the use of curious robotic performers permits a re-examination of the machine spectacle. Lazardig (2008) argues that spectacle, as "a performance aimed at an audience," was central to the conception of the machine in the 17th century as a means of projecting a perception of utility; allowing the machine to become "an object of admiration and therefore guaranteed to ‘function"'. Kinetic sculptures and robotic artworks exploit and promote the power of the spectacle in their relationship with the audience. This is also the case in Zwischenraume however, it is ¨ not only the machines that are the spectacle for the audience but also the audience that becomes an ‘object of curiosity' for the machines (Figure 5). Thus the relationship with a curious robot extends the notion of the spectacle, and, in a way, brings it full circle. Figure 5: Gallery visitor captured by one of the robots' cameras as he performs for the robotic wall. 2013 218 Concluding Remarks A significant aspect of Zwischenraume's specific embodi¨ ment is that it embeds the creative agents in our familiar (human) environment. This allowed us to direct both our, and the audience's, attention to the autonomous process and creative agency, rather than the spectacle of the machine. The integration of computational models of creativity into this artwork extended the range of open-ended, non-determined modes of interaction with the existing environment, as well as between the artwork and the audience. We argue that it is both, the embodied nature of the agents and their autonomous creative capacity that allows for novel meaningful interactions and relationships between the artwork and the audience. The importance of embodiment for computational creativity can also be seen in the improvising robotic marimba player Shimon, which uses a physical gesture framework to enhance synchronised musical improvisation between human and nonhuman musicians (Hoffmann and Weinberg 2011). The robot player's movements not only produce sounds but also play a significant role in performing visually and communicatively with the other (human) band members as well as the audience. Embodying creative agents and embedding them in our everyday or public environment is often messier and more ambiguous than purely computational simulation. What we gain, however, is not only a new shared embodied space for audience experience but also a new experimentation space for shared (human and non-human) creativity. Acknowledgements This research has been supported by an Australia Research Council Grant, a New Work Grant from the Austrian Federal Ministry for Education, Arts and Culture, and a Faculty Research Grant from COFA (University of NSW). 2013_33 !2013 An Artificial Intelligence System to Mediate the Creation of Sound and Light Environments Claudio Benghi Northumbria University, Ellison Building, Newcastle upon Tyne, NE1 8ST, England claudio.benghi@northumbria.ac.uk Gloria Ronchi Aether & Hemera, Kingsland Studios, Priory Green, Newcastle upon Tyne, NE6 2DW, England hemera@aether-hemera.com Introduction This demonstration presents the IT elements of an art installation that exhibits intelligent reactive behaviours to participant input employing Artificial Intelligence (AI) techniques to create unique aesthetic interactions. The audience is invited to speak into a set of microphones; the system captures all the sounds performed and uses them to seed an AI engine for creating a new soundscape in real time, on the base of a custom music knowledge repository. The compositions is played back to the users through surrounding speakers and accompanied with synchronised light events of an array of coloured LEDs. This art work allows viewers to become active participants in creating multisensory computer-mediated experiences, with the aim of investigating the potential for creative forms of inter-authorship. Software Application The installation's software has been built as a custom event manager developed under the .Net framework that can respond to events from the users, timers, and the UI cascading them through the required algorithms and libraries as a function of specified interaction settings; this solution allowed swift changes to the behaviour of the artwork in response to the observation of audience interaction patterns. Figure 1: Scheme of the modular architecture of the system Different portions of the data flow have been externalised to custom hardware to reduce computational load on the controlling computer: a configurable number of real-time devices converters transform the sounds of the required number of microphones into MIDI messages and channel them to the event manager; a cascade of Arduino devices control the custom multi channel lighting controllers and the sound output stage relies on MIDI standards. A substantial amount of work has been put into the optimisation of the UI console controlling the behaviour of the installation; this turned out to be crucial for the success of the project as it allowed to make use of the important feedback gathered in the first implementation of this participatory art work. Figure 2: GUI of the controlling system The work was first displayed as part of a public event over three weeks and allowed the co-generation of unpredictable soundscapes with varying levels of user's appreciation. The evaluation of any public co-creation environment is itself a challenging research area and our future work will investigate and evaluate methodologies to do so; further developments to the AI are also planned to include feedback from past visitors. More information about this project can be found at: http://www.aether-hemera.com/s/aib 2013 220 2013_34 !2013 Controlling Interactive Music Performance (CIM) Andrew Brown, Toby Gifford and Bradley Voltz Queensland Conservatorium of Music, Griffith University andrew.r.brown@griffith.edu.au, t.gifford@griffith.edu.au, b.voltz@griffith.edu.au Abstract Controlling Interactive Music (CIM) is an interactive music system for human-computer duets. Designed as a creativity support system it explores the metaphor of human-machine symbiosis, where the phenomenological experience of interacting with CIM has both a degree of instrumentality and a sense of partnership. Building on Pachet's (2006) notion of reflexivity, Young's (2009) explorations of conversational interaction protocols, and Whalley's (2012) experiments in networked human-computer music interaction, as well as our own previous work in interactive music systems (Gifford & Brown 2011), CIM applies an activity/relationality/prominence based model of musical duet interaction. Evaluation of the system from both audience and performer perspectives yielded consensus views that interacting with CIM evokes a sense of agency, stimulates creativity, and is engaging. Description The CIM system is an interactive music system for use in human-machine creative partnerships. It is designed to sit at a mid-point of the autonomy spectrum, according to Rowe's instrument paradigm vs player paradigm continuum. CIM accepts MIDI input from a human performer, and improvises musical accompaniment. CIM's behaviour is directed by our model of duet interaction, which utilises various conversational, contrapuntal and accompaniment metaphors to determine appropriate musical behaviour. An important facet of this duet model is the notion of turn-taking - where the system and the human swap roles as the musical initiator. To facilitate turn-taking, the system includes some mechanisms for detecting musical phrases, and their completion. This way the system can change roles at musically appropriate times. Our early implementation of this system simply listened for periods of silence as a cue that the human performer had finished a phrase. Whilst this method is efficient and robust, it limits duet interaction and leads to a discontinuous musical result. This behaviour, whilst imbuing CIM with a sense of autonomy and independence, detracts from ensemble unity and interrupts musical flow. To address this deficiency, we implemented some enchronic segmentation measures, allowing for inter-part elision. Inter-part elision is where phraseend in one voice coincides with (or is anticipated by) phrasestart in a second voice. In order to allow for inter-part elision, opportunistic decision making, and other synchronous devices for enhancing musical flow, we have implemented some measures of musical closure as secondary segmentation indicators. Additionally these measures guide CIM's own output, facilitating generation of coherent phrase structure. The evaluation procedure Our evaluation process involved six expert musicians, including staff and senior students at a University music school and professional musicians from the State orchestra, who performed with the system under various conditions. The setup of MIDI keyboard and computer used for these sessions is shown in Figure 5. Figure 5: A musician playing with CIM Participants first played a notated score (see Figure 6). Next they engaged in free play with the system, giving them an opportunity to explore the behaviour of the system. Finally, they performed a short improvised duet with the system. The interactive sessions were video recorded. Following the interactive session each performer completed a written questionnaire. Figure 1: A musician interacting with the CIM system 2013_35 !2013 Towards a Flowcharting System for Automated Process Invention Simon Colton and John Charnley Computational Creativity Group, Department of Computing, Goldsmiths, University of London www.doc.gold.ac.uk/ccg Figure 1: User-defined flowchart for poetry generation. Flowcharts Ironically, while automated programming has had a long and varied history in Artificial Intelligence research, automating the creative art of programming has rarely been studied within Computational Creativity research. In many senses, software writing software represents a very exciting potential avenue for research, as it addresses directly issues related to novelty, surprise, innovation at process level and the framing of activities. One reason for the lack of research in this area is the difficulty inherent in getting software to generate code. Therefore, it seems sensible to start investigating how software can innovate at the process level with an approach less than full programming, and we have chosen the classic approach to process design afforded by flowcharts. Our aim is to provide a system simple enough to be used by non-experts to craft generative flowcharts, indeed, simple enough for the software itself to create flowcharts which represent novel, and hopefully interesting new processes. We are currently in the fourth iteration of development, having found various difficulties with three previous approaches, ranging from flexibility and expressiveness of the flowcharts to the mismatching of inputs with outputs, the storage of data between runs, and the ability to handle programmatic constructs such as conditionals and loops. In our current approach, we represent a process as a script, onto which a flowchart can be grafted. We believe this offers the best balance of flexibility, expressiveness and usability, and will pave the way to the automatic generation of scripts in the next development stage. We have so far implemented the natural language processing flowchart nodes required to model aspects of a previous poetry generation approach and a previous concept formation approach. The Flow System In figure 1 we present a screenshot of the system, which is tentatively called Flow. The flowchart shown uses 18 subprocesses which, in overview, do the following: a negative valence adjective is chosen, and used to retrieve tweets from Twitter; these are then filtered to remove various types, and pairs are matched by syllable count and rhyme; finally the lines are split where possible and combined via a template into poems of four stanzas; multiple poems are produced and the one with overall most negative valency is saved. A stanza from a poem generated using ‘malevolent' is given in figure 2. Note in figure 1 that the node bordered in red (WordList Categoriser) contains the sub-process currently running, and the node bordered in grey (Twitter) has been clicked by the user, which brings up the parameters for that sub-process in the first black-bordered box and the output from it in the second black-bordered box. We see the 332nd of 1024 tweets containing the word ‘cold' is on view. Note also that the user is able to put a thumb-pin into any node, which indicates that the previous output from that node should be used in the next run, rather than being calculated again. It's our ambition to build a community of open-source developers and users around the Flow approach, so that the system can mimic the capabilities of existing generative systems in various domains, but more importantly, it can invent new processes in those domains. Moreover, we plan to install the system on various servers worldwide, constantly reacting in creative ways to new nodes which are uploaded by developers, and to new flowcharts developed by users with a variety of cultural backgrounds. We hope to show that, in addition to creating at artefact level, software can innovate at process level, test the value of new processes and intelligently frame how they work and what they produce. Figure 2: A stanza from the poem On Being Malevolent. 1 2013 222 2013_36 !2013 A Rogue Dream: Web-Driven Theme Generation for Games Michael Cook Computational Creativity Group Imperial College, London mtc06@doc.ic.ac.uk ABSTRACT A Rogue Dream is an experimental videogame developed in seven days for a roguelike development challenge. It uses techniques from computational creativity papers to attempt to theme a game dynamically using a source noun from the player, including generating images and theme information. The game is part of exploratory research into bridging the gap between generating rules-based content and theme content for videogames. 1. DOWNLOAD While A Rogue Dream is not available to download directly, its code can be found at: https://github.com/cutgarnetgames/roguedream Spritely, a tool used in A Rogue Dream, can also be downloaded from: https://github.com/gamesbyangelina/spritely 2. BACKGROUND Procedural content generation systems mostly focus on generating structural details of a game, or arranging pre-existing contextual information (such as choosing a noun from a list of pre-approved words). This is because the relationship between the mechanics of a game and its theme is hard to define and has not been approached from a computational perspective. For instance, in Super Mario eating a mushroom increases the player's power. We understand that food makes people stronger, therefore a mushroom is contextually appropriate. In order to procedurally replace that with another object, the system must understand the real-world concepts of food, strength, size and change. Most content generation systems for games are designed to understand games, not the real world. How can we overcome that? 3. A ROGUE DREAM In [1] Tony Veale proposes mining Google Autocomplete using leading phrases such as "why do s..." and using the autocompletions as a source of general knowledge Figure 1: A screenshot from A Rogue Dream. The input was ‘cow' enemies were ‘red', resulting in a red shoe being the enemy sprite. Abilities including ‘mooing' and ‘giving milk'. or stereotypes. We refer to this as ‘cold reading the Internet', and use it extensively in A Rogue Dream. We also employ Spritely, a tool for automatically generating spritebased artwork by mining the web for images. The game begins by asking the player to complete the sentence "Last night, I dreamt I was a...". The noun used to complete the sentence becomes a parameter for the search systems in A Rogue Dream, such as Spritely and the various text retrieval systems based on Veale's cold reading. These are subject to further filtering queries matching "why do s hate..." are used to label enemies, for example. This work connects to other research being conducted by the author currently in direct code modification for content generation [?]. We hope to combine these two research tracks in order to build technology that can understand and situate abstract game concepts in a real-world context, and provide labels and fiction that describe and illustrate the game world accurately and in a thematically appropriate way. 2013_37 !2013 A Puzzling Present: Code Modification for Game Mechanic Design Michael Cook and Simon Colton Computational Creativity Group Imperial College, London {mtc06,sgc}@doc.ic.ac.uk Figure 1: A screenshot from A Puzzling Present. ABSTRACT A Puzzling Present is an Android and Desktop game released in December 2012. The game mechanics (that is, the player's abilities) as well as the level designs were generated using Mechanic Miner, a procedural content generator that is capable of exploring, modifying and executing codebases to create game content. It is the first game developed using direct code modification as a means of procedural mechanic generation. 1. DOWNLOAD A Puzzling Present is available on Android and for all desktop operating systems, for free, here: http://www.gamesbyangelina.org/downloads/app.html The source code is also available on gamesbyangelina.org. 2. BACKGROUND Mechanic Miner was developed as part of PhD research into automating the game design process, through a piece of software called ANGELINA. ANGELINA's ability to develop small games autonomously, including theming the game's content using social and web media, was demonstrated at ICCC 2012[1]. Mechanic Miner represents a large step forward for ANGELINA as the system becomes able to inspect and modify code directly, instead of using grammars or other intermediate representations. ANGELINA's research has always aimed to produce playable games for general release. Space Station Invaders was released in early 2012 as a commission for the New Scientist, and a series of newsgames were released to coincide with several conferences in mid-2012. A Puzzling Present was the largest release to date, garnering over 6000 downloads, and entering the Android New Game charts in December, as well as coverage on Ars Technica, The New Scientist, and Phys.org. 3. A PUZZLING PRESENT The game itself contains thirty levels split into three sets of ten. Each set of levels, or world, has a unique power available to the player, such as inverting gravity or becoming bouncy. These powers can be switched on and o↵, and must be used to complete each level. Each power was discovered by Mechanic Miner by iterative modification of code and simulation of gameplay to test the code modifications. For more information on the system, see [2]. Levels were designed using the same system mechanics are tested against designed levels to evaluate whether the level is appropriate. This means the system is capable of designing novel levels with mechanics it has never seen before there is no human intervention to add heuristics or evaluations for specific mechanics. We are currently working on integrating Mechanic Miner into the newsgame generation module of ANGELINA, so that the two systems can work together to collaboratively build larger games. This initial work on code modification has also opened up major questions about the relationship between code and meaning in videogames, which we plan to explore in future work. 2013_38 !2013 Demonstration: A meta-pianist serial music comproviser Roger T. Dean austraLYSIS, Sydney; and MARCS Institute, University of Western Sydney, Australia roger.dean@uws.edu.au Computational processes which produce metahuman as well as seemingly-human outputs are of interest. Such outputs may become apparently human as they become familiar. So I write algorithmic interfaces (often in MAXMSPJitter) for real-time performative generation of complex musical/visual features, to be part of compositions or improvisations. Here I demonstrate a musical system to generate serial 12-tone rows, their standard transforms, and then to assemble them into melodic sequences, or into two part meta-pianistic performances. Serial rigour of pitch construction is maintained throughout. This means here that 12note motives are made, each of which comprises all the pitches within an octave on the piano (an octave comprises a doubling of frequency of the sound, and notes at the start and end of this sequence are given the same note name CDEFGABC etc). Then a generative system creates a rigorous set of transforms of the chosen note sequences. But as in serial composition at large, when these are disposed amongst multiple voices, and to create harmonies (simultaneous notes) as well as melodies (successions of separated notes), the serial chronology is modified. Furthermore, the system allows asynchronous processing of several versions of the original series, or of several different series. A range of complexity can result, and to enhance this I also made a companion system which uses tonal major scale melodies in a similar way. Here the original (Prime) version consists only of 12 notes taken from within an octave of the major scale (which includes only 7 rather than 12 pitches), thus permitting some repetitions. Chromatic inversion is used, so that for example, the scale of Cmajor ascending from C becomes the scale of Ab major descending from C, and major tonality with change of key centre is preserved. The performance patch within the system provided a default stochastic rhythmic, chordal and intensity control process; all of whose features are open to real-time control by the user.The patches are used for generating components of electroacoustic or notated composition, normally with equal-tempered or alternative tuning systems performed on a physical synthesis virtual piano (PianoTeq); and also within live solo MultiPiano performances involving acoustic piano and electronics. The outputs are meta-human in at least two senses. First, as with many computer patches, the physical limitations of playing an instrument do not apply, and Xenakian performance complexities can be realised. Second, no human improviser could achieve this precision of pitch transformation; rather we have evidence they tend to take a simplified approach to atonality, usually focusing on controlling intervals of 1, 2, 6, and 11 semitones. The products of these patches are also in use in experiments on the psychology of expectation (collaboration with Freya Bailes, Marcus Pearce and Geraint Wiggins, UK). 2013_39 !2013 assimilate collaborative narrative construction Damian Hills Creativity and Cognition Studio University of Technology, Sydney Sydney, Australia Damian.Hills@uts.edu.au Abstract This demonstration presents the 'assimilate collaborative narrative construction' project, that aims for a holistic system design with support for the creative possibilities of collaborative narrative construction. Introduction This demonstration presents the 'assimilate collaborative narrative construction' project (Hills 2011) that aims for a holistic system design with support for the creative possibilities of collaborative narrative construction. By incorporating interface mechanics with a flexible model of narrative template representation, the system design emphasises how mental models and intentions are understood by participants, and represents its creative knowledge outcomes based on these metaphorical and conversational exchanges. Using a touch table interface participants collaboratively narrate and visualise narrative sequences using online media obtained through a keyword search, or by words obtained from narrative templates. The search results are styled into generative behaviours that visually self-organise while participants make aesthetic choices about the narrative outcomes and their associated behaviours. The playful interface supports collaboration through embedded mechanics that extend gestural actions commonly performed during casual conversations. By embedding metaphorical schemes associated with narrative comprehension, such as pointing, exchanging, enlarging or merging views, gestural action drives the experience and supports the conversational aspects associated with narrative exchange. System Architecture The system architecture models the narrative template events to allow a particular narrative perspective, globally or locally within the generated story world. This is done by modeling conversation relationships with the aim of selforganising and negotiating an agreement surrounding several themes. The system extends Conversation Theory (CT) (Pask, 1976), a theory of learning and social interaction, that outlines a formal method of conversation as a sense-making network. Based on CT entailment meshes with an added fitness metric, this develops a negotiated agreement surrounding several interrelated themes, that leads to eventual narrative coherence. 2013_4 !2013 Evolving Figurative Images Using Expression-Based Evolutionary Art Joao Correia ˜ and Penousal Machado CISUC, Department of Informatics Engineering University of Coimbra 3030 Coimbra, Portugal jncor@dei.uc.pt, machado@dei.uc.pt Juan Romero and Adrian Carballal Faculty of Computer Science University of A Coruna˜ Coruna, Spain ˜ jj@udc.es, adrian.carballal@udc.es Abstract The combination of a classifier system with an evolutionary image generation engine is explored. The framework is composed of an object detector and a general purpose, expressionbased, genetic programming engine. Several object detectors are instantiated to detect faces, lips, breasts and leaves. The experimental results show the ability of the system to evolve images that are classified as the corresponding objects. A subjective analysis also reveals the unexpected nature and artistic potential of the evolved images. Introduction Expression based Evolutionary Art (EA) systems have, in theory, the potential to generate any image (Machado and Cardoso 2002; McCormack 2007). In practice, the evolved images depend on the representation scheme used. As a consequence, the results of expression-based EA systems tend to be abstract images. Although this does not represent a problem, there is a desire to evolve figurative images by evolutionary means since the start of EA. An early example of such an attempt can be found in the work of Steven Rooke (World 1996). McCormack (2005; 2007) identified the problem of finding a symbolic-expression that corresponds to a known "target" image as one of the open problems of EA. More exactly, the issue is not finding a symbolic-expression, since this can be done trivially as demonstrated by Machado and Cardoso (2002), the issue is finding a compact expression that provides a good approximation of the "target" image and that takes advantage of its structure. We address this open problem by generalizing the problem - i.e., instead of trying to match a target image we evolve individuals that match a given class of images (e.g. lips). The issue of evolving figurative images has been tackled by two main types of approach: (i) Developing tailored EA systems which resort to representations that promote the discovery of figurative images, usually of a certain kind; (ii) Using general purpose EA systems and developing fitness assignment schemes that guide the system towards figurative images. In the scope of this paper we are interested in the second approach. Romero et al. (2003) suggest combining a general purpose evolutionary art system with an image classifier trained to recognize faces, or other types of objects, to evolve images of human faces. Machado, Correia, and Romero (2012a) presented a system that allowed the evolution of images resembling human faces by combining a generalpurpose, expression-based, EA system with an off-the-shelf face detector. The results showed that it was possible to guide evolution and evolve images evocative of human faces. Here, we demonstrate that other classes of object can be evolved, generalizing previous results. The autonomous evolution of figurative images using a general purpose EC system has rarely been accomplished. As far as we know, evolving different types of figurative images using the same expression-based EC system and the same approach has never been accomplished so far (with the exception of userguided systems). We show that this can be attained with off-the-shelf classifiers classifiers, which indicates that the approach is generalizable, and also with purpose-built ones, which indicates that it is relatively straightforward to customize it to specific needs. We chose a rather ad-hoc set of classifiers in an attempt to demonstrate the generality of the approach. The remainder of this paper is structured as follows: A brief overview of the related work is made in the next section; Afterwards we describe the approach for the evolution of objects describing the framework, the Genetic Programming (GP) engine, the object detection system, and fitness assignment; Next we explain the experimental setup, the results attained and their analysis and; Finally we draw overall conclusions and indicate future research. Related Work The use of Evolutionary Computation (EC) for the evolution of figurative images is not new. Baker (1993) focuses on the evolution of line drawings, using a GP approach. Johnston and Caldwell (1997) use a Genetic Algorithm (GA) to recombine portions of existing face images, in an attempt to build a criminal sketch artist. With similar goals, Frowd, Hancock, and Carson (2004) use a GA, Principal Components Analysis and eigenfaces to evolve human faces. The evolution of cartoon faces (Nishio et al. 1997) and cartoon face animations (Lewis 2007) through GAs has also been explored. Additionally, Lewis (2007) evolved human figures. The previously mentioned approaches share two common aspects: the systems have been specifically designed for the 2013 24 evolution a specific type of image; the user guides evolution by assigning fitness. The work of Baker (1993) is an exception, the system can evolve other types of line drawings, however it is initialized with hand-built line drawings of human faces. These approaches contrast with the ones where general purpose evolutionary art tools, which have not been designed for a particular type of imagery, are used to evolve figurative images. Although the images created by their systems are predominantly abstract, Steven Rooke (World 1996) and Machado and Romero (see, e.g., 2011), among others, have successfully evolved figurative images using expression-based GP systems and user guided evolution. More recently, Secretan et al. (2011) created picbreeder, a user-guided collaborative evolutionary engine. Some of the images evolved by the users are figurative, resembling objects such as cars, butterflies and flowers. The evolution of figurative images using hardwired fitness functions has also been attempted. The works of by Ventrella (2010) and DiPaola and Gabora (2009) are akin to a classical symbolic regression problem in the sense that a target image exists and the similarity between the evolved images and the target image is used to assign fitness. In addition to similarity, DiPaola and Gabora (2009) also consider expressiveness when assigning fitness. This approach results in images with artistic potential, which was the primary goal of these approaches, but that would hardly be classified as human faces. As far as we know, the difficulty to evolve a specific target image, using symbolic regression inspired approaches, is common to all "classical" expression-based GP systems. The concept of using a classifier system to assign fitness is also a researched topic: in the seminal work of Baluja, Pomerlau, and Todd (1994) an Artificial Neural Network trained to replicate the aesthetic assessments is used; Saunders and Gero (2001) employ a Kohonen Self-Organizing network to determine novelty; Machado, Romero, and Manaris (2007) use a bootstrapping approach, relying on a neural network, to promote style changes among evolutionary runs; Norton, Darrell, and Ventura (2010) train Artificial Neural Networks to learn to associate low-level image features to synsets that function as image descriptors and use the networks to assign fitness. Overview of the Approach Figure 1 depicts an overview of the framework, which is composed of two main modules, an evolutionary engine and a classifier. The approach can be summarized as follows: 1. Random initialization of the population; 2. Rendering of the individuals, i.e., genotype-phenotype mapping; 3. Apply the classifier to each phenotype; 4. Use the results of the classification to assign fitness; This may require assessing internal values and intermediate results of the classification; Figure 1: Overview of the system. 5. Select progenitors; Apply genetic operators, create descendants; Use the replacement operator to update the current population; 6. Repeat from 2 until some stopping criterion is met. The framework was instantiated with a general-purpose GP-based image generation engine and with a Haar Cascade Classifier. To create a fitness function able to guide evolution it is necessary to convert the binary output of the detector to one that can provide suitable fitness landscape. This is attained by accessing internal results of the classification task that give an indication of the degree of certainty in the classification. In the following sections we explain the components of the framework, namely, the evolutionary engine, the classifier and the fitness function. Genetic Programming Engine The EC engine used in these experiments is inspired by the works of Sims (1991). It is a general purpose, expressionbased, GP image generation engine that allows the evolution of populations of images. The genotypes are trees composed of a lexicon of functions and terminals. The function set is composed of simple functions such as arithmetic, trigonometric and logic operations. The terminal set is composed of two variables, x and y, and randomly initialized constants. The phenotypes are images that are rendered by evaluating the expression-trees for different values of x and y, which serve both as terminal values and image coordinates. In other words, to determine the value of the pixel in the (0,0) coordinates one assigns zero to x and y and evaluates the expression-tree (see figure 2). A thorough description of the GP engine can be found in (Machado and Cardoso 2002). Figure 3 displays typical imagery produced via interactive evolution using this EC system. Object Detection For classification purposes we use Haar Cascade classifiers (Viola and Jones 2001). The classifier assumes the form of a cascade of small and simple classifiers that use a set of Haar features (Papageorgiou, Oren, and Poggio 1998) in combination with a variant of the Adaboost (Freund and Schapire 1995), and is able to attain efficient classifiers. This classification approach was chosen due to its state of the art relevance and for its fast classification. Both code and executables are integrated in the OpenCV API1. The face detection process can be summarized as follows: 1 OpenCV — http://opencv.org/ 2013 25 Figure 2: Representation scheme with examples of functions and the corresponding images. Figure 3: Examples of images generated by the evolutionary engine using interactive evolution. 1. Define a window of size w (e.g. 20 ⇥ 20). 2. Define a scale factor s greater than 1. For instance 1.2 means that the window will be enlarged by 20%. 3. Define W and H has the size of the input image. 4. From (0, 0) to (W, H) define a sub-window with a starting size of w for calculation. 5. For each sub-window apply the cascade classifier. The cascade has a group of stage classifiers, as represented in figure 4. Each stage is composed, at its lower level, of a group of Haar features 5. Apply each feature of each corresponding stage to the sub-window. If the resulting value is lower than the stage threshold the sub-window is classified as a non-object and the search terminates for the sub-window. If it is higher continue to next stage. If all cascade stages are passed, the sub-window is classified as !"#$%&'()* +,-."/,0 1 +,-."/,0 2 3 +,-."/,0 1 +,-."/,0 2 3 + 4#5,6.07,.,6.,( !.-8,01 !.-8,02 + 20!.-8,9 2)'04#5,6. Figure 4: Cascade of classifiers with N stages, adapted from (Viola and Jones 2001). Figure 5: The set of possible features, adapted from (Lienhart and Maydt 2002). containing an object. 6. Apply the scale factor s to the window size w and repeat 5 until window size exceeds the image in at least one dimension. Fitness Assignment The process of fitness assignment is crucial from an evolutionary point of view, and therefore it holds a large importance for the success of the described system. The goal is to evolve images that the object detector classifies as an object of the positive class. However, the binary output of the detector is inappropriate to guide evolution. A binary function gives no information of how close an individual is to being a valid solution to the problem and, as such, the EA would be performing, essentially, a random search. It is necessary to extract additional information from the classification detection process in order to build a suitable fitness function. This is attained by accessing internal results of the classification task that give an indication of the degree of certainty in the classification. Based on results of past experiments (Machado, Correia, and Romero 2012a; 2012b) we employ the following fitness function: f itness(x) = nstages X x i stagedifx(i)⇤i+nstagesx⇤10 (1) The underlying rational is the following: images that go through several classification stages, and closer to be classified as an object, have higher fitness than those rejected in early stages. Variables nstagesx and stagedifx(i) 2013 26 Table 1: Haar Training parameters. Parameter Setting Number of stages 30 Min True Positive rate per stage 99.9% Max False Positive rate per stage 50% Object Width 20 or 40(breasts,leaf) Object Height 20 or 40(leaf) Haar Features ALL Number of splits 1 Adaboost Algorithm GentleAdaboost are extracted from the object detection algorithm. Variable nstagesx, holds the number of stages that image, x, has successfully passed. That is, an image that passes several stages is likely to be closer of being recognized as having a object than one that passes fewer stages. In other words, passing several stages is a pre-condition to be classified as having the object. Variable stagedifx(i) holds the maximum difference between the threshold necessary to overcome stage i and the value attained by the image at the i th stage. Images that are clearly above the thresholds are preferred over ones that are only slightly above them. Obviously, this fitness function is only one of the several possible ones. Experimentation Within the scope of this paper we intend to evolve the following objects: faces, lips, breasts and leaves. For the first two we use off-the-shelf classifiers that were already trained and used by other researchers in different lines of investigation (Lienhart and Maydt 2002; Lienhart, Kuranov, and Pisarevsky 2003; Santana et al. 2008). For the last two we created our own classifiers, by choosing suitable datasets and training the respective object classifier. In order to construct an object classifier we need to construct two datasets: (i) positive - examples of images that contain the object we want to detect; (ii) negative - images that do not contain the object. Furthermore, for the positive examples, we must identify the location of the object in the images (see figure 6) in order to build the ground truth file that will be used for training. For these experiments, the negative dataset was attained by picking images from a random search using image search engines, and from the Caltech-256 Object Category dataset (Griffin, Holub, and Perona 2007). Figure 7 depicts some of the images used as negative instances. In what concerns the positive datasets: the breast object detector was built by searching images on the web; the leaf dataset was obtained from the Caltech-256 Object Category dataset and from web searches. As previously mentioned, the face and lip detector are off-the-shelf classifiers. Besides choosing datasets we must also define the training parameters. Table 1 presents the parameters used for training of the cascade classifier. The success of the approach is related to the performance of the classifier itself. By defining a high number of stages we are creating several stages that the images must overcome to be considered a positive example. The high true positive rate ensures that almost every positive example is Figure 6: Examples of images used to train a cascade classifier for leaf detection. On the top row the original image, on the bottom row the croped example used for training. learned per stage. The max false positive rate creates some margin for error, allowing the training to achieve the minimum true positive rate per stage and a low positive rate at the end of the cascade. Similar parameters were used and discussed in (Lienhart, Kuranov, and Pisarevsky 2003). Once the classifiers are obtained, they are used to assign fitness in the course of the evolutionary runs in an attempt to find images that are recognized as faces, lips, breasts and leaves. We performed 30 independent evolutionary runs for each of these classes. In summary we have 4 classifiers, with 30 independent EC runs, totaling 120 EC runs. The settings of the GP engine, presented in table 2, are similar to those used in previous experimentation in different problem domains. Since the classifiers used only deal with greyscale information, the GP engine was also limited to the generation of greyscale images. The population size used in this experiments 100 while in previous experiments we used a population size of 50 (Machado, Correia, and Romero 2012a). This allows us to sample a larger portion of the search space, contributing to the discovery of images that fit the positive class. In all evolutionary runs the GP engine was able to evolve images classified as the respective objects. Similarly to the behavior reported by Machado, Correia, and Romero (2012a; 2012a), the GP engine was able to exploit weaknesses of the classifier, that is, the evolved images are classified as the object but, from a human perspective, they often fail to resemble the object. In figure 8 we present examples of such failures. As it can be observed, it is hard to recognize breasts, faces, leafs or lips in the presented images. It is important to notice that these weaknesses are not a byproduct of the fitness assignment scheme, as such they cannot be solved by using a different fitness function, nor particular to the classifiers used. Although different classi 2013 27 Figure 7: Examples of images belonging to the negative dataset used for training the cascade classifiers. Table 2: Parameters of the GP engine. See (Machado and Cardoso 2002) for a detailed description. Parameter Setting Population size 100 Number of generations 100 Crossover probability 0.8 (per individual) Mutation probability 0.05 (per node) Mutation operators sub-tree swap, sub-tree replacement, node insertion, node deletion, node mutation Initialization method ramped half-and-half Initial maximum depth 5 Mutation max tree depth 3 Function set +, !, ⇥ , /, min, max, abs, neg, warp, sign, sqrt, pow, mdist, sin, cos, if Terminal set x, y, random constants fiers have different weaknesses, we confirmed that several of the evolved images that do not resemble faces are also recognized as faces by commercially available and widely used classifiers. These results have opened a series of possibilities, including the use of this approach to assess the robustness of object detection systems, and also the use of evolved images as part of the training set of these classifiers in order to overcome some of their shortcomings. Although we already are pursuing that line of research and promising results have been obtained (Machado, Correia, and Romero 2012b), it is beyond the scope of the current paper. When one builds a face detector, for instance, one is typically interested in building one that recognizes faces of all types, sizes, colors, sexes, in different lighting conditions, against clear and cluttered backgrounds, etc. Although the inclusion of all these examples may lead to a robust clas(a) (b) (c) (d) Figure 8: Examples of evolved images identified as objects by the classifiers that do not resemble the corresponding objects from a human perspective. This images were recognized as breasts (a), faces (b), leafs (c) and lips (d). sifier that is able to detect all faces present in an image, it will also means that this classifier will be prone to recognize faces even when only relatively few features are present. In contrast, when building classifiers for the purpose described in this paper, one may select for positive examples clear and iconic images. Such classifiers would probably fail to identify a large portion of real-world images containing the object. However, they are would be extremely selective and, as such, the evolutionary runs would tend to converge to images that clearly match the desired object. Thus, although this was not explored, building a selective classifier can significantly reduce the number of runs that converge to atypical images such as the ones depicted in figure 8. According to our subjective assessment, some runs were able to find images that actually resemble the object that we are trying to evolve. These add up to 6 runs from the face detector, 5 for the lip detector, 4 for the breast detector and 4 for the leaf detector. In figures 9,10, 11 and 12 we show, according to our subjective assessment, some of the most interesting images evolved. These results allow us to state that, at least in some instances, the GP engine was able to create figurative images evocative of the objects that the object detector was design to recognize as belonging to the positive class. By looking at the faces, figure 9, we can observe the presence of at least 3 facial features per image (such as eyes, lips, nose and head contour). The images from the first row have been identified by users as resembling wolverine. The 2013 28 Figure 9: Examples of some of the most interesting images that have been evolved using face detection to assign fitness. ones of the second row, particularly the one on the left, have been identified as masks (more specifically african masks). In what concerns the images from the last row, we believe that their resemblance "ghost-like" cartoons is striking. In what concerns the images resulting from the runs where a lip detector was used to assign fitness, we consider that their resemblance with lips, caricatures of lips, or lip logos, is self evident. The iconic nature of the images from the last row is particularly appealing to us. The results obtained with the breast detector reveal images with well-defined or exaggerated features. We found little variety in these runs, with changes occurring mostly at the pixel intensity and contrast level. As previously mentioned, most of these runs resulted in unrecognizable images (see figure 8), which is surprising since the nature of the function set would lead us to believe that it should be relatively easy to evolve such images. Nevertheless, the successful runs present images that are clearly evocative of breasts. Finally the images from the leaf detector, vary in type and shape. They share however a common feature they tend to be minimalist, resembling logos. In each of the images of the first row the detector identified two leaf shapes. On the Figure 10: Examples of some of the most interesting images that have been evolved using a detector of lips to assign fitness. Figure 11: Examples of some of the most interesting images that have been evolved using a detector of breasts to assign fitness. 2013 29 Figure 12: Examples of some of the most interesting images that have been evolved using a detector of leafs to assign fitness. others a single leaf shape was detected. In general, when the runs successfully evolve images that actually resemble the desired object, they tend to generate images that exaggerate the key features of the class. This is entirely consistent with the fitness assignment scheme that values images that are recognized with a high degree of certainty. This constitutes a valuable side effect of the approach, since the evolution of caricatures and logos fits our intention to further explore these images from a artistic and design perspective. The convergence to iconic, exaggerated instances of the class, may indicate the occurrence of the "Peak Shift Principle", but further testing is necessary to confirm this interpretation of the results. Conclusions The goal of this paper was to evolve different figurative images by evolutionary means, using a general-purpose expression based GP image generation engine and object detectors. Using the framework presented by Machado, Correia, and Romero (2012a), several object detectors were used to evolve images that resemble: faces, lips, breasts and leafs. The results from 30 independent runs per each classifier shown that is possible to evolve images that are detected as the corresponding objects and that also resemble that object from a human perspective. The images tend to depict an exaggeration of the key features of the associated object, allowing the exploration of these images in design and artistic contexts. The paper makes 3 main contributions, addressing: (i) A well-known open problem in evolutionary art; (ii) The evolution of figurative images using a general-purpose expression based EC system; (iii) The generalization of previous results. The open problem of finding a compact symbolic expression that matches a target image is addressed by generalization: instead of trying to match a target image we evolve individuals that match a given class. Previous results (see (Machado, Correia, and Romero 2012a)) concerned only the evolution of faces. Here we demonstrate that other classes of objects can be evolved. As far as we know, this is the first autonomous system that proved able to evolve different types of figurative images. Furthermore the experimental results show that this is attainable with off-the-shelf and purpose build classifiers, demonstrating that the approach is both generalizable and customizable. Currently, we are performing additional tests with different object detectors in order to expand the types of imagery produced. The next steps will comprise the following: combine, refine and explore the evolved images, using them in userguided evolution and automatic fitness assignment schemes; combine multiple object detectors to help refine the evolved images (for instance use a face detector first and an eye or a lip detector next); use the evolved examples that are seen as shortcomings of the classifier to refine the training set and boost the existing detectors. Acknowledgements This research is partially funded by: the Portuguese Foundation for Science and Technology in the scope of project SBIRC (PTDC/EIA- EIA/115667/2009) and of the iCIS project (CENTRO-07-ST24-FEDER-002003), which is cofinanced by QREN, in the scope of the Mais Centro Program and European Union's FEDER; Xunta de Galicia Project XUGA?PGIDIT10TIC105008PR. 2013_40 !2013 Breeding on site Tatsuo Unemi Department of Information Systems Science Soka University Tangi-machi 1-236, Hachioji, Tokyo 192-8577 Japan ¯ unemi@iss.soka.ac.jp Computer #1 SBArt4 Controller Player Computer #2 Ethernet Figure 1: System setup. This is a live-performance of improvisational productions and playbacks of a type of evolutionary art using a breeding tool, SBArt4 version 3 (Unemi 2010). The performer breeds a variety of individual animations using SBArt4 on a machine at his front in a manner of interactive evolutionary computation, and sends the genotype of his/her favorite individual to SBArt4Player through a network connection. Figure 1 is a schematic illustration of the system setups. Each individual animation that reached the remote machine is played back repeatedly with the synchronized sound effect until another one arrives. Assisted by a mechanism of automated evolution based on computational aesthetic measures as the fitness function, it is relatively easy to produce interesting animations and sound effects efficiently on site (Unemi 2011). The player component has a functionality to composite another animation of feathery particles that reacts against the original image rendered by a genotype. Each particle moves guided by the force calculated from the HSB color value under the particle. The brightness is mapped to the strength, the hue value is mapped to the orientation, and the saturation is mapped to the fluctuation. This additional effects provide another impression for viewers. The performance will start from a simple pattern selected from the initial population randomly generated, and then gradually shifts to complex patterns. The parameters of sound synthesis are fundamentally determined from statistic features of frame image so that it fits with the impression of visuals, but some of them are also subjects of real-time tuning. The performer is allowed to adjust several parameters such as scale, tempo, rhythm, noise, and the other modulation parameters (Unemi 2012) following his/her preference. Because the breeding process includes spontaneous transFigure 2: Live performance in Rome, December 2011. formation by mutation and combination, the animations shown in a performance are always different from those in another occasion. This means each performance is just one time. 2013_5 !2013 Fitness Functions for Ant Colony Paintings Penousal Machado and Hugo Amaro CISUC, Department of Informatics Engineering University of Coimbra 3030 Coimbra, Portugal machado@dei.uc.pt, hamaro@student.dei.uc.pt Abstract A creativity-support tool for the creation of nonphotorealistic renderings of images is described. It employs an evolutionary algorithm that evolves the parameters governing the behavior of ant species, and the paintings are produced by simulating the behavior of these artificial ants. The design of fitness functions, using both behavioral and image features is discussed, emphasizing the rationale and intentions that guided the design. The analysis of the experimental results obtained by using different fitness functions focuses on assessing if they convey the intentions of the fitness function designer. Introduction Machado and Pereira (2012) presented a non-photorealistic rendering (NPR) algorithm inspired on ant colony approaches: the trails of artificial ants were used to produce a rendering of an original input image. One of the novel characteristics of this algorithm is the adoption of scalable vector graphics, which contrasts with the pixel based approaches used in most ant painting algorithms, and enables the creation of resolution independent images. The trail of each ant is represented by a continuous line of varying width, contributing to the expressiveness of the NPRs. In spite of the potential of this generative approach, the number of parameters controlling the behavior of the ants and their interdependencies was soon revealed to be too large to allow their tuning by hand. The results of these attempts revealed that only a small subset of the creative possibilities allowed by the algorithm was being explored. To tackle this problem, Machado and Pereira (2012) presented a human-in-the-loop Genetic Algorithm (GA) to evolve the parameters, allowing the users to guide the algorithm according to their preferences and avoiding the need to understand the intricacies of the algorithm. Thus, instead of being forced to perform low-level changes, the users of this creativity-support tool become breeders of species of ants that produce results that they find valuable. The experimental results highlight the range of imagery that can be evolved by the system showing its potential for the production of large-format artworks. This paper describes a further step in the automation of the space exploration process and departure from low-level modification and assessment. The users become designers of fitness functions, which are used to guide evolution, leading to results that are consistent with the user intentions. To this end, while the ants paint, statistics describing their behavior are gathered. Once each painting is completed image features are calculated. These behavioral and image features are the basis for the creation of the fitness functions. Human-in-the-loop in evolutionary art systems are often used as creativity-support tools and thought to have the potential for exploratory creativity. Allowing the users to design fitness functions by specifying desired combinations of characteristics provides an additional level of abstraction, enabling the users to focus on their intents and overcoming the user fatigue problem. Additionally, this approach opens the door for evaluating the system by comparing the intents of the user with the outcomes of the process. We begin with a short survey of related work. Next, in the third section, we describe the system, focusing on the behavior of the ants and on the evolutionary algorithm. In the fourth section we present experimental results, making a brief analysis. Finally, we draw some conclusions and discuss aspects to be addressed in future work. State of the Art In this section we make a survey of related works, focusing on systems that use artificial ants for image generation purposes and on systems where evolutionary computation is employed for NPR purposes. Tzafestas (2000) presents a system where artificial ants pick-up and deposit food, which is represented by paint, and studies the self-regulation properties and complexity of the system and resulting images. Ramos and Almeida (2000) explore the use of ant systems for pattern recognition purposes. The artificial ants successfully detect the edges of the images producing stylized renderings of the originals and smooth transitions between different images. The artistic potential of these approaches is explored in later works (Ramos 2002) and thorough his collaboration with the artist Leonel Moura, resulting in several robotic swarm drawings (Moura 2002). Urbano (2005; 2007; 2011) presents several multi-agent systems based on artificial ants. Aupetit et al. (2003) introduce an interactive GA for the creation of ant paintings. The algorithm evolves parameters of the rules that govern the behavior of the ants. The artificial ants deposit paint on the canvas as they move, thus pro 2013 32 Figure 1: Screenshot of the graphic user interface. Control panel on the left and current population of ant paintings on the right. ducing a painting. In a later study, Monmarche et al. (2007) ´ refine this approach exploring different rendering modes. Greenfield (2005) presents an evolutionary approach to the production of ant paintings and explores the use of behavioral statistics of the artificial ants to automatically assign fitness. Later Greenfield (2006) adopted a multiple pheromone model where ants' movements and behaviors are influenced (attracted or repelled) by both an environmentally generated pheromone and an ant generated pheromone. The use of evolutionary algorithms to create image filters and NPRs of source images has been explored by several researchers. Focusing on the works where there was an artistic goal, we can mention the research of: Ross et al. (2006) and Neufeld et al. (2007), where Genetic Programming (GP), multi-objective optimization techniques, and an empirical model of aesthetics are used to automatically evolve image filters; Lewis (2004), evolves live-video processing filters through interactive evolution; Machado et al. (2002), use GP to evolve image coloring filters from a set of examples; Yip (2004) employs GAs to evolve filters that produce images that match certain features of a target image; Collomosse (2006; 2007) uses image salience metrics to determine the level of detail for portions of the image, and GAs to search for painterly renderings that match the desired salience maps; Hewgill and Ross (2003) use GP to evolve procedural textures for 3D objects; Machado and Grac¸a (2008) employ GP to evolve assemblages of 3D objects that are an artistic representation of an input image. The Framework The system is composed of two main modules: the evolutionary engine and the painting algorithm. A graphic user interface gives access to these modules (see Fig. 1). Each genotype of the GA population encodes the parameters of a species of ants. These parameters determine how that ant species reacts to the input image. Each painting is produced by simulating the behavior of ants of a given species while they travel across the canvas, leaving a trail of varying width and transparency. In the following sections we describe the framework. First, we present the painting algorithm. Next, we describe Figure 2: On the left, an ant with five sensory vectors. On the middle, the living canvas of an ant species. On the right, its painting canvas. the evolutionary component. Finally, we detail the behavioral and image features that are gathered. The Painting Algorithm Our ants live on the 2D world provided by the input image and they paint on a painting canvas that is initially empty (i.e., black). Both living and painting canvas have the same dimensions and the ants move simultaneously on both canvas. The painting canvas is used exclusively for depositing ink and has no interference with the behavior of the ants. Each ant has a position, color, deposit transparency and energy; all the remaining parameters are shared by the entire species. If the energy of an ant is bellow a given threshold it dies, if is is above a given threshold it generates offspring. The luminance of an area of the living canvas represents the available energy, i.e. food, at that point. Therefore, ants may gain energy by traveling through bright areas. The energy consumed by the ant is removed from the living canvas, as will be explained later in detail. The ants' movement is determined by how they react to light. Each ant senses the environment by "looking" in several directions (see Fig. 2). We use 10 sensory vectors, each vector has a given direction relative to the current direction of the ant and a length. The sensory organs return the luminance value of the area where each vector ends. To update the position of an ant one performs a weighted sum, calculating the sum of the sensory vectors divided by their norms, 2013 33 multiplied by the luminance of their end point and by the weight the ant gives to each sensor. The result of this operation is multiplied by a scaling scalar that represents the ant's base speed. Subsequently, to represent inaccuracy of movement and sensory organs, the direction is perturbed by the addition of Perlin (1985) noise to its angle. The ant simulation algorithm is composed of the following steps: 1. Initialization: n ants are placed on the canvas on preestablished positions; Each ants assumes the color of the area where it was placed; Their energy and deposit transparencies are initialized using the species parameters; 2. For each ant: (a) Update the ant's energy; (b) Update the energy of the environment; (c) Place ink on the painting canvas; (d) If the ant's energy is bellow the death threshold remove the ant from the colony; (e) If the ant's energy is above the reproduction threshold generate an offspring; The offspring assumes the color of the position where it was created and a percentage of the energy of the progenitor (which loses this energy); The offspring inherits the velocity of the parent, but a perturbation is added to the angular velocity by randomly choosing an angle between descvelmin and descvelmax (both values are species' parameters); Likewise, the deposit transparency is inherited from the progenitor but a perturbation is included by adding a randomly choosen a value between dtranspmin and dtranspmax; (f) Update ant's position; 3. Repeat from 2 until no living ants exist; Steps (b) and (c) require further explanation. The consumption of energy is attained by drawing on the living canvas a black circle of size equal to energy ⇤ consrate of a given transparency (constrans) . Ink is deposited on the paining canvas by drawing a circle of the color of the ant - which is attributed when the ant is born - with a size given by (energy ⇤ depositrate) and of given transparency (deposittransp). Fig. 2 depicts the living and painting canvas of an ant species during the simulation process. It is important to notice that the color of an ant is determined at birth. Thus, the ants may carry this color to areas of the canvas that possess different colors in the original image. A detailed description of the painting algorithm can be found in Machado and Pereira (2012). Evolutionary Engine As previously mentioned, we employ a GA to evolve the ant species' parameters. The genotypes are tuples of floating point numbers which encode the parameters of the ant species. The size of the genotype depends on the experimental settings. Table 1 presents an overview of the encoded parameters. We use a two point crossover operator for recombination purposes and a Gaussian mutation operator. We employ tournament selection and an elitist strategy, Table 1: Parameters encoded by the genotype Name # Comments gain 1 scaling for energy gains decay 1 scaling for energy decay consrate 1 scaling for size of circles drawn on the living canvas constrans 1 transparency of circles drawn on the living canvas depositrate 1 scaling for size of circles drawn on the painting canvas deposittransp 1 base transparency of circles drawn on the painting canvas dtranspmin 1 limits for perturbation of deposit transparency when offsprings are generated dtranspmax 1 initialenergy 1 initial energy of the starting ants deaththreshold 1 death energy treshold birththreshold 1 generate offspring energy threshold descvelmin 1 limits for perturbation of angular velocity when offsprings are generated descvelmax 1 vel 1 base speed of the ants noisemin 1 limits for the perlin noise noisemax 1 generator function initialpositions 2 ⇤ n initial coordinates of the n ants placed on the canvas sensoryvectors 2 ⇤ m direction and length of the m sensory vectors sensoryweights m weights of the m sensory vectors the highest ranked individual proceeds - unchanged - to the next population. The Features During the simulation of each ant species the following behavioral statistics are collected: avg(ants) Average number of living ants; coverage Proportion of the living canvas visited by the ants; An area is considered to be visited if, at least, one ant consumed resources from that area; depositedink The total amount of "ink" deposited by the ants; This is calculated by multiplying the area of each circle drawn by the ants by the opacity (i.e. 1 " transparency) used to draw it. avg(trail), std(trail) The average trail length and the standard deviation of the trail lengths, respectively; avg(life), std(life) The average life span of the ants and its standard deviation, respectively; avg(distance) The average euclidean distance between the position where the ant was born and the one where it died; avg(avg(width)), std(avg(width)) For each trail we calculate its average width, then we calculate the average width of all trails, avg(avg(width)), and the standard deviation of the averages, std(avg(width)); 2013 34 avg(std(width)), std(std(width)) For each trail we calculate the standard deviation of its width, then we calculate their average, avg(std(width)), and their standard deviation std(std(width)); avg(avg(av)), std(avg(av)), avg(std(av)), std(std(av)) These statistics are analogous to the ones regarding trail width, but pertaining to the angular velocity of the ants; When the simulation of each ant species ends we calculate the following image features: complexity the image produced by the ants, I, is encoded in jpeg format, and its complexity estimated using the following formula: complexity(I) = rmse(I, jpeg(I)) ⇥ s(jpeg(I)) s(I) , where rmse stands for the root mean square error, jpeg(I) is the image resulting from the jpeg compression of I, and s is the file size function fractdim, lac The fractal dimension of the ant painting estimated by the box-counting method and its ! lacunarity value estimated by the Sliding Box method (Karperien 2012), respectively; inv(rmse) The similarity between the ant painting and the original image estimated as follows: inv(rmse) = 1 1 + rmse(I,O) , where I is the ant painting and O is the original image; Experimental results The results presented in this section were obtained using the following experimental setup: Population Size = 25; Tournament size = 5; Crossover probability = 0.9; Mutation Probability = 0.1 (per gene); Initial Position of the ants = the image is divided in 3 ⇥ 3 rectangles of the same size and one ant is placed at the center of each of these rectangles; Initial number of ants = 9; Maximum number of ants = 250; Maximum number of simulation steps 1000. Thus, when the drawing stage starts each ant species is represented by nine ants. However, these ants may generate offspring during simulation, increasing the number of ants in the canvas. Typically, interactive runs had 30 to 40 generations, although some were significantly longer. The runs conducted using explicit fitness functions lasted 50 generations. For each fitness function we conducted 10 independent runs. User Guided Runs Machado and Pereira (2012) describe and analyze results attained in the course of user guided runs. In Fig. 3 we depict some of the individuals evolved in those runs, with the goal of giving a flavor of the different types of imagery that were evolved. Figure 3: Examples from user guided runs. Using Features Individually To test the evolutionary algorithm we performed runs where each feature, with the exception of fracdim and lac, was used as fitness function. Maximizing the values of fractal dimension and lacunarity would lead to results that we find uninteresting. Therefore, we established for these features by measuring the fractal dimension and lacunarity of one of our favorite ant paintings evolved in user guided runs, 1.5 and 0.95, respectively, and the maximum fitness is obtained when these values are reached. For these two features, fitness is assigned by the following formula: f itness = 1 1 + |targetvalue " featurevalue| In Fig. 4 we present the evolution of fitness across the evolutionary runs. To avoid clutter we only present a subset of the considered fitness functions. In general, the evolutionary algorithm was able to find, in all runs and for all features, individuals with high fitness in relatively few generations. Unsurprisingly, and although it is subjective to say it, the runs tended to converge to ant paintings that, at least in our eyes, are inferior to the ones created in the course of interactive runs. Fig. 5 depicts the individuals that obtained the maximum fitness value for the corresponding image features. These individuals are representative of the imagery evolved in the corresponding runs. It worth to notice that high complexity is obtained by evolving images with abrupt transitions from black to white. This results in high frequencies that make jpeg compression inefficient, thus resulting in high complexity estimates. The results attained with lacunarity yield paintings with "gaps" between lines, revealing the black background, 2013 35 0" 0.2" 0.4" 0.6" 0.8" 1" 0" 10" 20" 30" 40" 50" Normalized+Fitness+ Genera1on+ avg(ants)" avg(avg(width))" avg(distance))" avg(std(av))" avg(life))" avg(trail)" coverage" deposited_ink" fract_dim" inv(rmse)" Figure 4: Evolution of the maximum fitness. The results are averages of 10 independent runs. The results have been normalized to allow the presentation of the results using distinct fitness functions in the same chart. which matches the texture of the image from where the target lacunarity value was collected. This contrasts with the results obtained using fractdim, while the algorithm was able to match the target fractal dimension value, the images produced are radically different from the target's image. The inv(rmse) runs revealed images that reproduce the original with some degree of fidelity, showing that this feature can promote similarity between the painting and the original. The results obtained using a single behavioral feature are uninteresting in the context of NPR. They tend to fall in two categories, either they constitute "poor" variations of the original or they are unrecognizable versions of it. Combining Behavioral and Image Features From the beginning it was clear that it would be necessary to combine several features to attain our goals. To make the fitness function design process easy to understand, and thus allow inexperienced users to design their own fitness functions, we decided that all fitness functions should assume the form of a weighted sum. Since different features have different ranges of values, it is necessary to normalize them, otherwise some features would outweigh the others. Additionally, angular velocity may be negative, so we should consider the absolute values. Considering these issues, normalization is attained by the following formula: norm(feature) = abs ✓ feature o✏inemax (feature) ◆ , where o✏inemax returns the maximum value found in the course of the runs described in the previous section for the feature in question. This modification is not sufficient to prevent the evolutionary algorithm to focus exclusively on a single feature. To minimize this problem, we consider a logarithmic scale so that the evolutionary advantage decreases as the feature value becomes higher, promoting the discovery of individuals that use all features employed in the fitness function. This is accomplished as follows: lognorm(feature) = log(1 + norm(feature)) (a) (b) (c) (d) Figure 5: The individuals that obtained the maximum fitness value for: (a) Complexity; (b) inv(rmse); (c) lac; (d) fractdim. All the fitness functions that combine several features are weighted sums of the lognorm of each of the features used. However, for the sake of simplicity we will only mention the feature names when writing their formulas. From here onwards feature should be read as lognorm(feature). Next we describe several fitness functions that combine a variable number of features. The analysis of the experimental results of evolutionary art systems is subjective by nature. As such, more than presenting measures of performance that would be meaningless when considering the goals of our system, we focus on describing the intentions behind the design of each fitness function, and make a subjective analysis of the results based on the comparison between the evolved paintings and our original design intentions. f1: coverage + complexity + lac The design of this fitness function was prompted by the results obtained in previous tests. The goal is to evolve ant paintings where the entire canvas is visited, with high complexity, and with a lacunarity value of 0.95. As it can be observed in Fig. 6 the evolved paintings successfully match these criteria. By comparing them with the ones presented in Fig. 5 one can observe how lacunarity influences texture, complexity leads high frequencies, and coverage promotes visiting most of the canvas. f2: inv(rmse) ! 0.5 ⇤ complexity The rationale for this fitness function is obtaining a good approximation to the original image while keeping the complexity low. Thus, we wish to obtain a simplified version of 2013 36 Figure 6: Two of the fittest images evolved using f1. Figure 7: Two of the fittest images evolved using f2. Figure 8: Two of the fittest images evolved using f3. the original. Preliminary tests indicate the tendency of the algorithm to focus, exclusively, on minimizing complexity, which was achieved by producing images that were entirely black. Since this sort of image exists in the initial populations of the runs, this is a case of premature convergence. To circumvent it we decreased the weight given to complexity, which allowed the algorithm to escape this local optimum. Although the results are consistent with the design (see Fig. 7) they do not depict the degree of abstraction and simplification we intended. As such, they should be considered a failure since they do not match our design intentions. f3: avg(std(width))+std(avg(width))!avg(avg(width))+ inv(rmse) Here we focus on the width of the lines being drawn Figure 9: Two of the fittest images evolved using f4 (first row), f5 (second row) and f6 (third row). promoting the evolution of ant paintings with lines with high variations of width, avg(std(width)), heterogeneous widths among lines, std(avg(width)), and thin lines, !avg(avg(width)). To avoid radical deviations from the original drawing we also value inv(rmse). The experimental results, Fig. 8, depict these characteristics, however to fully observe the intricacies of the ant paintings a resolution higher than the space constraints of this paper allows would be required. f4: avg(std(av)) + inv(rmse) + coverage f5: avg(avg(av)) ! avg(std(av)) + inv(rmse) + coverage f6: !avg(avg(av)) +avg(std(av)) +inv(rmse) +coverage When designing f4-f6 we focused on controlling line direction. In f4 we use avg(std(av)) to promote the appearance of lines that often change direction. In f5 we use avg(avg(av)) ! avg(std(av)) to encourage the appearance of circular motifs (high angular velocity and low variation of velocity). Finally, f6 is a refinement of f4 with 2013 37 Figure 10: Results obtained by applying an individual from the f4 runs to different input images. !avg(avg(av)) preventing the appearance of circular patterns, valuing trails that curve in both directions, attaining an average angular velocity close to zero. In all cases, the addition of inv(rmse) and coverage serves the goal of evolving ant paintings with some similarity to the original and that visit a large portion of the canvas. In Fig 9 we present some of the outcomes of this experiences. As it can be observed the evolved images closely match our expectations and, as such, we consider them to be some of the most successful runs. Once the individuals are evolved the ant species may be applied to different input images, hopefully resulting in antpaintings that share the characteristics that we value. This is one of the key aspects of the system: although finding a valuable ant species may be time consuming, once it is found it can be applied with ease to other images producing large-scale NPR of them. In Fig. 10 we present ant paintings created by this method. Conclusions We presented a creativity-support tool that aids the users by providing a wide variety of paintings, which are arguably consistent with the intentions of the users, and which they would be unlikely to imagine on their own. While using this tool the users become designers of fitness functions, which are built using a combination of behavioral and image features. We reported the results obtained, focusing on the comparison between the evolved ant-paintings and the design intentions that led to the development of each fitness function. Overall the results indicate that it is possible, to some extent, to convey design intention through fitness functions, leading to the discovery of individuals that match these intentions. This allows the users to operate at a higher level of abstraction than in user guided runs, circumventing the userfatigue problem typically associated with interactive evolution. The analysis of the results also reveals the discovery of high-quality ant paintings that are radically different from the ones obtained through interactive evolution. Although the system serves the user intents, different runs converge to different, and sometimes highly dissimilar, images. Each fitness function can be maximized in a multitude of ways, some of which are quite unexpected. As such, we argue that the system opens the realm of possibilities that are consistent with the intents expressed by the user, often surprising him/her in the process. On the downside, as the f2 runs reveal, in some cases the design intentions are not fully conveyed by the evolved ant paintings. It is also worth mentioning that interactive runs allow opportunistic reasoning, which may allow the discovery of unexpected and highly valued ant paintings. The adoption of a semi-automatic fitness assignment scheme, such as the one presented by Machado et al. (2005), is one of the directions for further research. It also become obvious that we only began to scratch the vast number of possibilities provided by the design of fitness functions. In the future, we will invite users that are not familiar with the system to design their own fitness functions, which will allow us to assess the difficulty of the task for regular users. 2013 38 Acknowledgements This research is partially funded by the Portuguese Foundation for Science and Technology in the scope of project SBIRC (PTDC/EIA- EIA/115667/2009) and of the iCIS project (CENTRO-07-ST24-FEDER-002003), which is cofinanced by QREN, in the scope of the Mais Centro Program and European Union's FEDER. 2013_6 !2013 Adaptation of an Autonomous Creative Evolutionary System for Real-World Design Application Based on Creative Cognition Steve DiPaola, Graeme McCaig, Kristin Carlson, Sara Salevati and Nathan Sorenson School of Interactive Arts and Technology Simon Fraser University sdipaola@sfu.ca, gmccaig@sfu.ca, kca59@sfu.ca, sara_salevati@sfu.ca, nds6@sfu.ca Abstract This paper describes the conceptual and implementation shift from a creative research-based evolutionary system to a real-world evolutionary system for professional designers. The initial system, DarwinsGaze, is a Creative Genetic Programing system based on creative cognition theories. It generated artwork that 10,000's of viewers perceived as human-created art, during its successful run at peer-reviewed, solo shows at noted museums and art galleries. In an effort to improve the system for use with real-world designers, and with multi-person creativity in mind, we began working with a noted design firm exploring potential uses of our technology to support multivariant creative design iteration. This second generation system, titled Evolver, provides designers with fast, unique creative options that expand beyond their habitual selections that can be inserted/extracted from the system process at any time for modular use at varying stages of the creative design process. We describe both systems and the design decisions to adapt our research system, whose goal was to incorporate creativity automatically within its algorithms, to our second generation system, which attempts to take elements of human creativity theories and populate them as tools back into the process. We report on our study with the design firm on the adapted system's effectiveness. Introduction Creativity is a complex set of cognitive process theorized to involve, among other elements, attention shifts between associative and analytical focus (Gabora, 2010), novel goals (Luo and Knoblich, 2007), and situated actions and difficult definitions of evaluation (Christoff et al, 2011). Computational creative systems strive to model a variety of creativity's aspects using computer algorithms from evolutionary ‘small-step' modifications to intelligent autonomous composition and ‘big-leap' innovation in an effort to better understand and replicate creative process (Boden, 2003). The focus by some researchers on replicating creativity in computational algorithms has been instrumental in learning more about human cognition (individual and collaborative) and how creative support tools might be used to enhance and augment human creative individuals and teams. All these aspects continue to evolve our perceptions of creativity and its role in computation in the current technology-saturated world. Systems modeling creativity computationally have gained acceptance in the last two decades, situated mainly as artistic and research projects. Several researchers in computational creativity have addressed questions around such computational modeling by outlining different dimensions of creativity and proposing schema for evaluating a "level of creativity" of a given system, for example (Ritchie, 2007; Jennings, 2010; Colton, Pease and Charnley, 2011). While there is ongoing research and scholarly discourse about how a system is realized, how the results are generated, selected and adjusted and how the process and product are evaluated, there is less research about direct applications of creative cognitive support systems in realworld situations. Now that more autonomous, generative creative systems have been developed, we are reevaluating the role of the human collaborator(s) when designing a creative system for real-world applications in an iterative creative design process environment (Shneiderman, 2007). We explore creativity from theories of cognition that attempt to understand attentional shifts between associative and analytical focus. The existence of two stages of the creative process is consistent with the widely held view that there are two distinct forms of thought (Dartnell, 1993; Neisser, 1963; Piaget, 1926; Rips, 2001; Sloman, 1996). It has been proposed that creativity involves the ability to vary the degree of conceptual fluidity in response to the demands of any given phase of the creative process (Gabora, 2000; 2002a; 2002b; 2005). This dimension of variability in focus is referred to as contextual focus. Focused attention produces analytic thought, which is conducive to manipulating symbolic primitives and deducing laws of cause and effect, while defocused attention produces fluid or associative thought which is conducive to analogy and unearthing relationships of correlation. Thus, creativity is not just a matter of eliminating rules but of assimilating and then breaking free of them where warranted. This paper focuses first on the implementation and applicability of contextual focus through our research system, DarwinsGaze, developed to use an automatic fitness function. Second, we present our effort to adapt this successful 2013 40 but specific research system for more general use with real-world designers, and with multi-person creativity in mind. We worked with a noted design firm to examine potential uses of our technology for supporting multivariant creative design iteration. Our analysis of their process combined with our knowledge of the cognitive aspects of creativity (gleaned from our early research), were used to completely rewrite the DarwinsGaze system to an interactive creativity support tool within a production pipeline. This 2nd generation system, Evolver, provides designers with fast, unique options that expand beyond their habitual selections that can be inserted and extracted from the system process at any time for modular use at varying stages of the creative design process. The changes focused firstly on usability needs, but became more important when we saw opportunities for affecting the shifts between contextual and analytical focus of the designer through the Evolver system. This process required evaluating the realworld iterative process of designers and testing various prototypes with designers from the firm Farmboy Fine Arts (FBFA) to see how they engaged with interactive creativity support. Lastly we evaluated with a user study the effectiveness of this conversion process and how non-technical designers appreciated and used this Creative Evolutionary System. We hope that our experience and evaluation can be a guide for other researchers to adapt creative research systems to more robust and user centric real world production tools. The DarwinsGaze System The DarwinsGaze system (DiPaola and Gabora, 2007) is a Creative Evolutionary System (CES) (Bentley and Corne, 2002) (see Figure 1) based on a variant of Genetic Programming (GP). Unlike typical Genetic Programming systems this system favors exploration over optimization, finding innovative or novel solutions over a preconceived notion of a specific optimal solution. It uses an automatic fitness function (albeit one specific to portrait painting) allowing it to function without human intervention between being launched and obtaining the final, often unanticipated and pleasing set of results; in this specific and limited sense we refer to DarwinsGaze as "autonomous". The inspiration for this work is to directly explore to what extent computer algorithms can be creative on their own (Gabora and DiPaola, 2012). Related work has begun to use creative evolutionary systems with automatic fitness functions in design and music (Bentley and Corne, 2002), as well as building of a creative invention machine (Koza, 2003). A contribution of the DarwinsGaze work is to model, in software, newly theorized aspects of human creativity, especially in terms of fluid contextual focus (see Figure 2). DarwinsGaze capitalizes on recent developments in GP by employing a form of GP called Cartesian Genetic Programming (CGP) (Miller and Thomson, 2000; Walker and Miller, 2005). CGP uses GP techniques (crossover, mutation, and survival), but differs in certain key respects. The Figure 1. Source Darwin image with examples of evolved abstract portraits created using the DarwinsGaze autonomous creative system. program is represented by a directed graph of indexed nodes. Each node has a number of inputs and a function that gives an output based on the inputs. The genotype is a list of integers determining the connectivity and functionality of the nodes, which can be mutated and mated to create new directed graphs. CGP has several features that foster creativity including 1) its node based structure facilitates the creation of visual mapping modules, 2) its structure can represent complex computational input/output connectivity, thus accommodating our sophisticated tone and temperature-based color space model which enables designerly decision making, and most importantly 3) its component-based approach favors exploration over optimization by allowing different genotypes to map to the same phenotype. The last technique uses redundancy at the input, node, and functional levels, allowing the genotype to contain nodes that are not connected to the output nodes and so not expressed in the phenotype. Having different genotypes (recipes) map to the same phenotype (output) provides CGP with greater neutrality (Yu and Miller, 2005). Our work is based on Ashmore and Miller's (2004) CGP application to evolve visual algorithms for enhanced image complexity or circular objects in an image. Most of their efforts involve initializing a population and then letting the user take over. Our initial prototype was based upon their approach, but expanded it with a more sophisticated similarity and creativity function, and revised their system for a portrait painter process. Since the advent of photography, portrait painting has not just been about accurate reproduction, but also about using modern painterly goals to achieve a creative representation of the sitter. We have created a fitness function that mainly rewards accurate representation, but given certain situations it also rewards visual painterly aesthetics using simple rules of art creation as well as a portrait knowledge space. Specifically, the painterly portion of our fitness function 1) weighs for face versus background composition, 2) uses tonal similarity over exact color similarity matched with a sophisticated artistic color space model which weighs for warm-cool color temperature relationships based analogous and complementary color harmony rules and 3) employs unequal dominate and subdominant tone and color rules and other artistic rules based on a portrait painter knowledge domain (DiPaola and Gabora, 2007) as illustrated in Figure 2. We mostly weight heavily 2013 41 towards resemblance, which gives us a structured system, but can under the influence of functional triggers allow for artistic creativity. The approach gives us novelty and innovation from within, or better said, responding to a structured system -a trait of human creative individuals. Figure 2. The DarwinsGaze fitness function mimics human creativity by moving between restrained focus (resemblance) to more unstructured associative focus (resemblance and more ambiguous art rules of composition, tonality and color theory). Generated portrait programs in the beginning of the run will look less like the sitter but from an aesthetic point of view might be highly desirable, since the function set has been built with painterly rules. Specifically, the fitness function in the DarwinsGaze system calculates four scores (resemblance and the three painterly rules) separately and fluidly combines them in different ways to mimic human creativity by moving between restrained focus (resemblance) to more unstructured associative focus (3 rules of composition, tonality and color theory). In its default state the fitness function uses a ratio of 80% resemblance to 20% non-proportional scoring of our three painterly rules. Several functional triggers can alter this ratio in different ways. The system will also allow very high scoring of painterly rule individuals to be accepted into the next population. When a plateau or local minima is reached for a certain number of epochs, the fitness function ratio switches course where painterly rules are weighted higher than resemblance (on a sliding scale) and work in conjunction with redundancy at the input, node, and functional levels. Using this method, in the wider associative mode, high resemblance individuals are always part of the mix and when these individuals show a marked improvement in resemblance, a trigger is set to return to the more focused 80/20 resemblance ratio. For CES used to create fine art paintings, the evaluation was based less on the process and more on the output. Could a closed process (that has no human intervention once the evolutionary process was started) produce artwork that was judged as creative using the methods by which real human artists are judged? Example pieces from the output over 30 days were framed and submitted to galleries as a related set of work. Care was taken by the author to select representational images of the evolved unsupervised process, however creative human bias obviously exists in the representational editing process. This is similar to how a curator chooses a subset of pieces from their artists, so it was deemed that is does not diminish the soft evaluation process. The framed art work (darwinsgaze.com) was accepted and exhibited at six major galleries and museums including the TenderPixel Gallery in London, Emily Carr Galley in Vancouver, and Kings Art Centre at Cambridge University as well as the MIT Museum, and the High Museum in Atlanta, all either peer reviewed, juried or commissioned shows from institutions that typically only accept human art work. This gallery of abstract portraits of Darwin has been seen by tens of thousands of viewers who have commented with dated quotes in a gallery journal that they see the artwork as an aesthetic piece that ebbs and flows through creative ideas even though they were solely created by an evolutionary art computer program using contextual focus. Note that no attempt to create a formalized ‘creativity Turing Test' was made. Most of the thousands of causal viewers assumed they were looking at human created art. The work was also selected for its aesthetic value to accompany an opinion piece in the journal Nature (Padian, 2008), and was given a strong critical review by the Harvard humanities critic, Browne (2009). While these are subjective measures, they are standards in the art world. The fact that the computer program produced novel creative artifacts, both as single art pieces and as a gallery collection of pieces with interrelated themes, is compelling evidence that the process passed a type of informal creativity Turing test. The Shift from Autonomous Creative System to Creative Support Tool: the Evolver System To move forward from the DarwinsGaze system we began looking to explore a real-world application of creativity in computation by leveraging concepts of contextual focus to integrate with collaborative process. The opportunity arose to work with FBFA, an international art consultancy firm that designs site-specific art collections for the luxury hotel and corporate sectors, to develop software that could complement and provoke their current iterative design processes. The focus on visual design for hotel decor was an interesting perspective that enabled us to consider what we had achieved with visual creative design in prior work, and how we could engage in the designer's intuitive yet visual (and hence somewhat parameterized) creative process. 2013 42 In the effort to evaluate a CES within a Visual Design domain, we explored the use and adaptation of "Evolver". Evolver is a computational creative tool modified from the DarwinsGaze project structure. Evolver was created as a result of in-depth research and observations to support a specific design process at FBFA by automating some of the design tasks and restructuring the contextual search space. It provides a platform for brainstorming by generating various versions of original artwork provided by designers, through specific features such as controlling the color scheme or marrying different artworks together. It also offers some production capabilities by automating repeating tasks such as cropping for mass quantities of artworks traditionally performed by designers in programs such as Adobe Photoshop. Evolver incorporates a userfriendly GUI (see Figure 3) paired with a flexible internal image representation format for ease of use by the designer. The designer provides the seed material and selects preferred results while the system generates a population of artwork candidates, cross breeds and mutates the candidates under user control to generate new design products. The designer may select and extract any resulting candidate piece at any stage of the process for use in other areas or as generative fodder to later projects. System parameters of Evolver include shapes, colors, layers, patterns, symmetries and canvas dimensions. Developing the Evolver System to Fit the Needs and Process of a Design Firm FBFA takes design briefs from the hotel interior designers, and based on their extensive photo and graphic design database as source, designs specific art and design objects in a multitude of material (although typically wall hanging) often in unique sizes, shapes and multiples to specifically work with the hotel's (typically their large lobby and restaurants) needs. They do this by incorporating a number of designers who using digital systems like Adobe Illustrator significantly rework a source design to refit the space, shape and material specifics. We began by demonstrating to them an interactive version of our DarwinsGaze system, which was mocked up on the darwinsgaze.com website, called ‘Evolve It' to show what a potentially fully-interactive new system would look like. The designer's process to create a successful prototype for the client was a multi-step, iterative and somewhat inefficient process which relied on the designer's ‘feel' of the problem context, the potential solution contexts and their intuitive exploration and selection process. In this particular situation designers would discuss a project with a client, then go to physical boxes or their digital database containing immense amounts of image material, find seed material that fits the feeling of the multiple contexts and then manipulate them to better fit the design problem in Adobe Illustrator. The designer's manipulation adjusts size, scale, shape, multiples and color in layers by hand. This process is highly labor-heavy and we felt it was most receptive to computational support because the designer had already defined the contextual focus for this problem through their own interpretation of the available options, constraints and aesthetic preference (which had already been confirmed by the client engaging with this company). While the designers were reluctant to give up control of their intuitive, creative knowledge, they readily engaged with the Evolver system once they saw how CESs could support the restructuring of the designer's contextual space while also reducing the labor-intensive prior process. This shift freed up the designer's ability to creatively engage with the problem at hand. We strove to make the new systems flexible to different creative processes and paths that different designers might have. Figure 3. The Evolver Interface Evolver's cognitive aspect provides designers with a platform to externalize and visualize their ideas. Artwork generated through Evolver can be used for different purposes in different phases of the design process, from conceptual design through to presentation. During the early phase of conceptual design, free-hand, non-rigid sketching techniques have an important role in the formation of creative ideas as designers externalize their ideas and interact with them spatially and visually (Suwa, Gero and Purcell, 1998). Evolver supports flexibility of ideas in this phase by enabling designers to easily produce an extensive range of alternatives. The ambiguous nature of the multiple generations produced supports the uncertain and fuzzy nature of conceptual design as they discover, frame out early ideas and brainstorm. The alternatives produced relieve cognitive load from the designer by separating them from the manual task of manipulating the design parameters, but do not separate them so far from the process that they cannot use their psychomotor and affective design knowledge. Evolver is structured to support the shift between contextual and analytical focus by restructuring the contextual space users are working in. Users can choose to relinquish a degree of control while broadening their focus, gaining the ability to be inspired or provoked by novel generations from the system. On the other hand, it is possible to guide successive evolutions in a more deliberate, analytical way and the ability of Evolver to import/export individuals 2013 43 to/from a precisely editable format (SVG Adobe Illustrator) allows tightly focused design directions to be pursued. At later stages in the design process, artwork generated through Evolver can be used as mockups for clients and prototyping, and also as a communication tool in uses such as presentation at the very end of design process. The work produced by Evolver can be incorporated directly into the tool-chain leading to a finished piece. Evolver Genetic Encoding: Moving to a More "Linear" Scheme One of the most far-reaching design decisions involved in the construction of an evolutionary system is the specification of the genetic encoding. A particular choice of encoding delineates the space of possible images and dramatically influences the manner in which images can change during the course of evolution. The genotype induces a metric on the space of potential images: certain choices of representation will cause certain styles or images to be genetically related and others to be genetically distant. The related images will appear with much more probability, even if the distant images are technically possible to represent in the encoding system. For this reason, it is important that the genotype causes images that are aesthetically similar to be genetically related. Relevant aspects of the aesthetic merit of a work can then be successfully selected for and combined throughout the course of the evolutionary run. This property is referred to as gene linkage (Harik et al, 2006). We identified this property as especially important to an interactive creativity support tool, for designers who are used to exerting a high degree of creative control over their output and in a scenario where a certain sense of "high quality design" is to be maintained. A genetic encoding can either be low level, representing extremely basic atomic elements such as pixels and color values, or high level, representing more complex structures such as shapes and patterns. A common low level encoding is to represent images as the composition of elemental mathematical functions (Bentley and Corne 2002). Though it is technically possible that any image can be conceivably represented as a composition of such functions, this encoding typically results in recognizable geometric patterns that readily signal the algorithmic nature of the process. A higher level encoding can be seen in shape grammars that represent not individual pixels but aggregates of primitive shapes (Machado et al, 2010). This approach can theoretically produce a much narrower range of images, but the images that are produced do not demonstrate the same highly-mathematical nature of lower-level encodings. Compared to the CGP genetic structure of DarwinsGaze, Evolver uses a list-based, tree-structure encoding that draws some inspiration from CGP but operates on higherlevel components in order to maximize the property of gene linkage and user interpretability. We viewed this new genetic representation as broadly "linear" in the sense that the genotype could be decomposed into elements and recombined, leading to a corresponding effect in the phenotype of recombining visually identifiable elements. The genetic representation is based on a collection of "design elements" (DEs), which are objects that denote particular aspects of the image. For example, a major component of our image representation is that of a symbol: a shape that can be duplicated and positioned on the canvas according to a position, rotation, and scaling parameter. DEs are defined in terms of atomic values and composite collections. The DE for a symbol, for example, is represented as a tuple consisting of two floats representing the x and y coordinates of the shape, a float representing the rotation, a float representing the scale, and an enumerable variable representing the particular shape graphic of the symbol. An image is then described by a list of these symbols. The genetic operations of mutation and crossover are derived from the structure of the DE definitions. Mutation is defined for the atomic values as a perturbation of the current value. Crossover is defined for the collection structures. The genotype is "strongly typed" so only genes of the same type can cross over. (For example, "position" may cross over with "position" of another other stamp's record, "color" may cross over with "color"; however "position" will never cross over with "color".) Figure 4 shows an example of Evolver system output. Evolver User Interface: Optimizing Creative Support To make the power of this flexible encoding system available to designers, we constructed an automatic import tool that analyzed existing images and parsed their structure into DEs that formed initial seed populations for the interactive evolution. This approach served to bootstrap the evolutionary search with images that are known to demonstrate artistic merit. Source artwork is converted to the SVG vector image format, which is a tree-based description of the shapes and curves that comprise a vector based image. The hierarchical grouping of art elements in the original work is preserved in the SVG format, and is used in determining which pieces are isolated to form symbol DEs. We also make use of heuristics that take into account the size of various potential groupings art elements and any commonly duplicated patterns to identify candidates for extraction. The interactive evolution proceeds from a seed population constructed from these original parsed image elements. The user interface, by default, depicts a population of 8 pieces of generated art. These individuals can be selected from, to become the parents of the next generation, as is typical in interactive evolution. An added feature, which proved useful, was the ability to bookmark individuals, which placed them in a different collection that was separated from the evolutionary run. This collection of bookmarked individuals allowed users to store any 2013 44 interesting images discovered during the run while proceeding to guide the evolution in a different direction. Figure 4. Example Evolver Output Image Evaluating Designers' Usage and Opinions of the Evolver System Some months after the end of the project, with Evolver still being used and available for real world production at FBFA, we invited a small group of FBFA and associated designers to our labs, now under controlled study conditions. There we conducted a 45 minute questionnaire-based qualitative study that took place in 2 phases: it began with a uniform re-introduction and re-demonstration of Evolver and its functionalities, followed by a short session where the designer had the opportunity to re-explore the tool and answer a series of nine structured interview questions that concentrate on the adaptation of Evolver within their current and future work practices. The specific questions in phase two were: 1. What is your first impression of ‘Evolver'? 2. How and in which stage would you use this tool in your current practice? 3. How does this tool change your design process? Can you provide an existing scenario of your current practice and how you envision Evolver would change that? 4. Which features of this tool do you find most interesting? Why? 5. What features would you like to change and/or add in the future? Why? 6. How would you use this tool apart of your design thinking stage in your process? 7. How does it help with the conceptualization of ideas? 8. What do you think of the role of computational tools such as Evolver within the Visual Design domain? 9. Do you have any further comments/suggestions for the future of this research? The full qualitative study discursive results are beyond the scope of this paper; however we have included an exemplary set of these results, based on direct quotes from the designers and our assessment of the dominant themes in designer responses. Our main takeaways from this study were: 1. Designers saw Evolver as a creative partner that could suggest alternatives outside of the normal human cognitive capacity: "[The] Human brain is sometimes limited, I find Evolver to have this unlimited capacity for creativity." (KK, Interview) "Evolver introduces me to design options I never thought of before, it enhances my design thinking and helps me to produce abstract out of the norm ideas." (LA, Interview) 2. Evolver also enhanced the human user's ability to enter a more intuitive or associative mode of thought by easing some of the effort in manually visualizing alternative design concepts: "Sketching stuff out on paper takes more energy and tweaking Evolver allows me to visualize easier, have a dialogue and collaborate with the design space." (RW, Interview) 3. Evolver could be used flexibly at different stages of the design process to support different tasks and modes of thought, including both generation and communication of ideas "The best part about the Evolver is that you can stop it at any stage of generation, edit and feed it back to the engine, also it is mobile and you can take it to meetings with clients and easily communicate various ideas and establish a shared understanding. It provides a frame of referencewhat is in your head now." (RW, Interview) Comparison and Discussion We compare the details of the decisions made to shift from the autonomous DarwinsGaze system to the interactive Evolver system and describe their importance (see Table 1). One of the first changes was to shift the genetic representation (or the ‘gene' structure). The DarwinsGaze system has genes which work together in a tree structure, to evolve output as a bitmap of the whole piece. The Evolver System genes were more linear and 'predictably recombinable' in order to minimize contextual focus within the system while prioritizing a variety of potentially successful solutions. DarwinsGaze used automatic fitness functionbased Cartesian Genetic Programing while Evolver shifted to a simpler and interactive Genetic Algorithm in order to engage the designer in the system and support their intuitive decision-making process. In DarwinsGaze there is no control over pieces, layers or options for interaction involvement. The Evolver system has many layers and elements and is built on the standards based vector language (SVG). Using a design-shelf structure the user has more subtle control including feature navigation, text, symmetry and rotation. The user can either import many small SVG files as seed material or import a single large file and the 2013 45 system will automatically separate and label the elements. With the user acting as the fitness function, the population size can be adjusted and desired results can be ‘bookmarked' and set aside for manual iteration or can be reinserted into the Evolver system's gene pool. So for instance, work that they create traditionally can be used as partial seed material, used fully at the start, output at any time from the system as raw inspiration results to be reworked traditionally or used as a final result. A careful effort was made to iteratively develop the graphical userinterface based on feedback from the designers about how they think within a creative process, what metaphors they use, and which perspectives and skills they rely on based on their backgrounds and experience. Finally we integrated additional post-processing options to give added novelty if needed (outside of the Genetic Algorithm) with effects such as kaleidoscope and multiple panels. DarwinsGaze System Evolver System Genes specific to image resemblance & art rules Genes linear, strong typed, focus on existing parameters Automatic CGP: complex FF / functional triggering Interactive Genetic Algorithm: simple structured forms Bitmap, evolve-as-a-whole SVG, evolve as labeled layers Operates autonomously, no import/export material Ability to import/export labeled semantic material - HCI based Research system with specific evolve towards the sitter images goals Communicates at any point of process with trad. design tools supporting wide creative styles Innovative / complex auto functional triggers : analytical to associative & back Simpler user-interaction: population size, bookmarks to support human creative triggers One system : full process of creativity, no external communication Integrated system: built to work w/ other tools, processes; supports creativity as an adaptive human process Informed by creativity theory and simulates it internally in complex ways Informed by creativity theory but uses it to support a real world meta system w/humans Table 1. Comparison Between DarwinsGaze and Evolver Systems The study of Evolver in use also made apparent an attitude shift of visual designers towards CESs, which change their role from sole creators to editors and collaborators. The designers became more receptive of tools such as Evolver as they came to view them not as replacing designers or automating the creative process; but rather as promoting new ways of design thinking, assisting and taking designer's abilities to the next level by providing efficiency and encouraging more ‘aha' moments. The visual designers in the study described Evolver as an "invisible teammate", who they could collaborate with at any stage of their design process. Evolver became a center of dialogue among designers and helped them communicate their mental models and understanding of design situations to clients and other stakeholders. Conclusions Many significant research CES systems exist that are both innovative and useful. However as the field matures, there will be an increasing need to make CESs production worthy and work within a creative industry environment such as a digital design firm. To support others in this effort for production-targeted transformation, in this paper we described the shift from an autonomous fitness function based creative system, DarwinsGaze, to an interactive fitness function based creative support system, Evolver, for real-world design collaboration. DarwinsGaze operates using a complex automatic fitness function to model contextual focus as well as other aspects of human creativity simulated internally. In shifting to the Evolver project we found that the contextual focus perspective remained relevant, but now re-situated to overlay the collaborative process between designer and system. Four design principles developed on this basis were: 1) support analytic focus by providing tools tailored to the designer's specific needs and aesthetic preferences, 2) support associative or intuitive focus by relieving the designer's cognitive capacity, enabling a quick and serendipitous workflow when desired, and offering a large variety of parameterized options to utilize, 3) support a triggering of focus-shift between the designer and the system through options to ‘bookmark' and save interesting pieces for later, as well as to move creative material from and to the system while retaining the work's semantic structure and editability, and 4) support a joint 'train of thought' between system and user by structuring a genotype representation compatible with human visual/cognitive intuition. We found that the shift to a real-world design scenario required attention to the collaboration and creative processes of the designers who value their experiencedeveloped expertise. The system design had to act as both a support tool engaging some cognitive load of the process, and a flexible, interactive repository of potentially successful options. Future real-world design considerations can explore methods for adapting intelligent operations to the cognitive processes and constraints of necessary situations, taking into account the expertise of collaborators. Acknowledgements This research was supported by the Natural Sciences and Engineering Research Council of Canada and Mitacs (Canada). We would like to thank the design firm Farmboy Fine Arts, Liane Gabora, Nahid Karimaghalou, Robb Lov 2013 46 ell and Sang Mah for agreeing to work on the industrial/academic partnership part of the work. 2013_7 !2013 A Computational Model of Analogical Reasoning in Dementia Care Konstantinos Zachos and Neil Maiden Centre for Creativity in Professional Practice City University London Northampton Square, London EC1V 0HB, UK {k.zachos, N.A.M.Maiden}@city.ac.uk Abstract This paper reports a practical application of a computational model of analogical reasoning to a pressing social problem, which is to improve the care of older people with dementia. Underpinning the support for carers for people with dementia is a computational model of analogical reasoning that retrieves information about cases from analogical problem domains. The model implements structure-mapping theory adapted to match source and target domains expressed in unstructured natural language. The model is implemented as a computational service invoked by a mobile app used by carers during their care shifts. Dementia Care and Creativity Dementia is a condition related to ageing. After the age of 65 the proportion of people with dementia doubles for every 5 years of age so that one fifth of people over the age of 85 are affected (Alzheimers Society 2010). This equates to a current total of 750,000 people in the UK with dementia, a figure projected to double by 2051 when it is predicted to affect a third of the population either as a sufferer, relative or carer (Wimo and Prince 2010). Dementia care is often delivered in residential homes. In the UK, for example, two in three of all home residents have some form of dementia (e.g. Wimo and Prince 2010), and delivering the required care to them poses complex and diverse problems carers that new software technologies have the potential to overcome. However, this potential is still to be tapped. The prevailing paradigm in dementia care is personcentered care. This paradigm seeks an individualized approach that recognizes the uniqueness of each resident and understanding the world from the perspective of the person with dementia (Brooker 2007). It can offer an important role for creative problem solving that produces novel and useful outcomes (Sternberg 1999), i.e. care activities that both recognize a sense of uniqueness and are new to the care of the resident and/or carer. However, there is little explicit use of creative problem solving in dementia care, let alone with the benefits that technology can provide. Therefore, the objective of our research was to enable more creative problem solving in dementia care through new software technologies. This paper reports two computational services developed to support carers to manage challenging behaviors in person-centered dementia care - a computational analogical matching service that retrieves similar challenging behavior cases in less-constrained domains, and a second service that automatically generates creativity prompts based on the computed analogical mappings. Both are delivered to carers through a mobile software app. The next two sections summarize results from one pre-design study that motivates the role of analogical matching in managing challenging behavior in dementia care then describe the two computational creativity services. A Pre-Design Study Creative problem solving is not new to care work. Osborn (1965) reported that creative problem solving courses were introduced in nursing and occupational therapy programs in the 1960s. Le Storti et al. (1999) developed a program that fostered the personal creative development of student nurses, challenging them to use creativity techniques to solve nursing problems. This required a shift in nursing education from taskto role-orientation and established a higher level of nursing practice - a level that treated nurses as creative members of health care teams. There have been calls for creative approaches to be used in the care of people with dementia. Successful creative problem solving was recognized to counteract the negative and stressful effects that are a frequent outcome of caring for people with dementia (Help the Aged, 2007). Several current dementia care learning initiatives can be considered creative in their approaches. These include the adoption of training courses in which care staff are put physically into residents' shoes, and exercises to encourage participants to experience life mentally through the eyes of someone with dementia (Brooker 2007). Caring for people with late stage dementia is recognized to require more creative approaches, and a common theme is the need to deliver care specific to each individual's behavioral patterns and habits. To discover the types of dementia care problem more amenable to this model of creative problem solving, we observed care work and interviews with carers at one UK residential home revealed different roles for creative problem solving in dementia care. One of these roles was to 2013 48 reduce the instances of challenging behavior in residents. Challenging behavior defined as "culturally abnormal behavior(s) of such an intensity, frequency or duration that the physical safety of the person or others is likely to be placed in serious jeopardy, or behavior which is likely to seriously limit use of, or result in the person being denied access to, ordinary community facilities" (Bromley and Emerson 1995). Examples include the refusal of food or medication, and verbal aggression. Interviews with carers revealed that creative problem solving has the potential to generate possible solutions to reduce instances of challenging behavior. For example, if a resident is uncooperative with carers when taking medication, one means to reduce it might be to have a carer wear a doctor's coat when giving the medication. The means is creative because it can be useful, novel to the resident if not applied to him before, and novel to the care team who have not applied it before. Therefore, with carers in the pilot home, we explored the potential of different creativity techniques to reduce challenging behavior. During one half-day workshop with 6 carers we explored the effectiveness and potential of different creativity techniques to manage a fictional challenging behavior. During a three-stage process the carers were presented with the fictional resident and challenging behavior, generated ideas to reduce the behavior, then prepared to implement these ideas. They used different creativity techniques, presented to them as practical problem solving techniques, to reduce the fictional challenging behavior. The carers demonstrated the greatest potential and appetite for the other exploratory creativity technique, called Other Worlds (Innovation Story 2002). During the workshop, the carers sought to generate ideas to reduce the challenging behavior in four different, less constrained domains social life, research, word of mouth and different cultures. These ideas were then transferred to the care domain to explore their effectiveness in it. Other Worlds was judged to be the most effective as well as the most interesting to carers. It created more ideas than any of the other techniques, and two of the ideas from the session were deemed sufficiently useful to implement in the pilot home immediately. Carers singled out the technique because, unlike others, it purposefully transferred knowledge and ideas via similarity-based reasoning from sources outside of the immediate problem spaces - the resident, residential home and dementia care domain. The Carer App To implement Other Worlds in care work we decided to develop a mobile software app, called Carer, which carers can use during their work. In the place of human facilitation, the software retrieves then guides carers to explore other worlds that are retrieved by the app, and in place of face-to-face communication, the software was to support asynchronous communication between carers who would digitally share information about care ideas and practices via the software. The Carer app accesses a digital repository to retrieve natural language descriptions of cases of good care practice in XML based on the structure of dementia care case studies reported by the Social Care Institute for Excellence (Owen and Meyer 2009) as well as challenging behavior cases in non-care domains such as teen parenting, student mentoring and prison life. Each case has two main parts of up to 150 words of prose each - the situation encountered and the care plan enhancement applied - and is attributed to one class of domain to which the case belongs. The current version of the repository contains 115 case descriptions. Figure 1. The Carer mobile app showing how carers describe challenging behaviors (on the left-hand side) and a detailed description of one of these cases (on the right-hand side) Carer app automatically retrieves the previous cases using different services in response to natural language entries typed and/or spoken by a carer into the app. One supports case-based reasoning with literally similar cases based on information retrieval techniques, similar to strategies applied to people with chronic diseases (Houts et al. 1996). A second supports the other worlds technique more generally by automatically generating different domains such as traveling or cooking in which to generate care plan enhancements to a current situation without the constraints of the care domain (Innovation Company 2002). The user is encouraged to think about how to solve the aggression situation in the school playground. A simple flick of the screen will generate a different other world, such as parachuting from an aircraft. A third service automatically generates creativity prompts from retrieved case content. Lastly, the Carer app invokes AnTiQue, an analogical reasoning discovery service that matches the description of a challenging behavior situation to descriptions in the repository of challenging behavior cases in non-care domains. To do this, the service implements a computational analogical reasoning algorithm based on the Structure 2013 49 Mapping Theory (Gentner 1983; Falkenhainer et al. 1989) with natural language parsing techniques and a domainindependent verb lexicon called VerbNet (Kipper et al. 2000). A carer can then record new ideas resulting from creative thinking in audio form, then reflect on them by playing them back to change them, generate further ideas, compose them into a care plan and share the plan with other carers. Some of these features are depicted in Figure 1. The right-hand side of Figure 1 shows one retrieved analogical case description - Managing a disrespectful child - as it is presented to a carer using the app. The Carer app is described at length in Maiden (2012). The next section describes two of the computational creativity services - the analogical reasoning discovery service and the creativity prompt generation service. The Analogical Reasoning Discovery Service This service (called AnTiQue) matches a description of challenging behavior in dementia care to descriptions of challenging behavior problems and resolutions in other domains, for example good policing practices to manage disorderly revelers and good teaching practices to manage disruptive children. AnTiQue's design seeks to solve 2 research problems: (i) match incomplete and ambiguous natural language descriptions of challenging behaviour in dementia care and challenging behaviour problems and resolutions in other domains using different lexical terms; (ii) compute complex analogical matches between descriptions without a priori classification of the described domains. Analogical service retrieval can increase the number of cases that are useful to the care staff by retrieving descriptions of cases solved successfully in other domains, for example good policing practices to manage disorderly revelers and good teaching practices to manage disruptive children. The problem and solution description of each case might have aspects that, through analogical reasoning, can trigger discovery of new ideas on the current challenging behaviour. For example, a description of good policing practice to manage disorderly revellers can provide analogical insights with which to manage challenging behaviour in dementia care. AnTiQue seeks to leverage these new sources of knowledge in dementia care. Analogical retrieval in AnTiQue uses a similarity model called the Structure Mapping Theory (SMT) (Gentner 1983) which seeks to transfer a network of related facts rather than unrelated one (Gentner 1983) from a source to a target domain. To enable structure-matching AnTiQue transforms natural language queries and case descriptions into predicates that express prepositional networks of nodes (objects) and edges (predicate values). Attributional predicates state properties of objects in the form PredicateValue(Object) such as asleep(resident) and absent(relative). Relational predicates express relations between objects as PredicateValue (Object1,Object2) such as abuse(resident, care-staff) and remain(resident,room). According to the SMT, a literal similarity is a comparison in which attributional and relational predicates can both be mapped from a source to a target. In contrast an analogy is a comparison in which relational predicates but few or no attributional predicates can be mapped. Therefore AnTiQue retrieves cases with high match scores for relational predicates and low match scores for attributional predicates, for example a match with the predicate abuse(detainee,police-officer) but no match with the predicate drunk(detainee). AnTiQue NLP Parser Predicate Parser Predicate Expansion Predicate Matcher Similarity Query Parsed Sentences Predicates XQ Registry ueries Services Expanded Predicates Matched Services VerbNet Predicate Rules & Heuristics WordNet, SSParser, Stanford Parser Verb Classes WordNet, Dependency Thesaurus Figure 2. Internal structure of AnTiQue Figure 2 depicts AnTiQue's 5 components. When invoked the service first divides query and case problem description text into sentences, then part-of-speech tagged, shallow parsed to identify sentence constituents and chunked in noun phrases. It then applies 21 syntax structure rules and 7 lexical extraction heuristics to identify predicates and extract lexical content in each sentence. Natural language sentences are presented as predicates in the form PredicateValue(Object1, Object2). The service then expands each query predicate with additional predicate values that have similar meaning according to verb classes found in VerbNet to increase the likelihood of a match with predicates describing each case. For example the predicate value abuse is in the same verb class as attack. The service then matches all expanded predicates to a similar set of predicates that describe the problem description of each case in the repository. This is achieved using XQuery textsearching functions to discover an initial set of cases that satisfy global search constraints. Finally it applies semantic and dependency-based similarity measures to refine the candidate case study set. The service returns an ordered set of analogical cases based on the match score with the query. The components use WordNet, VerbNet, and the Dependency Thesaurus to compute attributional and relational similarities. WordNet is a lexical database inspired by psycholinguistic theories of human lexical memory (Miller 2013 50 1993). Its word senses and definitions provide the data with which to disambiguate terms in queries and case problem descriptions. Its semantic relations link terms to other terms with similar meanings with which to make service queries more complete. For example a service query with the term car is expanded with other terms with similar meaning, such as automobile and vehicle, to increase matches with web service descriptions. VerbNet (Kipper et al. 2000) is a domain independent verb lexicon. It organizes terms into verb classes that refine Levin classes (Levin 1993) and add sub-classes to achieve syntactic and semantic coherence among members of a verb class. AnTiQue uses it to expand query predicate values with different members from the same verb class. For example, queries with the verb abuse are expanded with other verbs with similar meaning such as attack. The Dependency Thesaurus supports dependency-based word similarity matching to detect similar words from text corpora. Lin (1998) used a 64-million word corpus to compute pair-wise similarities between all of the nouns, verbs, adjectives and adverbs in the corpus using a similarity measure. Given an input word the Dependency Thesaurus can retrieve similar words and group them automatically into clusters. AnTiQue used the Dependency Thesaurus to compute the relational similarity between 2 sets of predicates. In the remainder of this section we demonstrate the AnTiQue components using text from the following example challenging behaviour situation: A resident acts aggressively towards care staff and the resident verbally abuses other residents at breakfast. Suspect underlying insecurities to new people. Natural Language Processing This component prepares the structured natural language (NL) service query for predicate parsing and expansion. In the first step the text is split into sentences. In the second a part-of-speech tagging process is applied that marks up the words in each sentence as corresponding to a particular lexical category (part-of-speech) using its definition and context. In the third step the algorithm applies a NL processing technique called shallow parsing that attempts to provide some machine understanding of the structure of a sentence without parsing it fully into a parsed tree form. The output is a division of the text's sentences into a series of words that, together, constitute a grammatical unit. In our example the tagged sentence a resident acts aggressively towards care staff and the resident verbally abuses other residents at breakfast is shown in Figure 3. Tags that follow a word with a forward slash (e.g. driver/NN) correspond to lexical categories including noun, verb, adjective and adverb. For example, the NN tag means "noun singular or mass", DT means "determinant" and VBZ means "verb, present tense, 3rd person singular". Tags attached to each chunk (e.g. [The/DT driver/NN]NP) correspond to phrasal categories. For instance, the NP tag denotes a "noun phrase", VP a "verb phrase", S a "simple declarative clause", PP a "prepositional phrase" and ADVP a "adverb phrase". [A/DT resident/NN]NP [acts/VBZ]VP [aggressively/RB]ADVP [towards/]PP [care staff/NN]NP. Figure 3. The sentence a resident acts aggressively towards care staff after performing part-of-speech tagging and chunking The component then decomposes each sentence into its phrasal categories used in the next component to identify predicates in each sentence structure. Predicate Parsing This component automatically identifies predicate structures within each annotated NL sentence based on syntax structure rules and lexical extraction heuristics. Syntax structure rules break down a pre-processed NL sentence into sequences of phrasal categories where each sequence contains 2 or more phrasal categories. Lexical extraction heuristics are applied on each identified sequence of phrasal categories to extract its lexical content used to generate one or more predicates. Firstly the algorithm applies 21 syntax structure rules. Each rule consists of a phrasal category sequence of the form Ri ! [Bj], meaning that the rule Ri consists of a phrasal category sequence B1, B2,…, Bj. For example the rule R4 ! [NP, VP, S, VP, NP] reads: rule R1 consists of a NP followed by a VP, a S, a VP, and a NP, where NP, VP and S mean a noun phrase, a verb phrase and a simple declarative clause respectively. The method takes a phrasal category list as input and returns a list containing each discovered syntax structure rule and its starting point in the corresponding phrasal category list, e.g. {(R1,3), (R5,1)}. In our example, the input for the pre-processed sentence shown in Figure 3 corresponds to a list Input = (NP, VP, ADVP, PP, NP). Starting from the first list position the method recursively checks whether there exists a sequence within the phrasal category list that matches one of the syntax structure rules. The output after applying the algorithm on list Input is a list of only one matched syntax structure rule, i.e. Output = {(R2,1)}. Secondly the algorithm applies lexical extraction heuristics on a syntax structure rule-tagged sentence to extract content words for generating one or more predicates. For each identified syntax structure rule in a sentence the algorithm: (1) determines the position of both noun and verb phrases within the phrasal category sequence; (2) applies the heuristics to extract the content words (verbs and nouns) from each phrase category; (3) converts each verb and noun to its morphological root (e.g. abusing to abuse); and (4) generates the corresponding predicate p in the form PredicateValue(Object1, Object2) where PredicateValue is the verb and Object1 and Object2 the nouns. To illustrate this the algorithm identified rule R2+ for our example sen 2013 51 tence in Figure 3. According to one heuristic {R2+} corresponds to the following phrasal category sequence [NP, VP, ADVP, PP, NP]. Therefore the algorithm determines the position of both noun and verb phrases within this sequence, i.e. noun phrases in {NP,1} and {NP,5} and verb phrases in {VP,2}. Lexical extraction heuristics are applied to extract the content words from each phrase category, i.e. {NP,1} ! resident, {NP,5} ! care staff, {VP,2} ! act. Returning to our example, the algorithm generates two predicates for the sentence a resident acts aggressively towards care staff and the resident verbally abuses other residents at breakfast, namely act(resident,care_staff) and abuse(resident,resident). Predicate Expansion Predicate expansion and matching are key to the service's effectiveness. In AnTiQue queries are expanded using words with similar meaning. AnTiQue uses ontological information from VerbNet to extract semantically related verbs for verbs in each predicate. VerbNet classes are organised to ensure syntactic and semantic coherence among members, for example the verb abuse as repeatedly treat a victim in a cruel way is one of 24 members of the judgement class. Other members include attack, assault and insult and 20 other verbs as potential expansions. Thus VerbNet provides 23 verbs as potential expansions for the verb abuse. Although classes group together verbs with similar argument structures, the meanings of the verbs are not necessarily synonymous. For instance, the degree of attributional similarity between abuse and reward is very low, whereas the similarity between abuse and assault is very high. The service constrains use of expansion to verb members that achieve a threshold on the degree of attributional similarity computed with WordNet-based similarity measurements (Simpson and Dao 2005). Given 2 sets of text, T1 and T2, the measurement determines how similar the meaning of T1 and T2 is scored between 0 and 1. For example, for the verb abuse, the algorithm computes the degree of attributional similarity between abuse and each co-member within the judgement class. In our example verbs such as attack, assault and insult but not honour and doubt are used to generate additional predicates in the expanded query. Predicate Matching Coarse-grained Matching The expanded query is fired at problem descriptions of cases in the repository as an XQuery. Prior to executing the XQuery we pre-process all problem descriptions of cases in the registries using the Natural Language Processing and Predicate Parsing components and store them locally. The XQuery includes functions to match each original and expanded predicate value to equivalent representations of candidate problem descriptions of cases. The service retrieves an initial set of matched cases. Fine-grained Matching The Predicate Matcher applies semantic and dependency-based similarity measures to assess the quality of the candidate case set. It computes relational similarity between the query and each case retrieved during coarse-grain matching. To compute relational similarities that indicate analogical matches between service and query predicate arguments the Predicate Matcher uses the Dependency Thesaurus to select web services that are relationally similar to mapped predicates in the service query. In our example the case Managing a disrespectful child, which describes a good childcare practice to manage a disrespectful child, is one candidate case retrieved during coarse-grained matching. Figure 4 shows the problem and solution description of the case. Name Managing a disrespectful child Problem An intelligent 13-year-old boy voices opinions that are hurtful and embarrassing. The child refuses to consider the views of others and often makes discriminatory statements. The parents have removed his privileges and threatened to take him out of the school he loves. This approach has not worked. He now makes hurtful comments to his mother about her appearance. The child insults neighbours and guests at their home. He is rude and mimics their behaviour. The child shows no remorse for his actions. His mother is at the end of her tether. Solution The son needs very clear boundaries set. The parents are going to set clear rules on acceptable behaviour. They will state what they are not prepared to tolerate. They will highlight rude comments in a firm tone with the boy. He will receive an explanation as to why the comments are hurtful. Both parents will agree punishments for rule breaking that are realistic. They will work as a team and follow through on punishments. The son can then regain his privileges as rewards for consistent good behaviours. Figure 4. A retrieved case describing a good childcare practice to manage a disrespectful child The algorithm receives as inputs a pre-processed sentence list for the query and problem description of the case. It compares each predicate in the pre-processed query sentence list Pred(j)Query with each predicate in the preprocessed problem description sentence list Pred(k)Case to calculate the relevant match value, where Pred(j)Query = PredValQuery(Arg1Query; Arg2Query) and Pred(k)Case = PredValCase (Arg1Case; Arg2Case). The following conditions must be met in order to accept a match between the predicate pair: 1. PredValCase exists in list of expanded predicate values of PredValQuery; 2. Arg1Query and Arg1Case (or Arg2Query and Arg2Case respectively) are not the same; 3. Arg1Case (or Arg2Case) exists in the Dependency Thesaurus result set when using Arg1Query (or Arg2Query) as the query to the Thesaurus; 4. the resulting attributional similarity value from step 3 is below a specified threshold. 2013 52 If all conditions are met, PredCase is added to the list of matched predicates for the current case. If not the algorithm rejects PredCase and considers the next list item. AnTiQue queries the Dependency Thesaurus to retrieve a list of dependent terms. Terms are grouped automatically according to their dependency-based similarity degree. Firstly the algorithm checks whether the case predicate argument exists in this list. If so, it uses the semantic similarity component to further refine and assess the quality of the case predicate with regards to relational similarity. Using this 2-step process AnTiQue returns an ordered set of analogical cases based on the match score with the query. In our example consider Pred(j)Query = abuse(resident,residents) extracted from the sentence the resident verbally abuses other residents at breakfast, and the Pred(k)Case = insult(child,neighbours) from the sentence The child insults neighbours and guests at their home taken from the description of the Managing a disrespectful child good childcare practice case in Figure 4. In this example all conditions for an analogical match are met: the predicate values abuse and insult are semantically equivalent whilst the object names resident and child and residents and neighbours are not the same. According to the Dependency thesaurus child is similar based on dependencies to resident, and neighbour is similar based on dependencies to resident. Finally the attributional similarity value of resident and child is 0.33, for resident and neighbour 0.25 - both below the specified threshold. As a result the predicate insult(child,neighbours) is added to the list of matched predicates for the predicate abuse(resident,resident). At the end of each invocation, the service returns an ordered set of the descriptions of the highest-scoring cases for the app component to display to the care staff. The Creativity Trigger Generation Service Although care staff can generate new resolutions directly from retrieved case descriptions, formative usability testing with the app revealed that users were often overwhelmed by the volume of text describing each case and uncertain how to start idea generation. Therefore we developed an automated service that care staff can invoke to generate creative triggers that extract content from the retrieved descriptions to conjecture new ideas that care staff can consider for the resident. Each trigger expresses a single idea that care staff can use to initiate creative thinking. The service uses the attributional predicates generated by the analogical matching discovery service to generate prompts that encourage analogical transfer of knowledge using the object-pair mappings identified in each predicate. It has the form Think about a new idea based on the, followed by mapped subject and object names in the target domain. To illustate, referring back to the Managing a disrespectful child good practice case retrieved from the childcare domain shown in Figure 1, Figure 5 shows how they are presented in the Carer mobile app while Figure 6 lists all creativity prompts that the service generates for the analogical case. Figure 5. The Carer mobile app showing creativity prompts generated for the Managing a disrespectful child case Think about a new idea based on the boundaries Think about a new idea based on the clear rules Think about a new idea based on the acceptable behaviour Think about a new idea based on the rude comments Think about a new idea based on the firm tone Think about a new idea based on the explanation Think about a new idea based on the comments Think about a new idea based on the punishment Think about a new idea based on the rule breaking Think about a new idea based on the rewards Think about a new idea based on the privileges Think about new idea based on the consistent good behaviour Figure 6. Creativity prompts generated for the Managing a disrespectful child case Discovering Novel Ideas Our design of the Carer app builds on Kerne et al. (2008)'s notion of human-centered creative cognition, in which information gathering and idea discovery occur concurrently, and information search and idea generation reinforce each other. The computational model of analogical reasoning searches for and retrieves information from analogical domains, and the creativity trigger generation service manipulates this information to support more effective idea generation from information, however the generation of new ideas remains a human cognitive activity undertaken by carers, supported by bespoke features implemented in the app. For example, a carer can audio-record a new idea at any time in response to retrieved analogical cases and/or presented creativity triggers by pressing the red button visible in Figures 1 then verbalizing and naming the idea. Recorded ideas can be selected and ordered to construct a new care enhancement plan that can be extended with more 2013 53 ideas and comments at any time. The carer can also play back the audio-recorded ideas and care enhancement plans to reflect and learn about them, inspired by similar use of the audio channel in digitally supported creative brainstorming (van Dijk et al. 2011). Reflection about an idea is supported with guidance from the app to reflect on why the idea is needed, what the idea achieved, and how and when the idea should be implemented. Reflection about a care enhancement plan is more sophisticated. A carer can dragand-drop ideas in and out of the plan and into different sequences in it. Then, during play back of the plan, the app concatenates the individual idea audio files and plays the plan as a single recording, allowing the carer to listen to and reflect on each version of the plan as a different narrated story. Moreover, s/he can reflect collaboratively with colleagues using the app to share the plan as e-mail attachments, thereby enabling asynchronous communication between carers. Formative Evaluation of the Carer App The Carer app was made available for evaluation over prolonged periods with carers in a residential home. At the start of the evaluation, 7 nurses and care staff in the residential home were given an iPod Touch for their individual use during their care work over a continuous 28-day period. All 7 carers received face-to-face training in how to use the device and both apps before the evaluation started. A half-day workshop was held at the residential home to allow them to experiment with all of both apps' features. The carers were also given training and practice with the 3 forms of Other Worlds creativity technique through practice and facilitation to demonstrate how it can lead to idea generation. We deemed this training in the creativity technique an essential precondition for successful uptake of the app. Even though it only lasted 4 weeks, the reported evaluation of the Carer app in one residential home provided valuable data about the use of mobile computing and creativity techniques in dementia care. Figure 7 depicts the results. Residential cases Analogical domain cases Ideas generated Enhancement plans generated Totals 27 5 14 10 Figure 7. Situations, ideas and care enhancement plans generated by care staff using Carer app The focus group revealed that the nurses and carers implemented at least one major change to the care of one resident based on ideas generated using the app. However, most of this success was not based on the analogical cases retrieved by the computational model. Whilst carers using the app did use the analogical matching service, and the service did retrieve relevant cases from analogical domains such as childcare and student management, the carers were unable to map and transfer knowledge from each of these source domains to the current dementia-related challenging behavior. The log data recorded only 5 uses of the analogical reasoning service to retrieve descriptions of cases of challenging behaviors from non-care domains. Rather, the carers appeared to use the case-based reasoning service to retrieve descriptions of challenging behavior cases from the care domain - the log data recorded 28 uses of this service, and most of the 114 recorded uses of the creativity prompt generation service were generated from these same-domain dementia cases. The focus group revealed that the carers did not use retrieved non-care domain cases because they were unable to recognize analogical similarities between them and the challenging behavior situation. We identified two possible reasons for this. Firstly, AnTiQue implements an approach that approximates analogical retrieval, hence there is always the possibility of computing seemingly "wrong" associations and retrieve cases that do not have analogical similarities. Previous evaluations of AnTiQue with regards to the precision and recall (Zachos & Maiden, 2008) revealed a recall score of 100% and a precision score of 66,6% highlighting one potential limitation of computing the attributional similarity using WordNet-based similarity measures. Secondly, the results suggests that carers will require more interactive support based on results generated by the computational model to support cognitive analogical reasoning, consistent with previously reported empirical findings (e.g. Gick 1983). Examples of such increased interactive support include explicitly reporting each computed analogical mapping to the carer, use of graphical depictions of structured knowledge to transfer from the source to the target domain, and more deliberate analogical support prompts, for example based on the form A is to B as C is to D. We are extending Carer app with such features and look forward to reporting these extensions in the near future. Related Work Since the 1980s, the efforts of many Artificial Intelligence researchers and psychologists have contributed to an emerging agreement on many issues relating to analogical reasoning. In various ways and with differing emphases, all current computational analogical reasoning techniques use underlying structural information about the sources and the target domains to derive analogies. However, at the algorithmic level, they achieve the computation in many different ways (Keane et al. 1994). Based on the Structure Mapping Theory (SMT), Gentner constructed a computer model of this theory called Structure Mapping Engine (SME) (Gentner 1989). The method assumes that both target and source situations are represented using a certain symbolic representation. The SME also only uses syntactic structures about the two situations as the main input knowledge — it has no knowledge of any kind of semantic similarity between various descriptions and relations in the two situations. All processing is based on syntactic structural features of the two given representations. The application of analogical reasoning to software reuse is not new. For example, Massonet and van 2013 54 Lamsweerde (1997) applied analogy-making techniques to complete partial requirements specifications using a rich, well-structured ontology combined with formal assertions. The method was based on query generalization for completing specifications. The absence of effective ontologies and taxonomies would expose the weaknesses of the proposed approach due to the reliance on ontologies. Pisan (2000) tried to overcome this weakness by applying the SME to expand semi-formal specifications. The idea was to find mappings from specifications for problems similar to the one in hand and use the mappings to adapt an existing specification without requiring domain specific knowledge. The research presented in this paper overcomes limitations of the above-mentioned approaches by using additional knowledge bases to extent the mapping process with semantic similarity measures. Conclusion and Future Work This paper reports a practical application of a computational model of analogical reasoning to a pressing social problem, which is to improve the care of older people with dementia. The result is a mobile app that is capable technically of accepting spoken and typed natural language input and retrieving analogical domain cases that can be presented with creativity triggers to support analogical problem solving. The evaluation results reported revealed that our model of creative problem solving in dementia care did not describe all observed carer behavior, so we are currently repeating the rollout and evaluation of Carer in other residential homes to validate this finding. Carer is being extended with new creativity support features that include web images that match generated creativity prompts, and more explicit support for analogical reuse of cases from non-dementia care domains. We are extending the repository with new cases that are semantically closer to dementia care and, therefore, easier to recognize analogical similarities with. Acknowledgment The research reported in this paper is supported by the EUfunded MIRROR integrated project 257617, 2010-14. 2013_8 !2013 Transforming Exploratory Creativity with DeLeNoX Antonios Liapis1, Hector P. Mart ´ ´ınez2, Julian Togelius1 and Georgios N. Yannakakis2 1: Center for Computer Games Research IT University of Copenhagen Copenhagen, Denmark 2: Institute of Digital Games University of Malta Msida, Malta anli@itu.dk, hector.p.martinez@um.edu.mt, juto@itu.dk, georgios.yannakakis@um.edu.mt Abstract We introduce DeLeNoX (Deep Learning Novelty Explorer), a system that autonomously creates artifacts in constrained spaces according to its own evolving interestingness criterion. DeLeNoX proceeds in alternating phases of exploration and transformation. In the exploration phases, a version of novelty search augmented with constraint handling searches for maximally diverse artifacts using a given distance function. In the transformation phases, a deep learning autoencoder learns to compress the variation between the found artifacts into a lower-dimensional space. The newly trained encoder is then used as the basis for a new distance function, transforming the criteria for the next exploration phase. In the current paper, we apply DeLeNoX to the creation of spaceships suitable for use in two-dimensional arcade-style computer games, a representative problem in procedural content generation in games. We also situate DeLeNoX in relation to the distinction between exploratory and transformational creativity, and in relation to Schmidhuber's theory of creativity through the drive for compression progress. Introduction Within computational creativity research, many systems have been designed that create artifacts automatically through search in a given space for predefined objectives, using evolutionary computation or some similar stochastic global search/optimization algorithm. Recently, the novelty search paradigm has aimed to abandon all objectives, and simply search the space for a set of artifacts that is as diverse as possible, i.e. for maximum novelty (Lehman and Stanley 2011). However, no search is without biases. Depending on the problem, the search space often contains constraints that limit and bias the exploration, while the mapping from genotype space (in which the algorithm searches) and phenotype space (in which novelty is calculated) is often indirect, introducing further biases. The result is a limited and biased novelty search, an incomplete exploration of the given space. But what if we could characterize the bias of the search process as it unfolds and counter it? If the way space is being searched is continuously transformed in response to detected bias, the resulting algorithm would more thoroughly search the space by cycling through or subsuming biases. In applications such as game content generation, it would be particularly useful to sample the highly constrained space of useful artifacts as thoroughly as possible in this way. In this paper, we present the Deep Learning Novelty Explorer (DeLeNoX) system, which is an attempt to do exactly this. DeLeNoX combines phases of exploration through constrained novelty search with phases of transformation through deep learning autoencoders. The target application domain is the generation of two-dimensional spaceships which can be used in space shooter games such as Galaga (Namco 1981). Automatically generating visually diverse spaceships which however fulfill constraints on believability addresses the "content creation" bottleneck of many game titles. The spaceships are generated by pattern-producing networks (CPPNs) via augmenting topologies (Stanley 2006). In the exploration phases, DeLeNoX finds the most diverse set of spaceships possible given a particular distance function. In the transformation phases, it characterizes the found artifacts by obtaining a low-dimensional representation of their differences. This is done via autoencoders, a novel technique for nonlinear principal component analysis (Bengio 2009). The features found by the autoencoder are orthogonal to the bias of the current CPPN complexity, ensuring that each exploratory phase has a different bias than the previous. These features are then used to derive a new distance function which drives the next exploration phase. By using constrained novelty search for features tailored to the concurrent complexity, DeLeNoX can create content that is both useful (as it lies within constraints) and novel. We will discuss the technical details of DeLeNoX shortly, and show results indicating that a surprising variety of spaceships can be found given the highly constrained search space. But first we will discuss the system and the core idea in terms of exploratory and transformational creativity, and in the context of Schmidhuber's theory of creativity as an impulse to improve the compressibility of growing data. Between exploratory and transformational creativity A ubiquitous distinction in creativity theory is that between exploratory and transformational creativity. Perhaps the most well-known statement of this distinction is due to Boden (1990) and was later formalized by Wiggins (2006) and others. However, similar ideas seem to be present in al 2013 56 Transformation Denoising Autoencoder Exploration Feasible-Infeasible Novelty Search Fitness Function Training Set Figure 1: Exploration transformed with DeLeNoX: the flowchart includes the general principles of DeLeNoX (bold) and the methods of the presented case study (italics). most every major discussion of creativity such as "thinking outside the box" (De Bono 1970), "paradigm shifts" (Kuhn 1962) etc. The idea requires that creativity is conceptualized as some sort of search in a space of artifacts or ideas. In Boden's formulation, exploratory creativity refers to search within a given search space, and transformational creativity refers to changing the rules that bind the search so that other spaces can be searched. Exploratory creativity is often associated with the kind of pedestrian problem solving that ordinary people engage in every day, whereas transformational creativity is associated with major breakthroughs that redefine the way we see problems. Naturally, much effort has been devoted to thinking up ways of modeling and implementing transformational creativity in a computational framework. Exploratory creativity is often modeled "simply" as objective-driven search, e.g. using constraint satisfaction techniques or evolutionary algorithms (including interactive evolution). We see the distinction between exploratory and transformative creativity as a matter quantitative rather than qualitative. In some cases, exploratory creativity is indeed limited by hard constraints that must be broken in order to transcend into unexplored regions of search space (and thus achieve transformational creativity). In other cases, exploratory creativity is instead limited by biases in the search process. A painter might have a particular painting technique she defaults to, a writer a common set of plot devices he returns to, and an inventor might be accustomed to analyze problems in a particular order. This means that some artifacts are in practice never found, even though finding them would not break any constraints — those artifacts are contained within the space delineated by the original constraints. Analogously, any search algorithm will over-explore some regions of search space and in practice never explore other areas because of particularities related to e.g. evaluation functions, variation operators or representation (cf. the discussion of search biases in machine learning (Mitchell 1997)). This means that some artifacts are never found in practice, even though the representation is capable of expressing them and there exists a way in which they could in principle be found. DeLeNoX and Transformed Exploration As mentioned above, the case study of this paper is twodimensional spaceships. These are represented as images generated by Compositional Pattern-Producing Networks (CPPNs) with constraints on which shapes are viable spaceships. Exploration is done through a version of novelty search, which is a type of evolutionary algorithm that seeks to explore a search space as thoroughly as possible rather than maximizing an objective function. In order to do this, it needs a measure of difference between individuals. The distance measure inherently privileges some region of the search space over others, in particular when searching at the border of feasible search space. Additionally, CPPNs with different topologies are likely to create specific patterns in generated spaceships, with more complex CPPNs typically creating more complex patterns. Therefore, in different stages of this evolutionary complexification process, different regions of the search space will be under-explored. Many artifacts that are expressible within the representation will thus most likely not be found; in other words, there are limitations to creativity because of search biases. In order to alleviate this problem and achieve a fuller coverage of space, we algorithmically characterize the biases from the search process and the representation. This is what the autoencoders do. These autoencoders are applied on a set of spaceships resulting from an initial exploration of the space. A trained autoencoder is a function from a complete spaceship (phenotype) to a relatively low-dimensional array of real values. We then use the output of this function to compute a new distance measure, which differs from previous ones in that it better captures typical patterns at the current representational power of the spaceship-generating CPPNs. Changing the distance function amounts to changing the exploration process of novelty search, as novelty search is now in effect searching along different dimensions (see Fig. 1). We have thus transformed exploratory creativity, not by changing or abandoning any constraints, but by adjusting the search bias. This can be seen as analogous to changing the painting technique of a painter, the analysis sequence of an inventor, or introducing new plot devices for a writer. All of the spaceships that are found by the new search process could in principle have been found by the previous processes, but were very unlikely to be. Schmidhuber's theory of creativity Schmidhuber (2006; 2007) advances an ambitious and influential theory of beauty, interestingness and creativity that arguably holds explanatory power at least under certain circumstances. Though the theory is couched in computational terms, it is meant to be applicable to humans and other animals as well as artificial agents. In Schmidhuber's theory, a beautiful pattern for a curious agent A is one that can successfully be compressed to much smaller description length by that agent's compression algorithm. However, perfect beauty is not interesting; an agent gets bored by environments it can compress very well and cannot learn to compress better, and also by those it cannot compress at all. Interesting environments for A are those which A can compress to some extent but where there is potential to improve the compression ratio, or in other words potential for A to learn about this type of environment. This can be illustrated by tastes in reading: beginning readers like to read linguistically and thematically simple texts, but such texts are seen by advanced readers as "predictable" (i.e. compressible), and the curious advanced readers therefore seek out more 2013 57 complex texts. In Schmidhuber's framework, creative individuals such as artists and scientists are also seen as a curious agents: they seek to pose themselves problems that are on the verge of what they can solve, learning as much as possible in the process. It is interesting to note the close links between this idea and the theory of flow (Csikszentmihalyi 1996) but also theories of learning in children (Vygotsky et al. 1987) and game-players (Koster and Wright 2004). The DeLeNoX system fits very well into Schmidhuber's framework and can be seen as a novel implementation of a creative agent. The system proceeds in phases of exploration, carried out by novelty search which searches for interesting spaceships, and transformation, where autoencoders learn to compress the spaceships found in the previous exploration phase (see Fig. 1) into a lower-dimensional representation. In the exploration phases, "interesting" amounts to far away from existing solutions according to the distance function defined by the autoencoder in the previous transformation phase. This corresponds to Schmidhuber's definition of interesting environments as those where the agent can learn (improve its compression for the new environment); the more distant the spaceships are, the more they force the autoencoder to change its compression algorithm (the weights of the network) in the next transformation phase. In the transformation phase, the learning in the autoencoder directly implements the improvement in capacity to compress recent environments ("compression progress") envisioned in Schmidhuber's theory. There are two differences between our model and Schmidhuber's model of creativity, however. In Schmidhuber's model, the agent stores all observations indefinitely and always retrains its compressor on the whole history of previous observations. As DeLeNoX resets its archive of created artifacts in every exploration phase, it is a rather forgetful creator. A memory could be implemented by keeping an archive of artifacts found by novelty search in all previous exploration phases, but this would incur a high and constantly increasing computational cost. It could however be argued that the dependence of each phase on the previous represents an implicit, decaying memory. The other difference to Schmidhuber's mechanism is that novelty search always looks for the solution/artifact that is most different to those that have been found so far, rather than the one predicted to improve learning the most. Assuming that the autoencoder compresses relatively better the more diverse the set of artifacts is, this difference vanishes; this assumption is likely to be true at least in the current application domain. A case study of DeLeNoX: Spaceship Generation This paper presents a case study of DeLeNoX for the creation of spaceship sprites, where exploration is performed via constrained novelty search which ensures a believable appearance, while transformation is performed via a denoising autoencoder which finds typical features in the spaceships' current representation (see Fig. 1). Search is performed via neuroevolution of augmenting topologies, which changes the representational power of the genotype and warx C y (a) CPPN xm 0 t (C = -0.5) b (C = 0.5) 0 H W 0 xm W 0 H (b) Spaceship representation Figure 2: Fig 2a shows a sample CPPN using the full range of pattern-producing activation functions available. Fig. 2b shows the process of spaceship generation: the coordinates 0 to xm, normalized as 0 to 1 (respectively) are used as input x of the CPPN. Two C values are used for each x, resulting in two points, top (t) and bottom (b) for each x. CPPN input x and output y are treated as the coordinates of t and b; if t has a higher y value than that of b then the column is empty, else the hull extends between t and b. The generated hull is reflected vertically along xm. rants the transformation of features which bias the search. Domain Representation Spaceships are stored as two-dimensional sprites; the spaceship's hull is shown as black pixels. Each spaceship is encoded by a Compositional Pattern-Producing Network (CPPN), which is able to create complex patterns via function composition (Stanley 2006). A CPPN is ideal for visual representation as it can be queried with arbitrary spatial granularity (infinite resolution); however, this study uses a fixed resolution for simplicity. Unlike standard artificial neural networks where all nodes have the same activation function, each CPPN node may have a different, patternproducing function; six activation functions bound within [0, 1] are used in this study (see Fig. 2a). To generate a spaceship, the sprite is divided into a number of equidistant columns equal to the sprite's width (W) in pixels. In each column, two points are identified as top (t) and bottom (b); the spaceship extends from t to b, while no hull exists if t is below b (see Fig. 2b). The y coordinate of the top and bottom points is the output of the CPPN; its inputs are the point's x coordinate and a constant C which differentiates between t and b (with C = !0.5 and C = 0.5, respectively). Only half of the sprites' columns, including the middle column at xm = dW 2 e, are used to generate t and b; the remaining columns are derived by reflecting vertically along xm. A sufficiently expanded CPPN, as a superset of a multilayer perceptron, is theoretically capable of representing any function. This means that any image could in principle be produced by a CPPN. However, the interpretation of CPPN output we use here means that images are severely limited to those where each column contains at most one vertical black bar. Additionally, the particularities of the NEAT complexification process, of the activation functions used and of the distance function which drives evolution make the system heavily biased towards particular shapes. It is this latter bias that is characterized within the transformation phase. 2013 58 ... input layer (P) size M=Hʘ:¬¬ p 1 p p 3 p M-1 pM q ... 1 q qN hidden layer (Q) size N ¬¬ output layer (Pǯ) size M=Hʘ: ... pǯ 1 pǯ pǯ 3 pǯ M-1 pǯ M decoder encoder { { Figure 3: The autoencoder architecture used for DeLeNoX, consisting of the encoder where Q = fw(P) and the decoder where P0 = gw(Q). The higher-level representation in q1, q2,...,qN is used to calculate the difference between individuals for the purposes of novelty search. Transformation Phase: Denoising Autoencoder The core innovation of DeLeNoX is the integration of autoencoders (AEs) in the calculation of the novelty heuristic (described in the next section), which is used to explore the search space according to the current representational power of the encoding CPPNs. AEs (Hinton and Zemel ) are non-linear models that transform an input space P into a new distributed representation Q by applying a deterministic parametrized function called the encoder Q = fw(P). This encoder, instantiated in this paper as a single layer of logistic neurons, is trained alongside a decoder (see Fig. 3) that maps back the transformed into the original representation (P0 = gw(Q)) with a small reconstruction error, i.e. the original and corresponding decoded inputs are similar. By using a lower number of neurons than inputs, the AE is a method for the lossy compression of data; its most desirable feature, for the purposes of DeLeNoX, is that the compression is achieved by exploiting typical patterns observed in the training set. In order to increase the robustness of this compression, we employ denoising autoencoders (DAs), an AE variant that corrupts the inputs of the encoder during training while enforcing that the original uncorrupted data is reconstructed (Vincent et al. 2008). Forced to both maintain most of the information from the input and undo the effect of corruption, the DA must "capture the main variations in the data, i.e. on the manifold" (Vincent et al. 2008), which makes DAs far more powerful tools than linear models for principal component analysis. For the purposes of detecting the core visual features of the generated spaceships, DeLeNoX uses DAs to transform the spaceship's sprite to a low-dimensional array of real values, which correspond to the output of the encoder. Since spaceships are symmetrical along xm, the training set consists of the left half of every spaceship sprite (see Fig. 4d). The encoder has H·dW 2 e inputs (P), which are assigned a corrupted version of the spaceship's half-sprite; corruption is accomplished by randomly replacing pixels with 0, which is the same as randomly removing pixels from the spaceship (see Fig. 4e). The encoder has N neurons, corresponding to (a) (b) (c) (d) (e) Figure 4: Sample spaceships of 49 by 49 pixels, used for demonstrating DeLeNoX. Fig. 4a is a feasible spaceship; Fig. 4b and 4c are infeasible, as they have disconnected pixels and insufficient size respectively. The autoencoder is trained to predict the left half of the spaceship in Fig. 4a (Fig. 4d) from a corrupted version of it (Fig. 4e). the number of high-level features captured; each feature qi is a function of the input P as sig(Wi·P + bi) where sig(x) the sigmoid function and {Wi, bi} the feature's learnable parameters (weight set and bias value, respectively). The output P0 of the decoder is an estimation of the uncorrupted half-sprite derived from Q = [q1, q2,...,qN ] via P0 = sig(W0 ·Q + B0 ); in this paper the DA uses tied weights and thus W0 is the transpose of W = [W1, W2,...,WN ]. The parameters {W, B, B0 } are trained via backpropagation (Rumelhart 1995) according to the mean squared error between pixels in the uncorrupted half-sprite with those in the reconstructed sprite. Exploration Phase: Constrained Novelty Search The spaceships generated by DeLeNoX are expected to be useful for a computer game; spaceships must have a believable appearance and sufficient size to be visible. Specifically, spaceships must not have disconnected pixels and must occupy at least half of the sprite's height and width; see examples of infeasible spaceships in Fig. 4b and 4c. In order to optimize feasible spaceships towards novelty, content is evolved via a feasible-infeasible novelty search (FINS) (Liapis, Yannakakis, and Togelius 2013). FINS follows the paradigm of the feasible-infeasible two-population genetic algorithm (Kimbrough et al. 2008) by maintaining two separate populations: a feasible population of individuals satisfying all constraints and an infeasible population of individuals failing one or more constraints. Each population selects individuals among its own members, but feasible offspring of infeasible parents are transferred to the feasible population and vice versa; this form of interbreeding increases the diversity of both populations. In FINS, the feasible population selects parents based on a novelty heuristic (⇢) while the infeasible population selects parents based on their proximity to the feasible border (finf ), defined as: finf = 1 # 1 3 ⇥ max{0, 1 # 2w W } + max{0, 1 # 2h H } + As A ⇤ where w and h is the width and height of the spaceship in pixels; W and H is the width and height of the sprite in pixels; A is the total number of black pixels on the image and As the number of pixels on all disconnected segments. For the feasible population, the paradigm of novelty search is followed in order to explore the full spectrum of 2013 59 the CPPNs' representational power. The fitness score ⇢(i) for a feasible individual i amounts to its average difference with the k closest feasible neighbors within the population or in an archive of past novel individuals (Lehman and Stanley 2011). In each generation, the l highest-scoring feasible individuals are inserted in an archive of novel individuals. In DeLeNoX, the difference used to calculate ⇢ is the Euclidean distance between the high-level features discovered by the denoising autoencoder; thus ⇢(i) is calculated as: ⇢(i) = 1 k X k m=1 vuutX N n=1 [qn(i) ! qn(µm)]2 where µm is the m-th-nearest neighbor of i (in the population or the archive of novel individuals); N is the number of hidden nodes (features) of the autoencoder and qn(i) the value of feature n for spaceship i. As with the training process of the denoising autoencoder, the left half of spaceship i is used as input to qn(i), although the input is not corrupted. In both populations, evolution is carried out via neuroevolution of augmenting topologies (Stanley and Miikkulainen 2002) using only mutation; an individual in the population may be selected (via fitness-proportionate roulette wheel selection) more than once for mutation. Mutation may add a hidden node (5% chance), add a link (10% chance), change the activation function of an existing node (5% chance) or modify all links' weights by a small value. Experimentation DeLeNoX will be demonstrated with the iteratively transformed exploration of spaceships on sprites of 49 by 49 pixels. The experiment consists of a series of iterations, with each iteration divided into an exploration phase and a transformation phase. The exploration phase uses constrained novelty search to optimize a set of diverse spaceships, with "diversity" evaluated according to the features of the previous iteration; the transformation phase uses the set of spaceships optimized in the exploration phase to create new features which are better able to exploit the regularities of the current spaceship complexity. Each exploration phase creates a set of 1000 spaceships, which are generated from 100 independent runs of the FINS algorithm for 50 generations; the 10 fittest feasible individuals of each run are inserted into the set. Given the genetic operators used in the mutation scheme, each exploration phase augments the CPPN topology by roughly 5 nodes. While the first iteration starts with an initial population consisting of CPPNs with no hidden nodes, subsequent iterations start with an initial population of CPPNs of the same complexity as the final individuals of the previous iteration. The total population of each run is 200 individuals, and parameters of novelty search are k = 20 and l = 5. Each evolutionary run maintains its own archive of novel individuals; no information regarding novelty is shared from previous iterations or across runs. Forgetting past visited areas of the search space is likely to hinder novelty search, but using a large archive of past individuals comes with a huge computational burden; given that CPPN topology augments in each iteration, it is less likely that previous novel individuals will be re-discovered, which makes "forgetting" past breakthroughs an acceptable sacrifice. Each transformation phase trains a denoising autoencoder with a hidden layer of 64 nodes, thus creating 64 highlevel features. The weights and biases for these features are trained in the 1000 spaceships created in the exploration phase. Training runs for 1000 epochs, trying to accurately predict the real half-sprite of the spaceship (see Fig. 4d) from a corrupted version of it (see Fig. 4e); corruption occurs by replacing any pixel with a white pixel (with 10% chance). We observe the progress of DeLeNoX for 6 iterations. For the first iteration, the features driving the exploration phase are trained on a set of 1000 spaceships created by randomly initialized CPPNs with no hidden nodes; these spaceships and features are identified as "initial". The impact of transformation is shown via a second experiment, where spaceships evolve for 6 iterations using the initial set of features trained from simple spaceships with no transformation phases between iterations; this second experiment is named "static" (contrary to the proposed "transforming" method). The final spaceships generated in the exploration phase of each iteration are shown in Fig. 5 for the transforming run and in Fig. 6 for the static run. For the purposes of brevity, the figures show six samples selected based on their diversity (according to the features on which they were evolved); Fig. 5 and 6 therefore not only showcase the artifacts generated by DeLeNoX, but the sampling method demonstrates the shapes which are identified as "different" by the features. In Fig. 5, the shifting representational power of CPPNs is obvious: CPPNs with no hidden nodes tend to create predominantly V-shaped spaceships, while larger networks create more curved shapes (such as in the 2nd iteration) and eventually lead to jagged edges or "spikes" in later iterations. While CPPNs can create more elaborate shapes with larger topologies, Fig. 5 includes simple shapes even in late iterations: such an example is the 6th iteration, where two of the sample spaceships seem simple. This is likely due to the lack of a "long-term memory", since there is no persistent archive of novel individuals across iterations. In terms of detected features, Fig. 8 displays a random sample of the 64 features trained in each transformation phase of the transforming run; the static run uses the "initial" features (see Fig. 8a) in every iteration. The shape of the spaceships directly affects the features' appearance: for instance, the simple V-shaped spaceships of the initial training set result in features which detect diagonal edges. The features become increasingly more complex, and thus difficult to identify, in later iterations: while in the 1st iteration straight edges are still prevalent, features in the 5th or 6th iterations detect circular or vertical areas. Comparing Fig. 6 with Fig. 5, we observe that despite the larger CPPN topologies of later iterations, spaceships evolved in the static run are much simpler than their respective ones in the transforming run. Exploration in the static run is always driven by simple initial features (see Fig. 8a), showing how the features used in the fitness function ⇢ bias search. On the contrary, the transformation phase in each iteration counters this bias and re-aligns exploration towards more visually diverse artifacts. 2013 60 Iter. Initial 1st 2nd 3rd 4th 5th 6th Best Worst Figure 5: Sample spaceships among the results of each iteration of exploration; such spaceships comprise the training set for detecting the next iteration's features (transforming run). The best and worst spaceship in terms of difference (using the previous iteration's features) is included, along with spaceships evenly distributed in terms of difference. The diversity of spaceships and the quality of detected features can be gleaned from Fig. 7, in which features trained in different iterations of the transforming run generate distance metrics which evaluate the diversity of every iteration's training set, both for the transforming and for the static run. Diversity is measured as the Euclidean distance averaged from all spaceship pairs of the training set of an iteration. In the transforming run, the highest diversity score for a feature set is usually attained in the training set of the following iteration (e.g. the initial features score the highest diversity in the 1st iteration's spaceships). This is expected, since the features of the previous iteration are used in the distance function driving novelty search in the next iteration. This trend, however, does not hold in the last 3 iterations, possibly because patterns after the 3rd iteration become too complex for 64 features to capture, while the simpler patterns of earlier iterations are more in tune with what they can detect. It is surprising that features of later iterations, primarily those of the 3rd and 6th iteration, result in high diversity values in most training sets, even those of the static run which were driven by the much simpler initial features. It appears that features trained in the more complicated shapes of later iterations are more general — as they can detect patterns they haven't actually seen, such as those in the static run — than features of the initial or 1st iteration which primarily detect straight edges (see Fig. 8). Discussion This paper has presented DeLeNoX as a system which transforms exploration of the search space in order to counter the biases of the representation and the evolutionary process. While short, the included case study demonstrates Iter. Initial 1st 2nd 3rd 4th 5th 6th Best Worst Figure 6: Sample spaceships (sorted by difference) among the results of each iteration of exploration driven by static features trained on the initial spaceship set (static run). Figure 7: Diversity scores of the training sets at the end of each iteration's exploration phase, derived from the feature sets trained in the transformation phases of the transforming run. The training sets of the transforming run are evaluated on the left figure, and those of the static run on the right. the potential of DeLeNoX in several distinct but complementary ways. The shifting representation of augmenting CPPNs benefits from the iterative transformations of the novelty heuristic which is used to evolve it, as demonstrated by early features which detect straight lines versus later features which focus on areas of interest. Using the early, simple features for evolving complex CPPNs is shown to hinder exploration since the representational bias which caused those features to be prevalent has been countered by augmenting topologies. On the other hand, the iterative exploration guided by features tailored to the representation creates a more diverse training set for the autoencoder, resulting in an overall improvement in the features detected as shown by the increased diversity scores of later features on the same data. This positive feedback loop, where the exploration phase benefits from the transformation phase, which in turn benefits from the improved divergent search of exploration is the core argument for DeLeNoX. It should be noted, 2013 61 (a) Initial (b) 1st (c) 2nd (d) 3rd (e) 4th (f) 5th (g) 6th Figure 8: A sample of the 64 trained features at the end of each iteration. The visualization displays the weights of each pixel of the input (i.e. the left half of the spaceship's sprite). Weights are normalized to black (lowest) and white (highest). however, that for this case study DeLeNoX is not without its own biases, as the increasingly diverse training set eventually challenges the feature detector's ability to capture typical patterns in the latest of presented iterations; suggestions for countering such biases will be presented in this section. The case study presented in this paper is an example of exploration via high-level features derived by compressing information based on their statistical dependencies. The number of features chosen was arguably arbitrary; it allows for a decent compression (980 pixels to 64 real values) and measuring the Euclidean distance for novelty search is computationally manageable. At the same time, it is large enough to capture the most prevalent features among generated spaceships, at least in the first iterations where spaceships and their encoding CPPNs are simple. As exploration becomes more thorough — enhanced both by the increased representational power of larger CPPNs and by more informed feature detectors — typical patterns become harder to find. It could be argued that as exploration results in increasingly more diverse content, the number of features should increase to counter the fewer dependencies in the training set; for the same reasons, the size of the training set should perhaps increase. Future experiments should evaluate the impact of the number of features and the size of the training set both on the accuracy of the autoencoder and on the progress of novelty search. Other experiments should explore the potential of adjusting these values dynamically on a per-iteration basis; adjustments can be made via a constant multiplier or according to the quality of generated artifacts. It should be pointed out that the presented case study uses a single autoencoder, which is able to discover simple features such as edges. These simple features are easy to present visually, and deriving the distance metric is straightforward based on the outputs of the autoencoder's hidden layer. For a simple testbed such as spaceship generation, features discovered by the single autoencoder suffice — especially in early iterations of novelty search. However, the true potential of DeLeNoX will be shown via stacked autoencoders which allow for truly deep learning; the outputs from the upper layers of such a deep belief network (Bengio 2009) represent more "abstract" concepts than those of a single autoencoder. Using such robust features for deriving a novelty value is likely to address current limitations of the feature extractor in images generated by complex CPPNs, and can be applied to more complex problems. The case study presented in this paper is ideal for demonstrating DeLeNoX due to the evolutionary complexification of CPPNs; the indirect mapping between genotype and phenotype and the augmenting topologies both warrant the iterative transformation of the features which drive novelty search. A direct or static mapping would likely find the iterative transformation of the search process less useful, since representational bias remains constant. However, any indirect mapping between genotype and phenotype including neuroevolution, grammatical evolution or genetic programming can be used for DeLeNoX. Related Work DeLeNoX is indirectly linked to the foci of a few studies in automatic content generation and evolutionary art. The creation of artifacts has been the primary focus of evolutionary art; however, the autonomy of art generation is often challenged by the use of interactive evolution driven by human preferences. In order to create closed systems, an art appreciation component is used to automatically evaluate generated artifacts. This artificial art critic (Machado et al. 2003) is often an artificial neural network pre-trained to simulate user ratings in a collection of generated content (Baluja, Pomerleau, and Jochem 1999) or between manmade and generated images (Machado et al. 2007). Image compression has also been used in the evaluation of generated artifacts (Machado et al. 2007). While DeLeNoX essentially uses an artificial neural network to learn features of the training set, it does not simulate human aesthetic criteria as its training is unsupervised; moreover, the learned features are used to diversify the generated artifacts rather than converge them towards a specific art style or aesthetic. This same independence from human aesthetics, however, makes evaluating results of DeLeNoX difficult. Finally, while the autoencoder compresses images to a much smaller size, this compression is tailored to the particularities of the training set, unlike the generic compression methods such as jpeg used in NEvAr (Machado et al. 2007). Recent interest in dynamically extracting features targeting deviation from previously evolved content (Correia et al. 2013) has several similarities to DeLeNoX; the former approach, however, does not use novelty search (and thus exploration of the search space is limited) while features are extracted via supervised learning on a classification task between newly (and previously) generated artifacts and man-made art pieces. The potential of DeLeNoX is demonstrated using the generation of spaceship sprites as a testbed. Spaceship gener 2013 62 ation is representative of the larger problem of automatic game content creation which has recently received considerable academic interest (Yannakakis 2012). Search-based techniques such as genetic algorithms are popular for optimizing many different properties of game content; for a full survey see (Togelius et al. 2011). Procedurally generated spaceships have been optimized, via neuroevolution, for performance measures such as speed (Liapis, Yannakakis, and Togelius 2011a) or for predefined aesthetic measures such as symmetry (Liapis, Yannakakis, and Togelius 2012; 2011b). Similarly to the method described in this paper, these early attempts use CPPN-NEAT to generate a spaceship's hull. This paper, however, describes a spaceship via top and bottom points and uses a sprite-based representation, both of which are more likely to generate feasible content; additionally, the spaceship's thrusters and weapons are not considered. Acknowledgments The research is supported, in part, by the FP7 ICT project SIREN (project no: 258453) and by the FP7 ICT project C2Learn (project no: 318480). 2013_9 !2013 A Discussion on Serendipity in Creative Systems Alison Pease, Simon Colton, Ramin Ramezani, John Charnley and Kate Reed Computational Creativity Group Department of Computing Imperial College, London ccg.doc.ic.ac.uk Abstract We investigate serendipity, or happy, accidental discoveries, in CC, and propose computational concepts related to serendipity. These include a focus-shift, a breakdown of serendipitous discovery into prepared mind, serendipity trigger, bridge and result and three dimensions of serendipity: chance, sagacity and value. We propose a definition and standards for computational serendipity and evaluate three creative systems with respect to our standards. We argue that this is an important notion in creativity and, if carefully developed and used with caution, could result in a valuable new discovery technique in CC. Introduction and motivation A serendipitous discovery is one in which chance plays a crucial role and which results in a surprising, and often unsought, useful finding. This may result in a new product, such as Viagra, which was found when researching a drug for angina; an idea, such as acid rain, which was found when investigating consequences of tree clearance; or an artefact, such as the Rosetta Stone, discovered when demolishing a wall in Egypt. In this paper we describe serendipitous discovery firstly in a human, and secondly in a computational context, and propose a series of associated computational concepts. We follow a modified version of Jordanous's evaluation guidelines for CC (Jordanous 2012), and consider three computational case studies in terms of our concepts and standards for serendipity. We finish by discussing whether serendipity in computers is either possible or desirable, and placing our ideas in the context of related work. Eminent scientists have emphasised the role of chance in scientific discoveries: for instance, in 1679 Robert Hooke claimed: "The greatest part of invention being but a lucky bitt (sic) of chance" (cited in (Van Andel 1994, p. 634)), and, in 1775, Joseph Priestly said: "That more is owing to what we call chance ... than to any proper design, or preconceived theory in the business" (cited in (Merton and Barber 2004, p. 162)). In 1854, Louis Pasteur made what Merton and Barber refer to as "one of the most famous remarks of all time on the role of chance" (Merton and Barber 2004, p. 162) in his opening speech as Dean of the new Faculte des Sciences ´ at Lille: "Dans les champs de l'observation le hasard ne favorise que les esprits prepar ´ es" (cited in (Van Andel 1994, ´ p. 634 - 635)) ("In the fields of observation, fortune favours prepared minds"). Contemporary writers on serendipity include the psychologists Nickerson: "serendipity is widely acknowledged to have played a significant role in many scientific discoveries" (Nickerson 1999, 409) and Simonton: "Serendipity is a truly general process for the origination of new ideas" (Simonton 1995, p. 469); scientific journalist Singh: "The history of science and technology is littered with serendipity" (Rond and Morley 2010, p. 66); cognitive scientist and popular CC writer Boden remarks that "Chance is held to be a prime factor in many creative acts" (Boden 1990, p. 233). Equating serendipity with unexpected findings, Dunbar and Fugelsang used observational studies of scientists "in the wild" and brain imaging studies of scientific thinking to show that over half of scientists' findings are unexpected (Dunbar and Fugelsang 2005). The word serendipity was coined in 1754 by Horace Walpole, as describing a particular kind of discovery. He illustrated the concept by reference to a Persian folk tale, The Travels and Adventures of Three Princes of Serendip, in which the princes go travelling and together make various observations and Holmesian inferences: "They were always making discoveries, by accidents and sagacity, of things which they were not in quest of" (cited in (Merton and Barber 2004, p. 2)) One such example occurs when a camel driver asks if they have seen his lost camel, and they display such detailed knowledge of the camel that the driver accuses them of stealing it. They justify their knowledge based on their observations and abductive inferences. In the last 260 years (and the last 60 in particular), this notion of a happy, accidental discovery has gone from being an arcane word and concept, to being part of commonplace language. Serendipity is a value-laden concept, and has been considered both to depreciate and enhance a scientist's achievement, leading to accounts in which the role of serendipity in a discovery is either under or overrated. Despite this difficulty, there are numerous examples of serendipity in scientific discovery, some of which have been gathered into collections ((Roberts 1989) contains over 70 examples, (Rond and Morley 2010) contains examples in cosmology, astronomy, physics and other domains, and (Van Andel 1994) claims to have over 1000 (unpublished) examples). Examples from these sources include numerous medical discoveries, when a side effect was found to be more useful 2013 64 than the original goal; Kekule's 1865 dream-inspired dis´ covery of the structure of the benzine ring; the discovery of a Quechua man with maleria, drinking water which happened to be tainted from the bark of cinchona trees, that quinine (found in the bark) can cure maleria; Goodyear's discovery of vulcanised rubber, when trying to make a rubber resistent to temperature changes, after accidentally leaving a mixture of rubber and sulfer on a hot stove and finding that it charred, rather than melted; Penzias and Wilson's discovery of the echoes of the Big Bang, in which they were testing for the source of noise that a radio telescope was picking up, discovering eventually from a physicist that these were echoes from the Big Bang; and the Rosetta Stone, which was found by a soldier who was demolishing a wall in order to clear ground. We consider three examples below: 1. In 1928, while researching influenza, Fleming noticed an unusual clear patch in a petri dish of bacteria cultures. Subsequent examination revealed that the lid of the petri dish had fallen off (thus invalidating the experiment) and mould had fallen into the dish, killing the bacteria - resulting in the discovery of penicillin. 2. In 1948, on returning home from a walk, de Mestral found cockleburs attached to his jacket. While trying to pick them off, he became interested in what made them stick so tightly, and started to think about uses for a system designed on similar principles - resulting in the discovery of Velcro. 3. In 1974, Fry was struggling to use pieces of paper to mark pages in his choir book, when he recalled of a colleague's failed attempts to develop superglue. The colleague had accidentally made a glue so weak that two glued pieces of paper could be pulled apart - this resulted in the discovery of Post-it notes. We are fortunate in that the sociologist Merton and historian Barber have written a detailed account of the word "serendipity", tracing its meaning from its coinage in 1754 to 1954 (and an extended afterword on its usage from 1954 2004) (Merton and Barber 2004). This is a tremendous resource for those who require an algorithmic level of detail of a hard-to-grasp concept. By basing our computational interpretation on this book we can claim that we are using the word in the same way as is used in common parlance. They highlight three things of particular interest: firstly, while Walpole was unambiguous that serendipity referred to an unsought finding, this criterion has dropped from dictionary definitions (only 5 out of 30 English language dictionaries from 1909 2000 explicitly say "not sought for" (Roberts 1989, pp. 246- 249); secondly, while serendipity originally described an event (a type of discovery), it has since been reconceptualised as a psychological attribute (of the discoverer); thirdly, they argue that the psychological perspective needs to be integrated with a sociological one.1 1 Serendipity is usually discussed within the context of discovery, rather than creativity: in this paper we assume an association between the two. Serendipitous discovery in a computational context We identify characteristics of serendipitous discovery and propose corresponding computational concepts. The focus-shift. Serendipitous discovery often (perhaps always) involves a shift of focus. In our examples we see focus-shifts in the context of an unsuccessful (but valid) experiment (Viagra); a mistake (leaving the lid off a petri dish, thus invalidating an experiment); previously discarded refuse (weak glue); an accident (letting rubber touch hot stove); an object which is being removed (the Rosetta Stone); and something which was considered to be a nuisance (the noise in the Big Bang example, the burs on jacket), unimportant (side effects in medical drugs), or irrelevant (a dream). In all of these cases there is a radical change in the discoverer's evaluation of what is interesting: we can think of this as a reclassification of signal-to-noise (literally, in Penzias and Wilson's case). There is not always a main focus: for instance, de Mestral was out walking when he came across the seeds of his discovery. In cases where there is a focus, this might be abandoned in favour of a more interesting or promising direction, or may be achieved alongside the shift in focus. In computational terms we could model a focus-shift by enabling a system to "change its mind" that is, to re-evaluate an object as interesting, which it had previously judged to be uninteresting. Components. We break down the components in serendipitous discovery as follows: Prepared Mind: This is the discoverer's previous experiences, background knowledge, store of unsolved problems, skills and current focus. It corresponds to the set of background knowledge, unsolved problems, current goal, and so on in a system. Serendipity Trigger: This is the part of the examples discussed which arises immediately prior to the discovery. Examples include a dream, a petri dish with a clear area, cockleburs attached to a jacket and discarded glue. It corresponds to the example or concept in a system, which precedes the discovery. Bridge: The techniques which enables one to go from the trigger to the result. These include reasoning techniques such as abduction (Fleming uses abductive inference to explain the surprising observation of the clear patch in petri dish); analogical reasoning (de Mestral constructed a target domain from the source domain of burs hooked onto fabric); and conceptual-blending (Kekule blended molecule ´ structure with a vision of a snake biting its tail and invented the concept of benzine ring). In AI, some reasoning techniques are more associated with creativity than others. For instance, analogical reasoning, conceptual-blending, genetic algorithms and automated theory formation techniques have featured heavily in CC publications. This is a good start for 2013 65 the reasoning techniques we identify here. Another key attribute is the ability to perform a focus-shift at an opportune time. Result: This is the discovery itself. This may be a new product (such as Velcro), artefact (such as the Rosetta Stone), process (vulcanisation of rubber), hypothesis (such as "penicillium kills staphylococcus bacteria"), use for an object (such as quinine), and so on. The discovery may be an example of a sought finding (classified by Roberts as pseudoserendipity (Roberts 1989, p. x)), in which case the solution arises from an unknown, unlikely, coincidental or unexpected source. Three dimensions of serendipity. 1. Chance: The serendipity trigger is unlikely, unexpected, unsought, accidental, random, surprising, coincidental, arises independently of, and before, the result. The value of carefully controlled randomness in CC and AI systems is well-established. For instance, GA systems, which are popular in CC, employ a user-defined mutation probability, usually set to around 5-10%. Introducing randomness into search has also proved profitable in other systems. Likewise, the role that surprise plays in CC is well explored. 2. Sagacity: This dimension describes the attributes, or skill, on the part of the discoverer (the bridge between the trigger and the result). In many of these examples others had been in the same position and not made the discovery. This skill involves an open mind (an ability to take advantage of the unpredictable); ability to focus-shift; appropriate reasoning techniques; and ability to recognise value in the discovery. 3. Value: The result must be happy, useful (evaluated externally). Measuring the value of a system's results is a well-known problem in CC, and can be evaluated independently of the programmer and system or (as is more common) by the programmer alone. A discovery does not have to score highly on each axis to be considered serendipitous. The chances of an unanticipated use being found for a drug under development may be quite high (i.e., the role that chance plays in such a discovery is low), and the sagacity needed to discern that quinine-infused water has cured malaria may be low. While the discoveries that Walpole describes were not always important, the examples given today (in (Roberts 1989; Rond and Morley 2010; Van Andel 1994)) describe valuable, often domain-changing, discoveries. Arguably, the discovery of penicillin is the most serendipitous of our examples, since two improbable events were involved: the combination of penicillium mould and staphylococcus bacteria, and the accident of the petri dish lid falling off; it took great skill to recognise the importance of the observation, and - having saved millions of lives - it is clearly of great value. Environmental factors. As Merton and Barber point out, serendipitous discovery is not achieved in isolation. The discoverer is operating in a messy world and engaged in a range of activities and experiences. We propose the following characteristics of the discoverers' environments, and computational analogs: 1. Dynamic world: Data was presented in stages, not as a complete, consistent whole. This corresponds to streaming from live media such as the web. 2. Multiple contexts: Information from one context, or domain was used in another. This is a common notion in analogical reasoning. 3. Multiple tasks: Discoverers were often involved in multiple tasks. This corresponds to threading, or distributed computing. 4. Multiple influences: All discoveries took place in a social context, and in some examples the "unexpected source" was another person. This corresponds to systems such as agent architectures, in which different software agents with different knowledge and goals interact. The three-step model of SPECS. Jordanous summarises her evaluation guidelines in three steps; to identify a definition of creativity, state evaluation standards, and apply the standard to your creative system (Jordanous 2012). Here we apply these steps to the notion of serendipity. Step 1: Identify a definition of serendipity that your system should satisfy to be considered serendipitous. We propose the following definition of computational serendipitous discovery: Computational serendipitous discovery occurs when a) within a system with a prepared mind, a previously uninteresting serendipity trigger arises partially due to chance, and is reclassified as interesting by the system; and b) when the system, by processing this re-evaluated trigger and background information together with abductive, analogical or conceptual-blending reasoning techniques, obtains a new result that is considered useful both by the system and by external sources. Step 2: Using Step 1, clearly state what standards you use to evaluate the serendipity of your system. With our definition in mind, we propose the following standards for computational serendipity: Evaluation standard 1: (i) The system has a prepared mind, consisting of previous experiences, background knowledge, a store of unsolved problems, skills and (optionally) a current focus or goal. (ii) The serendipity trigger arises partially as a result of chance factors such as randomness, independence of the end result, unexpectedness, or surprisingness. Evaluation standard 2: The system: (i) uses reasoning techniques associated with serendipitous discovery: abduction, analogy, conceptual-blending; (ii) performs a focusshift; (iii) evaluates its discovery as useful. Evaluation standard 3: As a consequence of the focusshift, a result which is evaluated as useful by an external source is found. 2013 66 Step 3: Test your serendipitous system against the standards stated in Step 2 and report the results. In the following section we evaluate three systems against our standards. Computational Case Studies Armed with an analysis of serendipity in computational settings, we investigate here the value of these insights with respect to past, present and future creative systems. In particular, we describe and evaluate from a serendipity perspective: (a) an abductive reasoning system which has already been employed in a different context (b) a series of experiments with the HR automated theory formation system aimed at promoting serendipitous discovery, and (c) a proposed an extension to a framework for creative currently under development. The GH system Our first system models the sort of reasoning initially described by Walpole in the Princes of Serendip story. As described in (Ramezani and Colton 2010), Dynamic Investigation Problems (DIPs) are a type of hybrid AI problem specifically designed to model real life situations where a guilty party has to be chosen from a number of suspects, with the decision depending on a changing (dynamic) set of facts and constraints about the current case and a changing set of case studies of a similar nature to the current case. Such situations occur in criminal or medical investigations, for instance, and the GH solver has been named after the fictional medical investigator Gregory House, although his namesake of Sherlock Holmes would equally suffice. DIPs have been designed to be unsolvable either by machine learning rules from the case studies or solving the constraints as a Constraint Satisfaction Problem, hence requiring a hybrid learning and constraint solving approach. The GH system is given facts about a current investigation, in the form of predicates known to be true which relate various attributes of the guilty suspect but do not identify it. The problems are noisy in that only some of these facts are pertinent to finding the guilty suspect and (optionally) some facts which are required are missing. GH is also given similar facts about a number of previous cases which are related in nature to the current case, with the facts given again in predicate form. The facts of the current case and those of the case studies are given in blocks at discrete time steps, and the software solves the partial problems at each time step. To find the solutions, the facts of the current case are interpreted as a CSP to be solved by the CLPFD solver in Sicstus Prolog. Before it attempts to find a solution, GH maps the attributes of the previous cases onto those of the current case, and then uses association rule mining via the Weka machine learning package to find empirically true relationships between the attributes described in the facts. These relationships are selectively added to the CSP in order to find a more precise solution. The DIPs are set up so that the CSP without the extra constraints can be solved by multiple suspects, while - if the correct extra constraints are mined from the case studies - there is only one correct solution. Presenting further details of DIPs or the GH system is beyond the scope of this paper, but suffice to say, we performed a series of experiments to explore the nature DIPs and the solutions that GH can find. For instance, when the DIPs have 4 pertinent constraints of arity five or less, and 100% of the constraints are available either in the current case or hidden in the case studies, GH has an error rate (i.e., choosing the wrong subject) of 10%. When only 50% of the pertinent facts can be found, the error rate rises to 31%. Standard 1: (i) The system has a prepared mind consisting of past cases, background knowledge and an unsolved problem. (ii) the serendipity trigger corresponds to a new piece of data which means that a previous case is now relevant. Chance factors arise in the order and which data the system receives. Standard 2: (i) The system uses induction, abduction and constraint solving as reasoning techniques; its abductive procedures are of particular interest. (ii) Focus-shifts can occur if a previous case is re-evaluated by the system as relevant to the current case. (iii) The result is the diagnosis or identification of the guilty party, and is judged by the system to be correct. Standard 3: As a consequence of the previously irrelevant case being re-evaluated as relevant, the diagnosis is achieved. Value consists in external evaluations of whether the system has reached the correct solution. Additionally, the environmental factors are partially well represented: the system operates in a dynamic world; and we can see reasoning about different cases as operating in multiple contexts. However, it only solves one task at a time, and there are not currently multiple influences. Experiments in model generation The HR program (Colton 2002) is an automated theory formation system which, starting with background knowledge describing concepts and examples of those concepts, uses production rules iteratively to construct new concepts from old ones. It forms conjectures empirically which relate one or more concepts, and evaluates concepts and conjectures using a number of measures of interestingness, which in turn drives a best-first heuristic search whereby the most interesting old concepts are used to produce new concepts. For instance, the complexity of a concept is the number of production rule steps that were used in its production, and the complexity of a conjecture is the average of the complexity of the concepts it relates. When working in domains of pure mathematics for which axioms are given, HR can interface with the Davis-Putnam style model generator MACE and the resolution theorem prover Otter to attempt to disprove/prove empirical conjectures respectively. Working in domains of finite algebra, we started HR with only the axioms of the domain, and the background concepts required to express those axioms. In particular, HR was given no example algebras, and hence each algebra introduced to the theory came as a counterexample to a false conjecture the software made due to lack of data. In all sessions, we used modest time resources for using MACE (5 secs) and Otter (3 secs). 2013 67 HR was enhanced so that whenever it found a counterexample to a new false conjecture, it tested to see whether that counterexample broke any previously unsolved open conjecture (i.e., for which MACE could previously find no counterexample and Otter could find no proof). We found that such occurrences were very rare. In the three test domains of group theory (associativity, identity and inverse axioms), monoid theory (associativity, identity) and semigroup theory (associativity), when run in breadth first mode, i.e., with no heuristic search, we never observed this behaviour during sessions with tens of thousands of production rule steps. This is because the search strategy means that usually the simplest concepts and hence the simplest conjectures were made early on during the session, and as became increasingly harder to find counterexamples to the progressively more difficult false conjectures, it was never the case that a later conjecture was disproved with a counterexample that also disproved an earlier one. To attempt to encourage the re-use of counterexamples, we ran random search strategies, whereby the next concept to use in production rule steps was chosen randomly, subject to a complexity limit of 10. This strategy worked for monoids and semigroups, but not for group theory. As an example, in monoid theory, after 1532 steps, this conjecture: 8b, c, d(((b ⇤ c = d ^ b ⇤ d = c ^ d ⇤ b = c ^ c ⇤ d = b ^ (d 6= id)) $ (b ⇤ c = d ^ d ⇤ b = c ^ c ⇤ d = b ^ (d 6= id)))) was disproved by MACE finding a counterexample. The counterexample also broke this previous open conjecture: 8b, c, d(((b ⇤ c = d ^ c ⇤ b = d ^ c ⇤ d = b ^ (9e(e ⇤ c = d ^ e ⇤ d = c))) $ (b ⇤ c = d ^ (9f(b ⇤ c = f)) ^ (9g(g ⇤ c = b)) ^ d ⇤ b = c ^ c ⇤ d = b))) This was the sole example we saw in 2000 theory formation steps in monoid theory. In semigroup theory, such events were more common: there were three times when a new counterexample was used to solve a single open conjecture, and on one occasion ten open conjectures were disproved by one counterexample. Standard 1: (i) In these experiments HR develops a prepared mind during the run. The background knowledge is user-given concepts, the examples which have arisen during the run and all of the developed concepts and conjectures. The open conjectures constitute the store of unsolved problems, the skills are the production rules and other procedural mechanisms. At the point just before the serendipity trigger, the counterexample which arose in the context of the low complexity conjecture, the current focus is to prove or disprove the current conjecture. (ii) While there is no randomness in the way that MACE generates the serendipity trigger, in the random runs there is randomness in the way that the conjecture which prompted the new example was generated. In addition, the example was generated independently of the end result. Standard 2: (i) The system did not use any of the three reasoning techniques. (ii) It did re-evaluate the previously unsolved conjecture, once it was solved, but this was not the reason that focus shifted. Figure 1: A poetry generating flowchart. Standard 3: The result was the now-solved (previously open) conjecture. Apart from the fact that a theorem generally has higher status in mathematics than an open conjecture, we cannot claim that the solved conjectures were interesting. (None of them would appear in a textbook on group theory.) However, we can claim that, in this mode, if it was not for the example arising in a different context, the system would not have been able to solve the 18 open conjectures. We know this since it had already attempted to and failed within the time limits. A flowcharting framework In a project separate from our work on serendipity, we are building a flowcharting system to be used for Computational Creativity projects. Each node in the flowcharts undertakes a particular task on data types such as text and images, and the task can be generative or evaluative, or it could bring back data from websites or local databases. Without going into detail, the example flowchart in figure generates poems by compiling tweets mined from Twitter using a single adjective W as a search term, employing sentiment analysis and a rhyming dictionary along the way. The following is a stanza from a poem generated by the flowcharting system using this flowchart, where W was malevolent: I hear the souls of the damned wailing in hell. I feel a malevolent spectre hovering just behind me. It must be his birthday. Is God willing to prevent evil, but not able? Then he is not omnipotent. Is he able, but not willing? Then he is malevolent. It's only when his intelligence grows and he understands the laws of man that He becomes malevolent and violent. I don't find it malevolent, I find it affectionate. Geeks do weird things and that can be hilarious for different reasons. One of the purposes of the flowchart project is to have a platform for the development of creative systems that the whole Computational Creativity community to contribute to and benefit from. Our aim is to have a number of people developing nodes locally at various sites worldwide, then uploading them for everyone to share in building their own 2013 68 flowcharts via a GUI. We are specifically aiming for a domain independent framework, and to this end, our inspiring examples in building the system are the theory formation abilities of the HR system (Colton 2002), the painting abilities of The Painting Fool system (Colton 2013) and the poetry generation abilities described in (Colton, Goodwin, and Veale 2012). We currently have flowcharts which approximate the functioning of the original systems in all three cases. Another main purpose of the project is to explore ways in which the software can automatically construct flowcharts itself so that it can innovate at the process level. It is beyond the scope of this paper to describe how this will be done in detail, but one fact is pertinent: if/when such automated construction is possible, we will situate a version of the software on a server, constantly generating, testing and evaluating the flowcharts it produces, and making the artefacts it produces available, along with framing information (Charnley, Pease, and Colton 2012) about the process and the product. As new nodes are developed, they will be automatically made available to the system, and flowcharts will immediately be formed which utilise the new node. The dynamic nature of this framework is clear: nodes will be accessing web services, so the data being used will be constantly changing; existing nodes will be updated and new nodes will be uploaded regularly; and new flowcharts will be created rapidly. In fact, we aim to increase this dynamic nature by having multiple such systems residing on various servers around the world, swapping nodes, flowcharts, outputs and meta-level information at regular intervals. We believe that this will increase the likelihood of chance encounters occurring to expect serendipity to follow. Moreover, the framework is not domain specific, and we will encourage the building of nodes which transfer information, say, from visual arts outputs to textual inputs, and vice versa. Thus, the environmental factors are extremely well-represented: the system operates in a dynamic world as it brings back data from websites or local databases, such as streaming from Twitter; the domain-independent aspect ensures that is can operate in multiple contexts (these will be concurrent, as in the example given in which the contexts are theory formation, painting and poetry). At any time-point there will be multiple tasks being undertaken by the various nodes, and, by feeding into each other these will provide multiple influences. We believe this will increase the likelihood of results/ideas/processes in one domain being serendipitously applied in another domain, hopefully with happy consequences. Standard 1: (i) If we view the flowcharting system as a whole, then the prepared mind will be constructed via the nodes, consisting of the knowledge in the system at any time and the generative and evaluative procedures which the nodes are able to perform. Current goals will be the particular tasks that each node is involved in. (ii) The serendipity trigger to a particular node will arise via new information (for instance, from streaming such as Twitter) or sharing from other nodes. The sharing and updating could have a random element to it, but the main factor relating to chance will be that new information will arise in independent contexts, and thus will be independent of final results. Standard 2: (i) As a platform for the development of creative systems that the whole CC community will contribute to and benefit from, the system as a whole will perform a variety of techniques, in particular those associated with creativity. Therefore, we expect that it will be able to perform abduction, analogy and conceptual-blending. (ii) The task that each node undertakes can be evaluative, and, if the system can perform automated construction of the flowcharts itself, it will constantly be evaluating the flowcharts it produces. Thus, focus-shifts should be possible; (iii) likewise, nodes will evaluate their own results (the artefacts that they produce). Standard 3: The artefacts produced, such as the poem above, will be evaluated by external sources to determine the success of the whole project. Discussion With respect to the dynamic investigation problem and the model generation experiments described above, we can say that the former is realistic but not particularly serendipitous, while the latter is more serendipitous, but more artificial in fact, we had to willingly make the system less effective to encourage incidents which onto which we might project the word serendipity. This raises the question of whether it is indeed possible to set up a computational situation within which such incidents genuinely occur. The flowchart system is the most promising in terms of making serendipitous discoveries. Of course, the evaluation standards themselves should be subject to evaluation, to make sure that they both reflect our intuitive notion of serendipity and are practical to apply to our CC systems. We assume that in CC we are aiming to develop software which can surprise us, generate culturally valuable artefacts, and produce a good story about how it constructed the artefacts. There is tension between systematicity and serendipity, and it may be the case that incorporating serendipity into a creative system inhibits its ability to produce the desired artefacts. We take seriously the concern that modelling serendipity in CC may be either impossible or undesirable. One can argue that, given the role of chance in serendipity, it is impossible to program such discoveries. Like have-a-go heroes, serendipity in our systems should be cherished but not encouraged. In response to such arguments, we have tried to characterise the sorts of environments which enhance the likelihood of making a chance discovery, and we have suggested computational analogs. Serendipity is not "mere chance" - the axes of sagacity and useful results are equally important. That serendipity-facilitating skills can be taught to people is not a new argument - much work written by scientists on serendipity is designed to teach others what skills are involved (see also (Lenox 1985)). Many (perhaps all) of the skills are standard skills of a scientist, and it may be argued that relevant machine learning techniques, such as anomaly detection and outlier analysis, already exist. We suggest that such techniques will be extremely useful, but 2013 69 probably not sufficient, for computational serendipitous discovery. One might also argue that the same characteristics which aid serendipity would also aid negative serendipity. A system which allowed itself to be derailed from a task at hand might not achieve as much as one which maintains focus. Negative serendipity can be defined in various ways: Pek defines it as when: "A surprising fact or relation is seen but not (optimally) investigated by the discoverer", giving Columbus' lifelong belief that he had found a new route to Asia, rather than a new continent, as an example (Van Andel 1994, p. 369). We can also define it as a discovery which is prevented due to chance factors: this would be very hard to demonstrate, but relates to the "Person from Porlock" syndrome, where creative flow is interrupted due to an unwanted interruption. As well as negative serendipity, one might argue that a reliance on serendipity contrasts intelligence, and a system which uses a random search may exhibit less intelligent behaviour than one which follows a well developed heuristic search. Thus, in our HR experiment, enhancing its serendipity was a retrograde step for the system. We certainly would not advocate that all CC developers add serendipitous functionality to their existing software, which might detract from other functionality. Despite this, we suggest that serendipity is a feature which can be both possible and useful to model in future creative systems. The examples of human serendipity all describe groundbreaking discoveries. In CC, we have learned that we must not aim to build systems which perform domain-changing acts of creativity, before systems which can perform everyday, mundane creativity (distinguished as "Big C" and "little c" creativity.) Similarly, we must expect to model "little s" serendipity before we are able to model "Big S" serendipity. The dimension which this affects the most is the third one - we must not expect the discoveries to be rated too highly with our embryonic models of computational serendipity. A useful intermediate way of evaluating the results might be with respect to other, non-serendipitous, results. Related work Many of the aspects we have identified as inherent in serendipitous discovery are already widespread computational techniques, and there are large bodies of work which will be particularly relevant. For instance, research into the role of problem reformulation in problem-solving, such as (Griffith, Nersessian, and Goel 2000), is relevant to the focus-shift aspect in that reformulation can trigger new solutions and re-evaluations. Our notion of focus-shift differs from problem reformulation, in that the focus may be on examples, artefacts, etc rather than problems, and the result of a focus-shift is a re-evaluation rather than re-representation. Problem-shift, where a problem evolves alongside possible solutions (see, for instance, (Helms and Goel 2012)), is also relevant. Wills and Kolodner (Wills and Kolodner 1994) have analysed the processes involved in serendipitous recognition of solutions to suspended design problems, where the solutions overcome both functional fixedness and fixation on standard solutions. They propose a computational model which is based on the hypothesis that recognition arises from interaction between the processes of problem evolution and assimilation of proposed ideas into memory. Their analysis fits into our sagacity dimension as they elaborate skills needed to recognise value in unexpected places, and in particular ways in which the focus-shift can work. There is related work on chance. For instance, Campbell's model of creativity, "blind variation and selective retention" (described in (Simonton 1999)), in which he draws an analogy between biological evolution and creativity, seems to be particularly pertinent for serendipity, with its emphasis on "blind" (Campbell elaborates his use of the term and discusses other candidates, including: chance, random, aleatory, fortuitous, haphazard, unrestricted, and spontaneous). This corresponds to our notion of chance. Serendipity was formalised by Figueiredo and Campos in their paper ‘The Serendipity Equations' (Figueiredo and Campos 2001). This paper used logical equations to describe the subtle differences between some of the many forms of serendipity. In practice none of the implemented examples rely on the computer to be the prepared mind. It is the user that is expected to have the ‘aha' moment and thus the creative step. The computer is used to facilitate this by searching outside of the normal search parameters to engineer potentially serendipitous (or at least pseudo- serendipitous) encounters. One example of this is ‘Max' created by Figueiredo and Campos (Campos and Figueiredo 2002). Here the user emails Max with a list of interests and Max finds a webpage that may be of interest to the user. Max expands the search parameters by using WordNet2 to generate synsets for words of interest. Max also has the ability to wander; taking information from the first set of results and using these to find further pages. Other search examples include searching for analogies (Donoghue and Crean July 2002) and content (Iaquinta et al. 2008). These all use different strategies to provide new and potentially serendipitous information to the user (who must be the "prepared mind"). Further work and conclusions The notion of serendipitous discovery is a popular and rather romantic one. Thus, when scientists or artists are framing their work for public consumption, they might tell a backstory about the role that serendipity played, which might enhance our perception of the value of the discovery or discoverer. In (Charnley, Pease, and Colton 2012), we outline the importance of producing framing information in CC. While the account of a discovery can be fictional (and thus could refer to a serendipity which did not happen), incorporating it into discovery mechanisms could result in richer framing information. Challenging the idea that only humans can be serendipitous is a problem which is familiar to CC researchers. In the case of serendipity this may be even greater, since the notion of designing for serendipity can appear to be oxymoronic. Our message in this paper is that we should proceed with caution in this intriguing area. 2 http://wordnet.princeton.edu/ 2013 70 Acknowledgements We would like to thank our three reviewers who gave particularly thorough reviews and useful references. This research was funded by EPSRC grant EP/J004049. 2014_1 !2014 From Isolation to Involvement: Adapting Machine Creativity Software to Support Human-Computer Co-Creation Anna Kantosalo, Jukka M. Toivanen, Ping Xiao, Hannu Toivonen Department of Computer Science and Helsinki Institute for Information Technology HIIT University of Helsinki, Finland anna.kantosalo@helsinki.., jukka.toivanen@cs.helsinki.., ping.xiao@helsinki.., hannu.toivonen@cs.helsinki.. Abstract This paper investigates how to transform machine creativity systems into interactive tools that support human-computer co-creation. We use three case studies to identify common issues in this transformation, under the perspective of User-Centered Design. We also analyse the interactivity and creative behavior of the three platforms in terms of Wiggins formalization of creativity as a search. We arrive at the conclusion that adapting creative software for supporting human-computer co-creation requires redesigning some major aspects of the software, which guides our on-going project of building an interactive poetry composition tool. Introduction Machine creativity and support for human creativity are two complementary goals of computational creativity research. The role of the machine in supporting human creativity has been classified by Lubart (2005) into four categories: computer as a managment aid, computer as a communication enabler, computer as a creativity enhancer, and computer as a co-creator in the creative act. It is easy to see how advancements in machine creativity systems could support the role of the computer as a creativity enhancer, or even as a co-creator: A creative system in a certain domain, say poetry, could be used as a creative assistant for a human poet, producing draft poems that the poet could use as inspiration or raw material. This relationship could be taken even further to create a real partnership in which the computer and the user could take turns writing and editing a jointly authored poem. Such co-creative systems have great potential for transforming the lives of professionals and laymen alike by increasing their creative potential. To aid the development of future co-creative systems and their integration to everyday lives of people, it is important to gather and analyse knowledge on the design and use of existing co-creative systems. We use the term human-computer co-creation to refer to collaborative creativity where both the human and the computer take creative responsibility for the generation of a creative artefact. The term co-creation refers here to a social creativity process leading to the emergence and sharing of creative activities and meaning in a socio-technical environment (Fischer et al. 2005), but with the emphasis that the computer is, instead of only providing the socio-technical environment, also an active participant in the creative activities. This is similar to the definition of mixed-initiative co-creativity (MI-CC) by Yannakis et al. (2014), who define it as the creation of artefacts with the interaction of a human and a computational initiative. They note that the two participants do not need to contribute to the same degree, and we do not demand symmetric contributions from human-computer co-creative systems neither. The focus of this paper is on investigating the design processes for human computer co-creation systems. More specifically we investigate the transformation of machine creativity methods into co-creative ones, i.e., from batch methods to human-computer co-creation. Our goal is to shed light on the design process, key design decisions, and various issues in such transformation projects. We look at the process from two directions: a user-centered perspective and a computational creativity perspective based on Wiggins (2006) model. We first give a brief introduction to user-centered design and a brief description of Wiggins model of computational creativity. We then carry out an investigation of three systems described in the literature. We discuss the observations, and then reflect our findings by comparing them to our ongoing work to produce interactive, educational poetry writing software for children. User-Centered Design Perspective to Human-Computer Co-Creation We are interested in methodologies and tools for supporting human-computer co-creation. The design of computer support for creativity has been studied both in the fields of interaction design (e.g. Carroll and Latulipe (2009)) and computational creativity (e.g. Yeap et al. (2010)). Interaction design and especially user-centered design can provide us with a well defined design process and a selection of documented methods, which have been demonstrated useful in designing real-life interactive software. Therefore we adopt user-centered design as the methodological framework for examining the work presented in this paper. User-centered design (UCD) can be considered as the active involvement of users for a clear understanding of user and task requirements, iterative design and evaluation, and Figure 1: The user-centered design process as specified in (ISO/IEC 2010) a multi-disciplinary approach (Vredenburg et al. 2002). UCD methods have been developed since the 1980s and are today generally considered to have improved product usefulness and usability (Vredenburg et al. 2002). UCD can also be viewed more broadly as a part of Interaction Design an umberella term covering multiple disciplines emphasising different design perspectives in and outside of Human Computer Interaction (Rogers, Sharp, and Preece 2011, p. 9-11). The UCD process (ISO/IEC 2010) contains six steps (Figure 1): (1) Plan the human-centered design process, (2) Understand and specify the context of use, (3) Specify the user requirements, (4) Produce design solutions to meet user requirements, (5) Evaluate designs against requirements, and (6) Designed solution meets user requirements. Steps 2 to 5 form an iterative circle in which step 5 can be followed again by steps 2, 3, or 4 until the requirements have been satisfied as presented. Methods in UCD vary in level of user involvement, need of resources and type of gathered data as well as in which part of the design process they are most commonly utilised. Some of the methods are developed specifically by human-computer interaction specialists, and some are used by other human-oriented fields such as antrophology, as well. Usually each UCD team chooses methods suitable for the study of their users in the set context according to their own resources and expertise. The most used methods include iterative design, usability evaluation and informal expert review (Vredenburg et al. 2002). Many more exist and we encourage the interested reader to consult a handbook. A Search Perspective to Creativity From a computational creativity perspective, we can study creative behaviour supported by software in the light of Wiggins formalization of creativity as search (Wiggins 2006). Wiggins model attempts to clarify and formalize some concepts in Margaret Bodens (1992) descripive hierarchy of creativity. This model represents creative systems with a septuple (U, L, [ .] , ((., ., .)), R, T , E). Here Universe U refers to an abstract set of all possible artefacts, for instance poems. R refers to a set of rules, expressed in the language L, which defines a subset of the universe U i.e. the conceptual space of the creative system in question. Traversal function T defines how search in the universe is performed and the evaluation function E assigns a value for (some) elements of the universe. This formalization allows describing exploratory creativity as search (primarily) in the conceptual space defined by R via traversal funtion T and evaluation function E, whereas transformational creativity may be achieved, e.g., by modifying the rules R defining the conceptual space. Wiggins model provides one way to look at the co-creative process between the user and the computer and to study interaction in the process. For instance, issues arising from conflicts between the rules, evaluation functions, and traversal functions of the computer and the user can now be clearly described in Wiggins formalism. The (transformative) actions the user and the computer take when such conflicts appear decide what the rules, evaluation function, and traversal function of the larger system consisting of both the computer and the user are. It has to be noted that many other theories, for instance the work by Csikszentmihalyi (1997), could be used as a viewpoint to look at co-creativity. However, we selected Wiggins model for its rigorous nature and popularity in the field of computational creativity. Case Studies In this section we review three case studies of interactive software supporting human-computer co-creation. We first describe the criteria used for selecting these systems and then proceed to give a brief overview of the systems. We then analyse these three systems, in terms of design processes, user interactions and changes to the underlying machine creativity methods, which provides suggestions for developing future co-creative systems. Since there are few descriptions of the design processes of human-computer co-creative systems in literature, we have used somewhat loose criteria to select software for this study: 1. The project utilises established methods of computational creativity. 2. The end result of the project is interactive with a human user. 3. Design decisions taken in the project are described. 4. Quantitative or qualitative feedback is available for the interactive software. The above criteria emphasize projects drawing influences from both disciplines, computational creativity and human-computer interaction. Based on the criteria, we selected three systems: STANDUP (Ritchie et al. 2007; Waller et al. 2009), Scuddle (Carlson, Schiphorst, and Pasquier 2011), and Evolver (DiPaola et al. 2013). Our focus on the design process excludes some otherwise interesting examples of human-computer co-creative software, such as the Sentient Sketchbook (Liapis, Yannakakis, and Togelius 2013) and Tanagra (Smith, Whitehead, and Mateas 2011). Overview of the selected systems STANDUP is a pun generating language playground developed for children with complex communication needs (CCN) (Ritchie et al. 2007; Waller et al. 2009). It is built on the basis of the JAPE system (Binsted 1996; Binsted and Ritchie 1997; 1994), which generates different classes of punning riddles using symbolic rules and a large, general purpose lexicon. The evaluation of the system with its target users suggested some restrictions in the capacity of the program but an increased facility with words and apparent enjoyment from its users (Waller et al. 2009). In addition, anecdotal evidence supported a positive effect on the communication of the children (Ritchie et al. 2007). Scuddle is a movement exploration tool for choreographers to use in the early stages of their choreographic creation process (Carlson, Schiphorst, and Pasquier 2011). It is based on a genetic algorithm used to generate diverse combinations of movements. The evaluation of the program yelded positive results: users found the movements presented by the program non-habitual and creative and it prompted them to re-examine their own approaches to movement construction. Evolver is a tool designed to help interior designers to explore design options based on the initial design elements provided by the designers themselves (DiPaola et al. 2013). Its focus is on helping the labor intensive early stages of a design project and offering novel designs outside the capabilities of its users. It is based on the autonomous creative genetic programming system called DarwinsGaze (DiPaola and Gabora 2009). Evolver was well received by its target audience who reported it supporting their creative processes, suggesting novel alternatives, easing manual work, and enabling communication. Interestingly some of the interior designers involved in the evaluation also considered the program as a collaborative partner in design instead of a mere platform. All three systems show some established methods of computational creativity used as part of an interactive system. All systems have also been fairly successfull tools in increasing the creative potential of their users: STANDUP made the creative process of joke invention more accessible to an audience restricted by communication ability, Scuddle prompted new lines of creative inquiry in its users, and Evolver was at best considered a creative partner. Interaction The level of user interaction is quite varied among the three cases. Of the three examples, Scuddle has the lowest level of interactivity. It provides the users only with simple options of starting or continuing the evolutionary algorithm, re-starting the whole process, or viewing six results evaluated by the computer (Carlson, Schiphorst, and Pasquier 2011). Describing these interaction options in Wiggins framework, the theoretical categorisations of dance movements and their value can be seen as the conceptual space of the creative system. Traversal in the conceptual space is performed via a genetic algorithm which can be restarted or continued by the user evaluating the computers pre-evaluated results. The users role in the interaction lies more in the final evaluation of the artefacts than in the traversal of the options. STANDUP has a higher level of interactivity than Scuddle. It offers a dual mode of interaction: user control can be divided into (1) options for the end user a child with CCN, and (2) options for his or her carers. The child can choose a specific word to be included in the joke, a topic for the joke, or a specific joke type to be generated. The carer can adjust the program to suit the child best by restricting joke types, adjusting the words used in jokes based on their familiarity, or banning offensive words (Ritchie et al. 2007). In Wiggins terms, the STANDUP user participates in defining the rules R in addition to participating in the transition function T and the evaluation function E. On the other hand the computer provides the general conceptual space by defining the classes of puns and the allowed vocabulary. These can be modified by the user, i.e., the users set of rules for conceptual space changes the respective set of rules of the computer. The traversal function of the computer is supervised by the user. The evaluation function of the computer makes sure that similar jokes have not been presented to the user before. The user makes the final evaluation and decides which of the jokes are saved. Evolver provides the highest level of user interaction. The user provides the evolutionary algorithm with seed material and can select candidates to be used for generating the next generation of candidates as well as adjust the color scheme used (DiPaola et al. 2013). Viewed through Wiggins framework, Evolvers interaction capabilities make the users actions an integral part of the creative system: Evolver uses the seed material provided by the user to define the conceptual space. Traversal in this space is then performed via an evolutionary algorithm interactively with the user so that the user decides the parents for the next generation. The evaluation function of the co-creative system is a combination of the fitness function of the computer system and the final evaluation by the user. Mapping the systems into Wiggins model reveals that the human and the computer participating in the creative act can be viewed as one human-computer co-creative system. The mapping shows how both parties take responsibility over the generation of the creative artefact, although roles of the computer and the human are different. These particular examples also seem to indicate that the more interactive the system, the more integral the part of the user is in the creative model. Design processes Carlson et al. (2011) started their design process for Scuddle by studying other computer aided choreographic systems and used the theory of choreography to establish requirements for Scuddle. They then proceeded to construct a prototype, which was tested with seven coreographers in simulated work sessions between a coreographer Figure 2: The design process of a co-creative tool described through the major design stages identified in the example projects and a dancer. As evaluation methods they chose participant-observation and open ended interviews. DiPaola et al. (2013) partnered with a design firm to develop Evolver. The design process started with establishing requirements by analysing the work processes of the employees of the partnering firm. The process continued with iterative prototyping and ended with a final evaluation conducted some months after the completion of the software. Waller et al. (2009) relied on experts for gathering requirements for STANDUP. They continued iterative prototyping with the experts and adults with CCN and used typically developing children in testing graphics. The end product itself was evaluated with nine children with CCN during a ten week period including pre-and post-testing for the evaluation of learning effect, a training period for the children, and finally a scenario based observation of the users while using the software. The effects of the STANDUP software on the lives of the children beyond this period were studied with semistructured interviews and questionnaires directed at parents and other adults tightly involved with the childrens learning progress. All of the sample projects seem to follow a similar pattern in their design process (Figure 2). Each project starts by a requirement establishing stage and continues into prototype building. Two of the projects, Evolver and STANDUP continued this process iteratively by testing the prototype multiple times and adjusting it accordingly, while only one evaluation was conducted for Scuddle. The last iteration of this cycle can be called the final evaluation, a stage in which the final version of the prototype is evaluated more rigorosly, perhaps including assessment of usefullness or impact on the users. When the process used in the studied cases is compared to the UCD process of Figure 1, we see that both processes share the stages of specifying requirements, producing solutions and evaluation. Both processes also have iterative properties, while the sample projects seem not to repeat the requirements setting stage. The stages of planning the process and understanding and specifying the context are missing from the case based description, but this may also be due to the result oriented reporting style of the papers, which may omit seemingly obvious details. Waller et al. (2009) report specifically having followed the UCD approach in designing STANDUP, and DiPaola et al. (2013) included researchers with a background in human-computer interaction in the design of Evolver. Finally the processes differ in one important regard: If we categorise the processes by their starting points, Scuddle shows an example of applying a set of machine creativity methods directly into building interactive software, while Evolver and STANDUP both show an example of a process transforming existing autonomous creative systems into interactive products. Changes to machine creativity methods To enable a higher level of interaction, the two projects using existing computational creativity prototypes had to conduct major changes in the machine creativity methods. These changes can be categorized into two rough categories: (1) changes done to facilitate interaction and (2) changes done to enhance the technical properties to better suit real-time use. The distinction between these classes can also be viewed through Wiggins model. The first type of changes, driven by the goal of adding user interaction possibilities, increases the role of the user in Wiggins model for the co-creative system, while the technical changes do not. However, the technical changes may support the quality of user interaction, which makes their categorisation without Wiggins model difficult. Ritchie et al. (2007) state that JAPE had multiple deficiencies which the STANDUP team had to account for by changing the system. The changes done to facilitate interaction in JAPE include keeping a record of jokes offered to a user to avoid too similar ones, the restriction of vocabulary to avoid obscene words and to focus on familiar ones, and possibilities to guide the search for jokes to a topic or specific words. The technical changes relate to adding better phonetic similarity measures and dropping some joke options to enhance the quality of jokes, as well as dropping some mechanisms to make the algorithm faster. The DarwinsGaze algorithm underwent major changes in order to better suit the needs of Evolvers target audience as well (DiPaola et al. 2013). There is not as clear a distinction between interaction facilitating and technical changes on the surface, but viewed through Wiggins model we see that giving the user control over the seed material and selection of candidates for pairing and adjusting the population both increase the users role in the system. In addition, to emphasize gene linkage and user interpretability, the genetic algorithm was simplified by changing the gene structure to operate on a higher level of components called design elements. The team also changed the internal format of pictures from bitmap to SVG to support layers in the generation and facilitate the import and export of pictures. Both of these modifications change the system in a way that can be seen in Wiggins model. However, while the modifications increase the usability of the system, the users role is not increased. Building a Co-Creative Poetry Writer We now move on to describe our on-going project developing an interactive poetry writing tool based on existing poetry generation software. We chose children in comprehensive education as our target user group, as they are learning to use language in creative ways and explore much of the similar structures such as rhyme and rhythm, which are addressed by the existing creative software. The following sections examine our process and compare it to the example cases. Basis in Computational Creativity Methods The machine creativity elements in the interactive system under construction are based on the poetry generation work by Toivanen et al. (2012). This approach uses corpus-based methods to find associated words around a given topic word and then to write poetry about the topic by using these words to substitute words in a given piece of text. Poetic devices like rhyming and alliteration can be further controlled by using constraint-programming methods (Toivanen, Jarvisalo, and Toivonen 2013). In addition to these approaches, the system includes methods which can provide poetic fragments in a certain meter (e.g. iambic pentameter) and contain certain words. These fragments have been automatically extracted from large masses of text and different combinations of them, possibly modified with the word substitution method, can be used as a building block of poetry writing. Design Process After choosing school children as the target audience, we started establishing requirements by studying the users and the context. Restricted by time and targeting a very sensitive group of users, we decided, like Waller et al. (2009), to rely on indirect input from children in our early design phases and use their participation only in the evaluation. We recruited five enthusiastic grammar school teachers to help us. They kindly allowed us to observe their classes. Four of the teachers were teaching a group of approximately 70 second grade students together. One teacher specialised in the Finnish language and literature, teaching multiple classes in the 7-9th grade. We observed one full day of education in the second grade classroom, as well as two ninth grade lessons. We focused on observing interactions between the teachers and the pupils, as well as between pupils and how they worked on creative writing tasks on computers. After the lessons we conducted semi-structured interviews with the teachers in charge. We also sent an internet based open-ended questionnaire on teaching materials to the teachers. The observation revealed differences in the skills of children: Younger children were generally still honing their skills in basic writing, whereas older children were more focused on the subject matter. The younger children were also challenged by foreign language user interfaces but quick to learn by trying things out and learning from their neighbours. The observation also showed in real contexts the behavior and language used by the children when communicating peer-to-peer or with the teachers. This experience gave us inspiration for selecting suitable interaction metaphors connections to real world situations or objects, which help designing insightful interfaces as well as for reducing the level of complexity in the user interface of our application. as well as We expanded our observations with a literature study on educational software, which revealed more suitable interaction patterns and methods. The interviews and questionnaires showed that teachers saw technology as a means to motivate and aid the learner. Some teachers, especially those working with younger children and children with special education needs, expressed a need for quality software to aid the learning of writing. In general, teachers emphasised poetry writings role as a creative activity. The interviews and observation indicated that the writing skills of children develop highly individually. Therefore our software needs to cater for writers capable of different levels of creative writing. We decided to develop a creative writing tool allowing for a varied level of computer assistance, to enable writers with different skillsets to try out poetry writing. We decided to use fridge magnets as a simple metaphor for the manipulation of text on screen. An interface for writing sentences using the magnet metaphor has previously been successfully developed by Kuhn et al. (2009). To test the design, we developed a paper prototype which we evaluated with a specialist researching the use of information technology in education. Based on her feedback we simplified the interface further and revised some features in saving and exporting poetry. She also noted that more advanced writers would need more abstract topics for writing than those we offered in our paper prototype. We iterated the paper prototype development until both the specialist and we ourselves were confident in building a working prototype. At the moment of writing this, we are completing the prototype implementation. Next, we will evaluate the prototype in two ways: (1) scenario-based evaluations with pairs of children in a laboratory setting and (2) testing in a classroom. The former is designed to catch the troubles children might have with the tool and in the latter we want to see how teachers manage a learning setting using the software. The early decisions made about methodology and user involvement can be interpreted as the planning phase of our project viewed through the UCD process. The observation, interviews, questionnaires and the literature study conclude the second and the third phase of the UCD process, or they can be interpreted as the first stage of the general process seen in the examples. The paper prototyping shows some of the iterative prototyping of the general process, or one iterative cycle of the UCD process returning from phase five to phase three. Finally the planned evaluation fits the general process lifted from the examples very well, while also following the lines of the UCD process. However there are some challenges to following the UCD process to the letter: we found it challenging to communicate the restrictions of the computational approach to our users for ideation. Similarly, we found that it is difficult to create extensive paper prototypes for testing with users in iterative prototyping. This is mainly because the use cases by definition involve creative input from the user, and it is hard to imitate quick responses to creative inputs. This reduces the feasibility to include users in the early stages of design. Interaction In a typical use case, the user can give the computer inspirational keywords, around which the computer generates a few lines of draft poetry which the user can then start to modify and extend. This should help the user past the blank page stage. The user may additionally ask for more lines, or just new words for a specific place in the poem to help find suitable rhyming pairs for example. Any new fragments of text produced by the system adapt automatically to the modifications and additions done by the user. To enable more symmetric human-computer co-creation, we are also experimenting with different ways to show editing suggestions to the user. From the perspective of Wiggins model, the user and the software share the same universe U and language L, and they produce a poem together by editing it in turns. Traversal in the conceptual space can thus be performed both by the computer (e.g. providing a line of poetry or proposals for rhyming words) and by the user (e.g. adding more text or changing existing words). They both aim to satisfy (or modify and satisfy) their own rules R and evaluation function E. This shows that our system can also be interpreted as a co-creative system with the user and the computer both sharing responsibility over the creative artefact. Changes into the machine creativity methods The methods by Toivanen et al. (2012; 2013) were designed to compose poetry autonomously and certain changes were needed to modify them to work in an interactive system. The interactive poetry writing process supports turns of word substitution and moving by the user and the computer. The grammar template needs to be updated when the user moves the words around. The user may ask for suggestions for certain words and here the constraint-based methods need to be modified so that they can provide, for instance, suggestions for rhyming or alliterating words which satisfy some additional constraints like having a certain part-of-speech and grammatical case. Finally, the computer also needs to be able to update its vocabulary and keep record of the changes made by the user. In Wiggins terms, the rules R and the transition function T are defined in collaboration by the user and the computer as they both can change the contents of the poem. On the other hand, the evaluation is mainly done by the user as the computer evaluates only some things such as metric structure, and the final evaluation is always done by the user. Conclusions We have looked at the re-design of machine creativity methods into interactive human-computer co-creation tools. Based on the small sample of design processes that we studied, UCD methods seem to be common in creating interactive software on the basis of machine creativity methods. All of the cases we studied follow a similar process that can be viewed as an instance of the UCD process. However, the principles of user involvement, iterative design and a multidisciplinary approach are fulfilled to different extent in each project. Computational creativity methods also set some boundaries for the software to be designed. However, as two of the case studies and our own project show, the methods can be re-negotiated for interactivity, transformationally changing the boundaries of interaction. This again can permit new designs making the re-negotiation process also iterative. When characterised in Wiggins framework, the observations are that for a high level of interactivity the renegotiation of the methods must include interaction facilitating changes, which give the user a larger role in the system, and that only usability factors can be enhanced without expanding the role of the user in the re-negotiation. However, our sample is small and the search for other ways to increase interactivity demands further research. The re-negotiation of computational creativity methods and the role of the user in them is an important part of defining the nature of creative interaction in the software. The design choices taken in the re-negotiation further define the extent to which we can achieve human-computer co-creation. These design choices may include questions such as whether the interactions are always human initiated, or if the computer may also spontaneously offer new creative perspectives, whether the interaction is done by exchanging creative artefacts, is instruction oriented, or is carried out in a more conversational manner creating a socio-technical environment resembling that of human-human co-creation. UCD is focused on the human user. However, if we want to create more balanced human-computer co-creation, we may also need to account for the input the computer needs from the user to be able to participate in the process more extensively. Thus, it might be useful to look into collaborative creativity tools and remote presence to see if the computer can take a role similar to another human being as a creative collaborator. The roles of the user and the computer in co-creation should also be connected to the roles considered by Maher (2012). Finally, interesting insight into human-computer co-creation could be gained by using Wiggins framework to characterise interactions and their effects. Assume that the human and computer agents both apply their own traversal functions T on a shared (partial) artefact, based on their own rules R and evaluations E. This can result, for instance, (1) in immediate synergy, such as reaching good areas in the search space that neither one can reach alone (increasing generative inspiration), (2) in pressure/possibility for transformational creativity (e.g. productive aberration), as well as (3) in conflicts where one agent takes the search into an area where the other one is not able to operate in a meaningful way (generative uninspiration). An analysis of such cases could provide guidance for issues that one should be able to deal with in human-computer co-creation. Acknowledgments This work has been supported by the Algorithmic Data Analysis (Algodan) Centre of Excellence of the Academy of Finland and the Helsinki Doctoral Program in Computer Science (HECSE). 2014_10 !2014 Poetry generation system with an emotional personality Joanna Misztal1 and Bipin Indurkhya2 1 Faculty of Mathematics and Computer Science, Jagiellonian University, Krakow, Poland 2 Computer Science Department, AGH University of Science and Technology, Krakow, Poland Abstract fective state of others (Davis 1983). Similarly, we propose We introduce a multiagent blackboard system for poetry generation with a special focus on emotional modelling. The emotional content is extracted from text, particularly blog posts, and is used as inspiration for generating poems. Our main objective is to create a system with an empathic emotional personality that would change its mood according to the affective content of the text, and express its feelings in the form of a poem. We describe here the system structure including experts with distinct roles in the process, and explain how they cooperate within the blackboard model by presenting an illustrative example of generation process. The system is evaluated considering the final outputs and the generation process. This computational creativity tool can be extended by incorporating new experts into the blackboard model, and used as an artistic enrichment of blogs. Introduction Not until a machine can write a sonnet or compose a concerto because of thoughts and emotions felt, and not by the chance fall of symbols, could we agree that machine equals brain. (Lister 1949) This expresses one of the strongest requirements for AI quoted by (Turing 1950). It takes the view that the process of expressing feelings by means of artistic artifacts is a hallmark of human capability. Such requirements have created a challenging task for AI: how to design a computer program that could write a sonnet inspired by its thoughts and emotions. In recent years, various poetry-generating systems have been developed, discussed in more details below, some of which focus only on producing entertaining artifacts, while others simulate the creativity process and incorporate affective computing techniques. However, most of them do not model a sense of self capable of expressing its own feelings. The main goal of this project is to take up this challenge and to create a system with an emotional personality. Specifically, we plan to create an empathic system that changes its mood according to the emotions evoked by reading the given text, and expresses them in the form of a poem. The affective empathy has been defined in the psychological literature as the observers emotional response to the afa term computational empathy to mean recognition and interpretation of emotions of another person by the computer system. Our work introduces a system with a complex emotional model that attempts to understand affects in human artifacts, and expresses those feelings in the form of a poem. The design considers an optimism rate which is an individual feature of the system in.uencing its perception of the environment (the text). This paper is organized as follows. The Background section presents existing approaches to sentiment analysis and emotional modeling. It also presents the blackboard idea and other poetry-generation systems. The Overview section explains the general idea of the system. The poetry-generation process in our approach is implemented on a blackboard model, which is described in the System Architecture subsection. In this approach, the poetry is composed by a group of experts -each of whom has some specific knowledge about the poetry-generation process, and all of them share a global work-space called the blackboard. The details of the poetry-generation algorithm are presented in the Poetry Generation Algorithm section and explained with an illustrative example. The system takes the inspiration for its creativity from the text provided by the user. Key phrases are extracted from the text to determine the theme of the poem, and also to set its sentiment. The key phrase that is found to be the most inspiring by the experts is used as the title and main theme of the poem. The experts start to perform their tasks -words-generating experts produce words related to the topic based on their knowledge. Some of them use lexical resources such as synonyms dictionary or word collocations. There is also one expert incorporating a model of emotional intelligence that defines the mood evoked by the given text, and generates words describing this sentiment. The poem-making experts choose words from the pool and try to arrange them into phrases. Each of them uses its own Context-Free Grammar to construct phrases. Some poem-making experts use poetic tropes like metaphors or epithets to enrich the style. The evaluating experts select the best phrases according to some constraints, considering the stylistic form. The control component tries to regulate the poem composition by maximizing its diversity and choosing the experts that were the least frequent before. Some illustrative results are presented in the Examples section. Evaluation contains a summary of systems performance in the context of the proposed algorithm and the evaluation of the final outputs. Current version of the system includes some basic types of experts. However, the blackboard architecture allows the system to be extended by adding new experts. Possible improvements and proposition of new experts as well as possible application of the program are mentioned in the Conclusions section. Background Sentiment analysis and affective lexical resources The goal of text sentiment analysis is to extract the affective information or writers attitude from the source text. Basically the sentiments may be considered within the polarity classification (positive, negative or neutral). However, this method does not provide us with a detailed understanding of the authors emotional state, and another approach is needed. The computational methods for sentiment analysis are usually based either on machine learning techniques such as naive Bayes classifiers trained on labeled dataset, or use lists of words associated with the emotional value (positivenegative evaluation or sentiment score values). In our research we use ANEW database consisting of nearly 2500 words rated in terms of pleasure, arousal, and dominance (Bradley and Lang 2010) for text arousal calculation. To extract the sentiment evaluation, we use the Sentistrength (Thelwall et al. 2010) sentiment analysis tool. It estimates the negative and positive sentiment values in short informal texts (rating both positive and negative scores with 1-5 scale), considering common and slang words, emoticons and idioms. The base of the algorithm is the sentiment word-strength list containing terms with 2-5 scale of positive or negative evaluation. The initial, manually-prepared words-sentiments list has been optimized by a training algorithm to minimize the classification error for some training texts. The system also considers a spelling correction algorithm and booster words list with terms that can increase or decrease other words scores (such as very, extremely) as well as negating word list with terms which may invert emotion value (not, never). Additionally, the algorithm uses a list of emoticons commonly used in social web texts, and considers some other stylistic parameters such as questioning and repeated letters. In our approach, we also use the WordNet-Affect lexical resource (Strapparava and Valitutti 2004) to build a hierarchy of words describing emotional states that are used later to generate the affective content of poems. The lexicon contains WordNet hyponyms of the emotion word, which are a subset of synsets suitable to represent affective concepts correlated with affective words. For example, for the emotional word compassion, we can derive a correlated set of words describing this state: forgive, merciful, excusable, affectionate, commiserate, tender. Emotional modeling As mentioned in (Cambria, Livingstone, and Hussain 2012), the research on human emotions dates back to ancient times. One of the first categorization of emotional states was made by Cicero who separated them in four categories of fear, pain, lust and pleasure. Later studies on this topic were developed by Darwin (19th century), Ekman (who defined six basic emotions as happiness, sadness, fear, anger, disgust and surprise in 1970s) and many others. One approach towards emotional modeling that has been commonly used by scientists since 20th century is the dimensional model, where particular emotions are represented as coordinates in a multi-dimensional space. One of the first examples is the circumplex model (Figure 1) presented in (Russel 1980). In this model, the horizontal (...) dimension is the pleasure-displeasure and the vertical is arousalsleep(Russel 1980). In the Whissels model (Whissel 1989), the 2D spatial coordinates are evaluation (positive-negative) and activation (passive-active). The author places words from her Dictionary of Affects in Language in this space. Another example of such model is Plutchiks wheel of emotions (Plutchik 2001) consisting of 8 basic and 8 composed emotions placed in the circle, where the similarity of emotions is represented by radial dimension. Figure 1: 2D circumplex model of emotions adatpted from (Russel 1980). The dimensional models are a promising tool for computational modeling of emotions as they provide simple way to measure, define and compare the affective states. They are used in AI systems to simulate the emotional personality as presented in (van der Heide and Trivino 2010; Kirke and Miranda 2013). However, they have some significant limitations as they are based mostly on the verbal representation of affects. As mentioned in (Cambria, Livingstone, and Hussain 2012), they do not allow defining more complex emotions and they do not consider the situation of several emotions being experienced at the same moment. Blackboard architecture According to the Global Workspace Theory (Baars 1997; 2003) the brain functioning may be illustrated by a theater metaphor where: Consciousness (...) resembles a bright spot on the stage of immediate memory, directed there by a spotlight of attention, under executive guidance. The rest of the theater is dark and unconscious. (Baars 2003) Thus, in the conscious part the actions are performed by a large number of autonomous specialized modules (the actors). The blackboard architecture is a model that fulfills the assumptions of GW Theory of mind and therefore has a potential to be used in simulating cognitive processes such as creativity. The model may be visualized by another metaphor (Corkill 1991) of a group of independent experts with diverse knowledge who are sharing a common workspace (the blackboard). They work on the solution together and each of them tries to add some contribution on the blackboard until the problem is solved. The blackboard model is an appropriate solution for problems that require use of many diverse sources of knowledge, or for ill-defined, complex problems. It allows a range of different experts they may be represented as diverse computational models as their internal representation is invisible at the top level. The idea of using experts representing knowledge has been previously used to simulate cognitive tasks. For example in Word Expert Parser (Small 1979), experts cooperate to provide better understanding of text during the process of conceptual analysis of natural language. Poetry-generation systems Since making a system that would produce aesthetically pleasing poems based on predefined templates is not such a difficult task, there exist various poetry-generation programs working in this way. An elaborate example is Kurzweils Cybernetic Poet (Kurzweil 1992), which generates a language model from a set of poems input by the user, and composes new ones in the same style. However, a really challenging task is to make a program that produces the poems in an intentional way. (Gervas 2010) notes that the simulation of human creativity may be significantly different from the original process of creativity itself. Accordingly, there exist various approaches towards computer poetry generation. The McGONAGALL system (Manurung, Ritchie, and Thompson 2012) uses evolutionary algorithms to make a poem that fulfills the constraints on grammaticality, meaningfulness and poeticness. ASPERA (Gervas 2001) generates poems with a forward reasoning system. (Toivanen et al. 2012) present a system that creates novelty by substituting words in existing Finnish poetry. In subsequent work, (Toivanen, Jarvisalo, and Toivonen 2013) introduce a constraint programming technique for poetry generation. There are also several projects that incorporate emotional affects in the creation process. (Colton, Goodwin, and Veale 2012) present a corpus-based poetry generator that creates poems according to days mood estimated from the news of the day. However, the mood is only defined as good or bad, without any further refinement of the emotional state. The Stereotrope system (Veale 2013) generates the emotional and witty metaphors for given topic based on the corpus analysis. Another interesting approach is MASTER (Kirke and Miranda 2013), which is a tool for computer-aided poetry generation. In this system, a society of agents in various emotional states influences each others moods with their pieces of poetry. The final poem is a result of social learning. The poems produced by the system are not meaningful in the usual sense, but they consist of repeated words and sounds that create poeticity. Among the above-mentioned systems, we can distinguish two different approaches towards modeling the systems personality. In the first approach, the systems behavior is determined by some predefined parameters (e.g. in MASTER -agents have initial moods and words). Another alternative is to adapt the emotional state to some environmental factors. This approach is taken by (Colton, Goodwin, and Veale 2012), where the mood of the day is calculated from the sentiment value in daily news. The Cybernetic Poet also builds a data-driven model, but it does not exhibit any creative nor emotional behaviors the system can only replicate the style of the existing poetry. In our system, we combine both approaches -the emotional state is acquired based on the affective information extracted from the blog text, but it is also dependent on the individual features of the system the model of emotions and its optimism rate that give the system an individual personality. Hence the external factors are used only as an inspiration for the theme and stimulus for the affective state. Our approach may be also compared to MASTER, which is also a multi-agent model for poetry generation with emotions. In MASTER (Kirke and Miranda 2013), the agents interact by reciting their own pieces of poetry to each other. Thus, in contrast to our model, they do not share any global knowledge. The mood-defining factor for MASTERs agents is the poetry produced by the societal agents themselves. Hence the method for calculating the emotional state differs from ours, where we extract sentiments from web text. Moreover, all of the agents in (Kirke and Miranda 2013) have the same structure, while in the blackboard model they represent diverse computational units with distinct knowledge sources and roles. Our approach may be considered as similar to the idea of using specialized families of experts that cooperate during the poetry-generation process incorporated in the later version of WASP (Gervas 2010). Groups of experts work there as a cooperative society of readers/critics/editors/writers. However, WASP does not incorporate the blackboard model directly. Evaluation approaches The evaluation of any creative system is a nontrivial problem. As the task is not only to generate a satisfying output but also to imitate the creation process, the evaluation needs to consider both the aspects. The most obvious way to evaluate the output is to make a kind of Turing test (Turing 1950) for poetry as in (Kurzweil 1992). In such a test, some computer-generated poems mixed with the human-authored poetry are presented to the human judges. The score is based on how many poems composed by the system were classi.ed by judges as human-authored. However, the domain-specific Turing test does not consider the evaluation of the creation process. Another approach, taken in FACE descriptive model (Colton, Charnley, and Pease 2011), is based on evaluating the generative act performed by the system and its impact. FACE introduces a set of parameters evaluating the creativity of the program, and considers not only the artifacts produced by the system but also the process of generation, which is essential for creativity evaluation. A creative act that satis.es all FACE criteria is denoted by a tuple , where the C concept means the system taking input and producing outputs denoted by E expressions, the A aesthetic measure is the fitness function evaluating the (concept, expression) pairs with real-number values and the F framing information is the linguistic comment explaining the context or motivation of the outputs. Overview The system structure is based on the blackboard model. It consists of a group of experts that represent diverse sources of knowledge, the common blackboard workspace and the control component that regulates the process by choosing one of competing experts that will contribute to the final solution. The modules are described in the System architecture subsection. At the beginning of the poetry-generation process, the input text is set on the blackboard and the agents start to work on it. Each agent has a special role and knowledge and it waits until it finds something on the common workspace that it can use for performing its task. When something interesting appears, the agent processes the information using its individual knowledge and adds new partial solution to the blackboard. The control module decides which agents contribution should be used for the final poem. The algorithm is explained in more details in Poetry generation algorithm subsection along with an illustrative example of the generation process. System architecture The system architecture is presented in Figure 2. The main modules of the system are described below. Blackboard is a common workspace with partial solutions and other information about the problem, shared by the experts. In our system, it consists of: Text The input text which is used as an inspiration for the poem. The experts analyze it to define the main theme and sentimental content for the poem. Constraints The initial constraints and information about the poem. In the example, we use constraints on the number of lines, the number of syllables in each line and the grammar constraints on tense and person to ensure grammatical consistence of the poem. These constraints are set manually at the beginning of the process or chosen randomly by the system. Key phrases Most frequent noun phrases retrieved from the text by one of the experts. Each phrase has its inspiration Figure 2: Blackboard architecture used in the system. A group of experts that represent diverse sources of knowledge works on the common blackboard workspace. The control component regulates the process by choosing one of competing experts that will contribute to the final solution. value defined by W *Cat, where W is number of words that the experts can generate from this phrase and Cat is number of non-empty categories of words (categories are explained in Pool of ideas). Topic The main theme for the poem selected from the key phrases as the phrase with highest inspiration score. If there are more phrases with the same value, one is selected at random. Once the topic is set, the experts start to produce their artifacts associated with it. Emotion The emotional state for the poem defined by one of the experts by analyzing sentiments in sentences from the text containing the topic phrase. Pool of ideas A part of blackboard that is used as a workspace for experts. It contains all words and partial solutions produced by the experts. It is also a source of inspiration, as some of them use artifacts generated by others to produce new ones. The expressions in the pool are divided into categories based on their grammatical form and meaning. The main categories are: Nouns list of nouns from the topic phrase and their synonyms. Adjectives list of adjectives from the topic phrase and their synonyms. Epithets lists of adjectives that are most frequently preceding the noun for each noun from the topic phrase. Verbs lists of verbs that are most frequently following the noun for each noun from the topic phrase. Comparisons lists of nouns that are most frequently following the adjective for each adjective from the topic phrase. Hypernyms lists of hypernyms of the noun for each noun from the topic phrase. Antonyms lists of antonyms of the words for each noun and adjective from the topic phrase. Emotional words words describing the emotional state defined for the poem. Phrases list of expressions generated by experts, candidates for the new line in poem. Poem draft Current version of the poem consisting of lines. Each line is selected from phrases candidates by the evaluation experts. Model of emotions A 2-dimensional model, where each emotional state is represented by coordinates in (valence, arousal) space. The emotions used in the model are Word-Net hyponyms of the word emotion used in WordNet-Affect lexicon in the hierarchy of emotional categories. The (valence, arousal) coordinates for emotional labels in the model have been retrieved from the ANEW database. The choice of emotional categories is based on the lexical resources that we use. It is possible to improve the model by rearranging the categories or their spacial coordinates or to use other more complex models of emotions as mentioned in the Background section. Experts Independent modules that have access to the common blackboard. They are triggered by events on the blackboard when they find something that they can use, they try to add new information to the blackboard. Each of them has an individual knowledge and they have diverse roles in the system. Analyzing experts Experts that retrieve information from the initial text and add their data to the blackboard. Keywords expert Extracts the most frequent noun phrases from the text and adds them to the key phrases section on the blackboard. Emotion expert Defines the emotional state for the poem and sets the emotion on the blackboard. As the whole text may be long, and the emotional attitude may vary within it, the sentiments are considered only for the sentences containing the topic of the poem. Sentiments are calculated in terms of valence (positive/negative evaluation of pleasure scaled to -5 to 5) and arousal (passive/active scaled to -5 to 5) levels. The valence of the text is calculated by using SentiStrength tool, which estimates the negative and positive sentiment strength in sentences based on the Emotion Lookup Table. However, as we want our system to represent an independent emotional intelligence, it should perceive the affects of the text in a more subjective way. Therefore, we introduced the optimism rate which is a parameter set at the beginning of the algorithm (or chosen randomly) that biases the valence result so that the perception of the text may be more optimistic or pessimistic. Thus, the final valence estimated by the program is given by: V = .opt Sentpos+(2-.opt)Sentneg (1) s.T ext s.T ext where .opt is the optimism rate of the system (between oo 0,7 and 1,3), Sentpos and Sentneg is the s.T ext s.T ext sum of positive and negative sentiments respectively for all sentences in the text. The arousal value has been calculated with use of ANEW. The algorithm combines the average ANEW arousal value for the words in text. The basic formula for arousal calculation: A =(AANEW (w))/length(T ext) (2) w.T ext where AANEW (w) is the arousal value of word w retrieved from ANEW database. However, the sentiment in the text may be expressed not only within words but by other features of the text, similarly to expressing emotions with voice intonation in a spoken message. For example, the text Thats great... can be perceived as less arousing than the same words written in a different way: Thats GREAT!!!. Hence, the arousal calculation uses a punctuation-sensitive algorithm, i.e. some punctuation marks in the text increase the arousal value, while others decrease it. The calculated arousal score may be modified according to the rules: A 1 if ... in text f(A)=A +1 if ! in text or word in capitals in text A +2 if !!! in text (3) where A is the text arousal. Once the valence and arousal of the text are calculated, the emotional state is defined as follows: emotion = arg min d((vt,at), (vx,ax)), (4) x.S where emotion is the current emotional state, S is the set of all emotional states from the model of emotions, vt and at are the valence and arousal of the text, vx and ax are valence and arousal of the emotional state and d(x1,x2) is Euclidean distance. Words-generating experts Experts that have some lexical knowledge. They generate words associated with the topic and add them to the pool of ideas sections. WordNet expert generates synonyms, hypernyms and antonyms for nouns and adjectives based on the WordNet lexical resource (Miller 1995). Adds to nouns, adjectives, hypernyms, antonyms sections of the pool. Collocation expert generates words that are frequently used together with given nouns and adjectives. Retrieves information from 2gram model of texts from Brown Corpus. Adds adjectives that describe nouns to the epithets section, verbs that follow nouns to the verbs section and nouns that follow adjectives to the comparisons section of the pool. Emotional-Words expert generates words that describe the emotional state defined for the poem. The affective words are derived from WordNet Affect as the hyponyms of given category name. For instance, if the emotional state was defined as calmness, the generated set of words would contain peace, calm, tranquilly, easiness, cool, still. Poem-making experts Experts that compete to produce new lines for the poem. They use partial solutions generated by other experts in the pool of ideas to produce new phrases. Their outputs are added to the phrases section of the pool and are evaluated by the selection experts. These phrases may be also extended by others. These experts are triggered when they find something on the blackboard that they could use for their phrases. They can generate a number of phrases proportional to their importance factors that are set manually at the beginning of the algorithm. Some of these experts compose stylistic forms typical for poetry. Grammar experts Experts that use Context-Free Grammar rules to produce phrases. Apostrophe expert Generates apostrophes with the noun, its description and hypernym. For example: O life the heavenly being Comparison expert Generates comparisons for adjectives using nouns that are most frequently described by them. For example: As deep as a transformation Epithet expert Generates expressions with a noun and its epithets or emotional adjectives. For example: marvelous sophisticated fashion Metaphor expert Generates metaphors by comparing the person to an object. For example: You were like the downtalking style Oxymoron expert Composes phrases with antonym words. For example: good and bad Rhetorical expert Composes rhetorical questions about noun, or noun and its epithets. For example: why was the style so peculiar ? Sentence expert Generates sentences according to its grammar rules. Uses all the words categories, and also the emotions describing words. For example: She loved the peaceable new york Recycling experts Experts that generate new phrases by transforming phrases generated by other experts. Exclamation expert Generates a new phrase by adding ! exclamation mark to the phrase from the pool. Overflow expert Generates a new phrase by breaking phrases from the pool into two lines. Repetition expert Generates a new phrase by repeating a phrase from the pool . Selection experts Experts that select the best solutions according to given constraints and heuristics. Inspiration expert Selects the topic for the poem from the set of key phrases according to formula: T opic = arg max Wx Catx, (5) x.Keyphrases where Wx is the number of words that the experts can generate from this phrase, and Catx is the number of non-empty categories to which these words belong. Syllables expert Selects phrases that have the number of syllables closest to the target number of syllables for the current line in poem. Lines = arg min |Sx St[i]|, (6) x.phrases where i is current line number, Sx is number of syllables in phrase x, St[i] is number of target syllables for line i The syllables are counted using the CMU Pronouncing Dictionary combined with the syllables-estimating algorithm used for words that are not included in the dictionary. Control component the unit responsible for setting initial constraints for the poem, setting experts probabilities and evaluation expert whose contribution should be used for the current line of poem. In the current version of the system, the constraints are set for the number of lines and the numbers of syllables in each line, grammar form and tense. The stylistic constraints are selected at random from a set of templates. The experts importance factors are chosen manually, and are used during the generation process when an expert produces a number of phrases proportional to its importance factor. The control module also tries to maximize the diversity of the poem by giving preference to the artifacts generated by those experts that contributed less frequently before. For instance, if the poem consists of two lines generated by the grammar expert and one by apostrophe expert, and for the fourth line the grammar expert is competing with the oxymoron expert, the control component will give preference to the oxymoron expert. Poetry generation algorithm We present below the generation process along with an illustrative example. The algorithm can be divided into following phases: Modules initialization Blackboard is initialized with the text input by the user. The form of the poem is selected from a set of templates, and grammar constraints are defined for stylistic consistency. Text: When someone leaves you, apart from missing them, apart from the fact that the whole little world youve created together collapses, and that everything you see or do reminds you of them, the worst is the thought that they tried you out and, in the end, the whole sum of parts adds up to you got stamped REJECT by the one you love. How can you not be left with the personal confidence of a passed over British Rail sandwich? 1 Constraints: Number of syllables in lines: (line 1: 8; line 2: 8; line 3: 8; line 4: 8) Grammar form: Person: she; Tense: present; Poem-making experts are initialized with individual importance factors varying from 1 to 5, determining how many phrases they can generate in each turn. The default values presented below may be modified manually. Poem making experts importance factors: Apostrophe expert: 2, Comparison expert: 3, Epithet expert: 5, Metaphor expert: 2, Oxymoron expert: 2, Rhetorical expert: 3, Sentence expert: 5, Exclamation expert: 1, Overflow expert: 1, Repetition expert: 1. Emotional expert is initialized with a random optimism factor between 0,7 and 1,3. A higher value means a more optimistic attitude. Optimism factor: 0,84 Topic selection The topic is chosen as the most inspiring key phrase from the text. To define it, first all key phrases are retrieved and evaluated with the inspiration score. Keywords expert extracts key phrases as the most frequent phrases consisting of a noun and descriptive adjectives. Key phrases: [someone, end, whole little world, whole sum, british rail sandwich, parts, personal confidence, fact] 1http://www.jaceandjenelle.com/ my-personal-blog.php Words-generating experts estimate how many words they can produce from each key phrase. The inspiration for each phrase is calculated according to formula (5). The inspiration expert selects the most inspiring phrase for the topic. Inspirations: whole little world: 6920, personal con.dence: 3920, whole sum: 3880, someone: 2324, parts: 1918, fact: 1512, end: 910. Poem topic: Whole little world Emotional expert defines the emotional state for the poem. The sentiments are retrieved from sentences containing the topic phrase. The expert calculates valence and arousal according to (1), (2) and (3). Then the emotional state is defined as in (4). Sentences containing topic phrase : When someone leaves you, apart from missing them, apart from the fact that the whole little world youve created together collapses(...). Valence: -0.94;Arousal: 2.0; Emotional state: despair. Words generation Once the topic and emotional state for the poem are defined, the words-generating experts start to produce their ideas. They store their artifacts under appropriate categories in the pool of ideas section of the blackboard. Pool of ideas: Nouns [macrocosm, existence, universe, cosmos, world, creation] Adjectives [whole, little, small] Verbs existence: [loses, reflects, becomes, fails, is, belongs], world: [centered, admired], universe: [is, had, are, was], creation: [is, does, prevents] Epithets world: [little, contemporary, real, previous] , existence: [happy, celestial, historical], universe: [interdependent, entire], creation: [own, inventive, artistic Comparisons whole: [lines, block, incident, country] Hypernyms existence: [state], world: [natural object], creation: [activity] Antonyms whole: [fractional], little:[big] Emotional words [pessimistic, cynical, resignation, discourage, hopeless] Phrases generation As the words start appearing in the pool of ideas, the poem-making experts start to produce phrases for new lines according to grammar constraints. They add their artifacts to the phrases section. Phrases: Epithet Expert: corporate existence, great world Apostrophe Expert: oh world the little natural object Sentence Expert: the creation prevents abjectly, she likes the hopeless, she loves the pessimistic cosmos Comparison Expert: as whole as a story, whole like a convocation Metaphor Expert: she is like the human existence Exclamation Expert: as whole as a story! Rhetorical Expert: why is the existence so nonfunc tional? Oxymoron Expert: whole but fractional Line phrase selection When all experts .nish their generation, the phrases that fulfill the line constraints best are selected by selection experts. Then the control module makes the final selection judging by the experts frequencies in former lines. The same algorithm is repeated for each line of the poem. Generating line 4. Target syllables number: 8 Poem: line 1: what is the jewish cosmos? (Rhetorical Expert) line 2: o existence the daily state (Apostrophe Expert) line 3: perceptual physical world (Epithet Expert) Syllables expert best phrases candidates: happy corporate existence (Epithet Expert) : 8, she sees the pessimistic world (Sentence Expert): 8 Control module selecting less active experts in former lines generation: Epithet Expert: 1 line, Sentence Expert: 0 lines Line phrase selection: she sees the pessimistic world (Sentence Expert) Examples Below we present some example outputs of the system inspired by three input texts. We include some remarks on the interpretation of the produced poems, which are further analyzed in Evaluation section. Compassionate poem about the life Inspired by the text: With the holiday craziness yesterday, and having to work, i didnt get to .nish posting all of my thankfulness pictures. So you might see them pop up over the next few days.this morning i am thankful for the adult men in my life. My dad and mr P. i am fortunate to have both of them in my life to encourage me, support me, take care of me, and love the kids with all of their hearts.2 Topic: Life, Emotion: compassion Poem: O life the personal beingness You are like the simple life! Musical sacri.cial life You are like the general life You see the excusable life Emotional musical life O life the heavenly being Remarks: The topic Life provided a wide range of epithets associated with the main phrase. Produced output presents a big lexical diversity of adjectives describing life what creates the poetical stylistics. The apostrophes are used in the first and 2http://storyofmylifetheblog.blogspot.com. es/ last lines of the poem, giving it a closed form. This effect was accidental, however it could be an interesting improvement to order experts in this way. The emotional state is expressed only by the adjective excusable as numerous adjectives dominated the emotional words. Angry poem about the end Inspired by the text: I remember being endlessly entertained by the adventures of my toys! Some days they died repeated, violent deaths, other days they traveled to space or discussed my swim lessons and how I absolutely should be allowed in the deep end of the pool, especially since I was such a talented doggy-paddler. 3 Topic: Deep end, Emotion: anger Poem: I knew the undisrupted end I was like the various end As deep as a transformation O end the left extremity Objective undisrupted end I hated the choleric end O end the dead extremity Remarks: The emotional state for the poem is anger, which may correspond to some negative expressions in the text (died, violent deaths, deep end). The mood is expressed in the poem by words choleric and hated. Fearful poem about the way Inspired by the text: Lately everyone has been wondering Is Jenelle and Gary going to get back together?!. NO! He is living his life and Im living mine. We are both happy with our lives the way they are at the moment, I know for me at least Im EXTREMELY happy. Gary might of been tweeting things because he might of been jealous in a way that I was dating Courtland but he agrees to stop 1 today. Topic: Way, Emotion: fear Poem: O mode the symbolic property Quickest moderate way She was like the mode She seemed hysterical because the way left Remarks: We can observe here that the system does not do well with ambiguous words. The way is once interpreted as property or mode but the algorithm does not consider what was the phrase context in the text. However, the poetry may allow some less strict interpretations of meaning as the ambiguity can be used as an intentional poetical operation. 3http://hyperboleandahalf.blogspot.com Evaluation The evaluation of a creative system is a difficult and ill-defined problem. As the goal is not only to generate a satisfying output but also to imitate the creation process, the evaluation needs to consider both the aspects. Output evaluation As the human interpretation of poetical artifacts is a subjective process, we claim that the Turing tests are not reliable ways to evaluate poetry. However, the system requires some kind of evaluation for its outputs. Hence, according to (Manurung, Ritchie, and Thompson 2012) we assume that generated texts need to meet the constraints of grammaticality, meaningfulness and poeticness to be considered as valuable poetic artifacts. Below we evaluate our outputs along these dimensions. Grammar The consistency of grammatical form is controlled by the constraints on person and tense. Use of Context-Free Grammars as the knowledge for poem-making experts provides the poem with a proper grammatical structure. As we can observe in Examples section, the outputs generally represent proper grammar. Some minor mistakes are caused by mis-classification of ambiguous words. This problem could be solved by improving the text-analyzing phase so that the key phrases are analyzed considering the context in which they are used. Meaning The meaning of the poem is derived from the lexical (WordNet) and statistical (Brown Corpus analysis) associations of words in the topic phrase. Poems contain synonyms, hypernyms and antonyms as well as words that are most commonly used together with the main phrase. This combination results in a higher diversi.cation of produced poems. The choice of the topic as the most inspiring phrase causes more possibilities to produce varied and meaningful poems. Also, the use of phrases describing the emotional state gives the impression of intentionality in produced compositions. However, as observed in the last example in Examples section, the algorithm lacks handling of ambiguous phrases. Thus. the interpretation may differ from the meaning of the phrase in the initial text, and may not be consistent throughout the poem. This problem could be resolved by analyzing the context of words in the text but, as mentioned above, for poetry the ambiguity may sometimes be perceived as an intentional operation. Poeticness The poetic form of generated poems is created by two main factors the experts using poetical forms for their phrases and the stylistic constraints for lines. As can be observed in presented outputs, the poetical forms used by experts, such as epithets and apostrophes, make an important contribution to the overall perception of poetical composition. The stylistic constraints in the current version consider only the number of syllables for each line, and are used for selecting best candidates for lines. This approach does not allow more elaborated poetical operations, such as the use of rhymes or rhythm. However, this could be easily improved by adding new selection experts to the blackboard architecture. Each expert should use some heuristics to evaluate the competing phrases and the final selection should respect all criteria. Another important aspect in.uencing the poetical character of outputs is the use of emotionally rich words that evoke imagery and are typical for poetical expressions. Output evaluation summary As presented above, the products of the system meet the triple constraints on grammar, meaning and poeticness to some extent. Further improvements of these factors in the system should include context-based analysis of words and introducing more stylistic constraints for the poetical form. Model evaluation As the main focus of computational creativity systems is to produce their outputs in an intentional way, the generation process should consider this as an important concern for evaluation. We propose evaluation of our system using the FACE model (Colton, Charnley, and Pease 2011) which is aimed at evaluating creative acts performed by a computer. The details of the model are presented in the Background section. We present below how our system architecture corresponds to these criteria. Concept and concept expression In our case, the concept is the blackboard architecture with the set of experts cooperating to compose the poem. The motivation to use the blackboard architecture as presented in Background section is the Global Workspace theory which compares the brain functioning to a group of independent modules sharing a public workspace. The program takes a text as an input and produces the concept expressions in the form of poems. The outputs are evaluated in the Output evaluation subsection. In this approach we could also consider each expert as an independent concept producing its own expressions as partial solutions for the problem. Aesthetic measure The aesthetic measure in the system may be considered as the heuristic functions evaluating candidates for new lines in poem. Each pair expert (concept) phrase (expression) is evaluated respecting the stylistic constraints (6) and the experts frequency before. The result is a real number. Another measure is used for topic selection each key phrase is evaluated according to its inspiration value as in (5). Framing information The framing information in system might be found only in the name of the emotional state defined according to the model of emotions (4). This output provides some information about the context of the poem. FACE evaluation summary As presented above, the generation process performs the generative acts of the form . The F g is provided by description of the emotional state only, but it may not be sufficient to satisfy the framing information criterion. Conclusions We proposed a system that is capable of expressing its own feelings in the form of a poem. The emotional state is generated by empathic perception of the text, and the mood is modulated by the optimism rate factor given to the character. The blackboard architecture used in the system provides an effective way to model creativity: it is easily extensible with new linguistic resources and stylistic constraint. It could even incorporate experts representing other existing poetry generation systems such as Stereotrope for generating metaphors. Moreover, the blackboard model is a computational representation of Global Workspace theory of mind, which makes it a promising tool for simulating cognitive processes. The poems produced by the system generally satisfy the triple constraints of grammar, meaningfulness and poeticness. However, in the future work, more attention should be paid to the context of analyzed words. According to the FACE evaluation, our system performs the creative acts of the form . The aesthetics measure could be improved by defining more stylistic constraints for the poem. The approach presented here can also be applied for generating poetry based on blogs. 2014_11 !2014 Pemuisi: a constraint satisfaction-based generator of topical Indonesian poetry Fam Rashel1 and Ruli Manurung2 Faculty of Computer Science Universitas Indonesia Depok 16424, Indonesia 1fam.rashel@ui.ac.id, 2maruli@cs.ui.ac.id Abstract Pemuisi is a poetry generation system that generates topical poems in Indonesian using a constraint satisfaction approach. It scans popular news websites for articles and extracts relevant keywords that are combined with various language resources such as templates and other slot fillers into lines of poetry. It then composes poems from these lines by satisfying a set of given constraints. A Turing Test-style evaluation and a detailed evaluation of three different configurations of the system was conducted through an online questionnaire with 180 respondents. The results showed that under the best scenario, 57% of the respondents thought that the generated poems were authored by humans, and that poems generated using the full set of constraints consistently measured better on all aspects than those generated using the other two configurations. The system is now available online as a web application. Introduction Poetry is a form of literature with an emphasis on aesthetic aspects such as alliteration, repetition, rhyme and rhythm, which distinguishes it from other literary forms. In poetry, the specifically chosen wording is infused with much more meaning and expressiveness, hence the difficulty in translating poetry compared to translating prose. Poetry generators are systems capable of automatically generating poetry given certain restrictions and contexts. Gervs (2002) presents an overall evaluation of various poetry generators. Other notable works include Manurung (2003), Colton et al. (2012), and Toivanen et al. (2013). Colton et al. (2012) proposes an architecture for poetry generation that is able to generate poetry along with a commentary on the various decisions it chose in constructing the poem. Toivanen et al. (2013) show how constraint logic programming can be used to generate poems that satisfy various poetic and linguistic constraints. Our system, Pemuisi (a rather archaic Indonesian word meaning poet), combines the architecture and approach proposed by Colton, particularly the fact that generated poems are based on current news articles, with the constraint satisfaction-based approach of Toivanen, and generates poems using a combination of handcrafted and automatically extracted Indonesian language resources. The main contribution of this work, aside from the combination of these approaches, and the adaptation to the Indonesian language, is the user evaluation that was conducted, as both Colton et al. (2012) and Toivanen et al. (2013) present no user evaluation. In the Background section below, relevant previous work will be presented, especially the generator described in Col-ton et al. (2012). The Language Resources section introduces the various language resources required by our system. Pemuisi utilizes two kinds of language resources, templates and slot fillers. Slot fillers are divided into poetic words and keywords. Each of these language resources play their own role in satisfying poetic properties. In the Constraint Satisfaction Poetry Generation section, we present our constraint satisfaction approach to poetry generation. Poetic features such as number of lines, syllable counts, and rhymes are defined as a set of constraints. Hence, the system will try to satisfy the constraints while composing the poem. The Experiments and Evaluation section details the various experiments we conducted. We took the output for evaluation through online questionnaire with 180 respondents. The results were analyzed based on several criteria, such as structure, topic, and message of the poem. Finally we briefly discuss our implementation of a live web application that continuously monitors popular news websites for articles and produces corresponding poems. Background Manurung (2003) claims that poetry must satisfy the three properties of meaningfulness, grammaticality, and poeticness. The property meaningfulness states that a text should aim to convey a message or concept that has meaning when readers try to interpret the text. This property could be a common element for any text, not just poetry. The property grammaticality states that a poem must comply with linguistic rules defined by a given grammar and lexicon. This property is also one of the most common needs that must be met by any natural language generation (NLG) system. The last one is poeticness. This property states that poetry must contain strong characteristics of poetry elements, e.g. phonetic features such as metre and rhyme. This is the key property to distinguish poetry from other texts. Such requirements imply that it is insufficient for poetry generation systems to simply produce random words. Colton et al. (2012) states that the first poetry generator to be developed is most probably the Stochastische Texte system developed by Lutz that utilizes a small lexicon consisting of sixteen subjects and predicates from Kafkas Das Schlo. The system randomly chooses words from Kafkas works and fits them into a grammatical template that previously has been defined. Other poetry generators can be grouped into several categories. Referring to Gervs (2002) who provides a taxonomy of poetry generation systems based on the approach and techniques used, there are at least four different categories, namely (i) template-based systems, (ii) generate and test systems, (iii) evolutionary-based systems, and (iv) case based resoning systems. Another perspective from Colton et al. (2012) is that most existing poetry generation systems behave more as assistants, with varying degrees of automation, for the human user who has provided the majority of the resulting context of the poem. Departing from this view, they propose a fully autonomous computer system poet, which we refer to as Full-FACE. Full-FACE is a corpus-based poetry generator that utilizes various resources such as lexical databases, simile corpus, news articles, pronouncing dictionary, and sentiment dictionary. Given these resources, the system is able to generate poetry independently, to the extent of deciding its own form of poetry such as the number of lines, rhyme structure, message, and the theme of the poetry. Overall, this system consists of several stages. The first is retrieval, where the various resources needed to produce poetry are gathered, i.e. the Jigsaw Bard simile corpus, a set of constraints, and a collection of keyphrases from Guardian news articles that will be the topic of poetry. Then we go to multiplication stage, where the aforementioned resources are permutated to obtain variations in order for the resulting poetry to be more expressive. For example, the existing simile corpus yields similes in the form of a triple , which contains information about the simile, e.g. the tuple represents the simile as happy as a child's life. Multiplication is done by applying three kinds of substitution methods: using the DISCO corpus, the simile corpus, or WordNet to find words that are similar. During the combination stage, Full-FACE produces lines of poetry through combining simile corpus, the simile multiplication result, and article keyphrases. This combination is done by following a certain template. For example, there is a keyphrase excess baggage that match the simile the emotional baggage of a divorce can be applied to the process of combination into line poem Oh divorce! So much emotional excess baggage in accordance with the specifications of the template. Finally, the results of the previous process are collated in accordance with the user-given constraints or existing template in the last stage called instantiation. A fully autonomous computer system poet was established by handing over high-level control to the system itself. This was done by the system with context generation alongside with the commentary. Context generation is a process of how context, topics, templates to structure the poetry, such as lines and rhyme patterns, determined by the system to form poetry. In order to deliver the context, commentary generation is a process to produce a commentary on the poetry made. In general, the comments contain the condition of the heart/emotions at the time of making the poetry, a summary of the article reference, and how the process of writing poetry. Language Resources Our system requires at least two types of resources, templates and slot fillers. These resources are necessary pieces for the system to make poetry. To prepare these resources we need to go through several processes. Hereby is the explanation of each process. Templates A template is a ready-made sentence (canned text) that has one or more slots to be filled by certain words. Each slot is associated with a part-of-speech tag, such as noun, verb, adjective, or pronoun. Templates are used to fulfill the grammaticality property of a poem. Firstly, we applied an Indonesian part-of-speech tagger on a corpus consisting of 213 poems written by famous Indonesian poets. Template extraction is then performed by removing words that have specific part-of-speech tags, i.e. nouns, verbs, adjectives, and pronouns. The positions of these removed words become slots to be filled later. A slot is also associated with a part-of-speech tag indicating what words may fill the slot. For example, consider the following sentence: Aku mencintai kamu dengan sepenuh hati I love you with full heart I love you with all of my heart. Each word is initially tagged with its part-of-speech. Subsequently, we remove all words tagged as (pronoun) and (noun) to obtain the following template: mencintai dengan sepenuh love with full ? love ? with all of (my/your/their) ?. After extracting such templates, the feasibility and appropriateness of a template is evaluated by considering the semantic specificity embedded in the template. This consideration is important to prevent providing too much context to the system, and to avoid the risk of plagiarism against an existing line of poetry. Furthermore, with this evaluation we can determine the limits of human intervention concerning the poetic knowledge provided to the system. To illustrate, consider the following two templates (note, VBI indicates an intransitive verb): ada yang , ada yang some that , some that some are ?, some are ? TEMPLATE: SYLLABLE COUNT, SLOT COUNT [,dan,,bisa,dibawa,]: 6, 3 [,,,]: 0, 4 Figure 1. Two examples of templates ya, tahu mereka masih menggunakan yes, know they still use yes, ? know that they still use ? From these two examples we can see that the latter template already carries with it a fairly specific semantic message. We believe such templates should be avoided. Furthermore, the former template is much more general and does not overconstrain the semantics. Such are the desired templates for our knowledge base. Using this consideration, we manually identified 22 templates to be used in our experiments. Theoretically it is possible to automate this process by computing the ratio of open class words remaining in the template, as opposed to function words, or closed class words. The selected templates, along with illustrative English translations, are presented in Table 1. Note that due to grammatical differences, the translations may not be well-formed, but they are intended to illustrate the level of generality of the templates. In particular, note that almost all of the canned text contained within the templates consist of function words. Additionally, other information that must be provided along with the template is the number of lexical slots available and the number of syllables that currently exist in the canned text of the template. This information is required for the selection process, such as to count the number of syllables and keywords. Figure 1 provides an example of how templates are represented in our system. It shows two templates (#11 and #5 from Table 1). The first template contains 6 syllables within its canned text (dan, bi, sa, di, ba, wa), and has 3 lexical slots (2 nouns and an intransitive verb). The second template has 0 syllables within its canned text and has 4 lexical slots (2 pronouns and 2 intransitive verbs). Slot fillers Slot fillers are simply words used to fill the slots contained in the template. They must also be associated with a part-ofspeech tag and other information that is needed in the selection process. Slot fillers can be divided into two types, keywords and poetic words. Keywords are slot fillers that will determine the theme of the constructed poem. These words are expected to fulfill a sense of meaningfulness in the poem so that readers of the poem will capture some message that is being conveyed. At the beginning of the poetry generation process, we crawl popular Indonesian news websites such as kompas.com and detik.com. This is motivated by Full-FACE, which crawls the Guardian news website to determine the theme of the poem. An article is selected based on a given criteria, such as most recent, most commented on, or most read. After selecting an article, keyword extraction is done to obtain the keywords. Keyword extraction is done using simple unigram statistics, with stopword removal. Templates manually selected to be used Illustrative translations of the templates 1. 1. 2. 2. 3. 3. 4. 4. 5. 5. 6. dari ke 6. from to 7. adalah 7. there is 8. tapi 8. but 9. dan 9. and 10. ini hanyalah 10. is just 11. dan bisa dibawa 11. and can be brought 12. bersama 12. with 13. adalah untuk 13. is for 14. dengan penuh dalam 14. with full in 15. tak ada lagi dan 15. no more and 16. adakah padaku atau 16. is there with me or 17. ada yang ada yang 17. some are some are 18. mengapa 18. why 19. oh begitu 19. oh is so 20. terlalu bagi 20. too for 21. menjadi 21. becomes 22.apa itu Table 1. List of templates along with illustrative translations WORD: POS, PRONOUNCE, SYLL.COUNT, FLAG senja: nn, [s, eu, n, j, aa], 2, keyword Figure 2. Example of keyword representation Words that have the most frequency of occurrence will be the keywords candidate. An expanded collection of keywords is then constructed by identifying words that frequently occur together with the extracted words using the Wortschatz-Leipzig Corpora Collection (Quasthoff et al., 2006). Other information that should be associated with each keyword is its pronunciation and syllable count. This information is used for the selection process, such as for the computation of rhyme and the number of syllables in a line. Figure 2 shows an example of how keywords are represented in our system. In this example, the keyword, senja, has a part-of-speech value of NN (noun), pronunciation (s, eu, n, j, aa), 2 syllables (sen and ja), and a keyword flag that indicates that senja is one of the keywords of the article. For the experiments that we conducted, we selected 3 news articles and extracted a total of 247 keywords: 88 from the 1st article, 72 from the 2nd article, and 87 from the 3rd article. Poetic words are obtained from the same corpus of poetry used for template extraction. They are designed to help the generated poem satisfy the property of poeticness. Unlike other constraints that are more focused on the structure, this property is more focused on the selection of words to add to the aesthetics of the poem. The frequency of appearance of every word in the existing corpus is computed and stopwords are removed. The fifty words that most frequently appear in the corpus are selected. Finally, we apply an Indonesian POS Tagger to obtain their part-of-speech tags. Poetic words tend to convey a more general concept as opposed to the specific keywords based on news article. Furthermore, they tend to be more archaic in nature.. The technical representation of poetic words is similar to how keywords are represented, as they must also be associated with pronunciation, and number of syllables. Figure 3 shows an example of how poetic words are represented in our system. The poetic word kalbu has a part-of-speech value of NN (noun), pronunciation (k, aa, l, b, oo), 2 syllables ( kal and bu) and a filler flag that indicates that the word kalbu is a poetic word. Constraint Satisfaction Poetry Generation Our system adapts the approach proposed by Colton et al. (2012). The system creates poetry from the collection of templates combined with a particular set of words. The result of combining templates with keywords and poetic words will be the lines that will be collated to construct the poem. Overall, the system is implemented as three stages: retrieval, combination, and selection. It differs from Full-FACE in the following ways. Firstly, Pemuisi is a much more knowledge-poor system, as there are far fewer lexical resources available for Indonesian as there are for English, in particular the Jigsaw Bard resource WORD: POS, PRONOUNCE, SYLL.COUNT, FLAG kalbu: nn, [k, aa, l, b, oo], 2, filler Figure 3. Example of poetic word representation that appears to provide a major contribution to the poeticness and coherence to the poems generated by Full-FACE. Secondly, following Toivanen et al. (2013) (and to a lesser degree, Manurung (2003)), it explicitly treats the generation process as a constraint satisfaction problem, which affords a declarative formulation of the generation process, and the use of efficient off the shelf constraint solvers. Currently, Pemuisi is implemented as a logic program in Prolog. All lexical resources are encoded as factual assertions in the Prolog database, and the poetic constraints are implemented as clauses with subgoals that must be satisfied. Lastly, Pemuisi does not attempt the handing over of high level control that is implemented in Full-FACE, which is equipped with various definitions of aesthetics. Retrieval During this stage, a simple retrieval is performed by taking the relevant resources previously described from the knowledge base. Given an input news article, the system will populate the Prolog database with all relevant keywords, poetic words, and appropriate templates. The retrieval process can be set to randomly reorder the sequence of factual assertions, so that the systematic Prolog depth first search can yield novel results on repeated runs. Figure 4 shows an example of the output of this stage. Combination After collecting all the necessary resources, the system can start building the poem from the simplest unit, namely the poetry line. The combination process produces a poetry line through merging of a template with slot filler(s) by obeying certain rules. Each slot in the template must be filled with precisely one slot filler. A slot can only be filled with a slot filler with a corresponding part-of-speech tag. For example, a slot with a POS tag of NN (noun) can only be filled by a keyword or poetic word with a POS tag of NN. The system TEMPLATE: [,dan,,bisa,dibawa,]: 6, 3 [,,,]: 0, 4 SLOT FILLER aku:pr,[aa,k,oo],2,filler kau:pr,[k,aa,oo],2,filler senja:nn,[s,eu,n,j,aa],2,keyword kalbu:nn,[k,aa,l,b,oo],2,filler bayang:nn,[b,aa,y,aa,ng],2,keyword pergi:vbi,[p,eu,r,g,ee],2,filler kembali:vbi,[k,eu,m,b,aa,l,ee],3,filler menunggu:vbi,[m,eu,n,oo,ng,g,oo],3,keyword Figure 4. Example output of retrieval stage will exhaustively consider all possible valid combinations of templates and slot fillers. Consider the following example. Suppose that the resources obtained from the retrieval stage are as in Figure 4, which means the system must now combine 2 templates with 8 slot fillers consisting of: 2 slot fillers; 3 slot fillers, 2 of which are keywords; and 3 slot fillers, 1 of which is a keyword. Going by the previous explanation of how the process is done then all slot combinations are instantiated with the corresponding slot fillers to form the poem lines. Based on simple observation, it can be calculated that the number of combinations of lines of poetry can be generated from the collection of the above resources. 63 valid combinations of poetry lines can be obtained from the combination of templates and corresponding slot fillers. Selection After the combination stage, the system now has a large collection of poem lines that are ready to be built into larger units, i.e. the poem itself. This stage combines the lines that have been previously obtained as results of the combination. The resulting poem must satisfy the elements of poetry, such as the number of syllables, rhyme, rhythm, and number of lines. Such poetic elements are defined as constraints. Constraints that will be used include: 1) Number of lines: a constraint that states the number of poetry lines. As explained in the combination stage, the definition used for a single line is a result of a combination of a template with one or more slot filler. 2) Rhyme: a constraint that states the rules of rhyme in between lines of the poetry. 3) Number of words: a constraint that states the number of words contained in a single line of poetry. The number of words can be specified differently for each row. 4) Number of syllables: a constraint that states the number of syllables contained in a single line of poetry. Number of syllables can be specified differently for each row. 5) The number of keywords relative to the number of slots: a constraint that states the number of keywords relative to the number of slots contained in the whole poetry. In order to be more intuitive and easier, this constraint is expressed as a percentage. It can be used to control how the content of the poem focuses on a topic. The above set of constraints must be met when choosing combinations of line results from the previous stage. This is an important point of the concept of constraint satisfaction approach as also seen in Toivanen et al. (2013). From the previous example results obtained 63 lines of poetry that can be built into a combination of poetry. For instance, assume the following constraints: 1) The poem consists of 2 lines. 2) Line 1 and 2 share the same end-of-line rhyme. 3) Line 1 consists of 6 words with a total of 12 syllables. 4) Line 2 consists of 4 words with a total of 10 syllables. 5) 40% of all slots must be filled with content keywords. If we only look at the first constraint, it can be calculated that there are 632 poems that could be generated. But the more we continue to meet the subsequent constraints, the less the combinations of lines of poetry that are able to meet all the constraints. There are at least three cases that may occur after the selection process is done: (i) the system does not produce a single poem at all, (ii) it produces exactly a single poem, and (iii) it generates more than one poem. If no poem is produced, it means there is no combination that successfully meets the constraints that have been defined. In this case, the constraints will be gradually relaxed and the selection stage repeated until eventually a poem can be produced. In loosening constraints, the constraint that has the lowest precedence is first chosen to be relaxed. This process is repeated until the system is capable of producing a poem that satisfies the remaining constraints. If the system is able to produce one or more poems, it will randomly select one as its eventual output. Another alternative is to provide all the poetry as the output. Pemuisi is currently equipped with six poem structures, i.e. sets of constraints, to be used during the experiments. The purpose of the provision of six alternative structures is for the poetry generated by the system to be more diverse. An Illustrative Example In this section we provide an example of the output of Pemuisi. It was run to construct a poem based on an article from an Indonesian news portal, kompas.com, about Sir Alex Fergusons retirement in 2013 as Manchester United head coach. We situated Pemuisi to compose a poem with full constraint parameter and then randomly took 3 stanzas. Figure 5 shows the poem made by Pemuisi. The corresponding constraints which became the reference for Pemuisi while generating this poem can be seen in Figure 6. While comparing Figure 5 and Figure 6, we can see that the set of constraints were all satisfied by the resulting poem. Experiments and Evaluation We conducted experiments using several constraint configurations through an online web-based questionnaire to see the respondents opinions about the poetry generated by the system. Information about the experiment was distributed through various mailing lists and social media channels (e.g. Facebook, Twitter), targeting native Indonesian speakers including public groups, academic communities, and poetry appreciation communities in order to provide a more balanced and valid distribution of respondents, ranging from a laymans appreciation of poetry to communities that specifically discuss poetry appreciation. At the end of the data collection, we managed to obtain 180 respondents. fergie pergi fergie is gone ferguson pensiun, ferguson berhenti ferguson retired, ferguson stopped adakah masa padaku atau juri is there time with me or jury fergie berhenti fergie stopped fergie pensiun sendirian fergie retired alone dengan penuh merah dalam perjuangan with full red in struggle tak ada lagi akrab dan perjalanan no more friendship and trips fergie pensiun sendirian fergie retired alone dengan penuh biru dalam kesedihan with full blue in sadness tak ada lagi akrab dan pertandingan no more friendship and matches ferguson, ini hanyalah kompetisi ferguson, this is just a competition usia dan keputusan bisa dibawa pensiun age and decisions can be brought in retirement fergie, ini hanyalah tradisi fergie, this is just a tradition pemain dan manajemen bisa dibawa pensiun players and management can be brought in retirement Figure 5. Illustrative output of Pemuisi Constraint configurations There are three constraint configurations that were applied. In the first configuration, the full set of poetic constraints are applied, and a ratio of 50% of the open slots must be filled by content keywords. The second configuration is similar to the first, but in this case all the open slots must be filled by Stanza 1 Number of lines: 4 Line 1 number of words: 2; number of syllables: 4 Line 2 number of words: 4; number of syllables: 12 Line 3 number of words: 5; number of syllables: 12 Line 4 number of words: 2; number of syllables: 5 Line 1, 2, 3, and 4 rhyme with each other Keywords composition: 100% Stanza 2 Number of lines: 6 Line 1 number of words: 3; number of syllables: 10 Line 2 number of words: 5; number of syllables: 12 Line 3 number of words: 6; number of syllables: 12 Line 4 number of words: 3; number of syllables: 12 Line 5 number of words: 5; number of syllables: 12 Line 6 number of words: 6; number of syllables: 12 Line 1, 2, 3, 4, 5, and 6 rhyme with each other Keywords composition: 100% Stanza 3 Number of lines: 4 Line 1 number of words: 4; number of syllables: 12 Line 2 number of words: 6; number of syllables: 16 Line 3 number of words: 4; number of syllables: 10 Line 4 number of words: 6; number of syllables: 16 Line 1 and 3 rhyme with each other Line 2 and 4 rhyme with each other Keywords composition: 100% Figure 6. Constraint configuration used for poem in Figure 5 content keywords. Finally, the loose constraint configuration is one where the system is more or less left unguided to generate poems, with the only constraints being the use of templates, part of speech tags, and the number of lines to be generated, i.e. poetic features such as syllable counts, rhymes, and content keyword ratios are ignored. Obviously, respondents were not made aware of the distinction of these three configurations, and were simpy asked to rate the perceived quality of the generated poems regardless of the configuration of the generator. Turing Test Before conducting the main experiment to see how respondents evaluated the computer generated poems in terms of various aspects, we first conducted a simple Turing Test-like experiment to to determine how the system is able to imitate human behavior, in this case writing poetry. For this experiment, we selected snippets from four poems created by famous Indonesian poets (such as Chairil Anwar, Sutardji Calzoum Bachri, and WS Rendra), four poems generated by the system with the full constraint configuration, and four poems generated by system with the loose constraint configuration. For this Turing Test, the system only used poetic words as slot fillers so that the poetry does not specifically discuss a particular topic. These poems were randomized in the questionnaire and respondents were asked to annotate each poem by guessing whether the poem was written by a human or system. Figure 7 shows some poem examples for the Turing Test section. The questionnaire results for the Turing Test are shown in Table 2. 74% of the respondents correctly identified human-authored poems, but 26% of the human-authored poem judgments were erroneous (i.e. deemed to be machine-authored). As for the poems generated with the full set of constraints, 57% of the judgments were erroneous, i.e. they were deemed to be human-authored, and for the poems generated with the loose constraints, in only 35% of the cases did respondents falsely identify them as human-authored. Human authored (Hilang (Lost), by Sutardji Calzoum Bachri) batu kehilangan diam jam kehilangan waktu pisau kehilangan tikam mulut kehilangan lagu A stone loses silence A clock loses time A knife loses stab A mouth loses song Full constraint tak ada lagi pilu dan rindu dari rindu ke mentari ada yang terdiam ada yang menunggu no more pain and yearning from yearning to the sun some lay silent some lay in waiting Loose constraint cinta kau adalah sakit untuk kau aku melayang, aku melayang cinta kau adalah sakit untuk kau your love is pain for you I fly, I fly your love is pain for you Figure 7. Poem examples for Turing Test Main experiment For the main experiment, the three constraint configurations were each applied to three different news articles, resulting in 9 different poems being assessed. The poems were randomly obtained from the system output. In this section of the experiment, we aim to analyze how the poems generated by the system under different configurations were appraised by respondents. The questionnaire randomly presents one of the three chosen news articles along with the three poems produced from that article under the previously discussed constraint configurations. Each poem is the result of concatenating three stanzas that were generated and selected randomly. Respondents were asked to give an assessment of the poems based on the following criteria: 1) Structure: a criterion to evaluate the overall structure of the poem, i.e. whether or not it fulfilled the respondents subjective expectations of what constitutes a poem. Human Full Loose authored constraints constraints Human 74% 57% 35% Machine 26% 43% 65% Table 2. Results for Turing Test experiment 2) Diction: a criterion to evaluate the choice of words used in the poetry generated. 3) Grammar: a criterion to evaluate how well the grammar was in the poem. 4) Unity: a criterion to evaluate the unity between the form and content of poetry produced. 5) Message/theme: a criterion to evaluate the suitability of the poetry content with the reference article. 6) Expressiveness: a criterion to evaluate the level of expression of the resulting poem. An overview of the data analysis results of the questionnaire can be seen in Figure 8. The blue bar represents 50% keywords-full constraint, the red bar represents 100% keywords-full constraint, and the green bar represents loose constraint. Every respondents assessment is transformed to number scale with range of 0-3 then accumulated for the six criteria that have been mentioned previously. From the overview we can see that in general 50% keywords-full constraint and 100% keywords-full constraint parameter give better performance than loose constraint parameter in every criterion. As can be seen from Figure 8, poems made with 50% keywords-full constraint and 100% keywords-full constraint have a better structure than loose constraint. The structure is evaluated from the number of lines, number of syllables, and rhyme in the poem. We can predict this result as the full constraint configuration is meant to give a strict rule for the system when composing poems that the loose constraint Figure 8. Overview of main experiment results configuration does not have to obey. This phenomenon was also seen in unity and message aspect. Poems made with 50% keywords-full constraint and 100% keywords-full constraint seem to have a message and stay in specific theme/topic rather than loose constraint. The system is expected to achieve a good performance for discussing a specific theme given the way that the keywords are selected. The keyword ratio constrains the poems to remain on topic while the loose constraint configuration does not. However, it is important to remember that Pemuisi is not deliberately conveying a particular semantic message as it is simply constructing lines of poetry by randomly filling slots (given constraints). Thus, we claim that Pemuisi composes poems that can be said to be related to the article rather than faithful to the article. Tables 3 and 4 show the detail between topic and message aspect retrieved from the questionnaire response. 50%-FC stands for 50% keywords-full constraint, 100%-FC stands for 100% keywords-full constraint, and LC stands for loose constraint. 50%-FC 100%-FC LC Topic Msg Topic Msg Topic Mes TA 29% 10% 11% 6% 5% 2% A 59% 61% 76% 61% 54% 49% D 10% 25% 11% 31% 34% 42% TD 1% 4% 2% 3% 7% 8% TA: Totally Agree; A: Agree; D: Disagree; TD: Totally Disagree Table 3. The existence of topic and message 50%-FC 100%-FC LC Topic Msg Topic Msg Topic Msg TA 24% 4% 11% 7% 5% 2% A 64% 66% 68% 61% 46% 41% D 12% 28% 20% 31% 42% 49% TD 1% 2% 1% 1% 7% 8% TA: Totally Agree; A: Agree; D: Disagree; TD: Totally Disagree Table 4. The relation of topic and message with the article The unity between the form and content is better in 50% keywords-full constraint and 100% keywords-full constraint than loose constraint. This aspect shows us about the unity of poem structure and content. 50% keywords-full constraint and 100% keywords-full constraint have a slight lead in expressiveness aspect. This could be due to the composition between poetic words and keywords that is regulated by the keywords ratio. While keeping the poem to stay on topic, we allow the system to also be expressive by using poetic words. Finally, an almost tie result is shown in diction and grammar aspect with 50% keywords-full constraint and 100% keywords-full constraint, with both yielding a slightly better result than loose constraint. We can infer that for every parameter we use the 1 http://eclipseclp.org same templates set that already holds for grammaticality property. Pemuisi: Up-to-date Poem Feed We have developed a web application as a showcase to publish Pemuisi poems at http://budaya.cs.ui.ac.id/pemuisi. The core generation system runs as a background process of the site and is scheduled at noon everyday to crawl various news portals. In order to make Pemuisi up-to-date with the world situation, Pemuisi will find a recent article published by looking into the news portal RSS feed. The entire preprocessing work is automated. Pemuisi composes a poem consisting of 3-4 stanzas about the chosen news article. As Pemuisi would produce all poem combinations which satisfy the set of given constraint, we demand a fast processing and relevant poem. We provide seven sets of constraints which represent various kinds of Indonesian traditional poem form structure. We also provide 22 templates and 50 poetic words as static language resources. These constraints and language resources can be added anytime later. The Pemuisi web application also randomly shuffles the order of all the language resources and set of constraints before generation commences in order to raise the diversity level of the output. The poem produced by Pemuisi is then published to the site page. The first line of the poem is also tweeted by the Pemuisi Twitter account (@pemuisi) along with the website page link. In the site page connected with Twitter and Face-book, viewers can comment and share their thoughts about the poem to social media. Conclusions and Future Work We have developed an automatic poetry generation system that is capable of automatically generating poems in Indonesian based on specific context restrictions defined by existing constraints and reference news articles. The system combines the general architecture of the Full-FACE system introduced in Colton et al. (2012), particularly the aspect that generated poems are based on current news articles, with the explicit treatment of the generation process as a constraint satisfaction problem as in Toivanen et al. (2013) (and to a lesser degree, Manurung (2003)), which affords a declarative formulation of the generation process, and the use of efficient off the shelf constraint solvers (although in our current system we use Prolog, we plan to use purpose-built constraint solvers such as ECLiPSe1). The main contribution of this work, aside from this combined approach, and the adaptation to Indonesian, is the user evaluation that was conducted, as both Colton et al. (2012) and Toivanen et al. (2013) present no user evaluation. Lastly, Pemuisi is in effect a much more knowledge-poor system than Full-FACE, as there are far fewer lexical resources available for Indonesian as there are for English, in particular the Jigsaw Bard resource that appears to provide a major contribution to the poeticness and coherence to the generated poems. From the experimental results, it was found that when all the implemented constraints are applied the system is able to produce poetry that is deemed more similar to human-authored poetry rather than the poetry generated under the loosely-constrained configuration. They were also deemed to have better structure, more focus on a topic and conveyed the message from the reference article better. Many aspects from the system are still rudimentary, and there are still many opportunities to improve the system, such as expanding the types of constraints that can be handled, developing a better interface for the user, and improving the language resources. A careful qualitative evaluation from poets and other poetry experts would be valuable in order to gain feedback about the output of the system. With the developed web application, viewers can leave comments about the generated poem, thus this provides a channel for collecting information for a deep analysis on human perception about the generated poems. 2014_12 !2014 Musical Motif Discovery in Non-musical Media Daniel Johnson and Dan Ventura Computer Science Department Brigham Young University Provo, UT 84602 USA daniel.johnson@byu.edu ventura@cs.byu.edu Abstract Many music composition algorithms attempt to compose music in a particular style. The resulting music is often impressive and indistinguishable from the style of the training data, but it tends to lack significant innovation. In an effort to increase innovation in the selection of pitches and rhythms, we present a system that discovers musical motifs by coupling machine learning techniques with an inspirational component. Unlike many generative models, the inspirational component allows the composition process to originate outside of what is learned from the training data. Candidate motifs are extracted from non-musical media such as images and audio. Machine learning algorithms select and return the motifs that most resemble the training data. This process is validated by running it on actual music scores and testing how closely the discovered motifs match the expected motifs. We examine the information content of the discovered motifs by comparing the entropy of the discovered motifs, candidate motifs, and training data. We measure innovation by comparing the probability of the training data and the probability of the discovered motifs given the model. Introduction Computational music composition is still in its infancy, and while numerous achievements have already been made, many humans still compose better than computers. Current computational approaches tend to favor one of two compositional goals. The first goal is to produce music that mimics the style of the training data. Approaches with this goal tend to 1) learn a model from a set of training examples and 2) probabilistically generate new music based on the learned model. These approaches effectively produce artefacts that mimic classical music literature, but little thought is directed toward expansion and transformation of the music domain. For example, David Cope (1996) and Dubnov et al. (2003) seek to mimic the style of other composers in their systems. The second goal is to produce music that is radically innovative. These approaches utilize devices such as genetic algorithms (Burton and Vladimirova 1999; Biles 1994) and swarms (Blackwell 2003). While these approaches can theoretically expand the music domain, they often have little grounding in a training data set, and their output often receives little acclaim from either music scholars or average listeners. A large portion of work serves one of these two goals, but not both. While many computational compositions lack either innovation or grounding, great human composers from the period of common practice and the early 20th century composed with both goals in mind. For instance, Beethovens music pushes classical boundaries into the beginnings of romanticism. The operas of Wagner bridge the gap between tonality and atonality. Schoenbergs twelve-tone music pushes atonality to a theoretical maximum. Great composers of this period produce highly creative work by extending the boundaries of the musical domain without completely abandoning the common ground of music literature. We must note that some contemporary composers strive to completely reject musico-historical precedent. While this is an admirable cause, we do not share this endeavor. Instead, we seek to compose music that innovates and extends the music of the period of common practice and the early 20th century. Where do great composers seek inspiration in order to expand these boundaries in a musical way? They find inspiration from many non-musical realms such as nature, religion, relationships, art, and literature. Olivier Messiaens compositions mimic birdsong and have roots in theology (Bruhn 1997). Claude Debussy is inspired by nature, which becomes apparent by scanning the titles of his pieces, such as La mer [The Ocean], Jardins sous la pluie [Gardens in the Rain], and Les parfums de la nuit [The Scents of the Night]. Debussys Prelude a lapres-midi dun faune [Prelude to the Afternoon of a Faun] is a direct response to Stephane Mallarmes poem, Lapr esmidi dun faune [The Afternoon of a Faun]. Franz Liszts programme music attempts to tell a story that usually has little to do with music. Many pop musicians are clearly inspired by relationships and social interactions. While it is essential for a composer to be familiar with music literature, it is apparent that inspiration extends to non-musical sources. We present a computational composition method that serves both of the aforementioned goals rather than only one of them. This method couples machine learning (ML) techniques with an inspirational component, modifying and extending an algorithm introduced by Smith et al. (2012). The ML component maintains grounding in music literature and harnesses innovation by employing the strengths of generative models. It embraces the compositional approach found in the period of common practice and the early 20th century. The inspirational component introduces non-musical ideas and enables innovation beyond the musical training data. The combination of the ML component and the inspirational component allows us to serve both compositional goals. Media Inspiration Just as humans often rely on inspiration for their creative work, our motif discovery system relies on non-musical audio files for inspiration. Non-musical audio is a natural starting place for musical inspiration because audio and music both exist in the sound medium. We also generalize one step further by allowing our system to be inspired by other forms of media, specifically images. A human might look at a painting, understand its meaning, and compose a piece of music based on the way he feels about it. He might also feel inspired to compose a piece of music shortly after attending a speech, listening to a bird chirp, watching a movie, or reading poetry. Since computer technology has not yet matched the full capacity of humans in understanding events in the world, we begin with unsophisticated means for extracting musical inspiration from media (our precise methods are described in a later section). Musical Motifs We focus on the composition of motifs, the atomic level of musical structure. We use Whites definition of motif, which is the smallest structural unit possessing thematic identity (1976). There are two reasons for focusing on the motif. First, it is the simplest element for modeling musical structure, and we agree with Cardoso et al. (2009) that success is more likely to be achieved when we start small. Second, it is a natural starting place to achieve global structure based on variations and manipulations of the same motif throughout a composition. Since it is beyond the scope of this research to build a full composition system, we present a motif composer that performs the first compositional step. The motif composer trains an ML model with music files, it discovers candidate motifs from non-musical media, and it returns the motifs that are the most probable according to the ML model built from the training music files. It will be left to future work to combine these motifs into a full composition. Related Work A variety of machine learning models have been applied to music composition. Many of these models successfully reproduce credible music in a genre, while others produce music that is radically innovative. Since the innovative component of our algorithm is vastly different than the innovative components of other algorithms, we only review the composition algorithms that effectively mimic musical style. Cope extracts musical signatures, or common patterns, from the works of a composer. These signatures are recombined into a new composition in the same style (1996). This process effectively replicates the styles of composers, but its novelty is limited to the recombination of already existing signatures. Aside from Copes work, the remaining relevant literature is divisible into two categories: Markov models and neural networks. Markov Models Markov models are perhaps the most obvious choice for representing and generating sequential data such as melodies. The Markov assumption allows for inference and learning to be performed simply and quickly on large data sets. However, first-order Markov processes do not store enough information to represent longer musical contexts, while high-order Markov processes require intractable space and time. This issue necessitates a variable order Markov model (VMM) in which variable length contexts are stored. Dubnov et al. (2003) implement a VMM for modeling music using a prediction suffix tree (PST). A longer context is only stored in the PST when 1) it appears frequently in the data and 2) it differs by a significant factor from similar shorter contexts. This allows the model to remain tractable without losing significant longer contextual dependencies. Begleiter et al. (2004) compare results for several variable order Markov models (VMMs), including the PST. Their experiments show that Context Tree Weighting (CTW) minimizes log-loss on music prediction tasks better than the PST (and all other VMMs in this experiment). Spiliopoulou and Storkey (2012) propose the Variable-gram Topic model for modeling melodies, which employs a Dirichlet-VMM and is also shown to improve upon other VMMs. Variable order Markov models are not the only extensions explored. Lavrenko and Pickens (2003) apply Markov random fields to polyphonic music. In these models, next-note prediction accuracies improve when compared to a traditional high-order Markov chain. Weiland et al. (2005) apply hierarchical hidden Markov models (HHMMs) in order to capture long-term dependencies in music. HHMMs are used to model both pitch and rhythm separately. Markov models generate impressive results, but the emissions rely entirely on the training data and a stochastic component. This results in a probabilistic walk through the training space without introducing any actual novelty or inspiration beyond perturbation of the training data. Neural Networks Recurrent neural networks (RNNs) are also effective for learning musical structure. However, similar to Markov models, RNNs still struggle to represent long-term dependencies and global structure due to the vanishing gradient problem (Hochreiter et al. 2001). Eck and Schmidhuber (2008; 2002) address the vanishing gradient problem for music composition by applying long short-term memory (LSTM). Chords and melodies are learned using this approach, and realistic jazz music is produced. Smith and Garnett (2012) explore different approaches for modeling long-term structure using hierarchical adaptive resonance theory neural networks. Using three hierarchical levels, they demonstrate success in capturing medium-level musical structures. Like Markov models, neural networks can effectively capture both long-term and short-term statistical regularities in music. This allows for music composition in any genre given sufficient training data. However, few (if any) researchers have incorporated inspiration in neural network composition prior to Smith et al. (2012). Thus, we propose a novel technique to address this de.ciency. Traditional ML methods can be coupled with sources of inspiration in order to discover novel motifs that originate outside of the training space. ML models can judge the quality of potential motifs according to learned rules. Methodology An ML algorithm is employed to learn a model from a set of music themes. Pitch detection is performed on a non-musical audio file, and a list of candidate motifs is saved. For our purposes, semantic content in the audio files is ignored. The candidate motifs that are most probable according to the ML model are returned. This process is tested using different ML model classes over various audio input files. A high-level system pipeline is shown graphically in Figure 1. In order to generalize the concept of motif discovery from non-musical media, we also extend our algorithm to accept images as inputs. With images, we replace pitch detection with edge detection, and we iterate using a spiral pattern through the image in order to collect notes. This process is further explained in its own subsection. The training data for this experiment are 9824 monophonic MIDI themes retrieved from The Electronic Dictionary of Musical Themes.1 The training data consists of themes rather than motifs. We make this decision due to the absence of a good motif data set. An assumption is made that a motif follows the same general rules of a theme, except it is shorter. In order to better learn statistical regularities from the data set, themes are discarded if they contain at least one pitch interval greater than a major ninth. This results in a .nal training data set with 9383 musical themes. Themes and motifs are represented using the Phrase class from the jMusic library. We also utilize core functionality from jMusic for reading, writing, and manipulating musical structures.2 Machine Learning Models A total of six ML model classes are tested. These include four VMMs, an LSTM RNN, and an HMM. These model classes are chosen because they are general, they represent a variety of approaches, and their performance on music data has already been shown to be successful. The four VMMs include Prediction by Partial Match, Context Tree Weighting, Probabilistic Suffix Trees, and an improved Lempel-Ziv algorithm named LZ-MS. Begleiter et al. provide an implementation for each of these VMMs,3 an LSTM found on Github is used,4 and the HMM implementation is found in 1http://www.multimedialibrary.com/barlow/all barlow.asp 2http://explodingart.com/jmusic 3http://www.cs.technion.ac.il/~ronbeg/vmm/code index.html 4https://github.com/evolvingstuff/SimpleLSTM the Jahmm library.5 Each of the learned ML models is used on both pitches and rhythms separately. Each model contains 128 possible pitches (0-127) and 32 possible note durations (32nd note multiples up to a whole note). The set of inputs in the RNNs represents which note is played, and the set of outputs represents the next note in the sequence to be played. The RNNs train for a fixed number of iterations before halting. The HMMs are trained using the Baum-Welch algorithm for a fixed number of iterations. The VMMs are trained according to the algorithms presented by Begleiter et al. (2004). Audio Pitch Detection Our system accepts an audio file as input. Pitch detection is performed on the audio file using an open source command line utility called Aubio.6 More precisely, we use the aubionotes Windows binary from version 0.4.0 of Aubio, schmitt pitch detection, kl onset detection, and a threshold of 0.5. Aubio combines note onset detection and pitch detection in order to output a string of notes, in which each note is comprised of a pitch and duration. The string of detected notes is processed in order to make the sequence more manageable: given a tempo of 120 beats per minute, note durations are quantized to a 32nd note value; and note pitches are restricted to MIDI note values in the range [55, 85] by adding or subtracting octaves until each pitch is in range. Image Edge Detection Images are also used as inspirational inputs for the motif discovery system. We perform edge detection on an image using a Canny edge detector implementation,7 which returns a new image comprised of black and white pixels. The white pixels (0 value) represent detected edges, and the black pixels (255 value) represent non-edges. We also convert the original image to a greyscale image and divide each pixel value by two, which changes the range from [0, 255] to [0, 127]. We simultaneously iterate through the edge-detected image and the greyscale image one pixel at a time using a spiral pattern starting from the outside and working its way inward. For each sequence of b contiguous black pixels (delimited by white pixels) in the edge-detected image, we create one note. The pitch of the note is the average intensity of the corresponding b pixels in the greyscale image, and the duration of the note is b 32nd notes. The pitches are restricted to MIDI note values in the range [55, 85] as they were for pitch-detected sequences. Quantization is not performed for edge-detected sequences, since all of the note durations are already multiples of 32nd notes. Motif Discovery After the string of notes are detected and processed, we extract candidate motifs of various sizes (see Algorithm 1). We define the minimum motif length as l min and the maximum motif length as l max. All contiguous motifs of length 5http://www.run.monte.ore.ulg.ac.be/~francois/software/jahmm/ 6http://www.aubio.org 7http://www.tomgibara.com/computer-vision/canny-edgedetector Figure 1: A high-level system pipeline for motif discovery. An ML model is trained on pre-processed music themes. Pitch detection is performed on an audio file or edge detection is performed on an image file in order to extract a sequence of notes. The sequence of notes is segmented into a set of candidate motifs, and only the most probable motifs according to the ML model are selected. greater than or equal to l min and less than or equal to l max are stored. For our experiments, the variables l min and l max are set to 4 and 7 respectively. After the candidate motifs are gathered, the motifs with the highest probability according to the model of the training data are selected (see Algorithm 2). The probabilities are computed in different ways according to which ML model is used. For the HMM, the probability is computed using the forward algorithm. For the VMMs, the probability is computed by multiplying all the transitional probabilities of the notes in the motif. For the RNN, the activation value of the correct output note is used to derive a pseudo-probability for each motif. Pitches and rhythms are learned separately, weighted, and combined to form a single probability. The weightings are necessary in order to give equal consideration to both pitches and rhythms. In our system, a particular pitch is generally less likely than a particular rhythm because there are more pitches to choose from. Thus, the combined probability is defined as |m||m| Pp+r(m)= Pr(mp)Np+ Pr(mr)Nr(1) where m is a motif, mp is the motif pitch sequence, mr is the motif rhythm sequence, Np and Nr are constants, and Np >Nr. In this paper we set Np = 60 and Nr =4. The resulting value is not a true probability because it can be greater than 1.0, but this is not significant because we are only interested in the relative probability of motifs. For convenience, in what follows, we will use the simpler notation Pr(m) as a short hand for Pp+r(m) as well as the conditional notation Pr(m|M) as a shorthand for Pp+r(m|M), where Pp+r(m|M) is computed as in Eq. 1, replacing the independent probabilities with their respective conditional counterparts. Since shorter motifs are naturally more probable than longer motifs, an additional normalization step is taken in Algorithm 2. We would like each motif length to have equal probability: Algorithm 1 extract candidate motifs 1: Input: notes, l min, l max 2: candidate motifs {} 3: for l min . l . l max do 4: for 0 . i .|notes|l do 5: motif (notesi, notesi+1, ..., notesi+l-1) 6: candidate motifs candidate motifs . motif 7: return candidate motifs Algorithm 2 discover best motifs 1: Input: notes, model, num motifs, l min, l max 2: C extract candidate motifs(notes, l min, l max) 3: best motifs {} 4: while |best motifs| < num motifs do 5: m * argmax [norm(|m|)Pr(m|model)] m.C * 6: best motifs best motifs . m 7: return best motifs 1 Pequal = (2) (l max l min + 1) Since the probability of a generative model emitting a motif of length l is P (l)= Pr(m|model) (3) m.C,|m|=l we introduce a length-dependent normalization term that equalizes the probability of selecting motifs of various lengths. Pequal norm(l)= (4) P (l) This normalization term is used in step 5 of Algorithm 2. Validation and Results We perform three stages of validation for this system. First, we compare the entropy of pitch-detected and edge-detected music sequences to comparable random sequences as a baseline sanity check to see if images and audio are better sources of inspiration than are random processes. Second, we run our motif discovery system on real music scores instead of media, and we validate the motif discovery process by comparing the discovered motifs to hand annotated themes for the piece of music. Third, we evaluate the structural value of the motifs. This is done by comparing the entropy of the discovered motifs, candidate motifs, and themes in the training set. We also measure the amount of innovation in the motifs by measuring the probability of the selected motifs against the probability of the training themes according to the learned ML model. Preliminary Evaluation of Inspirational Sources Although pitch detection is intended primarily for monophonic music signals, interesting results are still obtained on non-musical audio signals. Additionally, interesting musical inspiration can be obtained from image files. We performed some preliminary work on fifteen audio files and fifteen image files and found that these pitch-detected and edge-detected sequences were better inspirational sources than random processes. This evaluation was performed as a sanity check, and we did not select motifs or use machine learning at this stage. Instead, we compared the entropy (see Equation 5) of pitch-detected and edge-detected sequences against comparable random sequences and found that there was more rhythm and pitch regularity in the pitch-detected and edge-detected sequences. In our data, the sample space of the random variable X is either a set of pitches or a set of rhythms, so Pr(xi) is the probability of observing a particular pitch or a rhythm. n H(X)= Pr(xi) logb Pr(xi) (5) i=1 More precisely, for one of these sequences we found the sequence length, the minimum pitch, maximum pitch, minimum note duration, and maximum note duration. Then we created a sequence of notes from two uniform random distributions (one for pitch and one for rhythm) with the same length, minimum pitch, maximum pitch, minimum note duration, and maximum note duration. The average pitch and rhythm entropy measures were lower for pitch-detected and edge-detected sequences. A homoscedastic, two-tailed Students t-test on the data shows statistical signifficance with p-values of 1 10-5 for pitches from images, 1 10-23 for rhythms from images, and 0.0003 for rhythms from audio files. In addition, although the p-value for pitches from audio files is not statistically significant (0.175), it is still fairly low. This suggests that there is potential for interesting musical content (Wiggins, Pearce, and Mullensiefen 2009) in the pitch-detected and edge-detected sequences even though the sequences originate from non-musical sources. Figure 2: An example of a motif inside the theme and a motif outside the theme for a piece of music. The average normalized probability of the motifs inside the theme are compared to the average normalized probability of the motifs outside the theme. Evaluation of Motif Discovery Process A test set consists of 15 full music scores with one or more hand annotated themes for each score. The full scores are fetched from KernScores,8 and the corresponding themes are removed from the training data set (taken from the aforementioned Electronic Dictionary of Musical Themes). Each theme effectively serves as a hand annotated characteristic theme from a full score of music. This process is done manually due to the incongruence of KernScores and The Electronic Dictionary of Musical Themes. In order to ensure an accurate mapping, full scores and themes are matched up according to careful inspection of their titles and contents. We attempt to choose a variety of different styles and time periods in order to adequately represent the training data. For each score in the test set, candidate motifs are gathered into a set C by iterating through the full score, one part at a time, using a sliding window from size l min to l max. This is the same process used to gather candidate motifs from audio and image files. C is then split into two disjoint sets, where Ct contains all the motifs that are subsequences of the matching theme(s) for the score, and C-t contains the remaining motifs. See Figure 2 for a visual example of motifs that are found inside and outside of the theme. A statistic Q is computed which represents the mean normalized probability of the motifs in a set S given a model M: 8http://kern.ccarh.org/ Algorithm 3 evaluate discovery process Algorithm 4 evaluate motif quality T is the set of all 9383 themes, V and S are sets of scores. Each T is the set of all 9383 themes, F is a non-musical (inspirational) r . V contains a set of themes {t1...tn}, ti . T and each s . S media file, Mp contains a set of themes {u1...uk}, ui . T . V .S = O and .s . S *M is a learned model . and .r . V , s . r = O 1: Input: T , F , Mp . *M 2: allmotifs extract candidate motifs from T 1: Input: T , V , S 3: Hm = average entropy(allmotifs) 2: for each ML model class M do 4: candidates extract candidate motifs from F 3: best = -. 5: Hc = average entropy(candidates) 4: for each setting p of Ms hyperparameters do 6: best discover best motifs from candidates using 5: ave =0 model Mp *M 6: for each score s . V do 7: learn Mp using T s as training data 7: Hb = average entropy(best) 8: results R(T, best|Mp 9: return Hm, Hc, Hb, results *M ) 8: ave = ave + U(s|Mp) 9: ave = ave/|V | 10: if ave > best then 11: best = ave Given the data in the table, a case can be made that certain 12: pbest = p * ML model classes can effectively discover thematic motifs 13: p = pbest M with a higher probability than other motif candidates. Four 14: for each ML model class M do of the six ML model classes have an average U value above 15: for each score r . R do zero. This means that an average theme is more likely to be discovered than an average non-theme for these four classes. 16: learn Mp *M using T r as training data 17: results U(r|Mp 18: return results *M ) PPM and CTW have the highest average U values over the norm(|m|)Pr(m|M) m.S Q(S|M)= (6) |S| Q(Ct|M) informs us about the probability of thematic motifs being extracted by the motif discovery system. Q(C-t|M) informs us about the probability of non-thematic motifs being discovered. A metric U is computed in order to measure the ability of the motif discovery system to discover desirable motifs. Q(Ct|M) Q(C-t|M) U(C|M)= (7) min{Q(Ct|M),Q(C-t|M)} U is larger than zero if the discovery process successfully identi.es motifs that have motivic or thematic qualities according to the hand-labeled themes. Given our collected set T of 9383 themes, we use leave-one-out cross validation on a set V of music scores and their hand-labeled themes in order to fine-tune the ML model class hyperparameters to maximize U, as shown in Algorithm 3. For each score s . V , we learn an ML model M from the model class M using T s as training data (line 7), and using the learned model we calculate the average U value for the set V (lines 8-9). We perform this validation under various hyperparameter configurations for all s . V for each ML model class (lines 2-6). After this is done, we select the hyperparameter configuration that results in the highest average value for U (lines 10-13). Finally, after these hyperparameters are tuned, we calculate U over a separate test set S of scores and themes (disjoint from V ) for each model class (lines 14-17). The results are shown in Table 1. test set. LSTM has the worst average, but this is largely due to one outlier of -91.960. Additionally, PST performs poorly mostly due to two outliers of -24.363 and -31.614. Except for LSTM and PST, all of the models are fairly robust by keeping negative U values to a minimum. Evaluation of Structural Quality of Motifs We also evaluate both the information content and the level of innovation of the discovered motifs, as shown in Algorithm 4. First, we measure the information content by computing entropy as we did before. We compare the entropy of the discovered motifs (lines 6-7) to the entropy of the candidate motifs (lines 4-5). We also segment the actual music themes from the training set into a set of motifs using Algorithm 1, and we add the entropy of these motifs to the comparison (lines 2-3). In order to ensure a fair comparison, we perform a sampling procedure which requires each set of samples to contain the same proportions of motif lengths, so that our entropy calculation is not biased by the length of the motifs sampled. The results for two image input files and two audio input files are displayed in Table 2, with each column for each input file the result of running Algorithm 4 twice, once for pitch and once for rhythm. The images and audio files are chosen for their textural and aural variety, and their statistics are representative of other files we tested. Bioplazm2.jpg is a computer-generated fractal while Landscape.jpg is a photograph, and Lightsabers.wav is a sound effect from the movie Star Wars while GalwayKinnell-Neverland.wav is a recording of a person reading poetry. The results are generally as one would expect. The average pitch entropy is always lowest on the training theme motifs, it is higher for the discovered motifs, and higher again for the candidate motifs. With the exception of Landscape.jpg, the average rhythm entropy follows the same pattern as pitch entropy for each input. One surprising obTable 1: U values for various score inputs and ML model classes. Positive U values show that the average normalized probability of motifs inside themes is higher than the same probability for motifs outside themes. Positive U values suggest that the motif discovery system is able to detect differences between thematic motifs and non-thematic motifs. Score File Name CTW HMM LSTM LZMS PPM PST BachBook1Fugue15.krn 4.405 4.015 3.047 2.896 11.657 4.951 BachInvention12.krn -2.585 -5.609 26.699 1.078 0.534 13.191 BeethovenSonata13-2.krn 1.065 -0.145 7.769 8.876 4.973 9.182 BeethovenSonata6-3.krn -0.715 -5.320 2.874 0.832 1.283 4.801 ChopinMazurka41-1.krn 6.902 0.808 -7.690 3.057 18.965 -24.363 Corelli5-8-2.krn -6.398 -1.270 -0.692 -2.395 -1.166 1.690 Grieg43-2.krn 2.366 1.991 -2.622 0.857 8.800 -7.740 Haydn33-3-4.krn 14.370 2.370 1.189 6.155 8.475 0.841 Haydn64-6-2.krn 1.266 2.560 -1.092 0.855 1.809 -0.133 LisztBallade2.krn -0.763 -0.610 -1.754 -0.046 1.226 0.895 MozartK331-3.krn 0.838 0.912 3.829 0.756 3.222 5.413 MozartK387-4.krn -4.227 -0.082 -91.960 -2.127 -3.453 -31.614 SchubertImpromptuGFlat.krn 49.132 3.169 0.790 8.985 59.336 1.122 SchumannSymphony3-4.krn 0.666 2.825 -2.154 0.289 1.560 -6.830 Vivaldi3-6-1.krn 7.034 2.905 0.555 7.055 9.633 -0.367 Average 4.890 0.568 -4.081 2.475 8.457 -1.931 servation is that the rhythm entropy for some of the ML model classes is sometimes higher for the discovered motifs than it is for the candidate motifs. This suggests that thematic rhythms are often less predictable than non-thematic rhythms. However, the pitch entropy almost always tends to be lower for the discovered motifs than the candidate motifs. This suggests that thematic pitches tend to be more predictable. Next, we measure the level of innovation of the best motifs discovered (line 8). We do this by taking a metric R (similar to U) using two Q statistics (see equation 6), where A is the set of 9383 themes from the training database and E is the set of discovered motifs. Q(A|M) Q(E|M) R(A, E|M)= (8) min{Q(A|M),Q(E|M)} When R is greater than zero, A is more likely than E given the ML model M. In this case, we assume that there is a different model that would better represent E. If there is a better model for E, then E must be novel to some degree when compared to A. Thus, If R is greater than zero, we infer that E innovates from A. The R results for the same four input files are shown along with the entropy statistics in Table 2. Except for PPM, all of the ML model classes produce R values greater than zero for each of the four inputs. While statistical metrics provide some useful evaluation in computationally creative systems, listening to the motif outputs and viewing their musical notation will also provide valuable insights for this system. We include six musical notations of motifs discovered by this system in Figure 3, and we invite the reader to listen to sample outputs at http://axon.cs.byu.edu/motif-discovery. Conclusion and Future Work The motif discovery system in this paper composes musical motifs that demonstrate both innovation and value. We show that our system innovates from the training data by extracting candidate motifs from an inspirational source without generating data from a probabilistic model. This assumption is validated by observing high R values. Additionally, the motif discovery system maintains compositional value by grounding it in a training data set. The motif discovery process is tested by running it on actual music scores instead of audio and image files. The results show that motifs found inside of themes are on average more likely to be discovered than motifs found outside of themes. Improvements and modifications can be made in the analysis and methodology of our system. We are currently preparing another manuscript which evaluates the difference between motifs discovered by our system and comparable random motifs. The results show that using (non-musical) media as inspiration for the motif discovery process is more efficient at producing musical motifs than is randomly generating reasonable motifs. The discovered motifs are the contribution of this system. While work presented here is a proof-of-concept for the use of non-musical media sources as inspiration in creating musical motifs, more sophisticated techniques should be explored. In the future, we plan to utilize machine vision to extract meaning from images; we plan to study saccades from human subjects on various images in order to train the computer to see them in a more human, natural way; and we plan to incorporate digital signal analysis on audio files in order to hear audio more like a human would hear it. (While it is certainly not necessary for a computer to be inspired in the same way as a human might be, if the goal is to compose music that people can appreciate, it seems worthwhile to explore human-centric models of musical inspiration.) In addition to improving the motif creation process, future work will investigate combining these motifs, adding harmonization, and creating full compositions. This work is simply the first step in a novel composition system. While there are a number of directions to take with this system as Table 2: Entropy and R values for various inputs. We measure the pitch and rhythm entropy of motifs extracted from the training set, the best motifs discovered, and all of the candidate motifs extracted. On average, the entropy increases from the training motifs to the discovered motifs, and it increases again from the discovered motifs to the candidate motifs. The R values are positive when the training motifs are more probable according to the model than the discovered motifs. Higher R values represent higher amounts of innovation from the training data. Bioplazm2.jpg CTW HMM LSTM LZMS PPM PST Average pitch entropy training motifs 1.894 1.979 1.818 1.816 1.711 1.536 1.793 pitch entropy discovered motifs 2.393 2.426 1.944 1.731 2.057 1.759 2.052 pitch entropy candidate motifs 2.217 2.328 2.097 2.104 1.958 1.784 2.081 rhythm entropy training motifs 1.009 1.051 0.976 0.970 0.927 0.822 0.959 rhythm entropy discovered motifs 2.110 2.295 1.789 2.212 0.684 1.515 1.767 rhythm entropy candidate motifs 2.387 2.466 2.310 2.309 2.132 1.934 2.256 R 7.567 13.296 20.667 4.603 -0.276 7.643 8.917 Landscape.jpg CTW HMM LSTM LZMS PPM PST Average pitch entropy training motifs 1.894 1.979 1.818 1.816 1.711 1.536 1.793 pitch entropy discovered motifs 1.974 2.074 2.143 1.833 2.027 1.675 1.954 pitch entropy candidate motifs 2.429 2.531 2.598 2.341 2.271 2.028 2.367 rhythm entropy training motifs 1.009 1.051 0.976 0.970 0.927 0.822 0.959 rhythm entropy discovered motifs 1.984 1.863 2.175 1.983 0.727 1.455 1.698 rhythm entropy candidate motifs 1.549 1.712 1.810 1.509 1.396 1.329 1.551 R 0.805 0.236 1.601 0.429 4.624 1.283 1.496 Lightsabers.wav CTW HMM LSTM LZMS PPM PST Average pitch entropy training motifs 1.894 1.979 1.818 1.816 1.711 1.536 1.793 pitch entropy discovered motifs 2.076 1.884 1.881 1.652 2.024 1.586 1.850 pitch entropy candidate motifs 2.225 2.097 2.217 1.876 2.115 1.755 2.048 rhythm entropy training motifs 1.009 1.051 0.976 0.970 0.927 0.822 0.959 rhythm entropy discovered motifs 1.534 1.309 2.024 1.623 0.860 1.225 1.429 rhythm entropy candidate motifs 1.540 1.524 1.541 1.502 1.548 1.276 1.489 R 5.637 0.793 27.227 4.812 6.768 7.540 8.796 GalwayKinnell-Neverland.wav CTW HMM LSTM LZMS PPM PST Average pitch entropy training motifs 1.894 1.979 1.818 1.816 1.711 1.536 1.793 pitch entropy discovered motifs 1.823 2.480 2.132 1.773 1.997 1.701 1.984 pitch entropy candidate motifs 2.153 2.248 2.250 2.141 2.242 1.839 2.146 rhythm entropy training motifs 1.009 1.051 0.976 0.970 0.927 0.822 0.959 rhythm entropy discovered motifs 1.550 1.587 1.560 1.779 0.289 1.128 1.315 rhythm entropy candidate motifs 1.472 1.469 1.471 1.477 1.469 1.226 1.431 R 1.520 10.163 24.968 4.283 0.257 6.865 8.010 a starting point, we are inclined to compose from the bottom up. Longer themes can be constructed by combining the motifs from this system using evolutionary or other approaches. Once a set of themes is created, then phrases, sections, and multiple voices can be composed in a similar manner. Contrastingly, another system could compose from the top down, composing the higher level features first and using the motifs from this system as the lower level building blocks. This system could also be extended by including additional modes of inspirational input such as text or video. Our intent is for this system to be the starting point for an innovative, high quality, well-structured system that composes pieces which a human observer could call creative. 2014_13 !2014 Non-Conformant Harmonization: the Real Book in the Style of Take 6 Franois Pachet, Pierre Roy Sony CSL Paris, France pachetcsl@gmail.com Abstract We address the problem of automatically harmonizing a leadsheet in the style of any arranger. We model the arranging style as a Markov model estimated from a corpus of non-annotated MIDI files. We consider a vertical approach to harmonization, in which chords are all taken from the arranger corpus. We show that standard Markov models, using various vertical viewpoints are not adapted for such a task, because the problem is basically over constrained. We propose the concept of fioriture to better capture the subtleties of an arranging style. Fioritures are ornaments of the given melody during which the arranging style can be expressed more freely than for melody notes. Fioritures are defined as random walks with unary constraints and can be implemented with the technique of Markov constraints. We claim that fioritures lead to musically more interesting harmonizations than previous approaches and discuss why. We focus on the style of Take 6, arguably the most sophisticated arranging style in the jazz genre, and we demonstrate the validity of our approach by harmonizing a large corpus of standard leadsheets. Introduction Automatic harmonization has been addressed for decades by computer music research (see Steels, 1986 for an early attempt at machine-learning of harmonization and Fernandez and Vico, 2013 for a survey). One reason for the success of this problem in the research community is that it can be considered, in first approximation, as a well-defined problem, a crown jewel in computer music. Automatic harmonization denotes in practice many different problems, depending on the nature of the input (melody, chord labels, bass, song structure given or not) and of the output (chord labels, chord realizations, contrapuntal voices), the constraints concerning the nature of the targeted harmonization (number of voices) and the way the targeted style is modeled (programmed explicitly or learned from examples). A widely studied variant of the automatic harmonization problem is the generation of a four-part (or more) harmonization of a given melody. Such a problem has been tackled in a variety of contexts, though mostly for classical music, Bach chorales in particular, and using virtually all the technologies available including rules, functions (Koops et al. 2013), grammars, constraints (Anders and Miranda, 2011), and statistical models of all types (Paiement et al. 2006). Today, there are many approaches that work satisfactorily to produce harmonizations in the Classical style with reasonable musical quality. It is remarkable that automatic harmonization has achieved such a status of welldefinedness that many papers in this domain consist in variations of existing algorithms, with little or no musical output (a sign, probably of the maturity of the field). However, there is no system, to our knowledge, that is able to produce truly musically interesting harmonizations, at least for the ears of musically trained listeners such as the first author of this paper. In the context of computational creativity, we claim that there are two problems with the current state of the art which limit their quality, and therefore their possibility for generating creative outputs: excess of conformance and excess of agnosticism. Conformance. Automatic harmonization has so far been envisaged solely under the viewpoint of harmonic conformance: the main criterion of success is that the generated material has to conform to the harmonic constraints of the problem. For instance, a harmonic label of C minor (either imposed or inferred from, say, a soprano) should produce chord realizations that conform to C minor, for instance, chords composed of important notes of the scale. Conformance yields indeed a well-defined measure to evaluate systems, because there are well-defined harmonic distances (see Section Harmonic Distance), but tends to go in the way of creativity, since the best a system can do is to paraphrase harmonic labels. Such a skill can be impressive for non-musicians, but not for experts. Consequently, many harmonization systems give the impression that they are essentially filling the blanks (inner voices) with correct but uninteresting musical excipients. This is sometimes re ferred to as the correct versus good problem, but in fact, such harmonizers are basically unable to produce interesting solutions, because of excess in conformance. Agnosticism (excess of generality). Most works, with the exception of (Ebcioglu, 1986), attempt to model a given style using general methods (such as Markov models, rules, etc.). General methods can be good in general, but are rarely very good in particular. Similarly to the famous glass ceiling problem occurring in MIR (Casey et al., 2008), there seems to be a glass ceiling concerning the musical quality of automatic harmonization. In our view this is caused by the use of too general methods and by the absence of consideration for the details of what makes a specific style interesting or creative. Most often, these details are not captured by general methods. In this study, we focus on the harmonization style of the American six-voice a cappella band Take 6. Take 6 is the most awarded vocal group in history. Since their first two albums (Take 6, 1988; 1990) they renewed the genre of gospel barbershop-like harmonization by pushing it to its harmonic and vocal limits. Their style of arranging is considered unanimously as extraordinarily inventive, recognizable, and very difficult to imitate. Even the transcription of their performances is a very difficult task that only harmony experts can perform correctly (see Section Acknowledgements). Most of their works consist in 6-voice note-tonote harmonization of traditional songs, with many dissonances and bold voice movements typical of jazz big bands. The creativity of Take 6, if any, consists precisely in the use of those dissonances and digressions. Of course, their style and specificity is arguably also dependent on the quality of the singing voices (notably the bass), but this dimension is outside the scope of this paper, and we consider here only the symbolic aspects of their arranging style. Most knowledgeable listeners of Take 6 enjoy wow effects due to their spectacular use of harmonic surprises. Figure 1 shows an excerpt of a harmonization by Take 6 of the traditional Hark the Herald Angels Sing. Figure 2 shows an estimation of the corresponding excerpt of the leadsheet (end of section A). It can be seen clearly that the chords used to harmonize the note B. do not conform to the expected harmony of B. major: although the performance of Take 6 are not labeled, we can estimate the last realization of the B. as an instance of a C7dim9#11 (CEG Bb Db F#), which is very far from the expected Bb major scale, or of any scale close by (such as relative minors). Such a harmonic surprise is typical of the style of Take 6. By definition, conformant methods in automatic harmonization are not able to capture this kind of knowledge, especially from non-labeled training data. Our goal is to produce six-voice harmonization in that style that triggers the same kinds of wow effects as the originals. The key idea of our approach is that most wow effects are obtained by non-conformant harmonizations, i.e., harmonizations that do not conform to the harmonic labels of the original leadsheet, but stay within well-defined constraints. The technical claim of this paper it that the technology of Markov constraints (Pachet et al., 2011) is particularly well suited for such a task, thanks to the possibility of generating creative sequences within well-defined constraints. Problem Statement The problem we address constitutes a variation on standard harmonization problems such as melody or bass given. It can be defined in terms of inputs/outputs as follows: Inputs: -A leadsheet representing the target melody to harmonize, as well as chord labels in a known syntax (i.e., we know their pitch constituents), -A harmonization style represented by a set of non-annotated scores containing polyphonic content. No annotation of these scores is needed. In practice, arbitrary MIDI files may be used, including files without a fixed tempo coming from, e.g., recordings of real-time performances. The expected output is a fully harmonized score, in the given style, i.e. a polyphonic score that maintains the the soprano of the leadsheet, and whose harmonies fit with the leadsheet chord labels. Figure 1. Example of a typical non-conformant harmonization by Take 6. Harmonies used (estimated from the score) go from Bb (which conforms to the leadsheet) to a surprising, non-conformant C7dim9#11 (Transcription by A. Dessein). Figure 2. Extract of a leadsheet for Hark the Herald Angels Sing (end of section A). The last Bb is supposed to be harmonized in Bb (shortcut for Bb major). Musically, the goal is to produce a harmonization that is reminiscent of the style, i.e., such that knowledgeable listeners can recognize the authors. However this is not a well-defined problem, for several reasons: listeners may not recognize a style because they do not know the arranger well enough, or because they give more importance to the sound than to the notes, or for many other reasons, including that the arranger may not have any definite style per se. In this paper, we do not attempt to solve the harmonization problem in many styles (though the system can, as exemplified in Section Applications to Other Styles). Rather, we attempt to convince ourselves, as knowledgeable Take 6 listeners, that our system grasps some of their subtle arranging tricks and reproduce them in unknown situations. A scientific evaluation of the system based on style recognition is in progress but is not the subject matter of this paper. Corpora Used The experiments we describe use a comprehensive database of jazz leadsheets described in (Pachet et al., 2013). For each leadsheet we have a melody (monophonic sequence of notes) and chord labels. For each chord label, the database provides the set of pitch-classes of the chord, in ascending order (that is, the formal definition of the chord, not its realization). In this study we used the Real Book (illegal edition), the most widely used jazz fake book. The Real book contains about 400 songs, 397 of which are parsed correctly (a few songs with no harmony or no melody are ruled out for instance). For the harmonization style, we have selected a number of composers including classical ones (Wagner, Debussy, etc.) and jazz (Take 6 notably, and Bill Evans). Each composer is represented by a set of MIDI files of some of their compositions or performances. All MIDI files have been found on the web, except the 10 MIDI files of Take 6 that were provided to us by a human transcriber (A. Dessein). The Take 6 MIDI files are of excellent quality (i.e., there are virtually no transcription errors). The other MIDI files are of varying quality. Some of them correspond to actual scores (Wagner), others to performances (Bill Evans). IN order to cope with the diversity of tonalities and pitch ranges encountered in the leadsheet melodies, we have transposed systematically the corpus in all 12 keys. Homophonic Harmonization The approach we follow consists in considering the harmonization problem as a vertical problem, as opposed to voice-leading approaches (such as Whorley et al., 2013), and following an older tradition initiated by (Pachet and Roy, 1995) on constraint-based 4-voice harmonization in the Classical style. To compensate for the monotony of strict vertical harmonization, we complement this step by a smoothing procedure that somehow reestablishes voice-leading a posteriori from the vertical skeleton structure, by joining contiguous notes with the same pitch. This second step is completely deterministic, and the central issue we address is the production of the chordal skeleton. Before describing the harmonization process, we introduce a measure of harmonic conformance, which is at the core of the whole process. Harmonic Conformance Because the scores of arrangers are not labeled, we need a way to relate chord realizations found in the arranger corpus to chord labels of a leadsheet. In order to avoid the pitfalls of chord recognition (which works well for simple chords, but much less for the complex chords as found in jazz), we use a simple but robust measure of the harmonic conformance between unlabeled chords. This measure, called fi-conformance is based on pitch class histograms. For any chord realization .., i.e., a set of MIDI pitches, we build a pitch class histogram as an array of 12 integers, where each integer represents the number of occurrences of the corresponding pitch class in the chord (starting with C up to B), normalized by the total number of pitches. For instance, the circled chord in Figure 1 has a pitch-class frequency count . . ....... .. .. .. .. .. .. .. .. ... The histogram is the frequency count divided by its module .. ..... .. The harmonic distance between two chords . .. ... .. and .. can then be defined as the scalar product of the pitch class histograms: .. ... ........... . .. . ... .. where ..(resp. ..) is the pitch-class histogram of chord .. (resp. ..). Such a distance takes its values in ... ... In practice, this distance enables us to categorize chord realizations appearing in the arranger corpus with regards to a given chord label. For each chord label, we can define an ideal prototype consisting of its pitch class definition, and then consider the ball centered on this ideal prototype of radius .. . represents the harmonic conformance of a chord realization to a chord label. Increasing values of fi provide increasingly large sets of chords, that are more or less conformant to the label. Figure 3 shows examples of chords at various distances to C 7 for various values of fi in the Take 6 corpus. Another way to relate chord realizations to chord labels is to consider the best match for a given corpus: the chord in the arranger corpus with the minimal harmonic distance to the ideal realization of the label. We then consider the ball centered around this best match, of radius .. In any case, pitch class histograms provide us with a robust way to fetch chord realizations for any chord label, in non-annotated corpora. Unary Markov Constraints Equipped with a harmonic distance, we can generate new chordal skeletons. The idea is to estimate a Markov model of the sequences of chord realizations from the arranger corpus. The leadsheet (soprano movement and chord labels) is represented as a set of unary constraints holding on the sequence to generate. The framework of Markov constraints (Pachet et al., 2011), is precisely designed to handle such cases, and provides an efficient algorithm to generate those sequences, as well as a guarantee that all sequences satisfying the constraints will be found, with their correct probability in the original model. Solving a Markov constraint problem is strictly equivalent to sampling the sequences in the space of solutions. Each sequence . . ........ , has a probability .... . ..... . ... . .......... according to the considered Markov mod ... el (see next section). The unary Markov constraint algorithms guarantee that all sequences satisfying the constraints are drawn with their probability in the original model. Figure 3. Various chord realizations from the Take 6 corpus for several values of fi (0.01, 0.1 and 0.2), representing increasing harmonic distance to a C 7 chord label. As fi increases, more notes outside of the legal notes of C 7 (C, E, G, Bb) are added. For ... (maximum distance) all possible chords of the corpus are considered. In practice, reasonable, con-formant realizations lie within a distance of about .15. Viewpoints Such a process raises an important issue concerning the choice of the viewpoint, i.e. the actual data used to estimate the Markov model. The most demanding viewpoint is the actual set of notes (Midi pitches) of the chord. This is called here the Identity viewpoint, since it contains all the information we have on a chord. Degraded viewpoints are also considered: BassTenorSoprano is the viewpoint consisting of the bass, tenor and soprano pitches (and ignoring the others). We define similarly the BassSoprano and Soprano viewpoints. For the sake of comparison, we also introduce the Constant viewpoint, which assigns a constant value to any chord (and serves as a base line for our experiments). Note that we do not consider duration information, as we do not want to rely on the quality of the MIDI Files. Of course there is a tradeoff here between 1) harmonic conformance, represented here by fi, and 2) style conformance, which manifests itself by the presence of chord transitions that actually occurred in the corpus. Such a tradeoff between adaptation and continuity is not novel, and has been studied in automatic accompaniment (Cabral et al., 2006; Marchini and Purwins, 2010). In our context, it is formulated as a tradeoff between fi and viewpoint selectiveness. The most demanding viewpoint generate chord sequences that sound more natural in the given style, since they replicate actual transitions of chord realizations occurring in the corpus. However, such chord transitions will generate a sparse Markov model. The consequence is that only a very small number of leadsheets can be harmonized in that way for small values of fi. By degrading the viewpoints, more transitions will be available, so smaller (more conformant) values of fi can be considered. Harmonizing the Real Book In order to illustrate the harmonic conformance / viewpoint tradeoff, we describe a basic experiment that has, to our knowledge, never been conducted, at least on such a scale. For several values of fi we study the sparsity of the four viewpoints introduced above, by counting how many songs from the Real Book can be harmonized entirely with the viewpoint. More precisely, for each leadsheet taken from the Real Book (397), we build a Markov Constraint problem consisting of the following constraints: -Generate a sequence of chord realizations taken exclu sively from the Take 6 corpus, transposed in all 12 pitches (variable domains), -Each note of the leadsheet is harmonized by one chord realization (homophonic note-to-note harmonization), -Transitions between 2 chord realizations .. and ....are all Markovian for the considered viewpoint, i.e. .......... . ., -Each chord .... has a soprano which is the leadsheet note -Each chord realization .... must be .-conformant to the corresponding leadsheet chord label, for the chosen value of .. These constraints can all be implemented as a unary Markov constraint problem. The experiment consists in counting, for each value of fi in ... .. and for each of the four viewpoints how many songs from the Real Book can be fully harmonized. The results are presented in Figure 4. It can be seen clearly that with non-trivial viewpoints (i.e. all viewpoints but soprano), solutions are found only for high values of ... For those values, harmonic conformance is lost. Only the basic Soprano viewpoint leads to many solutions (160, a value insensitive to fi). It can be noted that the Constant viewpoint (a trivial viewpoint that consists in basically removing the Markovian constraint), solutions are found for 262 songs. This means that there are 102 songs for which the Soprano viewpoint does not lead to any solution, for any value of fi. This corresponds to songs that contain pitch transitions that never occur between two consecutive realized chords in the Take 6. It is important to note here that when no solution is found for a given leadsheet / viewpoint combination / value of fi, this does not necessarily implies that the leadsheet contains a transition for which there is no match in the corpus (for the given viewpoint). It means that there is no complete solution, i.e. transitions compatible with each other so as to make up a complete solution sequence. This experiment shows clearly that harmonic conformance is somewhat incompatible with precise Markov models of chord realizations, for a realistic corpus (Take 6) on a realistic test database (the Real Book). However, we can use the Soprano viewpoint as a basis for producing interesting harmonization of most reasonable leadsheets, with a clear control on harmonic conformance. Figure 5, Figure 6 and Figure 7 show homophonic harmonization of the four first bars of Giant Steps with various values of fi. It can be noted that while harmonic conformance can be used as a parameter to generate more or less conformant realization, the results are academically correct, but rarely very interesting musically. The style of the arranger is hard to recognize, because there are not enough actual transitions that are being reused from the corpus. The control of harmonic conformance can generate surprises, but at the price of losing the essence of the style. Figure 4. Graph showing the number of successful harmonization from the Real Book (illegal edition) using a Markov model of chord realizations, and various viewpoints of decreasing precision (identity, bass/tenor/soprano, bass/soprano, soprano). Figure 5. The beginning of Giant Steps harmonized with a value of . . ... .. .... All realizations come from the Take 6 corpus satisfy exactly the chord labels. The overall harmonization is conformant but not very interesting. Figure 6. The beginning of Giant Steps with . . ........ The chords are less conformant and more interesting, but the whole harmonization still lacks surprise. Figure 7. The beginning of Giant Steps with . . ........ Chords are clearly farther away from the label, while retaining some flavor of the labels. However the decrease in harmonic conformance is musically not very interesting. In order to express the harmonization style more clearly, and simultaneously bring creativity in the harmonization process, we introduce the concept of Fioriture. Fioritures as a stylistic device The idea of fioriture comes from a simple observation of polyphonic scores written by masters: It is difficult to be inventive on short duration notes. However, long notes raise opportunities to express a style: the longer a note is, the more possibilities of invention the arranger has. In the context of leadsheet based harmonization, we therefore introduce the concept of fioriture as a free variation, in the style of the arranger, occurring exactly during a long note, and making sense with its context. A Simple Fioriture Example We illustrate the concept of fioriture on a simple example. The task is to harmonize the melody shown in Figure 8: two notes with simple chords labels (both notes belong to the chord triads). Figure 8. A simple melody to harmonize with fioritures. This melody can be harmonized homophonically as described above, as illustrated in Figure 9. Figure 9. Two homophonic harmonization of the melody in Figure 10, with . .......and . . ....... respectively. A higher value of ., the second one is more jazzy with a 9th added to the first chord and a 6th to the second one. We can generate here a fioriture on the first note, since its duration is 4 beats. The Markov constraint problem corresponding to this fioriture is the following: -First, select a rhythm for a note starting on the first beat of a 4/4 bar, and lasting 4 bars (rhythm selection is described in the next section). Let fi be the number of notes, we generate ... chord realizations to include the chord on the following note (here a D). -The domain of the first chord contains only chords whose soprano is the first melody note (here, A). -We can choose here a demanding viewpoint such as the identity viewpoint because in most cases the constraints above are not too hard. Figure 10 shows various solutions, with increasing number of notes in the fioriture. It should be noted that all fioritures start from a soprano A on a Amin chord and end on a soprano D on a D7 chord. However, some of them, in particular the last ones, deviate substantially from the chord labels. In short, they achieve musically meaningful harmonic non conformance. To our knowledge, only Markov constraints can compute quickly distributions of solutions Figure 10. Fioritures with various numbers of notes. First one introduces an interesting chromaticism (E to Eb then to D); second example (3 notes) introduce a clearly non con-formant chord, that resolves nicely to the D; third example (4 notes) consists in a bold chromatic descent from A minor to D; fourth example (5 notes) uses an interesting triplet-based rhythm that also departs substantially from the A minor chord label; last example is a remarkable jazzy sequence of chords. Common-sense rhythms One difficulty that arises when creating fioritures is to find an adequate rhythm for the generated chords. One solution would be to try to imitate rhythm as found in the arranger corpus, but this implies that the corpus used is perfectly reliable, and that metrical information is provided, which is not the case with MIDI files obtained from performances. More importantly, generating Markov sequences with durations raise sparsity issues that do not have general solutions. Another argument is that the rhythm of the fioriture should comply with the genre of the leadsheet more than of the arrangers corpus. In this study, we have exploited the statistical properties of the leadsheet database to find commonsense rhythms that fit with the leadsheet to harmonize. For each rhythm to generate, we query the database to retrieve all the melodic rhythms that occur in all jazz standards, at the given metrical position. For a given leadsheet note to harmonize, we retrieve all melodic extracts starting at the same metrical position in the bar, and of the same duration. We then draw a rhythm at random, weighted by its probability in the database. Such a method can be parameterized in many ways (imposing the number of notes, the presence of rests, filter out by composer, genre, etc.). Figure 11 and Figure 12 show the most frequent rhythms found by such a query on the Real book, for 2 different configurations (starting beat in bar and duration). Figure 12. The four most frequent rhythms for a note starting on the last beat of a 4/4 bar with a 2 beat duration, and their respective frequencies. Query found 3943 occurrences of 111 different rhythms. Full Examples Two examples of Giant Steps harmonized with fiortiures are given in annex. One in the style of Take 6, and another one in the style of Richard Wagners tetralogy. In both cases, it can be said that the musical quality is high, compared to previous approaches in automatic harmonization. Preliminary experiments were conducted by playing some harmonizations to highly trained experts (a world famous Brazilian composer, a harmony professor at Goldsmiths College, a talented jazz improviser and teacher, a professional UK jazz pianist): all of them acknowledge that the system produces highly interesting outputs. A full evaluation is under study to try to evaluate precisely the impact of fioritures on the perception of the piece, but is seems reasonable to say that they increase the musical creativity of the software in a significant manner. Applications to Other Styles This paper has focused on the style of Take 6, because of the acknowledged difficulty in modeling their productions. Our approach clearly improves on previous attempts at modeling barbershop harmonization such as (Roberts, 2005), who concludes his study by: although it is possible to formalize the creative process into rules, it does not yield good arrangements. We think we have reached a reasonable level of musical quality here. Our approach, however, is applicable to other styles, as this paper shows with the case of Wagner. Technically our approach is able to harmonize most leadsheets in any style defined by at least one more polyphonic MIDI files, but we did not conduct any specific musical evaluation in other styles yet. Conclusion We have introduced the concept of fioriture to harmonize leadsheets in the style of any arranger. Fioritures are controlled random walks within well-defined boundaries defined by long notes in the melody to harmonize. Fioritures could be envisaged under the framework of HMM (as in Farbood and Schoner, 2001). However, HMMs use chord labels as hidden states so we would need an annotated corpus, which is not the case. Furthermore, annotating Take 6 scores with chord labels is in itself an ill-defined problem. Finally, HMM cannot be controlled as precisely and meaningfully as Markov constraints. Our approach works with non-annotated, non voice-separated corpora for modeling the arranging style. It only requires a definition of chord labels used in the leadsheet (as sets of pitch classes). Like all music generation systems a rigorous evaluation of our approach is difficult. We claim that our system works remarkably well for most cases, as it rarely makes blatant musical errors, and most often produces musically interesting and challenging outputs. Beyond automatic harmonization, the possibility to control manually fioritures (when, with which parameters) paves the way for a new generation of assisted composition systems. Our approach could be easily extended to exploit social preferences, to help the system choose chords that sound right to listeners and ruling out the ones that do not. Fioritures can also be used as a creative device. By forcing fioritures to have many notes, or by manually substituting chosen leadsheet notes by others, one can generate harmonizations in which the original melody become less and less recognizable, and the style of the arranger becomes increasingly salient. Finally we want to stress that using fioritures to express style is a paradox: fioritures (from italian fioritura, flowering) are supposed to be decorative, as opposed to core melody notes, i.e. are not considered primary musical elements. But in our highly constrained context, they can become a device for creative expression. Acknowledgements This research is conducted within the Flow Machines project, which received funding from the European Research Council under the European Unions Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. 291156. We thank A. Dessein for providing us with perfect transcriptions of Take 6 recordings. The accompanying web site1 gives examples of harmonization of jazz standards. 2014_14 !2014 A Musical Composition Application Based on a Multiagent System to Assist Novel Maria Navarro Computer Science Department Salamanca University Pza Merced, Salamanca 37005 Spain mar90ali94@usal.es Composers Juan Manuel Corchado Yves Demazeau Computer Science Department Laboratoire dInformatique de Grenoble Salamanca University 110 avenue de la Chimie Pza Merced, Salamanca 37005 Spain Domaine Universitaire de Saint-Martin-dHres corchado@usal.es BP53 -38041 Grenoble cedex9France Yves.Demazeau@imag.fr Abstract This paper presents a solution to help new composers make harmonies. Amultiagent approach based on virtual organizations has been used to construct this application. This modelisbuiltby usinga multiagent system. This study presents a Multi-Agent System (MAS) built with PANGEA, a platform to develop different multiagent systems, capable of composing music following the HS algorithm. The results show the success of this application in correctly composing a classical harmony. Introduction Interest in computational creativity has been increasing in the scienti.c community. Although this interest is recent, there are a number of algorithms, schemas and procedures to develop an intelligent machine, capable of creating new ideas or new artistic compositions. Manymusic students, or even musicians, have problems composing or improvising melodies with their own instrument. Theymay find it difficult to practice their improvisation or to compose their own melody because theyusually needtoworkwithother musicianswhoaretoobusytocollaborate with them. This systemwas designedto assist these music students in improving their abilities. The goal of the system is to show that a simple and general agent framework such asPANGEA (Platform for Automatic coNstruction of orGanizations of intElligents Agents) (Zato and others 2012) canbuildaproper and scalable music composition system. Amultiagent system based on virtual organizations is used because it permits making changes in the problem specification, and can modify the music style or add new rules without altering the structural composition. Only the agents behavior needs modification. The BDI architecture was chosen for these reasons. We will evaluate the results by considering two types of criteria. First, we will consider mathematical criteria, which include an optimization function to minimize. The smaller the value in this function for one chord, the better the chord. This function considers constraint rules that evaluate the chord obtained. These rules and the evaluation method will be detailed in Section 3. InWestern music, dissonanceisthe qualityof sounds that seem unstable and have a need to resolve to a stable sound called consonance. The definition of dissonance is culturally conditioned, which is whya classical and an occidental music culture is considered for the evaluation of consonance. According to this criterion, we can consider these consonance intervals (in order of consonance): Octaves Perfect fourths and perfect .fths Major thirds and minor sixths Minor thirds and major sixths Wewill alsoevaluate whether the system helps composers to make their melodies or to improvise a melody by just listening to the harmonies. This evaluation consists of evaluatingthe systemwitha numberfrom1to10. The second section contains a brief review of algorithms in music composition, Multiagent Systems and basic concepts of virtual organizations. The third section presents our model, and our particular solution, attempting to solve the problem of harmonycomposition with an unknown melody, andhowVirtualOrganizations(VO)canhelpto improvethis system. The last section shows some results of the system, and proposes new lines of improvement. Background This section presents general information about composition algorithms, concepts about MAS andVOanda briefexplanation about the background of agents. Review in composition algorithms While grammar-based systems were initially widely used in composition tasks, today there are many other algorithms attempting to compose music. Some of these are called live algorithms (Bown 2011). One of the most successful algorithms involves Markov models (Eigenfeldt andPasquier 2013).There are also algorithms that uses lyrics as a variable into their compositions, as forexample (Monteith, Martinez, andVentura 2012). One interesting and notable study is that of F. Pachet (Pachet 2003). (Hoover,Szerlip, and Stanley2011) focused onevolvinga single monophonic accompaniment fora multipart MIDIby using a compositional pattern producing network (CPPN), a special type of artificial neural network (ANN). Agents and creativity are two disciplines that have interacted in several case studies (Martin, Jin, and Bown 2011; Lacomme, Demazeau, and Dugdale 2010). Harmony Search Algorithm Algorithm Music improvisationaimstoproduceanidealstate determinedbyaesthetic parameters, i. e., consonance or sound balance. The procedure has five steps described here (Geem and Choi 2007). First, it is necessary to choose the optimization function and to consider a memory called Harmony Memory (HM, a matrix filled with as many generated solution vectors as HMS (harmonymemory size)). The new harmonyis generatedbya random selection,a memory consideration, by using HM and a pitch adjustment (Geem and Choi 2007). The choice of one or another is conditioned by two probabilistic parameters:PAR (Pitch Adjustment Rate) and HMCR (Harmonymemory Considering Rate). Although the new harmony is built, the constraint rules that evaluate the obtained chord must also be taken into account. For this, a threshold is established. If the chord exceeds this value, it is dismissed, and the process starts again with a new chord that replaces the rejected chord. Finally,ifthenew harmonyvectorxhasa bettervaluefor the fitness function than the worst harmonyin the HM, the newharmonyis included in the HM. This process is repeated over andover until the stopping criterion (maximum number of improvisations) is reached. Virtual Organizations In the initial development of multiagent systems, the agents were seen as autonomous and dynamic entities that evolve according to their own objectives, without external explicit restrictions on their behavior and communications (Demazeau andMuller 1990). In recent years, developers have directed their interest to the organizational aspects of the societyof agents(Hubneretal. 2010). Thus,two descriptive levels are set: the organization and the agent. Agents are now seen as dynamic entities that evolve within organizations. The following sections present a description of the system, as well as the algorithm, and the agent structures used to solve the problem. Classical Harmony Composition Modeling musical composition is difficult because musical objects do not have any pre-assigned connotation. That means there are as many definitions of the same object as there are belief systems in musical history. For this reason, our efforts were centered on composing music from the classical period. In this period, there were many rules for composing classical music. In particular, the following main norms are considered. R1 -8th and 5th parallels: these are produced when the interval between the i-note and the j-note of the chord n and the interval between the (i+1)-note and the (j+1)-note of the chord n+1 are both 5th or 8th. R2 -Leading-note resolution. Thereisa rule that requires a resolution of the leading-note in the tonic. R3 -Voices crossing. An ideal harmonymustavoidvoice igetting abovevoicej, when j=i+1. R4 -Movements between tension. Each chord hasa peculiar role that produces stability or instability,depending on the functions (tonic, dominant and subdominant). It is the tension that permits the music to evolve in the composition.Forthis reason,our desireistoproduceamovement between chords, to prevent the music from becoming boring. Thus, the repetition of the same function over time must be penalized in some way. R5 -Avoidalarge interval betweentwopitchesinachord. This is important because if we have a big pitch in the same chord, the connection between all pitches can break. R6 -Avoid a large interval between two pitches in the samevoice. This rule allowsbuilding more cantabile melodies, in general. With all of these constraints and rules, the following optimization equationwasbuiltto minimize: N3N3 .. .. Rank(xij)+ P enalty(xij ) (1) i=1 j=1 i=1 j=1 Where: Rank(xij )= iRank(xij ,xi(j-1) + (2) ln(T ensioni)+ xij x(i-1)j (3) T ension(x) values are considered with a discrete scale from 1 to 3, depending of the tension role. If the chord is Subdominant, the tension is 1, if it is dominant, tension is 3, and if it is tonic, tension would be 2. The values of iRank(x) for a specific harmonic interval are: 3rd, 8th interval:Valueof1 6th interval:Valueof 1.5 4th interval:Valueof2 5th interval:Valueof 2.5 Unisone interval:Valueof3 2nd, 7th interval:Valueof4 P enalty(x) areshownin equations 4,5,6and7,keeping in mind the constraints considered previously. x(i-1)j . SI . xij . (4) = DO . P enalty(xij )=5 xi(j-1) . xij . P enalty(xij)=4 (5) T ensioni-1 =3 . T ensioni =1 . P enalty(xij )=2 (6) x(i-1)j x(i-1)(j-1) = xij xi(j-1) =5 . 8 (7) . P enalty(xij)=3 (8) The algorithm starts with an initialization of the Harmony Memory (HM) matrix that is stored in the repository. SeveralPAR and HMCR were also tested, and we chose the best ones:0.3toPARand0.2to HMCR.Inthenext section,both the structure of MAS based onVO and its advantages will be explained. Multiagent System Structure Virtual organizations were used to implement and develop our model. Virtual organizations provide a certain number of roles easily replaceable by an agent, depending on the context. This allows the system to be very flexible. Besides, a methodology based onVO can provide us witha global vision of the problem, the model and the possible solutions. To design the virtual organization it is necessary to analyze the needs and expectations of the system. The result of this analysis will be the roles of the entities involved in the proposed system. The following specific roles were found: Composer Role: This role creates the harmonic music following their rules to achieve a goal (desire). Evaluator Role: This role evaluates the result of the composer role and decides if it is good enough to present it to the user. Interface Role: This role allows the user to interact with the system. Data Supplier Role: This role is an agent that accesses and stores all or most of the information needed to manage the actions that govern this system. Control Role: The agents that exercise this role will have overall control of the system. To implement the roles of the VO we chose to develop a MAS. For the composer and evaluator agents, we chose a BDI agent architecture (Corchado et al. 2004), for two reasons: firstly, it is the most common deliberative agent architectures, and one of the simplest; and secondly, this structure is perfectly adapted to our requirements. The BDI agent process involves two fundamental activities: a) determining which goals should be achieved (deliberation) and b) deciding how to reach these goals (planning). Both processes should be carried out by taking into account the limited resources of each agent. The schemain Figure1showshow client agents are connected to model our problem. To begin, the composer agent has as a goal or desire to minimize thevalueof the optimization function.To achieve this goal, it has to make some rules or intentions (that is, the algorithm), starting from its beliefs or its initial stage. As we can see, the BDI architecture is perfectly suited to the agent. Additionally,the composer agenthasasadesireto classify the chord madeby the composer agent.To achieve this goal, it has to follow its intentions, starting with its beliefs. Finally, the remaining agents are given communication, coordination and representation tasks. The systemwasdeveloped onPANGEA (Zato and others 2012), which provides us with certain advantages.PANGEA is a service-oriented platform that allows the open multiagent system to take maximum advantage of the distribution of the resources. WithPANGEA, we can change our musical agent in order to change the composition algorithm or behavior. We can even change an agent and replace it with a multiagent system capable of communicating to compose a new music. Second, we can change our Constraint Agent. Figure 2: Harmonyachieved with 45 iterations However, the more iterations we performed, the better the results we obtained.Wehavea new line with 200 iterations, noticeably better than the previous one (See Figure 3). The first chord is perfect, taking the intervals between the notes Figure 3: Harmonyachieved after 200 iterations This means that we have an evolutionary algorithm. This depends not only on the iterations we perform,but also on the parameters PAR or HMCR, which indicate the probability of making a random value for a pitch in a chord, as explained in the previous section. The fitness of the results isevaluatedbystudying theway the rules and constraints are followed. In other words, the more the rules are followed, the better the harmonywill sound. The mathematical evaluation is to study the value of the optimization function as well as the number of the constraints that are violated. Nevertheless in music, there is also a qualitative form to evaluate the model. This method of evaluation is based on acoustic perception, and therefore depends on the listener. We conducted tests with two experts in classical music (composers) and two non-experts in classical music to punctuate both harmonies above. The evaluation criteria was: completely dissonant, dissonant, a bit consonant, consonant, completely consonant. The experts number1and number2evaluated the first harmonybetween a bit consonant and dissonant, and the others evaluated as dissonant. In the second harmony all four rated it as consonant. In our small study, two composers used our method and evaluated the results on a scale of 1-10. The first evaluated the result witha6and the second witha 7,5, which we consider as acceptable in our first approach to the system. With regards to the virtual organization, the process of identifying and organizing roles helped to improve the management and thus to improve ef.ciency. The MAS structure allows us to make an extensible and scalable system as we change rules, constraints and behavior, with little effort, searching new ways of mixing different techniques, or even tools in the composition. The BDI architecture is perfectly suited for the solution we were seeking. BDI has a clear methodology thatfacilitates the development stage, with manytheories that suit our problem. This architecture enables us to easily introduce a learning mechanism, as we can see in our case study. Moreover, using PANGEA as the platform allowed .uid communication between agents, which is evident in the design of the application, improving the modularity and the separation between client and provider as well. Asafuturework,we propose incorporatingrhythms. This model can also evolve to learn and self-check its own mistakes in harmonycomposition. 2014_15 !2014 Empirically Grounding the Evaluation of Creative Systems: Incorporating Interaction Design Oliver Bown Design Lab, University of Sydney, NSW, 2006, Australia oliver.bown@sydney.edu.au Abstract In this paper I argue that the evaluation of artificial creative systems in the direct form currently practiced is not in itself empirically well-grounded, hindering the potential for incremental development in the field. I propose an approach to evaluation that is grounded in thinking about interaction design, and inspired by an anthropological understanding of human creative behaviour. This requires looking at interactions between systems and humans using a richer cultural model of creativity, and the application of empirically better-grounded methodological tools that view artificial creative systems as situated in cultural contexts. The applicability of the concepts usability and user experience are considered for creative systems evaluation, and existing evaluation frameworks including Coltons creativity tripod and Ritchies 18 criteria are reviewed from this perspective. Introduction: Evaluation, Creativity and Empiricism This paper is concerned with the evaluation of creative systems, specifically in the area of artistic creativity (not to be confused with evaluation by creative systems). Whilst AI researchers in other application domains are able to observe and measure incremental improvements in their algorithms, computational creativity researchers are burdened by the inherent ambiguity in the field regarding whether algorithm or system X is better than algorithm or system Y. Incremental developments in the field are also relatively obscure to the outsider: the figurative artworks created by Harold Cohens celebrated automated artist, AARON, in the 1980s1 look like the work of a competent and creative artist. As far as the artwork itself is concerned, this would appear to be as good as it gets problem solved. But most in the field believe that we are only just beginning to develop good creative systems. Such appearances foster confusion about where we are at in the development of significant artistic creativity in computers, between a far-off goal on the one hand, and a solved problem on the other. Cardoso, Veale, and Wiggins (2009) characterise the field as taking a pragmatic, demonstrative approach to compu 1See AARONs online biography at http://www.usask.ca/art/ digital culture/wiebe/moving.html. tational creativity practice, which sees the construction of working models as the most convincing way to drive home a point (Cardoso, Veale, and Wiggins, 2009, p. 19). This tradition has kept the focus on innovation, distinguishing it from more theoretical studies of creativity. Nevertheless the discussions and demonstrations that surround such an approach depend on a firm relationship between empirical observations and what we claim about systems. Hence, understandably, a significant portion of the literature in the field focuses on the necessary theoretical distraction of how to go about evaluating systems. Wiggins notion of evaluation (Wiggins, 2006), widely adopted in the field, requires that a system performs tasks in a way that would be deemed creative if performed by a human. But whilst simple to state, the task of concretely drawing such a conclusion about a given system maintains an opaque and vexing relationship to the various forms of empirical observations available to us. In light of these issues, the purpose of this paper is to examine the empirical grounding underlying the evaluation of systems. Empirical grounding is defined as the practice of anchoring theoretical terms to scienti.cally measurable events, and is necessary for the effectiveness of the application of knowledge (Goldkuhl, 2004), that is essential for transforming discussions about system designs and methods into incremental scienti.c progress. I argue that whilst the essential incompatibility between evaluation in computational creativity and the objective nature of optimisation found in AI may have been acknowledged from the outset, there remains a gap that has still not yet been plugged by a positive theory of evaluation in computational creativity. Further to this, I propose that the standard model of creativity in art, derived largely from Bodens concepts, has not provided a suitable framework for thinking about where and how the evaluation of creativity applies in human artistic behaviour. To address this, it is proposed that a human-centred view, specifically the use of design-based approaches such as interaction design, can give computational creativity a thorough empirical grounding. An interaction design approach can be applied easily to existing work in computational creativity, viewing the understanding and measurement of system behaviours in terms of their interaction with human users. It offers a practical route to bringing a much-needed human and social dimension to studies of creative systems without rejecting aspirations towards autonomy in computational creativity software. The Soft Side of Computational Creativity The adjectives hard and soft have been used, controversially, to refer to different areas of scienti.c enquiry (as a precaution, they remain in quotes throughout this paper!). Diamond (1987) explains that some areas are given the highly .attering name of hard science, because they use the firm evidence that controlled experiments and highly accurate measurements can provide, whereas soft sciences, as theyre pejoratively termed, are more difficult to study for obvious reasons... You cant start... and stop [experiments] whenever your choose. You cant control all the variables; perhaps you cant control any variable. You may even find it hard to decide what a variable is (Diamond, 1987, p. 35). Although many theoreticians such as Diamond reject the tone of the terms (here Diamond is arguing that soft sciences are in fact harder than hard sciences), the definitions given here usefully describe a continuum of what he understands as degrees of operationalisation. Whilst the terms may connote tough and weedy respectively, they also connote rigid and well-defined levels of operationalisation versus more flexible and loosely-defined levels of operationalisation. This distinction remains useful. A key point is that there are appropriate ways to deal with soft concepts, above all of which is to acknowledge them as such in order to apply suitable methods. A popular perception is that soft sciences harden as their theory and practice coevolve, with psychology and sociology given as typical examples (Nature, 2005). But doing quality soft science would appear to be the first step towards this ambition. Computational creativity necessarily deals with both sorts of concepts, and researchers must therefore know how to work across this spectrum. I discuss as an example Coltons creativity tripod (2008). Colton proposes to include in his formulation of evaluation a set of internal properties of systems, due to the limited information available when using only the end products of an automated creative process to evaluate that process (as advocated by Ritchie (2007)). He proposes that we look inside the system itself in order to gain a fuller description of the systems processes along with its products, and thus make a more informed decision about the creativity of the system. This, he argues, is more in line with how we evaluate human creativity: A classic example... is Duchamps displaying of a urinal as a piece of art. In situations like these, consumers are really celebrating the creativity of the artist rather then the value of the artefact (Colton, 2008, p. 15) Colton suggests breaking down creativity into three components a creativity tripod of skill, appreciation and imagination that can be sought in creative systems. He defines each of these as necessary conditions for the identi.cation of creativity, and proposes that creativity evaluation could be built around an analysis of these properties. He performs such an analysis of his own systems, HR and The Painting Fool, and identi.es the existence of each component in both systems (although he clarifies that they do not occur simultaneously in the same version of the Painting Fool system). In Coltons analysis, skill, appreciation and imagination are not formalised, and are treated as intuitive ideas taken in the manner of Wiggins creativity as recognised by a human criterion. Accordingly, Coltons application of the terms is impressionistic. For example, he says of The Painting Fools imagination that we wrote a scene generation module that uses an evolutionary approach to build scenes containing objects of a similar nature, such as city skylines and flower arrangements (Colton, 2008, p. 21). From this, the reader has little hope of determining whether the imagination criterion has been satisfied, let alone what the sub-criteria are for imagination. A further problem is that, in empirical terms, the expected order of knowledge discovery has clearly been put in reverse: imagination has been defined first as a kind of internal scene generation process, then implemented into the system, the conclusion being drawn that the system contains imagination. This abandons the critical step of enquiry into whether, having defined imagination as such and implemented it accordingly, this is actually a suf.cient definition of imagination. Under these circumstances, the concepts skill, appreciation and imagination cannot be distinguished from trivial pseudo-versions of themselves. Accordingly, reduction to triviality provides an easy rebuttal to such claims, and this has been performed by Ventura on Coltons criteria (Ventura, 2008). Ventura presents a clearly trivial, unanimously uncreative computer program, and applies a similar analysis to that performed originally by Colton, concluding that the mock system has skill, appreciation and imagination. If Venutras system has these features, and they are sufficient for the attribution of creativity, then we must either accept the system as creative or reject the criteria as they currently stand. Can such vague concepts be used at all, or should they dropped altogether if they cant be precisely formalised? I prefer to support both Coltons initial premise that an understanding of the inner workings of systems is as necessary to evaluating creativity as the outputs the system produces and his identification of skill, appreciation and imagination as critical features of advanced creative systems. They are things that we would expect to see well implemented in our finest systems and there is nothing wrong with making this intuitive step. But unfortunately they are clumsy terms, and as Venturas analysis demonstrates, dont look like hopeful performers at a formal level. In Diamonds terms, they are far from being effectively operationalised, and they may never be operationalised, because in the process we would reasonably expect to device concepts that are far removed from folk terminology, just as physicists and neuroscientists have done. A more generic scienti.c strategy for how to work with both rigid and flexible objects alike comes from the de.nitive hard scientist Richard Feynman (1974) who makes a simple appeal to what he describes as an unspoken law of science, a kind of utter honestya kind of leaning over backwards to face the problem of how not to fool ourselves (Feynman, 1974). He draws an analogy between forms of habitual scienti.c practice and the famed cargo cults of the South Paci.c, who carved wooden headphones and bamboo antennae in the hope of attracting cargo planes to land, imitating the troops they had seen during WWII. He calls upon scientists across disciplines to ask themselves, Am I making symbolic wooden headphones or real working headphones? In the spirit of Feynmans call to utter honesty, an overlooked first step is to acknowledge that these terms, given our current knowledge, are extremely flexible and far-fromoperationalised, which places very different demands on how we address and manipulate them as concepts. Their treatment is implicitly argument-based, meaning that no neat proof or direct basis in evidence is available to us. This makes for a very messy equivalence to the process of checking the steps of a proof or repeating a simulation experiment, with each step containing unknowns and vagaries: flexible rather than rigid science. Computational creativity needs to learn to work with vague concepts that are not easily subject to formal treatment. Other examples of slips into the space of soft science that are likely to occur in computational creativity discourse include describing a system as doing something on its own when discussing the autonomy of systems, but remaining imprecise about what the it and the doing specify (e.g., to say a program composes a piece of music on its own requires quite a detailed analysis of the sequence of events leading to the specific configuration of musical content), and cases of comparing exploratory and transformational creativity in an interpretive manner (e.g., to classify any historical creative act as transformational requires the imposition of our own chosen categories onto incomplete historical data) (see Ritchie, 2006, for an interesting discussion). For this reason soft sciences, such as social anthropology, subject the use of language to great scrutiny. The meaning of terms that cannot easily be made measurable or mathematically manipulable are instead treated with an acknowledgement of their fragility. As a part of their data gathering, anthropologists immerse themselves in cultural situations in order to be able to fully understand and successfully interpret what they observe. Immersion is necessary in order to expose the cultural content of these situations, which is not directly accessible through hard science methods such as surveys, lab tests or recordings. For example, the difference between a twitch of the eye, a wink, a fake wink, a parodied wink, a burlesque of a parodied wink, and so on, might only be fully accessible to someone who has an intimate understanding of the sociocultural context in which the act occurs (Geertz, 1973). Misinterpretation of such acts is a clear source of error in the development of theory. In the 1980s, borrowing from philosopher Gilbert Ryle, anthropologist Clifford Geertz (Geertz, 1973) developed these practices into a method of thick description that gave new impetus to, and validation of, the interpretative (soft) side of anthropology as a science. Such thinking is more relevant to computational creativity than it may appear. The empirical material underlying Wiggins creativity as recognised by a human criterion, is in the first instance anthropological rather than psychological, revolving around interpretations of culturally-situated human behaviour: in particular that we establish a shared understanding of what creative means. Geertz advice on grounding methodology is that if you want to understand what a science is, you should look in the first instance not at its theories or findings, and certainly not at what its apologists say about it; you should look at what the practitioners of it do (Geertz, 1973, p. 5). This is a call to work the sciences methods around the data and practices that are practically available. This may be helpful given what computational creativity practitioners do. Cardoso, Veale and Wiggins characterisation of computational creativity practice as the construction of working models as the most convincing way to drive home a point (Cardoso, Veale, and Wiggins, 2009, p. 19), breaks down into two parts: the engineering excellence to create advanced creative systems, and the analysis of human social interaction in creative contexts that will be used to round off the argument. Thus a necessary direction for computational creativity is to fuse excellence in the hard science area of algorithms and the soft science of understanding human social interaction. The terms skill, appreciation and imagination are things that we should be seeking to better define through (soft) computational creativity research, and cannot at the same time be used as the basis for a (hard) test for creativity. Characterising Artistic Creativity Using Generative and Adaptive Creativity Value or utility is included in the vast majority of definitions of creativity (most notably (Boden, 1990)), and is critical to many applications of creativity research, such as improving organisational creativity and building creative cities. But non-cognitive processes such as biological evolution are also viewed as creative. Here, value cannot have the same meaning as it does in the context of human cogintion-based creativity, because there is no agent to do the valuing. And yet this difference has not been explored in any depth. The application of theoretical concepts has tended to focus on Bodens (1990) two key distinctions in her analysis of creativity: between personal and historical creativity as indications of scope; and between combinatorial, exploratory and transformational creativity as forms of creative succession. From this point of view, creativity is tightly bound to individual human goals, and is primarily conceived of as a cognitive process that is used to discover new things of value. This lack of attention to the variable nature of value in creativity causes confusion and has led to a poor empirical grounding for evaluation in computational creativity, precisely because much creativity occurs outside of the process of human creative cognition (in the narrower sense given above). A distinction based on different relations to value has not been taken up by the community. I draw on a distinction (Bown, 2012) between generative and adaptive creativity, and argue that this distinction clarifies and resolves the confusion about how value is manifest in the arts. In (Bown, 2012) I propose a distinction between two forms of creativity based on their relationship to value: generative and adaptive creativity. Generative creativity is defined with a very broad scope, it occurs wherever new types of things come into existence. It does not require cognition: non-human processes such as biological evolution are capable of creating new types of things, and, I argue, there are also examples of human activity in which things emerge autopoietically without being planned or conceived of by individual humans. The role of generative creativity in art will be discussed below. Generative creativity offers an expanded view of creativity in which the production of new types of thing is the sole criterion for creativity to have occurred, and the process by which those things are produced whether by deities, human minds or autopoietic processes is secondary. In human creativity, this liberates us from the possibly misleading premise that the creative mind is necessary and sufficient for the act of creation. A framework that distinguishes between those entities can properly address the issue of when and how human thinking is associated with new things coming into existence. Adaptive creativity on the other hand is that in which something is created by an intelligent agent in response to a need or opportunity. The distinguishing feature here is that of value or benefit generative creativity is value free. In adaptive creativity, the agent doing the creation stands to benefit from the creative act: a link must exist between the creative agent and the beneficial return of the creative act in order for adaptive creativity to have occurred. Uncontroversial examples include solving everyday problems, such as using a coat-hanger to retrieve something from behind a wardrobe. Adaptive creativity is understood as requiring certain cognitive abilities such as mental representation, whereas generative creativity is completely blind, as in biological evolution. Generative and adaptive creativity are not extremes at either ends of a continuum, but distinct and mutually exclusive categories either there was a preceding purpose or there was not. However, the appearance of new things may be the sum of different episodes of generative and adaptive creativity. Given these terms, I argue that the existing notion of the evaluation of creative systems is entirely indeed inherently geared towards adaptive creativity, and is unable to accommodate generative creativity at all. Adaptive creativity alone is compatible with computational creativitys AI legacy, which preferences an optimisation or search approach to discovering valuable artefacts. This is not without powerful applications. Evolutionary optimisation regularly discovers surprising designs in response to engineering problems. Thalers Creativity Machine, for example, was used to discover novel toothbrush designs using a relatively traditional optimisation approach involving a clear objective function (Plotkin, 2009). It is only generative creativity that is incompatible with optimisation. Adaptive and Generative Creativity in the Arts For the purpose of evaluating creative systems, it has been considered reasonable to assume that we can treat artistic domains entirely in terms of adaptive creativity, and that the act of creating artworks is an adaptively creative act. Accordingly one can view the production of an artwork as an optimisation or search problem. This simpli.cation is built in to the premise of an agent designed to evaluate its output in order to find good solutions. For such an agent to incorporate generative creativity into its behaviour would mean that the value of its output was indeterminate and evaluation would be frustrated. But evidence suggests that this view of art does not hold when one considers its social functions. I will focus on music for the purpose of this discussion, and take what I believe is an uncontroversial understanding of music insofar as sociologists of music are concerned. Hargreaves and North (1999) identify three principal social functions for music: self-identity, interpersonal relationships and mood. These in turn, they argue, shape musical preference and practice. For example, research on the sociocultural functions of music suggests that it provides a means of defining ethnic identity (Hargreaves and North, 1999, p. 79). The evidence they gather shows the perceived aesthetic value of music not to be determined purely by exposure to a corpus or inspiring set, but also by a set of existing social relationships. More recent research in experimental psychology reveals an increasingly complex story behind how we give value to creative artefacts. Salganik, Dodds, and Watts (2006), for example, show that music ratings are directly influenced by ones perception of how others rated the music, not just in the long term but at the moment of making the evaluation. Newman and Bloom (2012) examine the underlying causes of the attachment of value to originals rather than copies, finding, amongst other things, that the value given to an original is associated with its physical contact with the artist. Both studies suggest a form of winner-takesall process whereby success begets further success. Such phenomena place limits on the importance of the creative content in evaluation. Admittedly artistic success is not the same as artistic creativity, but the overlap is great enough, in any practical sense of evaluating creativity, to carry the argument from one domain to the other. Csikszentmihalyis (1999) domain-individual-field theory has long held that individuals influence domains and alter fields, but such observations have on the whole been only been acknowledged, not actually applied in computational creativity. Coming close, Charnley, Pease, and Colton (2012) present framing as a way to deal with the process of adding additional information that may influence the value of a creative output. According to the idea of framing, I might provide information alongside an artwork, such as an exhibition catalogue entry, that influences its perception. In its simple form framing would embellish an artwork, perhaps explaining some hidden symbolism behind the materials used. But in this sense it is simply a part of the system output along with the artwork. By comparison, verbal statements, and other social actions, can have effects with respect to value that are categorically different from this, for example by provoking people to alter their perception of value in general. Framing takes steps towards the idea that value can be manipulated, even created, but continues to assume a fixed frame of reference. Taking these additional processes into account, when an individual produces an artwork, some amount of the value of that artwork may have already been determined by factors that are not controlled by the individual, or be later determined by factors that are unrelated to the content of the work. The creativity invested in the creation is not entirely the product of the individual, whose artistic behaviour may be more associated with habit and enculturation than discovery, but is imposed upon the individual through their context and life history. The anthropological notion of the dividual, or porous subject (Smith, 2012) has been used capture this idea of a person as being composed of cultural influences, indicating their ongoing permeability to in.uence. According to this view, the .ux of influence between individuals may have an equivalence to the interaction between submodules within a single brain, meaning that isolating individuals as units of study is no better a division than focusing on couples, tuples, larger groups or cognitive submodules. Given this understanding of individual human behaviour in relation to culture in general, and the arts in particular, computational creativity can be seen to place too much emphasis on the idea of individuals being independent creators. From this alternative point of view it is argued that artistic behaviour has a significant generative creativity element by which new forms spring up, not because individuals think of them, but through a jumble of social interaction. Such emergent forms may have structural properties related to the process that produced them, but they were not made with purpose. By analogy, consider a classic debate about adaptationism and form in evolutionary theory: the shape of a snail shell, as described in Thompsons On Growth and Form (Thompson, 1992) comes about through the process of evolutionary adaptation. But this is not purely a product of the selective pressures acting on the species. It results from an interaction between selective pressures and naturally-occurring structure. Likewise, human acts of creation are constrained by structural factors that guide the creator, augmenting agency. The notion that a system possesses a level of creativity is riddled with complexity, owing to the fact that creativity is as much something that is enacted upon individual systems as enacted by them. In computational creativity, this means that the goal of evaluating virtual autonomous artists is not empirically well-grounded when performed in isolation. Empirical grounding requires a strong coherence between our theories and practices, and the things we can observe. In the following section, I will argue that an interaction design approach delivers this coherence, bringing together system development with a thorough understanding of the culturally-situated human. I will suggest that interaction design shouldnt be viewed merely as an add-on or a form of research used only at the application stage, but that it has a central role to play in improving methodology in computational creativity. Towards Empirical Grounding To reiterate the argument so far, empirical grounding is defined as the process of anchoring theoretical terms to scienti.cally measurable events. Computational creativity characteristically employs a makers approach to innovating new ideas and building better systems, but the idea of asking how creative these systems are is not empirically well-grounded. Then what can we ask? I have examined the need simply to elaborate on terms and concepts during the process of evaluation, adopting approparite soft science ways of thinking alongside the existing engineering mindset, but although a well-grounded approach needs to take this into account, it does not provide a grounding itself. Two research methodologies already well integrated into computational creativity offer a basis for empirically well-grounded research. These are interaction design and multi-agent systems modelling. In both cases the imbalance between generative creativity and adaptive creativity is addressed. In the interaction design approach, creative systems are treated as objects that are inevitably situated in interaction with humans. The nature of that interaction, including its efficacy, is treated as the primary concern. Here the empirical grounding comes from the fact that properties of interaction and experience related to the analysis of usability and user experience can be observed and measured, whilst existing notions of creativity evaluation can easily be incorporated into theories of interaction design. This need not be limited to a creative professional working with a piece of creative software, but could apply to any form of interaction between person and creative system. In the modelling approach, artificial creative systems are treated as models of human creative systems. For the reasons discussed above, it does not suf.ce to test the success of model systems by attempting to evaluate their output, but many other observable and measurable aspects of human creativity can be studied. Multi-agent models of social networks are particularly appealing in this regard because generatively creative processes fall inside the scope of the system being studied, alleviating the tension between adaptive and generative creativity. In this paper I only elaborate on the interaction design approach, firstly because it is more immediately applicable to computational creativity practice, and secondly because much of what can be said about empirically grounded modelling is well-known to researchers. Interaction Design Discussions of humans evaluating machines are commonplace in the computational creativity literature. But a lot less attention is paid to the wider range of ways in which humans can interact with creative systems. The word interaction, applied in the context of humans interacting with creative systems, was only used in three out of 41 papers in the 2013 ICCC proceedings (and six papers out of 46 in 2012). Interaction design is a large field of research and is not presented in any depth here (a good introduction is the textbook by Rogers, Preece, and Sharp (2007)). The following discussion considers computational creativity in light of some core topics from the field, and looks beyond to how a study of interaction in its widest sense could be usefully applied to computational creativity. A number of computational creativity studies are already explicitly user-focused owing to their specific research goals. For example, DiPaola et al. (2013) examined the use of evolutionary design software in the hands of professional designers, looking at usability through the integration with the creative process, and ultimate creative productivity. A human-centred approach to the evaluation of creative systems shifts the nature of the enquiry very slightly, by asking not how creative a system is, or whether it is creative by some measure, but how its creative potential is practically manifest in interactions with people. However, this does not require researchers to repurpose their systems as tools for artists, designers or end users, or abandon the goal of automating creativity, but to take a pluralistic approach to the application of creativity as something that is realised through interaction. As addressed in the work of DiPaola et al. (2013), described above, an obvious instance is to look at usability in the case of creativity support tools. This is the classical locus of interaction between interaction design and computational creativity. But even researchers working towards fully autonomous artificial artists are building systems that will ultimately interact with people, albeit in non-standard ways. Examples include artists such as Paul Brown (Brown, 2009), who has wrestled with the notion of maximising the agency of a system to the exclusion of the human artists signature. As the discussions surrounding such system design shows, there is no shortage of interaction between systems and the social worlds they inhabit, any of which can be considered a source of rich data. Beyond usability a key concept in interaction design is user experience (Hassenzahl and Tractinsky, 2006). User experience looks beyond efficacy with respect to function to consider a host of subjective qualities to do with interaction more generally, such as desirability, credibility, satisfaction, accessibility, boredom and so on (Rogers, Preece, and Sharp, 2007). Analysis of user experience includes understanding users desires, expectations and assumptions, and their overall conceptual model of the system. These diverse and quite vague concepts in user experience are arguably of greater importance than usability in a wide number of circumstances, and can also be at odds with it. For example, in game development pleasure can be seen to be contrary to usability (Rogers, Preece, and Sharp, 2007): dysfunctional ways of doing things, as embodied in interface design choices, may be more fun than more functional choices. By comparison, computational creativity need not be reduced to issues of function. Such analytical concepts present a striking match with the most ambitious goals of computational creativity. Returning to Wiggins definition, it would not be surprising to find that a humans appraisal of machine creativity is subject to a complex of user-experience design factors. Concepts such as surprise are already established in computational creativity theory, whereas other notions, such as the role of music and art in the development of social identification, are not, but may form part of the design of a successful computational creativity experience. To acknowledge and make explicit the design component in creating autonomous systems may help remove the perceived paradox that the system is an autonomous agent supposedly independent of its creators, by examining what designed autonomy would actually mean. Often successful computationally creative systems involve some kind of puppetry, such as the subtleties of fine tuning described by Colton, Pease, and Ritchie (2001). Many working in this area have embraced the idea of creative software either as a tool, a collaborator that is not capable of full autonomy, or as as creative domain in its own right. In these cases the interaction between human artist and software agent is treated as a persevering and explicitly acknowledged state of affairs, rather than as a temporary stop on the way to fully autonomous creative systems. In such cases it is again fruitful to think of the relationship between the developer/artist and the system in terms of usability, even if the working interface is simply a programming environment. Such a view may lead to better knowledge about effective development practices that in turn speed up the creation of more impressive creative systems. Accepting the role of developers and artists also enables a better grasp of the attribution of authorship and agency, asking instead a question of degree how much and in what way the system contributed to the creative outputs rendering unimportant the ideal of full autonomy. From Evaluating System Creativity to Analysing Situated Creativity Taking an interaction design approach reveals a wealth of empirically grounded questions that can be asked about creative systems without changing the basic designs and objectives of practitioners, and without an overly narrow focus on the question of how creative the system is. But in order not to throw out the baby with the bathwater, since our interest is in systems that act creatively, then the creativity of systems must remain the focus of an interaction design approach. We require enriched ways to question the nature of creative efficacy and creative agency in systems. For example, an interaction design approach can better frame our evaluation of the issue of the softwares autonomy, which might otherwise be occluded. A number of existing approaches to evaluation already give ample space for domain-specific and application-dependent variation in their use, but do not go so far as to preference design and interaction studies over direct evaluation in computational creativity. Jordanous (2011) proposal for creativity evaluation measures that are domain-specific suggests a design approach which is targeted at specific user-groups and specific needs, rather than an objective notion of what creativity is. A number of other researchers have proposed objective or semi-objective (depending on human responses) measures that are associated with creativity (they are not necessarily measures of creativity). Kowaliw, Dorin, and McCormack (2012), for example, compare formal de.nitions of creativity, written into a system, with human evaluations, so as to examine the accuracy of these definitions. One of the most widely applied and discussed examples is Ritchies (2001; 2007) set of criteria. Ritchie proposes 18 criteria for attributing creativity to a computer program. The criteria derive from two core pieces of information that apply wherever a machine produces creative outputs: the inspiring set I (the input to the system) and the systems output R. An evaluation scheme (often multi-person surveys in the implementations examined by Ritchie) is then used to form two key measures for each output in R: typicality is a measure of how typical the output is of the kind of artefact being produced; quality is a measure of the perceived or otherwise computed quality of that artefact. From these scores, Ritchie organises the outputs into sets according to whether they fall into given ranges of typicality and quality. These sets are then applied in various ways in the calculation of the resulting Boolean criteria. For example, criterion 5 states that the number of outputs that are both high-quality and typical, divided by the number of outputs that are just typical, is greater than some given threshold (this, plus the thresholds required to determine the high-quality and typical sets, are left to the implementer to specify). As with all of Ritchies criteria, criterion 5 corresponds to a natural usage of the term creativity, in this case that a system whose set of typical outputs rarely includes valuable outputs is in some sense creatively lacking. One practical problem with Ritchies criteria, as illustrated by the examples of their application to creative systems reported in (Ritchie, 2007), is the difficulty with which implementers establish their evaluation scheme. For example, Pereira et al. (2005) measure typicality based on closeness to I, calculated using edit distance. The appropriateness of this choice is hard to determine. Others use human responses to surveys, providing a form of empirical grounding. But such surveys may have wide variance, and the formulations of the criteria have no way of incorporating variance, which would represent a more complex model of the social system in which the creative agent operates. This belies the fact that typicality is a slippery, soft science concept in reality and its relationship to a measure of quality more so, despite the clarity of Ritchies mathematics. Thus, as with Coltons tripod, Ventura (2008) points to shortcomings in the criteria by showing that trivial programs can reveal instances of inherent insuf.ciency in their outcomes when compared with intuitive analysis of the same systems. The underlying problem is that of how to empirically ground the choice of evaluation scheme itself, such that it might provide an empirical grounding for the criteria, suggesting that the mathematics has simply shifted the hard problem from one place to another. The best we can do is to see how the various evaluation schemes and criteria relate in practice to other observables, thus the critical point: using human responses about creativity or related features of a system, alone, does not itself provide an empirical grounding for understanding the system, but rather a data point about the wider interaction. Further studies of behaviour are required to empirically ground our understanding of what these human responses mean. A related issue in the discussion surrounding Ritchies criteria, is what to do with the results obtained. The criteria have, in Ritchies view, often been misunderstood as some sort of multivariate test for creativity. Confusingly, Ritchie unintentionally encourages this misunderstanding in his description of them as criteria for attributing creativity to a computer program (Ritchie, 2007). In fact he cautions against their direct use in this way. Thus the criteria offer different analytical windows onto the creative nature of systems. We are invited to preference some criteria over others, but given no advice on how to. However, from the point of view of interaction design, such ambiguity is expected and desirable. In application, we may value systems that are good at producing a high ratio of quality to overall output, or typicality to overall output, or quality within the typical set. Alternatively, other approaches to creativity may suggest counter-intuitive additions or alterations to Ritchies criteria, such as novelty search (e.g., Lehman and Stanley, 2011), which attempts to chart an output space by relentlessly searching for atypicality. The result of this is a broad representative spread of prototypes, not a concentration of high-value or typical outputs, so would score low on many of Ritchies criteria but may prove to be the basis for powerful automated creativity. An interaction design approach is implicit in Ritchies treatment of systems as tools. For example, in defining typicality, he refers to the system as having a job to do producing artefacts of the required sort (Ritchie, 2007, p. 73). This is not, on re.ection, a requirement associated with being creative, but with performing some function required by the user or designer. With this in mind, is it possible that the final step of attributing creativity to a computer program has caused more confusion than clarity, and should be quietly dropped? I suggest that it should and, echoing Jordanous (2011), that the criteria are better suited to specific creative scenarios. A jingle composer may preference typicality and require only an average degree of value, whereas an experimental artist may have little or no interest in typicality but is willing to hold out for rare instances of exceptional value. Both have different time-demands, resources, goals, aesthetic preferences and notions of the role of creativity in their work. It would not be unusual to view the experimental artist as the more creative of the two, but this is clearly only an assumption given our present theoretical understanding of creativity. The same applies to end users. Even a consumer may want typicality sometimes, and extraordinary experiences at other times. Thus the problems raised concerning Ritchies criteria and their application are very easily addressed by taking a human-centred view of creative systems. Applications in the domain of both generative and adaptive creativity can be devised, and examination of the creative behaviour of systems can then be empirically well-grounded in the methods of interaction design. Conclusion The main argument of this paper is that the evaluation of systems as it is currently typically conceived in the computational creativity literature is not in itself empirically well-grounded. The data provided by performing human evaluations should instead be understood as one potential source of information that can feed into studies of the interaction between creative systems and people in order to be well-grounded. Systems may only be understood as creative by looking at their interaction with humans using appropriate methodological tools. A suitable methodology would include, (i) the recognition and rigorous application of soft science methods wherever vague unoperationalised terms and interpretative language is used, and (ii) an appropriate model of creativity in culture and art that includes the recognition of humans as porous subjects, and the signi.cant role played by generative creativity in the dynamics of artistic behaviour. For the time being at least, terms such as creativity and imagination do not describe things that we can readily measure or objectively identify, they are concepts that frame other kinds of measurable and objectively identi.able things, as part of a loose theoretical framework. 2014_16 !2014 What to expect when youre expecting: The role of unexpectedness in computationally evaluating creativity Kazjon Grace and Mary Lou Maher {k.grace,m.maher}@uncc.edu The University of North Carolina at Charlotte Abstract Novelty, surprise and transformation of the domain have each been raised alone or in combination as accompaniments to value in the determination of creativity. Spirited debate has surrounded the role of each factor and their relationships to each other. This paper suggests a way by which these three notions can be compared and contrasted within a single conceptual framework, by describing each as a kind of unexpectedness. Using this framing we argue that current computational models of novelty, concerned primarily with the originality of an artefact, are insufficiently broad to capture creativity, and that other kinds of expectation whatever the terminology used to refer to them should also be considered. We develop a typology of expectations relevant to computational creativity evaluation and, through it describe a series of situations where expectations would be essential to the characterisation of creativity. Introduction The field of computational creativity, perhaps like all emergent disciplines, has been characterised throughout its existence by divergent, competing theoretical frameworks. The core contention unsurprisingly surrounds the nature of creativity itself. A spirited debate has coloured the last several years conferences concerning the role of surprise in computational models of creativity evaluation. Feyerabend (1963) argued that scienti.c disciplines will by their nature develop incompatible theories, and that this theoretical pluralism beneficially encourages introspection, competition and defensibility. We do not go so far as to suggest epistemological anarchy as the answer, but in that pluralistic mindset this paper seeks to reframe the debate, not quell it. We present a way by which three divergent perspectives on the creativity of artefacts can be placed into a unifying context1 . The three perspectives on evaluating creativity are that, in addition to being valuable, 1) creative artefacts are novel, 2) creative artefacts are surprising, or 3) creative artefacts transform the domain in which they reside. We propose that these approaches can be reconceptualised to all derive from the notion of expectation, and thus be situated within a framework illustrating their commonalities and differences. Creativity has often been referred to as the union of novelty and value, an operationalisation first articulated (at least to the authors knowledge) in Newell, Shaw, and Simon (1959). Computational models of novelty (eg. Berlyne, 1966, 1970; 1Creative processes are another matter entirely, one beyond the scope of this paper. Bishop, 1994; Saunders and Gero, 2001b) have been devel oped to measure the originality of an artefact relative to what has come before. Newell and others (eg. Abra, 1988) describe novelty as necessary but insufficient for creativity, forming one half of the novelty/value dyad. Two additional criteria have been o.ered as an extension of that dyad: surprisingness and transformational creativity. Surprise has been suggested as a critical part of computational creativity evaluation because computational models of novelty do not capture the interdependency and temporality of experiencing creativity (Macedo and Cardoso, 2001; Maher, 2010; Maher and Fisher, 2012), but has also been considered unnecessary in creativity evaluation because it is merely an observers response to experiencing novelty (Wig gins, 2006b). Bodens transformational creativity (Boden, 2003) (operationalised in Wiggins, 2006a) has been o.ered as an alternative by which creativity may be recognised. In both cases the addition is motived by the insu.ciency of originality the comparison of an artefact to other artefacts within the same domain as the sole accompaniment to value in the judgement of creativity. Thus far these three notions novelty, surprise and transformativity have been considered largely incomparable, describing different parts of what makes up creativity. There has been some abstract exploration of connections between the two such as Bodens (2003) connection of fundamen tal novelty to transformative creativity but no concrete unifying framework. This paper seeks to establish that there is a common thread amongst these opposing camps: expectations play a role in not just surprise but novelty and transformativity as well. The foundation of our conceptual reframing is that the notions can be reframed thusly: Novelty can be reconceptualised as occurring when an observers expectations about the continuity of a domain are violated. Surprise occurs in response to the violation of a confident expectation. Transformational creativity occurs as a collective reaction to an observation that was unexpected to participants in a domain. We will expand on these definitions through this paper. Through this reframing we argue that unexpectedness is involved in novelty, surprise and domain transformation, and is thus a vital component of computational creativity evaluation. The matter of where in our fields pluralistic and still-emerging theoretical underpinnings the notion of unexpectedness should reside is for now one of terminology alone. This paper sidesteps the issue of whether expectation should primarily be considered the stimulus for surprise, a component of novelty, or a catalyst for transformative creativity. We discuss the connections between the three notions, describe the role of expectation in each, and present an exploratory typology of the ways unexpectedness can be involved in creativity evaluation. We do not seek to state that novelty and transformativity should be subsumed within the notion of surprise due to their nature as expectation-based processes. Instead we argue that the notions of novelty, surprise and transformativity are all related by another process expectation the role of which we yet know little. We as a field have been grasping at the trunk and tail of the proverbial poorly-lit pachyderm, and we suggest that expectation might let us better face the beast. The eye of the beholder Placing expectation at the centre of computational creativity evaluation involves a fundamental shift away from comparing artefacts to artefacts. Modelling unexpectedness involves comparing the reactions of observers of those artefacts to the reactions of other observers. This reimagines what makes a creative artefact different, focussing not on objective comparisons but on subjective perceptions. This eye of the beholder approach framing is compatible with formulations of creativity that focus not on artefacts but on their arti.cers and the society and cultures they inhabit (Csikszentmihalyi, 1988). It should be noted that no assumptions are made about the nature of the observing agent it may be the artefacts creator or not, it may be a participant in the domain or not, and it may be human or artificial. The observer-centric view of creativity permits a much richer notion of what makes an artefact different: it might relate to the subversion of established power structures (Florida, 2012), the destruction of established processes (Schumpeter, 1942), or the transgression of established rules (Dudek, 1993; Strzalecki, 2000). These kinds of cultural im pacts are as much part of an artefacts creativity as its literal originality, and we focus on expectation as an early step towards their computationally realisation. The notion of transformational creativity (Boden, 2003) partially addresses this need by the assumption that cultural knowledge is embedded in the definition of the conceptual space, but to begin computationally capturing these notions in our models of evaluation we must be aware of how narrowly we define our conceptual spaces. The notion common to each of subversion, destruction and transgression is that expectations about the artefact are socio-culturally grounded. In other words, we must consider not just how an artefact is described, but its place in the complex network of past experiences that have shaped the observing agents perception of the creative domain. A creative artefact is unexpected relative to the rules of the creative domain in which it resides. To unravel these notions and permit their operationalisation in computational creativity evaluation we focus not on novelty, surprise or transformativity alone but on the element common to them all: the violation of an observers expectations. Novelty as expectation Runco (2010) documents multiple definitions of creativity that give novelty a central focus, and notes that it is one of the only aspects used to define creativity that has been widely adopted. Models of novelty, unlike models of surprise, are not typically conceived of as requiring expectation. We argue that novelty can be described using the mechanism of expectation, and that doing so is illuminative when comparing novelty to other proposed factors. Novelty can be considered to be expectation-based if the knowledge structures acquired to evaluate novelty are thought of as a model with which the system attempts to predict the world. While these structures (typically acquired via some kind of online unsupervised learning system) are not being built for the purpose of prediction, they represent assumptions about how the underlying domain can be organised. Applying those models to future observations within the domain is akin to expecting that those assumptions about domain organisation will continue to hold, and that observations in the future can be described using knowledge gained from observations in the past. The expectation of continuity is the theoretical underpinning of computational novelty evaluation, and can be considered the simplest possible creativity-relevant expectation. Within the literature the lines between novelty and surprise are not always clear-cut, a con.ation we see as evidence of the underlying role of expectation in both. Novelty in the Creative Product Semantic Scale (OQuin and Besemer, 1989), a creativity measurement index developed in cognitive psychology, is defined as the union of originality and unexpectedness. The model of interestingness in Silberschatz and Tuzhilin (1995) is based on improbability with respect to confidently held beliefs. The model of novelty in Schmidhuber (2010) is based on the impact of observations on a predictive model, which some computational creativity researchers would label a model of transformativity, while others would label a model of surprise. Each of these definitions suggests a complex relationship that goes beyond the notion of originality as captured by simple artefact-to-artefact comparisons. Surprise as expectation Many models of surprise involve the observation of unexpected events (Ortony and Partridge, 1987). In our previous work we give a definition of surprise as the violation of a confidently-held expectation (Maher and Fisher, 2012; Grace et al., 2014a), a definition derived from earlier computational models both within the domain of creativity (Macedo and Cardoso, 2001) and elsewhere (Ortony and Partridge, 1987; Peters, 1998; Horvitz et al., 2012; Itti and Baldi, 2005). Models of surprise have previously looked at a variety of different kinds of expectation: predicting trends within a domain (Maher and Fisher, 2012), predicting the class of an artefact from its features (Macedo and Cardoso, 2001) or the effect on the data structures of a system when exposed to a new piece of information (Baldi and Itti, 2010). The first case concerns predicting attributes over time, and involves an expectation of continuity of trends within data, the second case concerns predicting attributes relative to a classification, and is an expectation of continuity of the relationships within data, and the third case concerns the size of the change in a predictive mechanism, and is based on an expectation of continuity, but measured by the post-observation change rather than the prediction error. In each of these cases it is clear that a related but distinct expectation is central to the judgement of surprisingness, but as of yet no comprehensive typology of the kinds of expectation relevant to creativity evaluation exists. The expectations of continuity that typically make up novelty evaluation can be extended to cover the above cases This paper investigates the kinds of expectation that are relevant to creativity evaluation independent of whether they are an operationalisation of surprise or some other notion. Transformativity as expectation Bodens transformational creativity can be reconceptualised as unexpectedness. We develop a notion of transformativity grounded in an observers expectations that their predictive model of a creative domain is accurate. This requires a reformulation of transformation to be subjective to an observer Boden wrote of the transformation of a domain, but we are concerned with the transformation of an observers knowledge about a domain. To demonstrate the role of expectation in this subjective transformativity, we consider the operationalisation of Bodens transformative creativity proposed by Wig gins (2006b,a), and extend it to the context of two creative systems rather than one. One system, the creator, produces an artefact and chooses to share it with the second creative system, the critic. For the purposes of this discussion we investigate how the critic evaluates the object and judges it transformative. In Wiggins formalisation the conceptual space is defined by two sets of rules: R, the set of rules that define the boundaries of the conceptual space, and T, the set of rules that define the traversal strategy for that space. Wiggins uses this distinction to separate Bodens notion of transformational creativity into R-transformational, occurring when a creative systems rules for bounding a creative domains conceptual space are changed, and T-transformational, occurring when a creative systems rules for searching a creative domains conceptual space are changed.In the case of our critic it is the set R that we are concerned with the critic does not traverse the conceptual space to generate new designs, it evaluates the designs of the creator. Once we assume the presence of more than one creative agent then R, the set of rules bounding the conceptual space, cannot be ontological in nature it cannot be immediately and psychically shared between all creative systems present whenever changes occur. R must be mutable to permit transformation and individual to permit situations where critic and creator have divergent notions of the domain. Divergence is not an unusual case: If a transformational artefact is produced by creator and judged R-transformational by it, and then shared with critic, there must by necessity be a period between the two evaluations where the two systems have divergent R even with only two systems that share all designs. With more systems present, or when creative systems only share selectively, divergence will be greater. To whom, then, is such creativity transformational? To reflect the differing sets belonging to the two agents we refer to R as it applies to the two agents as criticR and creatorR. If a new artefact causes a change in criticR, then we refer to it as criticR-transformational. This extends Bodens distinction between P-and H-creativity: A creative system observing a new artefact (whether or not it was that artefacts creator) can change only its own R, and thus can exhibit only P-transformativity. We distinguish Ptransformativity from P-creativity to permit the inclusion of other necessary qualities in the judgement of the latter: novelty, value, etc. We can now examine the events that lead critic to judge a new artefact to be criticR-transformational. The rules that make up criticR cannot have been prescribed, they must have developed over time, changing in response to the perception of P-transformational objects. The rules that make up Wiggins set R must be inferred from the creative systems past experiences. The rules in criticR cannot be descriptions of the domain as it exists independently of the critic system, they are merely critics current best guess at the state of the domain. The rules in R are learned estimates that make up a predictive model of the domain they can only be what the creative system critic expects the domain to be. A kind of expectation, therefore, lies at the heart of both the transformational and the surprise criteria for creativity. The two approaches both concern the un-expectedness of an artefact. They differ, however, in how creativity is measured with respect to that unexpectedness. Transformational creativity occurs when a creative systems expectations about the boundaries of the domains conceptual space Wiggins R are updated in response to observing an artefact that broke those boundaries. Surprisingness occurs when a creative systems expectations are violated in response to observing an artefact. Transformation, then, occurs in response to surprisingness, but both can occur in the same situations. This is not to say that all expectations are alike: surprise as construed by various authors as a creativity measure has involved a variety of kinds of expectation. The purpose of this comparison is to demonstrate that there is a common process between the two approaches, and we suggest that this commonality o.ers a pathway for future research. From individual to societal transformativity A remaining question concerns the nature of Htransformativity in a framework that considers all conceptual spaces to be personal predictive models. This must be addressed for an expectation-based approach to model transformation at the domain level that which Boden originally proposed. If all R and transformations thereof occur within a single creative system, then where does the domain as a shared entity reside? Modelling creativity as a social system (Csikszentmihalyi, 1988) is one way to answer that question, with the notion that creativity resides in the interactions of a society between the creators, their creations and the culture of that society. This approach argues that the shared domain arises emergently out of the interactions of the society (Saunders and Gero, 2001b; Sosa and Gero, 2005; Saunders, 2012), and that it is communicated through the language and culture of that society. The effect of this is that overall historical creativity can be computationally measured, but only if some bounds are placed on history. Specifically, the transformativity of an artefact can be investigated with respect to the history of a defined society, not all of humanity. One approach to operationalising this socially-derived H-creativity would be through a multi-agent systems metaphor: for an artefact to be judged H-creative it would need to receive a P-creative judgement from a majority of the pool of influence within the society, assuming that each agent possesses personal processes for judging the creativity of artefacts and the influentialness of other creative agents. This very simple formalisation does not model any of the influences discussed in Jennings (2010), but is intended to demonstrate how it would be possible to arrive at H-transformativity within a society given only P-transformativity within individual agents. A framework for creative unexpectedness The notion of expectation needs to be made more concrete if it is to be the basis of models of creativity evaluation. We develop a framework for the kinds of expectation that are relevant to creativity evaluation, and situate some prior creativity evaluation models within that framework. The framework is designed to describe what to expect when modelling expectation for creativity. The framework is based on six dichotomies, an answer to each of which categorises the subject of an expectation relevant to the creativity of an artefact. These six questions are not intended to be exhaustive, but they serve as a starting point for exploration of the issue. First we standardise a terminology for describing expectations: The predicted property is what is being expected, the dependent variable(s) of the artefacts description. For example, in the expectation it will .t in the palm of your hand the size of artefact is the predicted property. The prediction property is the information about the predicted, such as a range of values or distribution over values that is expected to be taken by artefacts. For example, in the expectation the height will be between two and five metres the prediction is the range of expected length values. The scope property defines the set of possible artefacts to which the expectations apply. This may be the whole domain or some subset, for example luxury cars will be comfortable. The condition property is used to construct expectations that predict a relationship between attributes, rather than predict an attribute directly. These expectations are contingent on a relationship between the predicted property and some other property of the object the condition. For example, the expectation width will be approximately twice length predicts a relationship between those two attributes in which the independent variable length affects the dependent variable width. In other expectations the prediction is unconditional and applies to artefacts regardless of their other properties. The congruence property is the measure of .t between an expectation and an observation about which it makes a prediction a low congruence with the expectation creates a high unexpectedness and indicates a potentially creative artefact. Examples of congruence measures include proximity (in attribute space) and likelihood. Using this terminology an expectation makes a prediction about the predicted given a condition that applies within a scope. An observation that falls within that scope is then measured for congruence with respect to that expectation. The six dichotomies of the framework categorise creativity-relevant expectations based on these five properties. 1. Holistic vs. reductionist Expectations can be described as either holistic, where what is being predicted is the whole artefact, or reductionist, where the expectation only concerns some subset of features within the artefact. Holistic expectations make predictions in aggregate, while reductionist expectations make predictions about one or more attributes of an artefact, but less than the whole. An example of a holistic expectation is I expect that new mobile phones will be similar to the ones Ive seen before. This kind of expectation makes a prediction about the properties of an artefact belonging to the creative domain in which the creative system applies. The attribute(s) of all artefacts released within that domain will be constrained by that prediction. In this case what is being predicted is the whole artefact and the prediction is that it will occupy a region of conceptual space. The scope is all possible artefacts within the creative domain of the system. The congruence measure calculates distance in the conceptual space. This kind of expectation is typically at the heart of many computational novelty detectors previously experienced artefacts cause a system to expect future artefacts to be similar within a conceptual space. One example is the Self-Organising Map based novelty detector of (Saunders and Gero, 2001a), where what is being predicted is the whole artefact, the scope is the complete domain, the prediction is a hyperplane mapped to the space of possible designs, and the congruence is the distance between a newly observed design and that hyperplane. An example of a reductionist expectation is I expect that new mobile phones will not be thinner than ones Ive seen before. This is a prediction about a single attribute of an artefact, but otherwise identical to the holistic originality prediction above: it is an expectation about all members of a creative domain, but about only one of their attributes. What is being predicted is the depth attribute, the form of that prediction is an inequality over that attribute, and the scope is membership in the domain of mobile phones. Macedo and Cardoso (2001) use reductionist expectations in a model of surprise. An agent perceives some attributes of an artefact and uses these in a predictive classification. Specifically the agent observes the facades of buildings and constructs an expectation about the kind of building it is observing. The agent then approaches the building and discovers its true function, generating surprise if the expectation is violated. In this case the predicted property is the category to which the building belongs and the prediction is the value that property is expected to take. 2. Scope-complete vs. scope-restricted Expectations can also be categorised according to whether they are scope complete, in which case the scope of the expectation is the entire creative domain (the universe of possibilities within the domain the creative system is working), or scope-restricted, where the expectation applies only to a subset of possible artefacts. The subset may be defined by a categorisation that is exclusive or non-exclusive, hierarchical or .at, deterministic or stochastic, or any other way of specifying which designs are to be excluded. The mobile phone examples in the previous section are scope-complete expectations. An example of a scope restricted expectation would be I expect smartphones to be relatively tall, for a phone. In this case the predicted property is device height (making this a reductionist expectation) and the prediction is a region of the height attribute bounded by the average for the domain of phones. The scope of this expectation, however, is artefacts in the category smartphones, a strict subset of the domain of mobile phones in which this creative system operates. This kind of expectation could be used to construct hierarchical models of novelty. Peters (1998) uses this kind of hierarchy of expectations in a model of surprise each level of their neural network architecture predicts temporal patterns of movement among the features identified by the layers below it, and surprise is measured as the predictive error. At the highest level the expectations concern the complete domain, while at lower levels the predictions are spatially localised. 3. Conditional vs. unconditional Conditional expectations predict something about an artefact contingent on another attribute of that artefact. Unconditional expectations require no such contingency, and predict something about the artefacts directly. This is expressed in our framework via the condition property, which contains an expectations independent variables, while the predicted property contains an expectations dependent variable(s). A conditional expectation predicts some attribute(s) of an artefact conditionally upon some other attribute(s) of an artefact, while an unconditional expectation predicts attribute(s) directly. In a conditional expectation the prediction is that there will be a relationship between the independent attributes (the condition) and the dependent attributes (the predicted). When an artefact is observed this can then be evaluated for accuracy. Grace et al. (2014a) details a system which constructs con ditional expectations of the form I expect smartphones with faster processors to be thinner. When a phone is observed with greater than average processing power and greater than average thickness this expectation would be violated. In this case the predicted property is the thickness (making this a reductionist expectation), the prediction is a distribution over device thicknesses, and the scope is all smartphones (making this a scope-restricted expectation given that the domain is all mobile devices). The difference from previous examples is that this prediction is conditional on another attribute of the device, its CPU speed. Without first observing that attribute of the artefact the expectation cannot be evaluated. In Grace et al. (2014a) the congruence measure is the unlikelihood of an observation: the chance, according to the prior probability distribution calculated from the prediction, of observing a device at least as unexpected as the actual observation. 4. Temporal condition vs. atemporal condition A special case of conditional expectations occurs when the conditional property concerns time: the age of the device, its release date, or the time it was first observed. While all expectations are influenced by time in that they are constructed about observations in the present from experiences that occurred in the past, temporally conditional expectations are expectations where time is the contingent factor. Temporal conditions are used to construct expectations about trends within domains, showing how artefacts have changed over time and predicting that those trends will continue. Maher, Brady, and Fisher (2013) detail a system which constructs temporally conditional expectations of the form I expect the weight of more newly released cars to be lower. Regression models are constructed of the how the attributes of personal automobiles have tended to fluctuate over time. In this case the predicted property is the cars weight, the prediction is a weight value (the median expected value), and the scope is all automobiles in the dataset. The conditional is the release year of the new vehicle: a weight prediction can only be made once the release year is known. The congruence measure in this model is the distance of the new observation from the expected median. 5. Within-artefact temporality vs. within-domain temporality The question of temporally conditional expectations requires further delineation. There are two kinds of temporally contingent expectation: those where the time axis concerns the whole domain, and those where the time axis concerns the experience of an individual artefact. The above example of car weights is the former kind the temporality exists within the domain, and individual cars are not experienced in a strict temporal sequence. Within-artefact temporality is critically important to the creativity of artefacts that are perceived sequentially, such as music and narrative. In this case what is being predicted is a component of the artefact yet to be experienced (an upcoming note in a melody, or an upcoming twist in a plot), and that prediction is conditional on components of the artefact that have been experienced (previous notes and phrases, and previous plot events). Pearce et al. (2010) describes a computational model of melodic expectation which probabilistically expects upcoming notes. In this case the predicted property is the pitch of the next note (an attribute of the overall melody), the prediction is a probability distribution over pitches. While the scope of the predictive model is all melodies within the domain (in that it can be applied to any melody), the conditional is the previous notes in the current melody. Only once some notes early in the sequence have been observed can the pitch of the next notes be estimated. 6. Accuracy-measured vs. impact-measured The first five categorisations in this framework concern the expectation itself, while the last one concerns how unexpectedness is measured when those expectations are violated. Expectations make predictions about artefacts. When a confident expectation proves to be incorrect there are two strategies for measuring unexpectedness: how incorrect was the prediction, and how much did the predictive model have to adjust to account for its failure? The first strategy is accuracy-measured incongruence, and aligns with the probabilistic definition of unexpectedness in Ortony and Partridge (1987). The second strategy is impact-measured incongruence, and aligns with the information theoretic definition of unexpectedness in Baldi and Itti (2010). In the domain of creativity evaluation the accuracy strategy has been most often invoked in models of surprise, while the impact strategy has been most associated with measures of transformativity. Grace et al. (2014b) proposes a computational model of sur prise that incorporates impact-measured expectations. Arte-facts are hierarchically categorised as they are observed by the system, with artefacts that .t the hierarchy well being neatly placed and artefacts that .t the hierarchy poorly causing large-scale restructuring at multiple levels. The system maintains a stability measure of its categorisation of the creative domain, and its expectation is that observations will affect the conceptual structure proportional to the current categorisation stability (which can be considered the systems confidence in its understanding of the domain). Measuring the effect of observing a mobile device on this predictive model of the domain is a measure of impact. These expectations could be converted to a measure of accuracy by instead calculating the classification error for each observation, not the restructuring that results from it. The system would then resemble a computational novelty detector. Experiments in expectability To further illustrate our framework for categorising expectation we apply it to several examples from our recent work modelling surprise in the domain of mobile devices (Grace et al., 2014b,a). This system measures surprise by constructing expectations about how the attributes of a creative artefact relate to each other, and the date which a particular artefact was released is considered as one of those attributes. Surprise is then measured as the unlikelihood of observing a particular device according to the predictions about relationships between its attributes. For example, mobile devices over the course of the two decades between 1985 and 2005 tended, on average, to become smaller. This trend abruptly reversed around 2005-6 as a result of the introduction of touch screens and phone sizes have been increasing since. The system observes devices in chronological order, updating its expectations about their attributes as it does so. When this trend reversed the system expressed surprise of the form The height of device A is surprising given expectations based on its release date. Details of the computational model can be found in earlier publications. Figure 1 shows a plot of the systems predictions about device CPU speed the system made based on year of release. At each date of release the system predicts a distribution over expected CPU clock speeds based on previous experiences. The blue contours represent the expected distribution, with the thickest line indicating the median. The white dots indicate mobile devices. The gradient background indicates hypothetical surprise were a device to be observed at that point, with black being maximally surprising. The vertical bands on the background indicate the effect of the models confidence measure when predictions have significant error the overall surprise is reduced as the model is insufficiently certain in its predictions, and may encounter unexpected observations because of inaccurate predictions rather than truly unusual artefacts. An arrow indicates the most surprising device in the image, the LG KC-1, released in 2007 with a CPU speed of 806Mhz, considered by the predictive model to be less than 1% likely given the distribution of phone speeds before that observation. Note that after soon after 2007 the gradient of the trend increases sharply as mobile devices started to become general-purpose computing platforms. The KC-1 was clearly ahead of its time, but without the applications and touch interface to leverage its CPU speed it was never commercially successful. Figure 1: Expectations about the relationship between release year and CPU speed within the domain of mobile devices. The LG KC-1, a particularly unexpected mobile device, is marked. This is a reductionist, scope-complete, within-domain tem porally conditional expectation, with congruence measured by accuracy. It is reductionist as the predicted attribute is only CPU speed. It is scope-complete because CPU speeds are being predicted for all mobile devices, the scope of this creative system. It is conditional because it predicts a relationship between release year and CPU speed, rather than predicting the latter directly, and that condition is temporal as it is based on the date of release. It is within-domain temporal, as the time dimension is defined with respect to the creative domain, rather than within the observation of the artefact (mobile phones are typically not experienced in a strict temporal order, unlike music or narrative). It is accuracy-measured as incongruence is calculated based on the likelihood of the prediction, not the impact of the observation on the predictive model. Figure 2 shows another expectation of the same kind as in Figure 1, this time plotting a relationship between device width and release year. The notation is the same as in Figure 1 although without the background gradient. The contours represent the expected distribution of device masses for any given device volume. Here, however, the limits of the scope-complete approach to expectation are visible. Up until 2010 the domain of mobile devices was relatively unimodal with respect to expected width over time. The distribution is approximately a Poisson, tightly clustered around the 40-80mm range with a tail of rare wider devices. Around 2010, however, the underlying distribution changes as a much wider range of devices running on mobile operating systems are released. The four distinct clusters of device widths that emerge phones, phablets (phone/tablet hybrids), tablets and large tablets are not well captured by the scope-complete expectation. If a new device were observed located midway between two clusters it could reasonably be considered unexpected, but under the unimodality assumption of the existing system this would not occur. A set of scope-restricted temporally conditional expectations could address this by predicting the relationship between width and time for each cluster individually. Additionally a measure of the impact of the devices released in 2010 on this predictive model could detect the transformational creativity that occurred here. Figure 3 shows a plot of the systems predictions about device mass based on device volume. Note that unsurprisingly there is a strong positive correlation between mass and volume, and that the distribution of expected values is broader for higher volumes. Two groups of highly unexpected devices emerge: those around 50-100cm3 in volume but greater than 250gr in mass, and those in the 250-500cm3 range of volumes but less than 250gr mass. Investigations of the former suggest they are mostly ruggedised mobile phones or those with heavy batteries, and investigations of the latter suggest they are mostly dashboard-mounted GPS systems (included in our dataset as they run mobile operating systems). This is a reductionist, scope-complete, atemporal condition, with congruence measured by accuracy. By our framework, the difference between the expectations modelled in Figure 1 and Figure 3 are that the formers conditional prediction is contingent on time, while the latters is contingent on an attribute of the artefacts. Figure 4 shows the results of a different model of surprise, contrasted with our earlier work in Grace et al. (2014b). An online hierarchical conceptual clustering algorithm (Fisher, 1987) is used to place each device, again observed chronologi cally, within a hierarchical classification tree that evolves and restructures itself as new and different devices are observed. The degree to which a particular device affects that tree structure can then be measured, indicating the amount by which it transformed the systems knowledge of the domain. The most unexpected device according to this measure were the Bluebird Pidiom BIP-2010, a ruggedised mobile phone which caused a redrawing of the physical dimensions based boundary between tablet and phone and caused a large number of devices to be recategorised as one or the other (although it must be noted that such labels are not known to the system). The second most unexpected device was the ZTE U9810, a 2013 high-end smartphone which put the technical specs of a tablet into a much smaller form factor, challenging the systems previous categorisation of large devices as also being powerful. The third most unexpected device was the original Apple iPad, which combined high length and width with a low thickness, and had more in common internally with previous mobile phones than with previous tablet-like devices. Figure 2: Expectations about the relationship between the release year and width of mobile devices. Note that the distribution of widths was roughly unimodal until approximately 2010, when four distinct clusters emerged. Figure 4: Incongruence of mobile devices with respect to their impact on learnt conceptual hierarchy. Three particularly unexpected devices are labelled. This is a reductionist, scope-complete, unconditional expectation with congruence measured by impact. It is reductionist it does not predict all attributes of the device, only that there exists certain categories within the domain. It is scope-complete as it applies to all devices within the domain. It is unconditional as the prediction is not contingent on observing some attribute(s) of the device. The primary difference from the previous examples of expectation is the congruence measure, which measures not the accuracy of the prediction (which would be the classification error), but the degree to which the conceptual structure changes to accommodate the new observation. Novelty, surprise, or transformativity? Our categorisation framework demonstrates the complexity of the role of expectation in creativity evaluation, motivating the need for a deeper investigation. We argue that expectation underlies novelty, surprise, and transformativity, but further work is needed before there is consensus on what kinds of expectation constitute each notion. Macedo and Cardoso (2001) adopt the definition from Ortony and Partridge (1987) in which surprise is an emotion elicited by the failure of confident expectations, whether those expectations were explicitly computed beforehand or generated in response to an observation. By this construction all forms of expectation can cause surprise, meaning that surprise and novelty have considerable overlap. Wiggins (2006a) goes further, saying that surprise is always a response to novelty, and thus need not be modelled separately to evaluate creativity. Schmidhuber (2010) takes the opposite approach, stating that all novelty is grounded in unexpectedness, and that creativity can be evaluated by the union of usefulness and improvement in predictability (which would, under our framework, be a kind of impact-based congruence). Wiggins (2006b) would consider Schmidhubers improvement in pre dictability to be a kind of transformation as it is a measure of the degree of change in the creative systems rules about the domain. Maher and Fisher (2012) state that the dividing line between novelty and surprise is temporality surprise involves expectations about what will be observed next, while novelty involves expectations about what will observed at all. Grace et al. (2014a) expand that notion of surprise to include any conditional expectation, regardless of temporality. We do not o.er a conclusive definition of what constitutes novelty, what constitutes surprise, and what constitutes transformativity, only that each can be thought of as expectation-based. It may well be that even should we all come to a consensus set of definitions the three categories are not at all exclusive. We o.er some observations on the properties of each as described by our framework: Surprise captures some kinds of creativity-relevant expectation that extant models of novelty do not, namely those concerned with trends in the domain and relationships between attributes of artefacts. Models of surprise should be defined more specifically than violation of expectations if the intent is to avoid overlap with measures of novelty, as novelty can also be expressed as a violation of expectations. The unexpectedness of an observation and the degree of change in the systems knowledge as a response to that observation can be measured for any unexpected event, making (P-)transformativity a continuous measure. Models of transformative creativity should specify the kind and degree of change that are necessary to constitute creativity. Conclusion We have sought to build theoretical bridges between the notions of novelty, surprise and transformation, reconceptualising all three as forms of expectation. This approach is designed to o.er a new perspective on debates about the roles of those disparate notions in evaluating creativity. We have developed a framework for characterising expectations that apply to the evaluation of creativity, and demonstrated that each of novelty evaluation, surprise evaluation, and transformational creativity can be conceived in terms of this framework. Given the wide variety of kinds of expectation that should be considered creativity-relevant we argue that originality alone is not a sufficient accompaniment to value to constitute creativity. This insu.ciency is a critical consideration for computational models that can recognise creativity. The expectation-centric approach provides a framing device for future investigations of creativity evaluation. Expectation both serves as a common language by which those seeking to computationally model creativity can compare their disparate work, and provides an avenue by which human judgements of creativity might be understood. 2014_17 !2014 Stepping Back to Progress Forwards: Setting Standards for Meta-Evaluation of Computational Creativity Anna Jordanous Centre for e-Research, Department of Digital Humanities Kings College London 26-29 Drury Lane, London WC2B 5RL, UK anna.jordanous@kcl.ac.uk Abstract There has been increasing attention paid to the question of how to evaluate the creativity of computational creativity systems. A number of different evaluation methods, strategies and approaches have been proposed recently, causing a shift in focus: which methodology should be used to evaluate creative systems? What are the pros and cons of using each method? In short: how can we evaluate the different creativity evaluation methodologies? To answer this question, five meta-evaluation criteria have been devised from cross-disciplinary research into good evaluative practice. These five criteria are: correctness; usefulness; faithfulness as a model of creativity; usability of the methodology; generality. In this paper, the criteria are used to compare and contrast the performance of five various evaluation methods. Together, these meta-evaluation criteria help us explore the advantages and disadvantages of each creativity evaluation methodology, helping us develop the tools we have available to us as computational creativity researchers. Introduction Computational creativity evaluation repeatedly appears as a theme in the calls for papers for the ICCC conference series. Such emphasis underlines the growing importance of evaluation to the computational creativity research community. For transparent and repeatable evaluative practice, it is necessary to state clearly what standards/methods are used for evaluation (Jordanous 2012a). Despite, or perhaps because of, a lack of creativity evaluation being employed in the computational creativity research community until recently (Jordanous 2011), a number of creativity evaluation strategies have been proposed in recent years (Pease, Winterstein, and Colton 2001; Ritchie 2007; Colton et al. 2010; Colton, Charnley, and Pease 2011; Jordanous 2012b). Herein lies a decision for a computational creativity researcher: which evaluation strategy should be adopted to evaluate computational creativity systems? What are the benefits and disadvantages of each? Such questions have not previously been examined to any detailed extent in computational creativity research. In various other research fields, though, issues around evaluating evaluation, or meta-evaluation. have been considered in some detail. Meta-evaluation has been considered from philosophical and more practical standpoints. As a burgeoning research community, computational creativity researchers can learn from such considerations, as they apply to our own research efforts. This paper proposes five standards for meta-evaluation of creativity evaluation methodologies, informed by the wider literature and by evaluative practices outside of the computational creativity field. These standards are offered as factors for assessment and comparison of creativity evaluation methodologies, to help us develop good evaluative practice in computational creativity research. The five meta-evaluation standards are applied to a case study on creative system evaluation, comparing different evaluation methodologies against each other. Results are reported below. It is proposed that these five standards should help guide us in re.ning our work on computational creativity evaluation, as we progress in the development of this important area of computational creativity research. The need to evaluate creativity evaluation We have an intuitive but tacit understanding of the concept of creativity that we can access introspectively (Kaufman 2009; Jordanous 2012a). For comparative purposes and methodical, transparent evaluation, this intangible understanding is not sufficient to help us identify and learn from our successes and failures in computational creativity research. To solve the problem of how to evaluate creative systems, various evaluation methodologies or strategies have been offered including the tests offered by Pease, Winter-stein, and Colton, Ritchies empirical criteria, the creative tripod model, the FACE model and the SPECS methodology (Pease, Winterstein, and Colton 2001; Ritchie 2007; Colton et al. 2010; Colton, Charnley, and Pease 2011; Jordanous 2012b, respectively).1 But which should computational creativity researchers use? One should note here that we are unlikely to find one single fully-specified, detailed, step-by-step methodology to suit all types of creative system. What we can do is to understand the strengths and weaknesses of different methodologies. Through trial, application of and comparison between different methodologies, refine and develop our eval 1See Jordanous 2012a for full discussion of these methodologies and strategies. uation strategies within computational creativity so that we can mutually learn from our advances and mistakes; the very essence of what evaluation offers researchers, after all. How can these methodologies be compared against each other? Reviewing various features of the methodologies and comparing them against each other helps us to learn through comparison. Below, five meta-evaluation standards are identi.ed for comparison and evaluation of creativity evaluation methodologies. These five meta-evaluation standards are drawn from cross-disciplinary reviews of evaluative practice. The meta-evaluation standards are applied in a practical case study, reported below. From this application of the standards, we can appreciate the strengths and weaknesses of each creativity evaluation methodology, guiding us in our evaluative choices when developing computational creativity research. With these meta-evaluation criteria, we can now compare evaluative results obtained through different methods and discuss how useful each of these evaluations are to the computational creativity researcher. Gathering effective evaluative feedback, using solidly developed evaluation methodologies, assists further computational creativity research development and helps identify more clearly the contributions to knowledge made by our research. Criteria for meta-evaluation of creativity evaluation methodologies Criteria for evaluation should be clearly stated and justi.ed (Jordanous 2012a). This theme also applies to meta-evaluation criteria for comparing various creativity evaluation methodologies. Certain areas suggest themselves as meta-evaluation criteria for assessing creativity evaluation methodologies, such as the accuracy and usefulness of the feedback to a researcher, or ease of applicability. Pease, Winterstein, and Colton (2001) identify two candidate meta-evaluation criteria: Firstly, to what extent do they reflect human evalu ations of creativity, and secondly, how applicable are they? (Pease, Winterstein, and Colton 2001, p. 9) More recently, Pease has suggested the set of {generality, usability, faithfulness, value of formative feedback} as candidate criteria (Pease, 2012, personal communications). In relevant literature on evaluation and related literature on proof of hypotheses in scienti.c method, other contributions could also be used as criteria for measuring the success of computational creativity evaluation methodologies, as outlined below. Criteria for testing scienti.c hypotheses and explanatory theories Sloman (1978) outlined seven types of interpretative aims of science (Sloman 1978, p. 26, my emphasis added), of which the third aim is the forming of explanatory theories for things we know exist. In the context of this current work, an example of the explanatory theories mentioned in the third aim would be a theory that allows us to explain if or why a computational creativity system is creative. Ten criteria were offered by Sloman (1978) as criteria for comparison of explanatory theories. a good explanation of a range of possibilities should be definite, general (but not too general), able to explain fine structure, non-circular, rigorous, plausible, economical, rich in heuristic power, and extendable. (Sloman 1978, p. 53) Within these criteria there is some significant interdependence and Sloman advises that the criteria are best treated as a set of inter-related criteria rather than distinct yardsticks, with some criteria (such as plausibility, generality and economy) to be used with caution. This may help to explain why Slomans list of criteria is longer than others mentioned in this Section. Thagard (1988) defined a good theory as true, acceptable, confirmed (Thagard 1988, p. 48). These criteria were later expressed in the form of the criteria of consilience, simplicity of analogy (Thagard 1988, p. 99) as essential criteria for theory evaluation: Consilience -how comprehensive the theory is, in terms of how much it explains. Simplicity -keeping the theory simple so that it does not try to over-explain a phenomenon. Thagard mentions in particular that a theory should not try to achieve consilience by means of ad hoc auxiliary hypotheses (Thagard 1988, p. 99). In other words, the main explanatory power of the theory should map closely to the main part of that theory, without needing extensive correction and supplementation. Analogy -boosting the explanatory value(Thagard 1988, p. 99) of a theory by enabling it to be applied to other demands. This is especially appropriate where theories can be cross-applied in more established domains where knowledge of facts is more developed. Guidelines for good practice in research evaluation Suggestions for good practice in performing evaluation in research can be interpreted as criteria that identify such good practice. For example, in his Short Course on Evaluation Basics, John W. Evans identi.es four characteristics of a good evaluation:2 a good evaluation should be objective, replicable, generalisable and as methodologically strong as circumstances will permit. In considering what constitutes good evaluation practice, the MEERA website (My Environmental Education Evaluation Resource Assistant)3 describes good evaluation as being: tailored to your program ... crafted to address the specific goals and objectives [of your program; [building] on existing evaluation knowledge and resources; inclusive of as many diverse viewpoints and scenarios as reasonable; replicable; as unbiased and honest as possible; and as rigorous as circumstances allow. From a slightly different perspective on research evaluation, the European Union FP6 Framework Programme describes how FP6-funded projects are evaluated in terms of three criteria: 2http : //edl.nova.edu/ secure/ evasupport/ evaluationbasics.html, last accessed Feb 2014. 3All quotes from the MEERA website are taken from http : //meera.snre.umich.edu / plan an evaluation / evaluation what it and why do it#good, last accessed Feb 2014. a projects rationale relative to funding guidelines and resources; implementation effectiveness, appropriateness and cost-effectiveness; and achievements and impact of contributions of objectives and outputs. Dealing with subjective and/or fuzzy data: Blankes specificity and exhaustivity In computational creativity evaluation, the frequency of data being returned is low and the correctness of that data is generally subjective and/or fuzzy in definition, rather than being discretely categorisable as either correct or incorrect, or as either present or missing. Blanke (2011) looked at how to evaluate the success of a methodology for measuring aspects like precision and recall, in cases where the results being returned were somewhat difficult to pin down to exact matches due to fuzziness in what could be returned as a correct result. The specific case Blanke considered was in XML retrieval evaluation, where issues such as hierarchical organisation and overlap of elements, and the identification of what was an appropriate part of an XML document to return, caused problems with using precision and recall measures. There was also an issue with relatively low frequencies in what was being returned. As an evaluation solution, Blanke (2011) proposed component specificity and topical exhaustivity, following from Kazai and Lalmas (2005). Exhaustivity is measured by the size of overlap of query and document component information (Blanke 2011, p. 178). Specificity is determined by counting the rest of the information in the component [of an XML document] that is not about the query (Blanke 2011, p. 178), such that minimising such information will maximise the specificity value, as more relevant content is returned. Identifying meta-evaluation criteria Drawing all the above contributions together, five criteria can be identified for meta-evaluation of computational creativity evaluation methodologies. These are presented here, with relevant points from the comments above being grouped under the most relevant criterion, as far as possible. Some overlap across criteria is acknowledged, for example Thagards analogy criterion can be interpreted as being concerned with both usefulness and generality. Correctness: how accurately and comprehensively the evaluation findings reflect the systems creativity. MEERAs honesty of evaluation criterion. MEERAs inclusiveness of diverse relevant scenarios criterion. Evans objectiveness criterion. MEERAs avoidance of bias in results criterion. Slomans definiteness criterion. Slomans rigorousness criterion. Slomans plausibility criterion. Thagards consilience criterion. Blankes exhaustivity criterion. Evans methodological strength criterion. Usefulness: how informative the evaluative findings are for understanding and potentially improving the creativity of the system. Peases value of formative feedback criterion. FP6s rationale, implementation and achievements criteria. Slomans heuristic power criterion. Thagards analogy criterion. Faithfulness as a model of creativity: how faithfully the evaluation methodology captures the creativity of a system (as opposed to other aspects of the system). Pease, Winterstein, and Colton (2001)s reflection of human evaluations of creativity criterion. Peases faithfulness criterion. MEERAs tailoring of the method to specific goals and objectives criterion. Blankes specificity criterion. Usability of the methodology: the ease with which the evaluation methodology can be applied in practice, for evaluating the creativity of systems. Pease, Winterstein, and Colton (2001)s applicability criterion. Peases usability criterion. Evans replicability criterion. MEERAs replicability and rigorousness of a methodology criteria. Slomans non-circularity criterion. Slomans rigorous and explicitness criteria (in how to apply the methodology). Slomans economy of theory criterion. Thagards simplicity criterion. Generality: how generally applicable this methodology is across various types of creative systems. Peases generality criterion. MEERAs inclusiveness of diverse relevant scenarios criterion. Evans generalisability criterion. Slomans generality criterion. Slomans extendability criterion. Thagards analogy criterion. Applying the criteria: a case study Now we have identified these five meta-evaluation criteria, we can use them to evaluate the performance of computational creativity evaluation methodologies. Previously, three different musical improvisation computer systems were evaluated using various computational creativity evaluation methodologies, to compare how creative each system was (Jordanous 2012a; 2012b). The task in this current work is to consider how well the creativity evaluation methodologies performed for this assessment. For an independent assessment of the relative performance of the evaluation methodologies, external evaluation was sought to consider and perform meta-evaluation on five key existing evaluative approaches (Ritchie 2007; Colton 2008; Colton, Charnley, and Pease 2011; Jordanous 2012b, surveys of human opinion). The invited external evaluators were the key researchers involved in creating the musical improvisation systems examined in the above-mentioned creativity evaluation case study(Jordanous 2012a): Al Biles (GenJam) and George Lewis (Voyager). Bob Keller was also invited because of his research into and development of a related musical improvisation system, the Impro-Visor system (Gillick, Tang, and Keller 2010).4 Evaluators were asked to view all the evaluative feedback obtained. They were then asked to give their opinions (as developers of musical improvisation systems) on various aspects of each methodology and on the results obtained. Below, the methodology used for the meta-evaluation is briefly described, and the obtained meta-evaluations are reported and discussed. Fuller details can be found in Jordanous (Jordanous 2012a). Methodology for obtaining external evaluation Each external evaluator was given a feedback sheet reporting the evaluation feedback obtained for their system from each creativity evaluation methodology being investigated: Ritchies criteria; Coltons creative tripod; survey of human opinion; the FACE model; and SPECS+cc. (N.B. SPECS+cc is used here to indicate the use of Jordanouss SPECS methodology with the 14 creativity components (Jordanous 2012a) as the adopted definition of creativity, as recommended (Jordanous 2012b).) For each methodology, the sheets also included brief comparisons between systems according to the systems evaluated creativity. An example of these feedback sheets, given in (Jordanous 2012a, Appendices), presents the sheet provided to Al Biles to report the evaluation results for GenJam. A similar set of feedback was prepared and sent to George Lewis as evaluative feedback relating to Voyager. Methodologies were presented under anonymous identi.ers in the feedback sheet to avoid any bias from being introduced, as far as possible. Evaluators were first asked if they had any initial comments on the results. They were then asked to provide full feedback for each methodology in turn, on the five criteria derived above. They looked at all five criteria for the current methodology and then were asked for any final comments on that methodology before moving onto the next methodology. Methodologies were presented to the evaluators in a randomised order, to avoid introducing any ordering bias. For each criterion, questions and illustrating examples were composed to present the criterion in a context appropriate for computational creativity evaluation. These questions and examples, listed below, were put to external evaluators to gather their feedback on each criterion as meta-evaluation of the various evaluation methodologies. 4The author of one evaluated systems (GAmprovising) was not included, due to being the author of one of the evaluation methods being examined (and the researcher conducting this work). Correctness: How correct do you think these results are, as a re.ection of your system? For example: are the results as accurate, comprehensive, honest, fair, plausible, true, rigorous, exhaustive, replicable and/or as objective as possible? Usefulness: How useful do you find these evaluation results, as an / the author of the system? For example: do the results provide useful information about your system, give you formative feedback for further development, identify contributions to knowledge made by your system, or give other information which you find helpful? Faithfulness as a model of creativity: How faithfully do you think this methodology models and evaluates the creativity of your system? For example: do you think the methodology uses a suitable model(s) of creativity for evaluation, does the methodology match how you expect creativity to be evaluated, how specifically does the methodology look at creativity (rather than other evaluative aims)? Usability of the methodology: How usable and user-friendly do you think this methodology is for evaluating the creativity of computational systems? For example: would you find the methodology straightforward to use if wishing to evaluate the creativity of a computational creativity system (or systems), is the methodology stated explicitly enough to follow, is the method simple, could you replicate the experiments done with this methodology in this evaluation case study? Generality: How generally do you think this methodology can be applied, for evaluation of the creativity of computational systems? For example: can the methodology accommodate a variety of different systems, be generalisable and extend-able enough to be applied to diverse examples of systems, and/or different types of creativity? For each criterion, evaluators were asked to rate the systems performance on a 5 point Likert scale (all of a format ranging from positive extreme to negative extreme, such as: [Extremely useful, Quite useful, Neutral, Not very useful, Not at all useful]). They could also add any comments they had for each criterion. Evaluators were asked about the correctness and usefulness of the methodologys results, before learning how the methodology worked. This gave the advantage of being able to hear the evaluators opinions considering the feedback results in isolation, without any influence from how the results were obtained. Nonetheless, the process by which a product was generated is important to consider alongside that product, for a more rounded and informed evaluation (Rhodes 1961). Evaluators were given details on how that methodology worked after evaluating the correctness and usefulness criteria. They were then asked to provide feedback for the final three criteria (faithfulness, usability and generality). The details provided to explain each methodology are reproduced in Jordanous (Jordanous 2012a, Appendices).5 Finally, evaluators were asked to rank the evaluation methodologies according to how well they thought the methodologies evaluated the creativity of their system overall. Although the formative feedback is, again, probably more useful in terms of developing the various methodologies, it was interesting to see evaluators opinions on how the methodologies compared to each other. The rankings, completed by Al Biles and Bob Keller, are reported in Table 1. At this point, evaluators were also given a change to add any final comments, before .nishing the study. Al Biles completed a full evaluation of all methodologies and (due to time constraints) George Lewis provided evaluations of two methodologies: Coltons creative tripod and the SPECS+cc methodology. Bob Keller also provided comments on some aspects of all methodologies. Results and discussion of meta-evaluation Al Biles summarised the meta-evaluation of the five different methodologies with: Five very different approaches, and each bring something to the table. In the comparisons between methodologies and the overall rankings listed in Table 1, SPECS+cc was either considered the best methodology overall (ahead of the creative tripod) or the second best (behind Ritchies criteria) for evaluating a systems creativity. The more useful information, though comes from the more detailed formative feedback and comments rather than a single summative ranking as given in Table 1. SPECS+cc was evaluated by both Biles and Lewis, with some additional comments from Keller. SPECS+cc generated extremely useful and quite correct results, in both of the main evaluators opinions. One evaluator found SPECS+cc to be an extremely faithful model of creativity, though the other was neutral on this matter. While one evaluator found SPECS+cc quite user-friendly, the other questioned how user-friendly the SPECS+cc methodology would be, given the steep learning curve in understanding the components. In terms of generality, evaluators disagreed on how generally SPECS+cc could be applied, further comments illustrated how methods like SPECS+cc were more appropriate for taking into account other system goals, compared to more limited views on creativity such as in the FACE model. Biles and Keller in particular commented on the lack of accommodation of other system goals in the FACE model, though it is to be acknowledged that such accommodation does not form one of the goals of the FACE 5It is worth noting that methodologies may well perform differently against the five criteria when applied to different systems (a meta-application of the generality criterion?) The evaluators cannot be expected to give rigorous feedback on the potential of the methodologies in evaluating any possible type of system, and we should refrain from drawing too-broad conclusions from their feedback. Nonetheless, with careful consideration of the evaluators feedback, we gain valuable insights on the methodologies. model and is more of an unintended but useful consequential result in models such as SPECS+cc. FACE was placed third in the overall rankings by Biles and last by Keller. Biles, the main evaluator for FACE, found the results generated by FACE to be completely correct, but gave a neutral opinion (neither positive nor negative) on the usefulness of FACE model feedback, the generality of the FACE model across domains and the faithfulness of the FACE model as a model of creativity. FACE was deemed quite user-friendly due to its simplicity; this opinion was repeated, more strongly, for the other creativity evaluation framework Colton was involved in, the creative tripod. Lewis and Biles both evaluated the tripod; they disagreed as to whether the tripod would be generally applicable across many domains, and also as to how faithfully the tripod modelled creativity. Both evaluators agreed, however, that the feedback from the tripod was extremely useful and either completely correct or quite correct. Biles ranked the creative tripod as the second best creativity evaluation methodology overall, though Keller placed it last. Ritchies criteria methodology was fully evaluated by Biles. Biles found the criteria to produce quite correct, quite useful feedback that was quite faithful to creativity (despite raising issues with enforced simpli.cations of the data due to the boolean rather than continuous nature of the feedback). Biles was neutral on the usability of applying the criteria for creativity evaluation and on their generality, questioning how the generic terminology used to solicit ratings of typicality and value could be applied to different domains successfully. Keller considered Ritchies criteria to be the best methodology overall for creativity evaluation, though Biles gave it a middling ranking. The opinion survey was ranked overall to be the fourth best methodology out of the .ve. It received a few negative comments from Biles, the main evaluator for this system, despite Biles noting that nothing is simpler than just ... asking whether something is creative or not and that the survey solicited spontaneous,unadulterated opinions rather than restructuring the feedback (though Biles also noted that the tripod feedback was clearer than the survey feedback due to its more structured presentation). Biles was guided in a number of comments by an observation that the opinion survey sacri.ced reliability/consistency of results for greater validity in terms of the personal qualitative feedback. He thought that the survey approach could be applied quite generally and was quite user-friendly and quite faithful to what it means to be creative. The success of this methodology would depend on the type of person participating, and whether they were clear on what creative means. Given that the GenJam system has been publicly presented many times before, though, Biles felt he learned nothing new from the feedback from the survey, unlike the other methodologies. He was neutral on the correctness of the methodology, confirming observations made in Jordanous 2012a that human opinion cannot be relied on as a ground truth to measure evaluations against, due to varying viewpoints. Table 1: Judges were asked to rank the methodologies according to how well overall they thought the methodologies evaluated the systems creativity: Position Al Biles Bob Keller 1st (best) SPECS+cc Ritchies criteria 2nd Creative Tripod SPECS+cc 3rd Ritchies criteria FACE 4th Opinion survey Opinion survey 5th (worst) FACE Creative Tripod Comparing and contrasting methodologies Five meta-evaluation criteria have now been identified for meta-evaluation of creativity evaluation methodologies and have been used for evaluation by external evaluators, as reported above. Next, the criteria were applied for further analysis of all the methodologies investigated earlier in this paper, using the full findings from the Jordanous (2012a) case study evaluating the creativity of musical improvisation systems. Such considerations on the methodologies allow us to compare if, and how, a particular evaluation methodology marks a development of our evaluation toolkit as computational creativity researchers. Here, the considerations are focused towards evaluating how well the SPECS+cc methodology (Jordanous 2012a) performed, to gain feedback as to how to improve SPECS+cc and what its strengths were in comparison to other methods. The considerations below also complement the evaluative case study findings by accounting for more detailed information and observations that may not have been detected by the external evaluators, but which should still be considered. Correctness Showing that human opinion cannot necessarily be relied on as a ground truth, even on a large scale, some participants in opinion surveys admitted that they were likely to be evaluating the systems based on how highly they rated a systems performance overall rather than specifically how creative they thought it was, which would affect the overall correctness of the results of evaluations from the human opinion survey. SPECS+cc performed better than Ritchies criteria for correctness. Although Ritchies 18 criteria have a comprehensive coverage of observations over the products of the system, criteria evaluation is based solely on the products of the creative system, not accounting for the systems process, or observations on the system or how it interacted with its environment. Coltons tripod model was found to be reasonably accurate in terms of identifying and evaluating important aspects in the case study, but it has disregarded aspects such as social interaction, communication and intention, which have been shown to be very important in understanding how musical improvisation creativity is manifested (Jordanous and Keller 2014). It should be noted that correctness does not imply that the results from evaluation match common human consensus as a ground truth, or right answer; Jordanous (Jordanous 2012a) demonstrated that these are not reliable goals in creativity evaluation. Instead, correctness is concerned with how appropriate the feedback is and how accurately and realistically the feedback describes the system. Usefulness The methodologies differed in the amount of feedback generated through evaluation. A fairly large volume of qualitative and quantitative feedback was returned through the application of SPECS+cc. This is unlike Ritchies criteria which only returned a set of 18 Boolean values, one for each criterion, with some interpretation effort needed to understand how each criterion influences creativity within the system.6 Coltons creative tripod generated feedback for 3 components, rather than 14 components, so was shorter than SPECS+cc. The human opinion surveys generated similar quantities of feedback to SPECS+cc, from more people but a shallower level of detail. The human opinions surveys returned less detailed feedback than SPECS+cc, which generated a large amount of detailed formative feedback. The opinion surveys feedback also often concentrated on aspects of the systems other than its creativity, according to participant feedback (Jordanous 2012a). Ritchies criteria returned a set of 18 boolean values rather than any formative feedback, in a fairly opaque form given the formal abstraction of the criteria specification; if there were no output examples, Ritchies criteria would not generate any feedback at all, even based on other observations about the system. Coltons creative tripod returned information at the same level of detail as SPECS+cc per component/tripod quality, but less information overall, as several useful components of SPECS+cc were overlooked because they did not map onto the set of tripod aspects. Faithfulness as a model of creativity Participant feedback for the human opinion surveys acknowledged that evaluations may have related more to the quality of the system, not its creativity, with several participants requesting a de.nition of creativity to refer to when evaluating how creative the systems were (Jordanous 2012b). The SPECS methodology requires researchers to base their evaluations on a researched and informed understanding of creativity that takes into account both domain-specific and domain-independent aspects of creativity. In this way it is the only methodology that directly accounts for specific informed requirements for creativity in a particular domain. Human opinion surveys would acknowledge this but only tacitly, without these requirements necessarily being identi.able or explainable. Although the parameters and weights in Ritchies criteria could be customised to reflect differing requirements for creative domains, in practice no researchers have attempted this 6One reviewer of this paper pointed out that Ritchie (2007) also briefly considered how his criteria could be adapted to return measurements of each criterion in the range [0,1], rather than Boolean values, although Ritchies main presentation of the criteria is as statements which generate Boolean values. This alternative usage gives slightly more information, but the issues of interpreting these criterias contribution to overall creativity still remain. when applying Ritchies criteria, probably due to the formal and abstracted presentation of the criteria. In Coltons creative tripod, all three tripod qualities are treated equally in previous examples (including those in Colton (2008)) regardless of their contribution in a specific creative domain and no further qualities can be introduced into the tripod framework. Usability of the methodology Less information needed to be collected for Coltons creative tripod than for the other methodologies, taking less time to collect. Coupled with the informal nature of performing creativity evaluation with the tripod framework, Coltons creative tripod emerged as the most easy-to-use of the methodologies evaluated. Data collection for the other methodologies was of a similar magnitude, although data analysis for Ritchies criteria was slightly more involved and more specialist than the other methodologies, requiring a specific understanding of the criteria. Feedback reflected on the volume of data generated by using the components as a base model of creativity, as recommended for SPECS. If SPECS is applied without using the Jordanous (2012b) components as the basis for the adopted definition of creativity, then SPECS becomes more involved and more demanding in terms of researcher effort, negatively affecting its usability. Hence the recommendation in Jordanous (2012b) for using the components within SPECS (i.e. SPECS+cc) becomes further strengthened. One issue is with who/what performs evaluation, and what effect that has on how usable the evaluation methodology. Using external evaluators increases the time demands of the experiment in the human opinion surveys, as this requires studies to be carried out and introduces extra work to be done such as planning experiments for participants or applying for ethical clearance for conducting experiments with people. While the use of external evaluators is not a formal requirement for the SPECS+cc methodology -indeed evaluation can be performed using quantitative tests rather than subjective judgements if deemed most appropriate -the accompanying commentary to SPECS+cc strongly encourages researchers to use independent evaluation methods in order to capture more independent and unbiased results (Jordanous 2012b). In the application of SPECS+cc that is being reviewed here, external judges were consulted to give feedback on the creative systems being evaluated. Hence SPECS+cc in this case is subject to similar criticisms, in terms of ease of use, as when conducting opinion surveys. These extra demands are not necessarily encountered when performing evaluation as recommended using Coltons tripod, Ritchies criteria, or FACE evaluation, where no speci.c demands or recommendations are made for evaluation to be performed independently of the project team behind the creative software. It is important to acknowledge, though, that should independent evaluation be sacri.ced in order to make an evaluation methodology more useful, there is a worrying knock-on effect, in terms of potential biases being introduced if evaluation is not being performed by independent evaluators. Generality SPECS+cc, Coltons tripod and to some extent, Ritchies criteria and the human opinion surveys, could all be applied to different types of system, providing that the system produces the appropriate information relevant to the individual methodologies.7 Ritchies criteria cannot be applied to systems that produce no tangible outputs, making this approach less generally applicable across creative systems. There is also some question of whether opinion surveys could be carried out for evaluating all types of creativity, particularly where creativity is not manifested outwardly in production of output, affecting the generality of opinion surveys. Overall comparisons Considering all the observations made in this paper from the perspective of the five meta-evaluation criteria presented in this paper, SPECS+cc performed well in comparison with the other evaluation methodologies on its faithfulness in modelling creativity. SPECS+cc also performed better than Ritchies criteria for usefulness and correctness and produced larger quantities of useful feedback than Coltons creative tripod (because less information was collected for Coltons creative tripod). A consequence of the information collection meant that Coltons creative tripod was the easiest to use of the methodologies evaluated. Somewhat counterintuitively, all the methodologies were more likely to generate correct results compared to the surveys of human opinion. A number of participants in the opinion surveys reported that they evaluated systems based on factors other than creativity, due to difficulties in evaluating creativity of the Case Study systems without a de.nition of creativity to refer to. There is also some question of whether human opinion surveys could be carried out for evaluating all types of creativity (particularly where creativity is not manifested outwardly in copious production of output); this affects the general applicability of using opinion surveys. Reliance on the existence of output examples also affects the usability and generalisability of Ritchies criteria. Conclusions Several evaluation methods were applied to three musical improvisation systems. Human opinion was consulted to try and capture a ground truth for creativity evaluation (Zhu, Xu, and Khot 2009). Four key existing methodologies for computational creativity were also applied (Ritchie 2007; Colton 2008; Colton, Charnley, and Pease 2011; Jordanous 2012b, Ritchies criteria, the Creative Tripod, the FACE model and the SPECS+cc methodology, respectively). Results were compared; it was noted that few right answers or ground truths for creativity were found. For the purposes of progressing in research, learning from advances and improving what has been done, how well did each evaluation methodology perform? To assist in answering this question, external evaluation was solicited from the authors of the evaluated musical improvisation systems and one other researcher with interests in creative musical improvisation systems. 7This is illustrated further in Case Study 2 in Jordanous 2012a. Five criteria were identified from relevant literature sources for meta-evaluation of important aspects of the evaluation methodologies: Correctness Usefulness Faithfulness as a model of creativity Usability of the methodology Generality The methodologies were compared based on the external evaluators feedback concerning the evaluations performed on their system and the comparative feedback generated by each methodology considered so far. Further comments could be made using the meta-evaluation criteria, based on detailed study of the methodologies themselves. These results are too small in number to be a comprehensive evaluation but they do help to give us some feedback on the compared methodologies. The results showed that SPECS+cc and Ritchies empirical criteria compared favourably to the other methodologies overall. SPECS+cc performed well on most of the five meta-evaluation criteria, though the volume of data produced by SPECS+cc raised questions on SPECS+ccs usability compared to more succinct presentations. Coltons creative tripod was the easiest to use although there were some concerns about the generality of the tripod across creative domains and its faithfulness as a general model of creativity. Ritchies criteria were considered accurate but there were usability issues with the abstract nature of the criteria and accompanying function de.nitions. The FACE model was considered quite user friendly but perhaps limited in how it could incorporate aspects of creativity that were important to the system domain but outside of the face model. Each of the evaluation methodologies proved to be an improvement (in at least some ways) over the approach of simply asking peoples opinions on how creative the systems were. The development of creativity evaluation methods is clearly a key current area of interest in the computational creativity research community, as partly illustrated by the prominent inclusion of requests for papers on evaluation, in the call for papers for ICCC 2014. The five meta-evaluation criteria offered in this paper are taken from a cross-disciplinary review of good practice in evaluation of areas relevant to computational creativity research. These five criteria help us to contrast different evaluation methodologies against each other Acknowledgments Thanks to Alison Pease, Steve Torrance and Nick Collins for helpful comments during the formulation of these ideas. Also thanks to Al Biles, Bob Keller and George E. Lewis for willingly offering their time and helpful comments as evaluators for this work, and to the three anonymous reviewers for their useful remarks on the original version of this paper. 2014_18 !2014 Assessing Progress in Building Autonomously Creative Systems Simon Colton*, Alison Pease, Joseph Corneli*, Michael Cook* and Teresa Llano* *Computational Creativity Group, Department of Computing, Goldsmiths, University of London, UK School of Computing, University of Dundee, UK ccg.doc.gold.ac.uk Abstract 2012). In many projects, there are both strong and weak obDetermining conclusively whether a new version of software creatively exceeds a previous version or a third party system is difficult, yet very important for scienti.c approaches in Computational Creativity research. We argue that software product and process need to be assessed simultaneously in assessing progress, and we introduce a diagrammatic formalism which exposes various timelines of creative acts in the construction and execution of successive versions of artefactgenerating software. The formalism enables estimations of progress or regress from system to system by comparing their diagrams and assessing changes in quality, quantity and variety of creative acts undertaken; audience perception of behaviours; and the quality of artefacts produced. We present a case study in the building of evolutionary art systems, and we use the formalism to highlight various issues in measuring progress in the building of creative systems. Introduction Creativity, we believe, relates to a perception that others have of certain behaviours exhibited by some person or system, rather than an inherent property of people or software: in this sense it is a secondary quality. Moreover, we believe that, just as the endless debates about is it art? fuel innovation in the arts, the endless debates about is it creative? are a force for good: they drive forward creative practices and Computational Creativity research. A longer discussion of this philosophical position is given in (Colton et al. 2014), and an exposition of creativity as being essentially contested (Gallie 1956) is given in (Jordanous 2012). In such a context of energetic and subjective debate about creativity, it has been difficult to derive systematic approaches to assessing progress in the building of software for creative purposes. One main issue has been the cross-purposes of the creativity project(s) for which software is developed. A useful analogy with the notions of weak and strong AI has arisen recently in Computational Creativity research. Focusing on software which generates artefacts such as poems, paintings or games, we can say that weak Computational Creativity objectives emphasise the production of increasingly higher valued artefacts, whereas strong Computational Creativity objectives emphasise increasing the perception of creativity people have of the system. This is similar to the distinction put forward in (al-Rifaie and Bishop jectives, and often they are not complementary. For instance, increasing autonomy in software may lead simultaneously to higher perception of creativity and lower value artefacts being produced. This is described as the latent heat problem in (Colton and Wiggins 2012), and is analogous to U-shaped learning, where to get better, we first have to get worse. The objectives for a project usually influence the assessment methods employed. In particular, to assess progress with respect to weak objectives, it makes sense to evaluate the quality of the artefacts produced. In contrast, for strong objectives, it makes more sense to assess what software actually does and how and why people perceive it as creative or not. To this end, in (Colton, Pease, and Charnley 2011) we introduced the FACE descriptive model to formalise descriptions of the creative acts undertaken by software, and the IDEA model to formalise the impact those creative acts might have on people. Subsequent attempts to use these models to describe particular systems have highlighted another major issue: the assignment of programmer/software ownership of creative acts. Along with other issues in applying it to describe systems, we have found the FACE model to be inadequate for fully capturing the interplay of creative acts between programmer and program in this respect. We describe here the next stage of our formalism for capturing notions of progress in building creative systems. We first provide a potted history of how progress has been evaluated in Computational Creativity research, and lay out some intuitive notions of progress. Given our philosophical and practical standpoints, we place less emphasis on asking whether artefacts are better than previously. We also avoid direct questions about creativity in computational systems. Instead, we integrate (i) aspects of the FACE and IDEA models (ii) objective measures of quality, quantity and variety of creative acts and (iii) audience perceptions of software behaviour and quality of output. We present a two-stage method for estimating whether obvious or potential progress or regress has occurred when building a new system. This involves diagrammatically capturing various time-lines in the building and execution of a system, then comparing diagrams. We use the method to describe progress of an evolutionary art system, leading to a general discussion about how the approach could be used in practice. We conclude by describing future directions for this formalism. Background in Assessing Creative Progress The assessment of progress in building creative systems has been a bespoke and multi-faceted endeavour, driven by various, often opposing objectives ranging from understanding human creativity to practical generation of artefacts to the raising of philosophical questions. The majority of practical researchers who engineer and test software joined the Computational Creativity field with objectives in the weak sense of getting software to produce quality artefacts. Hence the first way in which progress was assessed was Boolean: if software reliably produces artefacts of a particular type, then this is progress over software which was unreliable or unable to produce artefacts of the required form. In such a context, Turing-style discrimination tests indicated a particularly strong milestone: if certain artefacts usually hand-selected looked/sounded so like human-authored counterparts that observers couldnt tell the difference, progress had certainly been made. This approach was pioneered by (Pearce and Wiggins 2001) who were one of the first to emphasise the importance and role of evaluation in Computational Creativity, and to propose a concrete way of applying Popperian falsi.cationism. However, despite them urging caution at depending on the discrimination test to evaluate creativity, direct comparison of human produced and computer generated artefacts has frequently been used to assess progress. We further criticised such Turing-style tests in Computational Creativity, for, among other reasons, encouraging naivety in software and the generation of pastiches (Pease and Colton 2012). Moreover, we question whether this methodology, while beneficial for short-term scienti.c progress, is actually detrimental to the longer-term goal of embedding creative software in society (Colton et al. 2014). The work of (Ritchie 2007) was an important step away from simplistic discrimination tests, establishing an approach to assessing the value of artefacts according to their novelty, typicality, and quality within a genre. A number of practitioners have used this approach to compare and contrast their systems, e.g., (Pereira et al. 2005). As the field matured, attention moved from mere generation to programs able to assess, critique and select from their output. Often searching large spaces, software was required to find the best artefacts using mathematically derived or machine-learned aesthetic/utilitarian calculations (Wiggins 2006). If a later version of software with more sophisticated internal assessment techniques was able to produce higher yields of higher quality artefacts when assessed externally, then clear progress had been made. Audience perceptions of software became a focus, as the field further matured. Jordanous used methods from linguistics to determine how people are using the word creativity, and which other concepts are associated with it, and then used crowdsourcing techniques to evaluate a creative system in terms of the associated concepts (Jordanous 2012). As a complement to Jordanouss work in which she tried to capture societys perception of creativity, researchers began investigating ways to influence peoples perception of creativity in software. Software assessing its own work made it appear more intelligent, and seem more creative. This led to the engineering of software that framed its processes and outputs by producing titles, commentaries and other material. (Charnley, Pease, and Colton 2012) propose that this may increase perception of creativity, and audiences would possibly appreciate the artefacts produced more. Studying audience perceptions of creativity in software opened many research avenues, but raised an important problem: that the original product-based assessment methods no longer capture all intuitions of what constitutes progress in the field. From a strong perspective, some researchers, including ourselves, are not content to accept the underlying assumption of product-based evaluation methods: if better artefacts are produced, the software must have been improved, hence people will project higher perceptions of creativity onto the software and progress will have been made. As mentioned previously, the main problem here is that increasing autonomy which must happen if strong objectives are to be met can decrease artefact value. Conversely, when the objectives of a project are weak, it is perfectly natural to decrease software autonomy to produce artefacts of presentation quality, especially when a concert/exhibition is looming, but this is unlikely to increase any perceptions of creativity. Concentrating on understanding perceptions of software creativity by the general public, we introduced the creativity tripod in (Colton 2008b) as three types of behaviours which were necessary (but not necessarily sufficient) for software to avoid being labelled as uncreative. We proposed that people are influenced by their understanding of what software does when assessing its output. We argue that it is easy to ascribe uncreativity to software which is not simultaneously seen as skillful, appreciative and imaginative. Focusing on assessment of progress by peers, we introduced the FACE and IDEA descriptive models in (Colton, Pease, and Charnley 2011) and (Pease and Colton 2011). The FACE model categorises generative acts by software into those at (g)round level, during which base objects are produced, and (p)rocess level, during which methods for generating base objects are produced. These levels are subdivided by the types of objects/processes they produce: Fg denotes a generative act producing some framing information, Ag denotes an act producing an aesthetic measure, Cg denotes an act producing a concept and Eg denotes an act producing an example of a concept. Generative acts producing new processes are defined accordingly as Fp, Ap, Cp and Ep. Tuples of generative acts are compiled as creative acts, and various calculations and recommendations are suggested in the model with which to compare creative systems. We developed the IDEA model so that creative acts and any impact they might have could be properly separated. We defined various stages of software development and used an ideal audience notion, where people are able to quantify changes in well-being and the cognitive work required to appreciate a creative act and the resulting artefact/process. We have arrived at a very observer-centric situation in the assessment of progress towards creative systems, in which progress can only be measured using feedback from independent observers about both the quality of artefacts produced and their perceptions of creativity in the software. Unfortunately, the majority of researchers develop software using only themselves as an evaluator, because observerbased models are too time-consuming to use on a day-to-day progress. These informal in-house evaluation techniques generally do not capture the global aims of the research project, or of the field (e.g. producing culturally important artefacts and/or convincing people that software is acting in a creative fashion). In many cases, systems are presented as feats of engineering, with little or no evaluation at all (Jordanous 2012). We argue that assessing progress is inherently a process-based problem. We focus here on modeling diachronic change across multiple levels. A Formal Assessment of Progress We combine the most useful aspects of the IDEA and FACE models, an enhanced creativity tripod, and aspects of assessing artefact value into a diagrammatic formalism for evaluating progress in the building of creative systems. We focus on the creative acts that software performs, the artefacts it produces and the way in which audiences perceive it and consume its output. We simplify by assuming a development model where a single person or team develops the software, with various major points where the program is sufficiently different for comparisons with previous versions. We aim for the new formalism to be used on a daily basis without audience evaluations, to determine short term progress, but for it also to enable fuller audience-level evaluations at the major development points. We also aim for the formalism to help determine progress in projects where there are both weak and strong objectives. We found that the original FACE model didnt enable us to properly express the process of building and executing generative software. Hence another consideration for our new model is that it can capture various timelines both in the development and the running of software in such a way that it is obvious where the programmer contributed creatively and where the software did likewise. With the above aims in mind, we envisage a scenario where we are comparing two versions of creative software v1 and v2. At the highest level, we split the assessment method into a two stage process as follows: 1. Diagrams are drawn for both v1 and v2 which capture the interplay of programmer and program behaviours as time-lines during both the development phase and the runtime execution of both versions of the software. 2. The diagrams for v1 and v2 are compared by an audience to determine if the second system represents progress over the first in terms of process. Similarly, the output from v1 and v2 is compared, to see if progress has been made. Stage 1: Diagrammatic Capture of Timelines Taking a realistic but abstracted view of generative software development and deployment, we identify four types of timeline. Firstly, generative programs are developed in system epochs, with new versions being regularly signed o.. Secondly, each process a program undertakes will have been implemented during a development period where creative acts by programmer and program have interplayed. (b) (c) Figure 1: (a) Key showing four types of timelines (b) Progression of a poetry system (c) Progression of the HR system. Thirdly, at run-time, data will be passed from process to process in series of creative and administrative subprocesses performed by software and programmer. Finally, each subprocess will comprise a sequence of generative or administrative acts. We capture these timelines diagrammatically, highlighted with coloured arrows in Figure 1(a). The blue arrow from box fi to represents a change in epoch at system level. The red arrows overlapping a process stack represent causal development periods. The green arrows represent data being passed from one subprocess to another at run-time. The brown arrows represent a series of generative/administrative acts which occur within a subprocess. Inside each subprocess box is either a < creative act > from the FACE model (i.e., a sequence of generative acts), or an [ administrative act ] which doesnt introduce any new concept, example, aesthetic or framing information/method. Administrative acts were not originally described in the FACE model, but we needed them to describe certain progressions of software. For our purposes here, we use only T to describe a translation administrative act often involving programming, and S to describe when an aesthetic measure is used to select the best from a set of artefacts. To add precision, we indicate the output from which generative act the administrative routine is applied, and to which examples a ground aesthetic is applied. To enable this, we employ the FACE model usage of lower-case letters to denote the output from the corresponding upper-case generative acts. We extend the FACE notion of (g)round and (p)rocess level generative acts with (m)eta level acts during which process generation methods are invented. As in the original description of the FACE model, we use bar notation to indicate that a particular act was undertaken by the programmer. We use a superscripted asterisk (*) to point out repetition. As a simple example diagram, Figure 1(b) shows the progression from poetry generator version P1 to P2. In the first version, there are two process stacks, hence the system works in two stages. In the first, the software produces some example poems, and in the second the user chooses one of the poems (to print out, say). The first stack represents two timesteps in development, namely that (a) the programmer had a creative act < Cg > whereby he/she came up with a concept in the form of some code to generate poems, and (b) the programmer ran the software to produce poems in creative acts of the form < Eg > * . The second stack represents the user coming up with an idea for an aesthetic, e.g., much rhyming, in creative act < Ag >, and then applying that aesthetic ag him/herself to the examples, eg, produced by the software, in the selection administrative act [S (ag(eg))], which maps the aesthetic ag : {eg} [0, 1] over the generated examples, and picks the best one. In the P2 version of the software, the programmer undertakes translation act [T (ag)], writing code that allows the program to apply the rhyming aesthetic itself, which it does at the bottom of the second stack in box P2. Figure 1(c) shows a progression in the HR automated theory formation system (Colton 2002) which took the software to a meta-level, as described in (Colton 2001). HR operates by applying production rules which invent concepts that categorise and describe input data. Each production rule was invented by the programmer during creative acts of the type < Cp >, then at run-time, HR uses the production rules to invent concepts and examples of them in < Cg, Eg > * acts. In the meta-HR version, during the < Cm > creative act, the programmer had the idea of getting HR to form theories about theories, and in doing so, generate concept-invention processes (production rules) in acts of the form < Cp >. The programmer took meta-HRs output and translated [T (Cp)] it into an implemented production rule that HR could use, which it does at the bottom of the stack in box H2. Stage 2: Comparing Diagrams and Output In both simple cases of Figure 1, it is clear that progress has been made in the strong sense, but not clear in the weak sense, as the output could easily be degraded by the more sophisticated processing of the systems. The diagrams help us to capture the creative interplay between software and programmer at design time and run time. However, given that the ultimate aim of both strong and weak projects is to impress audiences with process and product, any assessment of progress must be done in a context of audience evaluation. However, as mentioned previously, audience evaluation is too expensive to help assess progress on a day to day basis. Hence, it seems sensible for the programmer to step in and act as a proxy for a perceived audience: we advocate the programmer putting themselves in the position of the type of person they would expect to form their audience, and answer questions about the products and processes accordingly. Examining the transition from one diagram to another should provide some shortcuts to estimate audience reactions, especially when there are strong project objectives. In particular, as with the original FACE model, the diagrams make it obvious where creative or administrative responsibility has been handed over to software, namely where an act which used to be barred has become unbarred, i.e., the same type of generative act still occurs, but it is now performed by software rather than programmer. This happened when the S became an S in Figure 1(b) and when the Cp became a Cp in Figure 1(c). At the very least in these cases, an unbiased observer would be expected to project more autonomy onto the software, and so progress in the strong sense has likely happened. In addition, the diagrams make it obvious when software is doing more processing in the sense of having more stacks, bigger stacks or larger tuples of acts in the stack entries. Moreover, the diagrams make it clear that more varied or higher-level creative acts are being performed by the software again, this was one of the benefits of the original FACE model. Both of these have the potential to convince audience members that software is being more sophisticated with respect to various behaviours described below, and hence can be a shorthand for progress. When dealing with actual external evaluation, where people dont know what software does, we suggest that the diagrams above (and verbalisations/simplifications of them) can be used to describe to audiences what the software and what the programmer have done in a project. In this way, using also their judgements about the artefacts produced, people can make fully informed decisions in evaluation studies. As a general philosophical standpoint, we suggest not asking people if they believe software is behaving creatively, but rather concentrating on whether they perceive the software as acting uncreatively. Our argument for this is that the concept of creativity is essentially contested (Gallie 1956), hence, no matter how sophisticated our software gets, we should not expect consensus on such matters. However, we have found that people agree much more on notions of uncreativity: if a program doesnt exhibit behaviours onto which certain words like intentionality can be projected, then it is very easy to condemn it as being uncreative. Hence, we advocate not asking a set of questions from which we can conclude that an audience member thinks that software is creative, but rather asking questions from which we can determine whether they think that software is acting uncreatively. It may seem like rather a negative admission, but we believe that the best way to get people to accept software as being creative is for them to eventually realise that there is no good reason to call it uncreative. Even then, people would be perfectly at liberty to say that while software is not uncreative, it is not creative either: creativity and uncreativity do not appear to be exact opposites. With this in mind, we have boiled down audience evaluation of behaviour to asking people whether they would project certain words onto software in reaction to understanding what it did in the context of a particular project. We then tentatively conclude that they believe the software is uncreative if they dont project onto it some or all of these words, as originally intended in the creativity tripod proposition (Colton 2008b). In the five years since the introduction of the creativity tripod, we have slowly added additional behaviours which we have found to be important in the perception of creativity in software. That is, for people to take seriously software as being not uncreative, we believe it needs to exhibit behaviours onto which people can meaningfully project (at least) these Table 1: Guidelines for using change in evaluation of product and process in gauging (O)bvious or (P)otential (P)rogress or (R)egress, in both weak and strong agendas. Product change Process change Weak Strong Up Up Up Down Down Up Down Same Up Down OP PP OP PR OR OP PR PP PP OR Down Same OR PR Same Same Up Down PP PR OP OR Same Same PP PP eight words: skill, appreciation, imagination, learning, intentionality, accountability, innovation, subjectivity and reflection. We have found that assessing the level of projection of these words onto the behaviours of software can help us to gauge peoples opinions about (the lack of) important higher-level aspects of software behaviour, such as autonomy, adaptability and self-awareness. The method we suggest for estimating progress from version v1 of a creative system to version v2 is to: (a) show audience members the diagrams for v1 and v2 as above, and explain the acts undertaken by the software, then (b) show audience members the output from v1 and v2, and (c) ask each person to compare the pair of product and process for v1 with that of v2. A statistical analysis could then be used to see whether the audience as a whole evaluates the output as being better, worse or the same, and whether they think that the processing is better, worse or the same in terms of the software seeming less uncreative. This takes into account the phenomenon described in (Colton 2008b) whereby the process can influence value judgements for artefacts. To use this analysis to estimate progress, its important to first prioritise objectives for the project locally in terms of strong and weak agendas. Then, taking the audience evaluation of change in output and in process, we suggest using the guidelines in Table 1. Here, we have stipulated that certain evaluation pairs indicate obvious progression (OP) or obvious regression (OR). For instance, in the weak sense, when the evaluation of output goes up and the evaluation of process increases or stays the same, it seems clear to indicate obvious progress. Other cases are not so clear-cut, for instance when evaluation of artefacts goes up, but evaluation of process goes down. In this case, we suggest that this is potential progress (PP) in a weak agenda, and potential regress (PR) in a strong agenda. In such cases, we give our judgements for whether it is likely, after more development, that v2 will be viewed retrospectively as a progressive success or a step backwards. Note that we have tended to be optimistic, e.g., when evaluation of output and process stay the same, we say that this is potential progress in both weak and strong agendas. Note also that this table is meant to be used flexibly, possibly in a context of more fine grained analysis. For instance, the focus of a subproject might be to increase audience perception of intentionality, and if this increases while audience perception of the value of the process as a whole reduces, it should still be seen as progress. A Case Study in Evolutionary Art Evolutionary art where software is evolved which can generate abstract art has been much studied within Computational Creativity circles (Romero and Machado 2007). Based on actual projects which we reference, we hypothesise here the various timelines of progress that could lead from a system with barely any autonomy to one with nearly full autonomy. Figure 2 uses our diagrammatic approach to capture three major lines of development, with the final (hypothetical) system in box 8 representing finality, in the strong sense that the software can do very little more creatively in generating abstract art. Since features from earlier system epochs are often present in later ones, we have colour-coded individual creative acts as they are introduced, so the reader can follow their usage through the systems. If an element repeats with a slight variation (such as the removal of a bar), this is highlighted. The figure includes a key, which describes the most important creative and administrative acts in the systems. Elements in the key are indexed with a dot notation: system.process-stack .subprocess (by number, from left to right, and top to bottom, respectively). System diagrams have repetitive elements, so that the timelines leading to its construction and what it does at run-time can be read in a stand-alone fashion. Following the first line of development, system 1 of Figure 2 represents an entry point for many evolutionary art systems: the programmer invents (Cp) (or borrows) the concept formation process of crossing over sets of mathematical functions to produce offspring sets. He/she also has an idea (Ep) for a wrapper routine which can use such a set of functions to produce images. He/she then uses the program to generate (Cg) a set of functions and employ the wrapper to produce (Eg) an image which is sent to the (P)rinter. The crossover and subsequent image generation is repeated multiple times in system 2, and then the programmer who has invented (Ag) their own aesthetic chooses a single image to print. In system 3, as in the poetry example above, the programmer translates their aesthetic into code so the program can select images. This is a development similar to that for the NEvAr system (Machado and Cardoso 2002). Following the second line of development, in system 4, the programmer selects multiple images using his/her own aesthetic preferences, and these become the positives for a machine learning exercise as in (Li et al. 2013). This enables the automatic invention (Ag) of an aesthetic function, which the programmer translates by hand T (ag) from the machine learning system into the software, as in (Colton 2012), so the program can employ the aesthetic without user intervention. In system 5, more automation is added, with the programmer implementing their idea (Cm) of getting the software to search for wrappers, then implementing this (Em), so that the software can invent (Ep) new example generation processes for the system. Following the final line of development, in system 6, we return to aesthetic generation. Here the programmer has the idea (Ap) of getting software to mathematically invent fitness functions, as we did in (Colton 2008a) for scene generation, using the HR system (Colton 2002) together with The Paint ID Event Explanation 1.1.1 Cp The programmer invents the idea of crossing over two sets of mathematical functions to produce a new set of mathematical functions. 1.1.1 Ep The programmer implements a wrapper method that takes a set of mathematical functions and applies them to each (x, y) co-ordinate in an image to produce an RGB colour. 1.1.2 Cg The software generates a new set of functions by crossing over two pairs of functions. 1.1.2 Eg The software applies these functions to the (x, y) co-ordinates of an image, to produce a piece of abstract art. 2.2.1 Ag The programmer had in mind a particular aesthetic (symmetry) for the images. 2.2.2 S (ag(eg)) The programmer uses his/her aesthetic to select a preferred image for printing. 3.2.2 T(ag) The programmer took their aesthetic and turned it into code that can calculate a value for images. 3.2.3 S (ag)) The software applies the aesthetic to select one of a set of images produced by crossover and the wrapper. 4.3.1 Ag The software uses machine learning techniques to approximage the programmers aesthetic. 4.3.2 T(ag) The programmer hand-translates the machine learned aesthetic into code. 4.3.3 S (ag(eg)) The software applies the new aesthetic to choosing the best image from those produced. 5.1.2 Cm The programmer has the idea of getting the software to search through a space of wrapper routines. 5.1.2 Em The programmer implements this idea. 5.1.3 Ep The software invents a new wrapper. 5.4.2 T(ag) The software translates the machine-learned aesthetic itself into code. 6.2.1 Ap The programmer has the idea of getting the software to invent a mathematical fitness function. 6.2.2 Ag The software invents a novel aesthetic function. 6.2.3 S (ag(eg)) The software selects the best artefact according to its aesthetic function. 7.1.1 Cm The programmer has the idea of getting the software to invent and utilise novel combination techniques for sets of functions, generalising crossover. 7.1.1 Em The programmer implements this idea so that the software can invent new combination techniques. 7.1.2 Cp The software invents a novel combination technique. 8.4.1 Fp The programmer has the idea of getting the software to produce a commentary on its process and artwork by describing its invention of a new aesthetic, combination method and wrapper. 8.4.2 Fg The software produces a commentary about its process and product. Figure 2: The progression of an evolutionary art program through eight system epochs. ing Fool (Colton 2012b). In system 7, the programmer realises (Cm) that crossover is just one way to combine sets of functions, and gives (Em) the software the ability to search a space of combination methods (Cp). The software does this, and uses the existing wrapper to turn the functions into images. System 8 is the end of the line for the development of the software, as it brings together all the innovations of previous systems. The software invents aesthetic functions, innovates with new concept formation methods that combine mathematical functions, and generates new wrappers which turn the functions into images. Finally, the programmer has the idea (Fp) of getting the software to write commentaries, as in (Colton, Goodwin, and Veale 2012), about its processing and its results, which it does in generative act Fg. Tracking how the system diagrams change can be used to estimate how audiences might evaluate the change in processing of the software, in terms of the extended creativity tripod described above. Intuitively, each system represents progress from the one preceding it, justifled as follows: 1 2: < Cg, Eg >< Cg, Eg > * Simple repetition means that the software has more skill, and the introduction of independent user selection shouldnt change perceptions about autonomy. 2 3: S S By reducing user intervention in choosing images, the software should appear to have more skill and autonomy. 1 4: Introduction of Ag and S (ag(eg)) acts Machine learning enables the generation of novel aesthetics (albeit derived from human choices), which should increase perception of innovation, appreciation and learning, involving more varied creative acts. 4 5: Introduction of an Ep act, T T Wrapper generation increases variety of creative acts, and may increase perception of skill and imagination. 1 6: Introduction of Ag and S (ag(eg)) acts The software has more variety of creative acts, and the invention and deployment of its own aesthetic this time, without any programmer intervention should increase perception of intentionality in the software. 6 7: Introduction of a Cp act Changes in the evolutionary processes should increase perceptions of innovation and autonomy. 5, 7 8: Introduction of an Fg act Framing its work should increase perceptions of accountability and reflection. With all strands brought together, the programmer does nothing at run-time and can contribute little more at design time. The software exhibits behaviours onto which we can meaningfully project words like skill, appreciation, innovation, intentionality, reflection, accountability and learning, which should raise impressions of autonomy, and make it difficult to project uncreativity onto the software. Discussion Capturing what programmers and software do creatively over long periods and during complicated program executions is difficult and open to variability. The systems in the above case study could easily have been interpreted and presented differently. In essence, we have provided some tools for presenting software development in terms of creative acts, and suggested a mechanism for turning audience perceptions into estimates of progress. We advise flexible application in both cases. In particular, the difference between potential progress and potential regress is quite subtle. Both mean that it is too early to determine whether progress or regress has been made, and the programmer should proceed with caution: the former suggesting cautious optimism and the latter, cautious pessimism. Practically speaking, the programmer may want to review longer term goals, archive previous versions, and/or clarify research directions. Our approach is currently more tailored to capturing progress in software behaviour than its output. We would understand some resistance to the approach, particularly from researchers with agendas for Computational Creativity in the weak sense. For example, if product evaluations remain the same, yet processing evaluations go up, this is presumably because the software is performing more sophisticated routines. From a weak perspective, the simpler version of the software clearly has advantages, as it produces the same results in a more understandable way. In certain application domains, for instance mathematical discovery, where aesthetics like truth are of paramount importance, a simpler method for finding a result is usually preferred. While reducing complexity of processing normally requires considerable invention or intervention, unless such invention is done by the software itself, the resulting simplicity would tend to increase perceptions of uncreativity in software, regardless (or, indeed, because of) how easy it is to understand what it has done. Our approach is also more tailored towards capturing progress from version to version of the same software than to comparing different programs. However, we have used the formalism to compare systems in the same application domains, such as mathematical discovery systems AM (Lenat 1976) and HR (Colton 2002), and various poetry and art generators. The comparative approach works somewhat here, because it was possible to compare diagrams meaningfully to suggest where one system would likely be perceived as an improvement over the other. However, full application of the approach may be difficult as the context for evaluating artefacts (and the processes producing them) can change greatly with small changes in artefact composition. For instance, we recently attempted to compare one-line What if...? ideas produced textually by three systems. We found that it was not possible to conceive a fair approach involving an audience to determine which systems artefacts or processes were the best. Fields like Machine Learning have largely homogenised the testing of their systems in a problem-solving paradigm. Given the tacit requirements for software to surprise us through its output and processing, and to innovate on many levels, it seems unlikely that such standardisation could apply in Computational Creativity research. Related Work Diagrammatic approaches to software modelling have been extensively studied in the last two decades. The best known example is the Unified Modelling Language (UML), managed by the Object Management Group (OMG), a standard that is widely used to visualize the design of systems (www.omg.org/spec). The main objectives of modelling with UML are to represent the architecture of a system, including use cases, deployment, information flow diagrams, etc., and to model system behaviour and data flow via activity diagrams, state machines, sequence diagrams, etc. Progress at the process level can be modelled with UML by diagramming the steps used to complete a task within the system. However, UML is not typically applied to model progress at the level of system epochs, although two UML diagrams can of course be compared on the basis of the functionality they describe. Some diagrams created using the UML model, such as use case diagrams, enable designers to specify the agents that participate in the development of a system: people, external processes, other systems and the system itself can all be modelled as agents. However, there is no formal notation to distinguish between the different agents, rather, they are simply assigned a label which is meaningful for the system designer. The OMG have also developed other graphical notations specialised for other aspects of systems modelling. For instance, the Business Process Model and Notation (BPMN) is used to model business processes by extending the original activity diagrams of UML. The specific objective of BPMN is to provide a high-level overview of business systems, rather than detailed information about how the system works. UML diagrams have also been used in the context of formal methods. In particular, the UML-B language (Said, Butler and Snook 2009) enables the modelling of Event-B specifications as UML-like diagrams. Event-B is a formalism based on set theory for the modelling and verification of systems (Abrial 2010). One of the main aspects of Event-B is the use of refinement to handle the complexity of systems at different levels of abstraction. UML-B can be used to diagrammatically model a system at increasing levels of refinement, and system consistency can then be verified through mathematical proof. However, UML-B considers one system at a time, so it is not possible to use this formalism to model creative change as system development progresses. Using the Event-B formalism, it is possible to model aspects of the environment, such as external systems that affect the behaviour of the modelled system. The aim is to ensure that the designed system will work in harmony with its operating environment. However, there is no clear way to delimit the aspects of the model that are related to the environment and those that are part of the final system. Again, the environment is simply identified by the designer assigning meaningful names to the state representing it. Other related approaches include Z-notation (Spivey 1992), the Vienna Development Method (Jones 1990) and the B-method (Abrial 1996). The objective of these approaches is to verify properties of systems. Progress would be meaningful at the modelling level, i.e., by building models that o.er increasing detail (and assurance) about how a given system works. Petri nets provide a graphical notation used primarily to model systems with concurrency (Girault and Valk 2003). With petri nets, progress at the process level is modelled in the form of state transitions, and data is represented by abstract tokens, with no data values assigned. An extension, called coloured petri nets (Jensen and Kristensen 2009), allow data values to be assigned to tokens. Neither type of petri net is used for modelling changes through versions of a system. Petri nets are an event-based modelling language and representations of agents (such as the programmer or the system) are not included in the formalism. Conclusions and Future Work We have presented a new diagrammatic formalism for assessing progress in building creative systems. Our aims were to enable more precise understanding of progress in Computational Creativity in general, and in mapping the progress of particular systems. In doing so, we aimed to bring closer together public/peer appreciation of progress, strong/weak agendas, and day-to-day/milestone progress assessments. The new approach involves producing diagrams of systems that depict creative acts in timelines, which are compared in a context of audience evaluation of process and product. When applied, the formalism captures some intuitive notions, including: quality of artefacts; quantity, level and variety of creative acts performed; and audience perception of software behaviour. To enable better understanding of process, and more informed audience judgements about (un)creativity, the diagrams explicitly separate creative acts coming from the programmer and the program. Even in the absence of audience participation, the diagrams themselves can be used in combination with straightforward assumptions about audience reactions to system design features to perform low-cost estimates of progress in a strong agenda. We motivated the approach throughout with various philosophical standpoints, as per (Colton et al. 2014), supported by a critical review of the ways in which progress in building creative systems has been measured historically. To highlight the potential for the formalism, we presented a case study where the progress through eight versions of evolutionary art software was mapped and justified. Our audience evaluation model is far from complete. We plan to employ the criteria specified in (Ritchie 2007), for more fine-grained evaluations of the quality, novelty and typicality of artefacts. We will also import audience reflection evaluation schemes from the IDEA descriptive model, e.g., change in well-being, cognitive effort and emotional responses such as surprise and amusement. We have so far used the diagrammatic approach to fully depict timelines in the building of generative software producing mathematics, visual art, poetry and video games, including dozens of system diagrams (omitted for space reasons). This has worked well, but there are still some subtle improvements required to capture better the functioning of the software at run-time. (Gabriel and Goldman 2000) describe system development environments with many contributing programmers, and multiple interacting, self-programming, and self-updating distributed systems (Gabriel and Goldman 2006). It would be straightforward to modify our formalism to deal with multiple agents, for example by turning bars into superscripts. However, this does complicate the notion of progress: if system chooses to hand o. creative control to system ., this would amount to changing a superscript but its not immediately clear that this should count as progress in the same way that removing bars does. If the agents are considered to be full partners in the creative process, and fi may well have their own perspectives on what counts as progress, and this needs to be formalized. Broadly speaking, we expect that the distinction between strong and weak agendas will eventually disappear: in order to produce higher quality artefacts, more sophisticated systems involving behaviours perceived as creative will be required, and audiences will expect to project notions of creativity onto software to fully appreciate its output. In such a context, assessing processes and products simultaneously will be important, and we hope versions of this diagrammatic approach will enable this. In (Colton, Goodwin, and Veale 2012), we used the FACE model as a driving force for poetry generation software, rather than as a descriptive tool. We hope that system developers will similarly begin to think about their software in the above diagrammatic terms, in order to suggest interesting new avenues for implementation. Acknowledgements This work has been supported by EPSRC grants EP/J004049 and EP/L00206X, and through EC funding for the project COINVENT 611553 by FP7, the ICT theme, and the Future Emerging Technologies FET programme. We would like to thank the anonymous reviewers for their helpful comments. 2014_19 !2014 Can a Computationally Creative System Create Itself? Creative Artefacts and Creative Processes Diarmuid P. ODonoghue, James Power, Sian OBriain, Feng Dong*, Aidan Mooney, Donny Hurley, Yalemisew Abgaz, Charles Markham, Department of Computer Science, NUI Maynooth, Co. Kildare, Ireland. * Department of Computer Science and Technology, University of Bedfordshire, Luton, UK. Abstract This paper begins by briefly looking at two of the dominant perspectives on computational creativity; focusing on the creative artefacts and the creative processes respectively. We briefly describe two projects; one focused on (artistic) creative artefacts the other on a (scientific) creative process, to highlight some similarities and differences in approach. We then look at a 2dimensional model of Learning Objectives that uses independent axes of knowledge and (cognitive) processes. This educational framework is then used to cast artefact and process perspectives into a common framework, opening up new possibilities for discussing and comparing creativity between them. Finally, arising from our model of creative processes, we propose a new and broad 4-level hierarchy of computational creativity, which asserts that the highest level of computational creativity involves processes whose creativity is comparable to that of the originating process itself. Introduction Creativity is frequently seen through the search space metaphor (Boden, 1992; ODonoghue and Crean, 2002; Wiggins, 2006; ODonoghue et al, 2006; Ritchie, 2012; Veale, 2012; Pease et al, 2013). The space of possible products is represented as physical space, where each location represents a different product. Other search processes have been through this space previously, so a creative search process attempts to focus on regions of this space that have not yet been explored. The space of all search products carries different, often unpredictable values (including novelty). Boden (1992) identified three levels of creativity with improbable creativity exploring regions of this search space that are unlikely to have been visited previously. Exploratory creativity deliberately attempts to explore the boundaries of that search space. Transformational creativity attempts to identify and explore new search spaces, to identify products that did not exist in the original search space. Viewing computational creativity through this search space metaphor, we can see that many artistic forms of creativity are adequately described. Artistic styles of creativity can be seen to explore the space of possible creative artefacts from one of the traditional creative domains like art, music, creative writing etc. (as used in Carson et al, 2005). Highly creative individuals transform accepted search spaces to create new possibilities such as impressionism or cubism. Creative artefacts and creative processes are generally discussed quite separately, with creative products/artefacts attracting the most attention. One criticism often levelled at the discipline of computational creativity, is that it is overly focused on creative products paying too little attention to the process (Stojanov and Indurkhya, 2012; ODonoghue and Keane, 2012). Analogy, metaphor are often seen as the dominant approaches to processes centred creativity, though evolutionary computing approaches are also popular. These creative processes appear to be generally associated with creativity within scientific or engineering types of disciplines. Thus, the starting point for this paper concerns the two distinct perspectives on computational creativity, focusing on artistic products and scientific processes. Later in this paper we shall use an educational assessment framework to cast both perspectives into a common framework, in order to bring resolution to these apparently conflicting perspectives. It should be noted that even the basic distinction between artistic creativity and scientific creativity is not universally accepted. The noted 18th century mathematician (and poet) W. R. Hamilton regarded mathematics as an aesthetic creation, akin to poetry, with its own mysteries and moments of profound revelation (from Hankins, 1980). Mathematicians have also compared the aesthetic beauty of various equations, with Eulers identity (ei. +1= 0) ranked the most beautiful equation in mathematics (Wells, 1990). Conversely, the process of analogical reasoning is generally seen as a driving force of scientific creativity (Brown, 2003), but at least one study has shown that analogical reasoning appears to play a part in some contemporary artistic creativity (Okada et al, 2009). Despite these overlaps, we shall proceed with the two basic categories of creative products and creative processes for the purposes of this paper. Creative Products and Creative Processes We briefly compare and contrast creative products (or artefacts) and creative processes using two projects that serve to highlight some commonalities and help identify some differences. The first is ImageBlender that creates new images using complex transformations of two given input images. The second RegExEvolver represents simple processes (a finite automaton) as regular expressions, creating new regular expressions from that expression. Another criticism often levelled at computationally creative systems is that Most of them are given, in advance, a detailed (hardcoded) description of the domain (Stojanov and Indurkhya, 2012). The two models presented in this paper make minimal assumptions about their relevant problem domains. ImageBlender is based on the assumption that the inspiring set contains images -regardless of what those images depict. RegExEvolver assumes only that the input is a valid regular expression again with no additional limits. Additionally, both models take a very small inspiring set of two and just one items respectively. Both systems use the search and evaluate strategy of evolutionary computation to explore the space of possible outputs. Both adopt a multi-objective selection strategy (Luke, 2013) to promote the emergence of high quality outputs. Multi-objective evaluation uses several independent objective functions to evaluate individuals in the population. Evolution then proceeds under the guidance of a Pareto-optimal selection strategy. Finally, both projects use interesting-ness as one of the objective functions to guide evolution towards the creation of solutions. In both cases interestingness is estimated by the Kolmogorov complexity of the created output. This use of Kolmogorov complexity is slightly different to that discussed by McGregor (2007). Other metrics are used to ensure that the results have some measurable novelty compared to the given inspiring set by measuring the dissimilarity between an evolved output and the given input(s). These two metrics of interestingness and novelty are used as simple, general purpose estimates of the quality and novelty (Ritchie, 2001) that are sought by creative systems. We shall now see if these minimal assumptions can prove useful for computational creativity in the absence of more detailed information on the problem domain. Creative Artefacts from ImageBlender ImageBlender creates new images by combining two given input images. Well known techniques exist for combining two images using techniques like; super-positioning those images; selecting and combining sub-regions of the images using image manipulators like rotation, translation, scale, reflection etc. Many such techniques can be considered as collage generation that selectively combine parts of (two or more) given images. However, ImageBlender does not operate directly upon the images but explores the space of possible images produced by combining transformed representations of those images. This process might be considered transformational in that it explores a space of possible images that has not been explicitly explored before (as far the authors can ascertain). ImageBlender currently focuses on the Fast Fourier Transform (FFT) of those images, creating a new image by combining portions of the phase and frequency information from those images. ImageBlender explores the space of possible images produced by various combinations of FFTs and then using the inverse transform (FFT-1) to produce the resulting image. No restrictions are placed on the input images other than those inherent to the FFT transform. Thus images may be black and white, greyscale, or colour; representing geometric figures, paintings, photographs etc. or any combination of these. ImageBlender uses evolutionary computation to produce creative images, guided by a Pareto-optimal selection strategy. Among the metrics used are a number of estimates of the Kolmogorov complexity of the output image ensuring there is some appropriate level of interestingness associated with the output images. Other metrics favour new images that are different from both input images. Interestingly, some of these measures also have a role in assessing the beauty of images. Forsythe et al (2010) found that visual complexity can be adequately assessed using GIF compression and that the fractal dimension of an image often appears to be an adequate predictor of peoples judgements of beauty. Figure 1 shows two input images formed from black and white pixels only; a checkerboard of alternating black and white pixels (top left) and a black circle on a white background (top right of Figure 1). The grey appearance of the first image is caused by the low resolution reproduction of alternating black and white pixels. The final image was formed by combining the phase information from one image with the frequency information from the other, forming the third (bottom) image in Figure 1. Surprisingly, the output image has a far higher Kolmogorov complexity than either input image, suggesting a more interesting product. We argue that this output is creative in that it has the properties most frequently associated with creativity, it is: novel, interesting, unexpected and (arguably) has some aesthetic if geometric beauty. Appexdix 1 contains a few more sample images created by ImageBlender. Creative Processes with RegExEvolver Computational creativity has addressed process centred creativity under three main categories: traditional GOFAI (Good Old Fashioned Artificial Intelligence) search processes, evolutionary search and analogy/metaphor/blending (Veale and ODonoghue, 2000) approaches. However, instead of focusing on specific processes we look instead at general Turing Machine models of computational processes. In this section we consider the case of creating outputs that are themselves processes. Creating a process rather than an artefact shouldnt in principle be that much of a change since computational processes are easily represented as strings of characters, parse trees or other structures. Such representations can allow traditional creativity e search to explore the space of possible artefacts/processes. In fact, evolutionary programming, genetic programming and grammatical evolution regularly output new programs in some executable programming language, though their focus in not normally on creative outputs. This situation where the creative output is itself a process also underpins the later section (below) that integrates creative processes and products through a theory of Educational Assessment. A number of previous project have looked at creating outputs that are themselves processes. Procedural content generation (Togelius et al, 2011) is an emerging area devoted to the creation of game content for playable computer games. Cook et al (2013) discuss the Mechanic Miner system that generates the game mechanics for platform games using evolutionary computation. However Mechanic Miner and other procedural content generators are very focused on the domain of platform games and not on general purpose software development. The Ars model (Pitu et al, 2013) creates formal specifications (in Spec#) for a given implementation (in C#) using analogical reasoning. Due to the creative and arguably unreliable nature of analogical reasoning, Ars uses a theorem prover to validate the inferences it automatically accepts. But unverified specifications may also spur the workaday little-c creativity (Gardner, 1993) of human specification writers. Finally, we note that Ars is also (potentially) capable of operating in the reverse direction, creating new source code (a process) for a given specification. Many practitioners of computational creativity use the concept of inspiring sets to describe both the creative domain and (a sample of) the artefacts that have already been generated within that domain. In this section we briefly look at the creation of simple computational processes, as represented by Regular Expressions (RegEx). Each regular expression defines a language, and any regular expression can be converted to a Finite State Machine (FSM) that recognises strings from this language. The RegExEvolver project uses just one regular expression for its inspiring set and attempts to create new and potentially useful expressions from it. As a simple example, a regular expression for the registration numbers of Irish vehicles before 2013 would be: [0-9]{2}[A-Z]{1,2}[0-9]{1,5} After this date, a new system was introduced conforming to the following regular expression: [0-9]{2}[1-2]{1}[A-Z]{1,2}[0-9]{1,5} As a second example we consider the rules for valid passwords used in a computer system. Valid passwords may be specified by a regular expression, with different strengths associated with different expressions. A weak expression might accept any combination of letters and numbers, but a stronger expression might require at least one of each of: a lower case letter, an upper case letter and a digit. RegExEvolver could also be used to create a new password specification given a pre-existing expression. The process is similar to that used in fuzz-testing (Godfried et al, 2012), a software engineering technique used to find bugs in a program. One approach to ('black-box') fuzz testing involves analysing existing test inputs and then generating different, new inputs that may expose previously unknown vulnerabilities. A more sophisticated ('whitebox') approach involves analysing the program's source code in order to generate test inputs that cause unexpected combinations of the program's flow of control. Common to both approaches is the goal of creating new combinations that had not been previously envisaged by the testers. RegExEvolver uses evolutionary computation techniques to guide formation of the new RegEx under the guidance of a Pareto-optimal selection technique. The objective functions focus on the original and evolved expressions and also assess the languages that are generated by these expressions. To this end RegExEvolver uses the Xeger tool to generate random strings for any given RegEx. This is achieved by employing standard algorithms to convert the RegEx to an equivalent FSM and then choosing random transitions through this machine. Although repetition (denoted by the Kleene Star '*') in a RegEx can theoretically generate a string of infinite length, this is not an issue in practice as it would require the same transition to be chosen every time. In addition to evaluating the generated strings (products) we also evaluate the processes themselves. The generated RegEx is compared to the original (input) RegEx by calculating the intersection of their corresponding FSM using the dk.brics.automaton package. In this way, evaluation of the new process (RegEx) itself ensures it overlaps the input expression, while also ensuring it contains some novelty compared to the input expression. However, in the absence of a problem domain, we do not evaluate the usefulness quality of the generated expressions. RegExEvolver is focused on generating novel and potentially useful processes at level 3 of the Chomsky hierarchy. However, it is easy to see that other computationally creative processes could generate creative processes at any level from the Chomsky Hierarchy. It has been shown that the set of regular languages corresponding to regular expressions (or produced by a regular grammar) at level 3 are a subset of the set of context free languages at level 2, which in turn are a subset of the set of content sensitive languages at level 1, and that these in turn are a subset of the set of recursively enumerable languages at level 0 (Chomsky, 1959). Evaluating Creativity Both ImageBlender and RegExEvolver create new outputs without the benefit of any specific context or the constraints and values that frequently arise from such contexts. Thus, evaluating their outputs can be considered all the more difficult. While this might be seen as a weakness, we see it as positive support for the generality of our approach. That is, some creativity is possible without making detailed assumptions about the target domain without committing to some low level detail that will later limit the breadth or flexibility (Guilford, 1950) of our creative system. Defeasible Creativity Newell, Shaw and Simon (1963) highlighted that one criterion for creativity is that a given answer should cause us to reject an answer that we had previously accepted. From this perspective computational creativity should place its highest value on creativity that contradicts some existing belief, leading to the shock and amazement often associated with H-Creativity. Evaluation plays a central role in computational creativity. We identify two distinct types of evaluation: Subjective evaluation and Objective evaluation. Subjective evaluation is carried out by a computationally creative process to ensure the quality and novelty of the output. However that real value of a creative output can only ever be truly determined by an independent group of evaluations. A true determination of the qualities of novelty and/or quality can only ever be made by an independent adjudicator. Objective evaluation relies heavily on consensus reality and thus on some target population of evaluators either the general public or some target group of critics. To this end, a comprehensive model of computational creativity must incorporate a model of the beliefs of that target group. Thus a creative system must either implicitly or explicitly, incorporate a model of the beliefs of that target group of evaluators. Thus, a Theory of Mind (ToM) is a fundamental issue in computational creativity be that either an explicit theory or one implicitly instantiated in the model and its use of data (such as the inspiring set). Any ToM will suffer inaccuracies and other problems, especially when it is used within the context of creative reasoning. Thus, we conclude that a defining characteristic of computational creativity is that the output can only be truly evaluated and assessed by an independent adjudicator. In effect, the objective metrics used in the two projects described above implicitly incorporate a simple ToM in terms of the interestingness value estimated by multivalued pareto-optimal values, including the Kolmogorov complexity of the created products. Integrating Creative Products and Creative Processes Creative products and creative processes appear to bring different perspectives to computational creativity. Often, it appears that these perspectives are almost irreconcilable in terms of their values and objectives. We now explore one means of resolving the apparent differences between the product and process perspectives of computational creativity. The integration we explore is at the cognitive level, but it also bares relevance to other levels of creativity; from the neurological to the sociological. In this section we review some work on education, as this is another discipline that values the creativity of its outputs promoting the creativity of students produced by educational systems. Blooms (1956) taxonomy of Learning Objectives (top of Figure 2) tried to get away from simple rote learning and promote higher forms of learning such as evaluating and analysing. The taxonomy was primarily aimed at informing education and assessment activities. The taxonomy was aimed at supporting objective assessment of educational activities and thus focuses on measurable and quantifiable properties. While rote learning was seen as the lowest form of education attainment, synthesis and evaluation were seen as the highest achievements in the original (1956) taxonomy. Creation was only included in this original taxonomy as part of the Synthesis category and surprisingly, Synthesis was seen as a lower level of attainment than Evaluation. Figure 2: Blooms Revised Taxonomy (below) places greater emphasis on the role of creativity in educational attainment Blooms Revised Taxonomy A subsequent revision of this taxonomy (Anderson and Krathwohl, 2001) (bottom of Figure 2) introduced a number of changes as moving from noun based to a verb based form and other changes. One of the most significant changes involved the introduction of Create as the highest level of educational attainment and a demotion of Evaluation below the Create level. Figure 3: A 3D representaation1 of Krathhwohls 2D MMatrix of Educaational Assesssment. This is used to view tthe artefact annd proccess perspectivves of computtational creativvity within a ccommon frameework. This image has been reproducedd from: A Model oof Learning Objecctivesbased on AA Taxonomy for LLearning, Teachinng, and Assessing:: A Revision of BBloom's Taxonommy of Educationall Objectives by Reex Heer, Center fofor Excellence in LLearning and Teaaching (CELT), Ioowa State Universsity. LLearning OObjectives MMatrix AAs noted by KKrathwohl (20002) the unid imensional hiierarcchy of Bloomms Revised Taxonomy iincorporated both nnoun/knowleddge and verb/pprocess and thhus was essenttially ddual in naturee. Anderson a nd Krathwohll (2001) overccame tthis problem by separatingg the noun (knnowledge) dimmenssion from the verb (process)) dimension. This resulteed in a two dimmensional maatrix, with onee axis ccalled The KKnowledge Dimmension repreesenting the nnoun rrelated informmation. The otther axis is c called The Proocess DDimension annd this repressents verb re lated informaation. TTowards the origin of thiss array are foound some off the ssimplest formms of educatioonal attainmeent, involving rote llearning and the listing of facts. Furthe st from the oorigin tthen, are the hhighest forms of educationaal attainment -nottably includin n g "create". At this pooint we shouldd acknowledgge that Krathwwohl's ooriginal diagrram was a siimple 2D maatrix. Howeveer, in FFigure 3 we depict a reprresentation duue to Rex H eers mmodel1 that usses the third d imension to h ighlight the diiffere atttainment. Thuus, the simplerr forms of learrning are depiictedd with the least e the highest ffiing t height, while orms of learnare depicted by greater heighhts. Heers moodel and this ppapeer make the asssumption by uusing the thirdd dimension thhat the rocess and Knnowledge Dimmensions are reep e Cognitive P resented to the same scale. HHowever, relaative heights are meerely suggestiive of the leve ls of educationnal attainmentt. Learning Objjectives are typpically stated in the form TThe leaarner will be aable to do X wwith Y where X is a verb reepresenting the reelevant cognittive process aand Y is a nooun reppresenting thee correspondinng knowledge.. Of course, booth the e sourced fromm the two axes of Figure 3. FFor e X and Y are exxample, The llearner will bee able to remeember the laww of supupply and demmand where XX is remembeer and Y is the laww of supply aand demand. The nouns aand verbs on tthe twwo axes, alongg with the verbbs contained iin each vertexx of the vvide a terminology and reffs e matrix, proo erence points to deescribe and disscuss differentt creative systeems. We propose an adaptation of this taxonomy for the purposes of informing work on computational creativity. Adapting the typical statement of Learning Objectives to the domain of computational creativity, we suggest that we read this as A computationally creative system should be able to do X with Y, where X and Y are identified from the diagram in Figure 3. Of course, we acknowledge that adopting this matrix is contingent upon accepting some similarity between an artefact and the knowledge that it embodies. We feel that allowing this comparison may provide a new and useful perspective on computational creativity. The Knowledge Dimension Firstly we look at the Knowledge Dimension of Figure 3. This we liken to the artefact perspective of computational creativity, as both are concerned with the production of new ideas in the form of knowledge or artefacts that represent that knowledge. Factual: Knowledge of the basic elements of the discipline, essential facts, terminology and details. Factual knowledge details the basic elements required to function in some discipline music, art, maths etc. Conceptual: knowledge of classifications, categories and generalisations; knowledge of theories, models, and structures. Knowledge about how factual elements can be related and combined to form low level structures; this might include ontological and other knowledge (warm colours, emotive words). Procedural: knowledge of genre-specific skills, algorithms and techniques, knowledge of criteria for determining when to use appropriate procedures, details how to do something; skills, algorithms, techniques and method, including their use. Metacognitive: strategic knowledge, knowledge about the cognitive tasks including appropriate contextual and conditional knowledge, self-knowledge, and awareness of ones own cognition (or the systems own cognition). The Cognitive Process Dimension depicted in Figure 3 highlights different levels of cognitive processes. While simple cognitive process are identified (like remember and understand), our concern is with the create level. Figure 3 depicts create as the highest level of cognitive process. However, it is interesting to note that creative and evaluate are seen as distinct regions on the cognitive dimension, given their joint roles in many creative systems. We shall examine how the creative process interactions with (or relies upon) the previous four levels of knowledge: factual, conceptual, procedural and metacognitive. Cognitive Processes and Create Before we look at the create level of Cognitive Processes Dimension, we note that the adjacent level of process is evaluate. This would appear to highlight the close relationship between creation and evaluation. For example, at the metacognitive level of evaluation we see the reflect verb with reflection often being seen as a precursor to creativity. However, this paper is focused on the differing levels of the create cognitive process. Generate: Create Factual Outputs While we may not frequently think of producing new facts as a creative challenge, we can see creativity as sometimes being involved -even when there is a known technique to help generate these facts. Let us consider the domain of prime numbers, whole natural numbers divisible only by themselves and 1. Prime numbers play an important role in cryptography and other domains. A non-creative process may simply list the known prime numbers. However, looking at the creative dimension we can see that generating a new prime number might be considered a creative task. Let us restrict the set of numbers even further to the set of Mersenne primes that is, a prime number that is also a Mersenne number of the form (Mn = 2n 1). While this equation looks like it can generate arbitrary prime numbers, in fact most Mersenne numbers are not prime. The Great Internet Mersenne Prime Search project is devoted to discovering ever larger Mersenne prime numbers. Among the reasons for considering this to be a creative task is the enormity of the space of numbers and Mersenne numbers and the enormity of verifying that a given candidate is actually prime. Assemble: Create Conceptual Outputs Creating new concepts might be achieved by combining previously existing concepts, by appropriately assembling a new construct using the lower factual level of knowledge. This could involve finding or creating new similarities between existing knowledge. Here the creation process is already known or relatively straightforward, with the focus being on the concepts and their creation. That is the assembly process is already known and is used to create the new knowledge. Many creative systems appear to produce artefacts that introduce new concepts and facts, using systems that do not change while that artefact is being created. Even powerful systems like analogical reasoning and evolutionary computation typically create new concepts in an assembly like manner. Design: Create Procedural Outputs The next level of creativity aims to design new procedures that might operate on existing or new facts. This level of creativity introduces additional flexibility and creative power, in that the range of possible outputs and artefacts is greatly increased upon the lower level. Analogical reasoning, evolutionary computation and other approaches might be seen as involving metacognitive creation were they to reflect upon their own processes and use this reflection to guide further progress (while evolutionary strategies take their progress into account through strategies like adaptive mutation and others, reactions do not (usually) take the form of metacognitive or reflective modifications to the creative process). Create: Create Meta-Cognitive Outputs These often involve self-knowledge and reflection on that knowledge. The authors are not aware of any computational models addressing this level of computational creativity. Metacognitive and reflective processes may well encompass a Theory of Mind (ToM) as mentioned earlier. However, meta-cognitive aspects are generally not made explicit in most creative systems. Levels of Computational Creativity In this section we build on this joint perspective of knowledge/artefacts and processes. We begin by re-visiting computational creativity, but bearing in mind that creativity is also valued among thinking, processing students. Creating Outputs that themselves Create Artefacts One significant feature of the generated RegEx is that it has a dynamic productive quality. The created product is itself, capable of generating products. In this case the created regular expression is at the lowest level of the Chomsky hierarchy, however a similar approach can in principle be adopted to generate automata at any level from the Chomsky hierarchy. Interestingly, from a creativity perspective it is relatively straightforward to generate an output process that is at a more complex level that the input expression. That is an FSA can be easily transformed into a pushdown automaton by introducing an additional rule from a higher level automaton or by introducing higher level rules that overlap with the pre-existing grammar. While there has been some discussion on the Turing Test and its potential use and adaptation for computational creativity (Boden, 2010; Pease et al, 2012), there have been surprisingly few references to Turing Machines in the various discussions on computational creativity. What limits can we see on the artefacts that are produced by a computationally creative process? Similarly, what limits can we see in the creative processes generated by a creative system? Let us consider a creative system that outputs new and interesting Turing Machines. Earlier in this paper we saw a creative system that created a very simple Turing Machine (a regular expression). Is it possible to generate a creative Turing machine whose output could be (or at least include) a creative Turing Machine? Turing Machine TM1 can be considered creative only if it generates an output string that was not produced by other machines in its inspiring set. Or alternatively, it produced the same output but did so using a different grammar. That is, either the language or the grammar must be different in some novel and useful way. We now look at four levels of computationally creative system that arise from our focus on creative processes. 1. Direct Computational Creativity (DCC): In direct computational creativity the outputs (artefacts or processes) display the novelty and quality attributes associated with creativity. This category includes the majority of work in computational creativity where the (direct) output of the computational process is seen as creative. The directly created output might be an image, a poem, a piece of music, a recipe, or it might be a computational process such as a regular expression or an evolved program. In terms of the search space metaphor, direct computational creativity searches through the space of novel and useful outputs. 2. Direct Self-Sustaining Creativity (DSC): In direct self-sustaining creativity, the outputs are added to the inspiring set and serve to drive subsequent creative episodes. Supporting this type of creativity involves two distinct factors. Firstly, the process must be capable of generating multiple creative artefacts and secondly the quality of the creative outputs must be adequately judged before inclusion in the inspiring set. Figure 4: Levels and Limits of Computational Creativity 3. Indirect Computational Creativity (ICC): Indirect computational creativity outputs a creative process -and that creative process is itself creative. That is, ICC outputs processes and those creative processes can be considered as computationally creative systems. We see this as a form of indirect computational creativity, where we attribute creativity to the created process (as well as its creator). We do not see these created processes as simple variants on some successful template outputting a family of closely related creative models. But instead, the ICC should also itself display an ability to produce processes with the attributes of novelty and quality. 4. Recursively Sustainable Creativity (RSC): This is a further restriction on ICC, where RCC learns from its own outputs to maintain its own creativity. This would appear to be a very challenging level of computational creativity, creating highly creative processes. RCS represents the most significant challenge for computational creativity arising from this discussion. It would appear that techniques like evolutionary and genetic programming are best suited to producing such creative models. Conclusion The search space metaphor pervades most work on computational creativity but appears to have led towards a divide, between a focus on creative artefacts and less of a focus on the creative processes. Two projects are briefly described to highlight some differences between artefact centred and process centred computational creativity. ImageBlender creates new images by combining two input images in complex mathematical transformation of those images. RegExEvolver takes just one regular expression as its input and creates new expressions that differ from their expressions, either in terms of the language it produces or in terms of the expression itself. Kolmogorov complexity and other general purpose compression algorithms appear to offer very useful and widely applicable mechanisms for assessing the quality of output artefacts. In particular they offer a means of assessing the interestingness of creative outputs. In recent work it has been shown that interestingness as estimated by the fractal dimension has been closely correlated with judgements of artistic quality (Forsythe et al, 2010). To help clarify the apparent friction between artefact and process centred creativity we turned to educational assessment as this is another discipline that values creativity among its outputs. We suggest that the 2-dimensional model of Learning Objectives by Anderson and Krathwohl (2002) can offer guidance in comparing creative artefacts and processes. Among its advantages are its 2D matrix, elucidating different levels of attainment achieved along the Cognitive Process Dimension and the Knowledge Dimension. We argue that these two dimensions can be seen as loosely analogous to the Creative Process and the Creative Artefact perspectives that are common to computational creativity. Four increasing levels of creative process were identified, described using the verbs; generate, assemble, design and create. Each of these four levels impacts on increasing levels of the knowledge (or artefact) dimension. Finally, our focus on computationally creative processes allowed us to identify a four-level hierarchy of computational processes. We suggest that the majority of work on computational creativity is at the level of Direct Computational Creativity and arguably some work approaches the level of Direct Self-Sustaining Computational Creativity. However, we also define two higher levels, the first being Indirect Computational Creativity that outputs processes that themselves are creative. The final level we call Recursively Sustainable Computational Creativity and only this highest level is capable of outputting creative processes that are akin in their creative potential to the originating process. Acknowledgements Some of the research leading to these results has received funding from the European Union Seventh Framework Programme [FP7/2007-2013] under grant agreement 611383. We would like to thank John McDonald, Tom Naughton, Ronan Reilly and Stephen Brown for their contributions to the ImageBlender project and we would like to thank Amy Wall for her assistance with RegExEvolver. 2014_2 !2014 The Social Impact of Self-Regulated Creativity on the Evolution of Simple versus Complex Creative Ideas Liane Gabora University of British Columbia Department of Psychology, Okanagan campus, Arts Building, 3333 University Way Kelowna BC, V1V 1V7, CANADA liane.gabora@ubc.ca Simon Tseng University of British Columbia Department of Engineering, 5000-2332 Main Mall Vancouver BC,V6T 1Z4, CANADA s.tseng@alumni.ubc.ca Abstract Since creative individuals invest in unproven ideas at the expense of propagating proven ones, excess creativity can be detrimental to society; moreover, some individuals benefit from creativity without being creative themselves by copying creators. This paper builds on previous studies of how societies evolve faster by tempering the novelty-generating effects of creativity with the novelty-preserving effects of imitation. It was hypothesized that (1) this balance can be achieved through self-regulation (SR) of creativity, by varying how creative one is according to the value of one creative outputs, and (2) that the social benefit of SR is affected by the openness of the space of possible ideas. These hypotheses were tested using EVOC, an agent-based model of cultural evolution in which each agent self-regulated its invention-to-imitation ratio as a function of the fitness of its inventions. We compared SR to non-SR societies, and compared societies in which the space of possible ideas was open-ended because agents could chain simple ideas into complex ones, to societies without chaining, for which the space of possible ideas was fixed. Agents in SR societies gradually segregated into creators and imitators, and changes in diversity were rapider and more pronounced than non-SR. The mean fitness of ideas was higher in SR than non-SR societies, but this difference was temporary without chaining whereas it was permanent with chaining. We discuss limitations of the model and possible social implications of the results. Keywords: Agent-based model; creativity; imitation; individual differences; self regulation; cultural evolution EVOC. Introduction It is commonly assumed that creativity is desirable, and the more creative one is, the better. Our capacity for self-expression, problem solving, and making aesthetically pleasing artifacts, all stem from our creative abilities. However, individuals often claim that their creativity is stifled by social norms, policies, and institutions. Moreover, our educational systems do not appear to prioritize the cultivation of creativity, and in some ways discourage it. Perhaps there is an adaptive value to these seemingly mixed messages that society sends about the social desirability of creativity. Perhaps what is best for society is that individuals vary widely with respect to how creative they are, so as to ensure that the society as a whole both generates novel variants, and preserves the best of them. This paper provides a computational test of the following hypotheses. The first hypothesis is that society as a whole benefits when individuals can vary how creative they are in response to the perceived effectiveness of their ideas. In theory, if effective creators create more, and ineffective creators create less, the ideas held by society should collectively evolve faster. The second hypothesis is that the space of possible ideas has to be open-ended in order to benefit from this self-regulation mechanism. In theory, the effectiveness of such a self-regulation should vary with the extent to which some ideas are fitter or more effective than others. Definition and Key Features of Creativity There are a plethora of definitions of creativity in the literature; nevertheless, it is commonly accepted that a core characteristic of creativity is the production of an idea or product that meets two criteria: originality or novelty, and appropriateness, adaptiveness, or usefulness, i.e., relevance to the task at hand (Guilford 1950; Moran 2011). Not only are humans individually creative, but we build on each others ideas such that over centuries, art, science, and technology, as well as customs and folk knowledge, can be said to evolve. This cumulative building of new innovations on existing products is sometimes referred to as the ratchet effect (Tomasello, Kruger, and Ratner 1993). Creativity has long been associated with personal fulfillment (May 1975; Rogers 1959), self-actualization (Maslow 1959), and maintaining a competitive edge in the marketplace. Thus it is often assumed that more creativity is necessarily better. However, there are significant drawbacks to creativity (Cropley et al. 2010; Ludwig 1995). Generating creative ideas is difficult and time consuming, and a creative solution to one problem often generates other problems, or has unexpected negative side effects that may only become apparent after much effort has been invested. Creativity is correlated with rule bending, law breaking, and social unrest (Sternberg and Lubart 1995; Sulloway 1996), aggression (Tacher and Readdick 2006), group conflict (Troyer and Youngreen 2009), and dishonesty (Gino and Ariely 2012). Creative individuals are more likely to be viewed as aloof, arrogant, competitive, hostile, independent, introverted, lacking in warmth, nonconformist, norm doubting, unconscientious, unfriendly (Batey and Furnham 2006; Qian, Plucker, and Shen 2010; Treffinger et al. 2002). They tend to be more emotionally unstable, and more prone to affective disorders such as depression and bipolar disorder, and have a higher incidence of schizophrenic tendencies, than other segments of the population (Andreason 1987; Eysenck 1993; Flaherty 2005). They are also more prone to drug and alcohol abuse, as well as suicide (Jamison 1993; Goodwin 1998; Rothenberg 1990; Kaufman 2003). This suggests that there is a cost to creativity, both to the individual and to society. Balancing Novelty with Continuity Given the correlation between creativity and personality traits that are potentially socially disruptive, it is perhaps fortunate that in a group of interacting individuals, not all of them need be particularly creative for the benefits of creativity to be felt throughout the group. The rest can reap the rewards of the creators ideas by copying them, buying from them, or simply admiring them. Few of us know how to build a computer, or write a symphony, but they are nonetheless ours to use and enjoy. Of course, if everyone relied on the strategy of imitating others rather than coming up with their own ideas, the generation of cultural novelty would grind to a halt. On the other hand, if everyone were as creative as the most creative amongst us, the frequency of the above-mentioned antisocial tendencies of creative people might be sufficiently high to interfere with cultural stability; i.e., the perpetuation of cultural continuity. It is well known in theoretical biology that both novelty and continuity are essential for evolution, that is, for cumulative, open-ended, adaptive change over time. This need for both novelty and continuity was demonstrated in an agent-based model of cultural evolution (Gabora 1995). Novelty was injected into the artificial society through the invention of new actions, and continuity was preserved through the imitation of existing actions. When agents never invented, there was nothing to imitate, and there was no cultural evolution at all. If the ratio of invention to imitation was even marginally greater than 0, not only was cumulative cultural evolution possible, but eventually all agents converged on optimal cultural outputs. When all agents always invented and never imitated, the mean fitness of cultural outputs was also sub-optimal because .t ideas were not dispersing through society. The society as a whole performed optimally when the ratio of creating to imitating was approximately 2:1. Although results obtained with a simple computer model may have little bearing on complex human societies, the finding that extremely high levels of creativity can be detrimental to the society suggests that there may be an adaptive value to societys ambivalent attitude toward creativity. This suggested that society as a whole might benefit from a distinction between the conventional workforce and what has been called a creative class (Florida 2002) This was investigated in the model by introducing two types of agents: imitators, that only obtained new actions by imitating neighbors, and creators, that obtained new actions either by inventing or imitating (Gabora and Firouzi 2012). It was possible to vary the probability that creators create versus imitate; thus, whereas a given agent was either a creator or an imitator throughout the entire run, the proportion of creators innovating or imitating in a given iteration fluctuated stochastically. The mean fitness of ideas across the artificial society was highest when not all agents were creators. Specifically, there was a tradeoff between C, the proportion of creators to imitators in the society, and p, how creative the creators were). This provided further support for the hypothesis that society as a whole functions optimally when creativity is tempered with continuity. We then hypothesized that society as a whole might perform even better if individuals are able to adjust how creative they are over time in accordance with their perceived creative success. For example, this could result from mechanisms such as selective ostracization of deviant behaviour unless accompanied by the generation of valuable novelty, and encouragement or even adulation of those whose creations are successful. In this way society might self-organize into a balanced mix of novelty generating creators and continuity perpetuating imitators, both of which are necessary for cumulative cultural evolution. A first step in investigating this hypothesis was to determine whether it is algorithmically possible to increase the mean fitness of ideas in a society by enabling them to self-regulate how creative they are, and investigate the conditions under which this is possible. The Computational Model We investigated this using an agent-based model of cultural evolution referred to as EVOlution of Culture, abbreviated EVOC (Gabora 2008)1. It uses neural network based agents that (1) invent new ideas, (2) imitate actions implemented by neighbors, (3) evaluate ideas, and (4) implement successful ideas as actions. EVOC is an elaboration of Meme and Variations, or MAV (Gabora 1995), the earliest computer program to model culture as an evolutionary process in its own right, as opposed to modeling the interplay of cultural and biological evolution2. The goal behind MAV, and also behind EVOC, was to distil the underlying logic of cultural 1The code is freely available; to gain access please contact the first author by email at liane.gabora@ubc.ca. 2The approach can thus be contrasted with computer models of how individual learning affects biological evolution (Best 1999; Higgs 1992; Hinton and Nowlan 1992; Hutchins and Hazelhurst 1991). evolution, i.e., the process by which ideas adapt and build on one another in the minds of interacting individuals. Agents do not evolve in a biological sense, as they neither die nor have offspring, but do in a cultural sense, by generating and sharing ideas for actions. In cultural evolution, the generation of novelty takes place through invention . EVOC was originally developed to compare and contrast the processes of biological and cultural evolution, but has subsequently been used to address such questions as how does the presence of leaders or barriers to the diffusion of ideas affect cultural evolution. We now summarize the architecture of EVOC in sufficient detail to explain our results; for further details we refer the reader to previous publications (Gabora 2008; Leijnen and Gabora 2009). Agents Agents consist of (1) a neural network, which encodes ideas for actions and detects trends in what constitutes a .t action, (2) a perceptual system, which observes and evaluates neighbours actions, and (3) a body, consisting of six body parts which implement actions. The neural network is composed of six input nodes and six corresponding output nodes that represent concepts of body parts (LEFT ARM, RIGHT ARM, LEFT LEG, RIGHT LEG, HEAD, and HIPS), and seven hidden nodes that represent more abstract concepts (LEFT, RIGHT, ARM, LEG, SYMMETRY, OPPOSITE, and MOVEMENT). Input nodes and output nodes are connected to hidden nodes of which they are instances (e.g., RIGHT ARM is connected to RIGHT.) Each body part can occupy one of three possible positions: a neutral or default position, and two other positions, which are referred to as active positions. Activation of any input node activates the MOVEMENT hidden node. Same-direction activation of symmetrical input nodes (e.g., positive activation which represents upward motion of both arms) activates the SYMMETRY node. The entire reason for the neural network is to enable agents to learn trends over time concerning what general types of actions tend to be valuable, and use this learning to invent new actions more effectively. Without the neural network agents invent at random and the fitness of their inventions increases much more slowly (Gabora, 2008). Invention An idea for a new action is a pattern consisting of six elements that dictate the placement of the six body parts. Agents generate new actions by modifying their initial action or an action that has been invented previously or acquired through imitation. During invention, the pattern of activation on the output nodes is fed back to the input nodes, and invention is biased according to the activations of the SYMMETRY and MOVEMENT hidden nodes. We emphasize that were this not the case there would be no benefit to using a neural network. To invent a new idea, for each node of the idea currently represented on the input layer of the neural network, the agent makes a probabilistic decision as to whether the position of that body part will change, and if it does, the direction of change is stochastically biased according to the learning rate. If the new idea has a higher fitness than the currently implemented idea, the agent learns and implements the action specified by that idea. When chaining is turned on, an agent can keep adding new sub-actions and thereby execute a multi-step action, so long as the most recently-added sub-action is both an optimal sub-action and different from the previous sub-action of that action (Gabora, Chia, and Firouzi 2013). Imitation The process of finding a neighbour to imitate works through a form of lazy (non-greedy) search. The imitating agent randomly scans its neighbours, and adopts the first action that is fitter than the action it is currently implementing. If it does not find a neighbour that is executing a fitter action than its own current action, it continues to execute the current action. Evaluation: The Fitness Function Following (Holland 1975), we refer to the success of an action in the artificial world as its fitness, with the caveat that unlike its usage in biology, here the term is unrelated to number of offspring (or ideas derived from a given idea). The fitness function used in these experiments rewards activity of all body parts except for the head, symmetrical limb movement, and positive limb movement. Fitness of a single-step action Fn is determined as per Eq. 1. Total body movement, m, is calculated by adding the number of active body parts, i.e., body parts not in the neutral position. Fn = m + 5(sa + st )+ 2(pa + pt )+ 10 * ah + 2 * ap (1) sa = 1 if arms move symmetrically; 0 otherwise st = 1 if legs move symmetrically; 0 otherwise pa = 1 if both arms move upwards; 0 otherwise pt = 1 if both legs move upwards; 0 otherwise ah = 1 if head is stationary; 0 otherwise ap =number of body parts moving upwards; Note that there are multiple optima. (For example an action can be optimal if either both arms move up or if both arms move down.) The fitness Fc of a multi-step action with n chained single-step actions (each with fitness Fn) is calculated by Eq. 2. n Fn Fc = . (2) 1.2n-1 k=1 Learning Invention makes use of the ability to detect, learn, and respond adaptively to trends. Since no action acquired through imitation or invention is implemented unless it is fitter than the current action, new actions provide valuable information about what constitutes an effective idea. Knowledge acquired through the evaluation of actions is translated into educated guesses about what constitutes a successful action by updating the learning rate. For example, an agent may learn that more overall movement tends to be either bene.cial (as with the fitness function used here) or detrimental, or that symmetrical movement tends to be either beneficial (as with the fitness function used here) or detrimental, and bias the generation of new actions accordingly. The Artificial World These experiments used a default artificial world: a toroidal lattice with 1024 cells each occupied by a single, stationary agent, and a von Neumann neighborhood structure. Creators and imitators were randomly dispersed. A Typical Run Fitness and diversity of actions are initially low because all agents are initially immobile, implementing the same action, with all body parts in the neutral position. Soon some agent invents an action that has a higher fitness than immobility, and this action gets imitated, so fitness increases. Fitness increases further as other ideas get invented, assessed, implemented as actions, and spread through imitation. The diversity of actions increases as agents explore the space of possible actions, and then decreases as agents hone in on the fittest actions. Thus, over successive rounds of invention and imitation, the agents actions improve. EVOC thereby models how descent with modification occurs in a purely cultural context. Method To test the hypothesis that the mean fitness of cultural outputs across society increases faster with social regulation (SR) than without it, we increased the relative frequency of invention for agents that generated superior ideas, and decreased it for agents that generated inferior ideas. To implement this the computer code was modified as follows. Each iteration, for each agent, the fitness of its current action relative to the mean fitness of actions for all agents at the previous iteration was assessed. Thus we obtained the relative fitness (RF) of its cultural output. The agents personal probability of creating, p(C), was a function of RF. It was calculated as follows: 1, if p(C)n-1 RFn-1 > 1 p(C)n = p(C)n-1 RFn-1, otherwise (3) The probability of imitating, p(I), was1 -p(C). Thus when SR was on, if relative fitness was high, the agent invented more, and if it was low the agent imitated more. p(C) was initialized at 0.5 for both SR and non-SR societies. We compared runs with SR to runs without it, both with and without the capacity to chain simple ideas into more complex ones. Results All data are averages across 250 runs. We first present the results of experiments in which chaining was turned off and thus only simple inventions were possible. Second we present the results of experiments with chaining turned on such that simple ideas could be combined into increasingly complex inventions. The Effect of Social Regulation with No Chaining With chaining turned off, the mean fitness of the cultural outputs of societies with SR (the ability to self-regulate inventiveness as a function of inventive success) was higher than that of societies without SR, as shown in Figure 1. However, the difference between SR and non-SR societies is only temporary; it lasts for the duration that the space of possible ideas in being explored. In both SR and non-SR societies mean fitness of actions plateaued when all agents converged on optimally .t ideas. Thus the value of segregating into creators and imitators is short-lived. Figure 1: This graph plots the mean fitness of implemented actions across all agents over the duration of the run without chaining, with and without social regulation. The diversity, or number of different ideas, exhibited an increase as the space of possibilities is explored followed by a decrease as agents converge on .t actions, as shown in Figure 2. This pattern is typical in evolutionary scenarios where outputs vary in fitness. What is of particular interest here is that this pattern occurred earlier, and was more pronounced, in societies with SR than in societies without it Inferior creators were evidently inventing the same ideas so decreasing their creativity had little effect on diversity. On the other hand, superior creators were diverging variety of different directions, so making them more creative did increase diversity. As illustrated in Figure 3, in societies with SR, while all agents initially invented and imitated with equal frequency, encouraging effective creators to create and discouraging ineffective creators did eventually cause them to segregate into two distinct groups: one that invented, and one that imitated. Thus whereas any point along the pareto frontier was optimal behaviour from an individual standpoint, they all piled up at the extreme ends, and the society as a whole benefited from this division of labour. Figure 2: This graph plots the mean diversity of implemented actions across all agents over the duration of the run without chaining, with and without social regulation. Thus the observed increase in fitness can indeed be attributed to increasingly pronounced individual differences in degree of creativity over the course of a run; agents that generated superior cultural outputs had more opportunity to do so, while agents that generated inferior cultural outputs became more likely to propagate proven effective ideas rather than reinvent the wheel. The Effect of Social Regulation with Chaining With chaining turned on, cultural outputs got increasingly fitter over the course of a run, as shown in Figure 4. This is because a .t action could always be made fitter by adding another sub-action. Note that with chaining turned on, although the number of different actions decreases, the agents do not converge on a static set of actions; the set of implemented actions changes continuously as they find new, fitter actions. As was the case without chaining, the diversity of ideas with chaining turned on exhibited an increase as the space of possibilities is explored followed by a decrease as agents converge on .t actions, and once again this pattern was more pronounced in societies with SR than in societies without it, as shown in Figure 5. Interestingly, however, diversity no longer peaks later for non-SR than SR. Because with the capacity to chain simple ideas into increasingly complex ideas, the pool of possible ideas is now unconstrained, it no longer makes sense to converge quickly on optimal ideas. Indeed, there no longer is a fixed set of optimal ideas. As was the case in the experiments without chaining, societies with SR ended up separating into two distinct groups: one that primarily invented, and one that primarily imitated. Discussion The goal of this paper was not to develop a realistic model of creativity but to investigate whether, with respect to creativity, can there be too much of a good thing. Are the needs Figure 3: This graph plots the fitness of actions obtained through invention on the y axis and through imitation on the x axis. Fitness values are given as a proportion of the fitness of an optimally .t action. The pareto frontier indicates the range of possible ways an agent can behave optimally, either by always inventing optimally (upper left corner) or always implementing an optimal action obtained by imitating a neighbour (bottom right corner) or by implementing optimal actions obtained through some combination of inventing and imitating (all other points along the curve). Each small red circle shows the mean fitness of an agents actions obtained through invention and imitation averaged across ten iterations: iterations 1 to 10 in the top graph, 25 to 35 in the middle graph, and 90 to 100 in the bottom graph. Since by iteration 90 all values were piled up in two spots the upper left and the bottom right they are indicated by large red circles at these locations. Figure 5: This graph plots the mean diversity of implemented actions across all agents over the duration of the run with Chaining, with and without social regulation. of the individual for creative expression at odds with societys need to reinforce conventions and established protocols? EVOC agents are too rudimentary to suffer the affective penalties of creativity but the model incorporates another drawback to creativity: time spent inventing is time not spent imitating. Because creative agents spend their time inventing new ideas at the expense of social learning of proven ideas, effectively rupture the fabric of the artificial society; they act as insulators that impede the diffusion of proven solutions. Imitators, in contrast, serve as a cultural memory that ensures the preservation of successful ideas. When effective inventors created more and poor inventors created less, the society as a whole could capitalize on the creative abilities of the best inventors and capitalize on efforts of the rest to disseminate .t cultural outputs. This effect was temporary when agents were limited to a finite set of simple ideas; in other words, when the set of possible ideas was finite, the benefits of self-regulated creativity were short-lived. However, when agents were able to chain simple ideas into complex ideas and thus the space of possible ideas was open-ended, the benefits of self-regulation of creativity increased throughout the duration of a run. The results suggest that it can be beneficial for a social group if individuals are allowed to follow different developmental trajectories in accordance with their demonstrated successes, but only if the space of possible ideas is open-ended enough that there are always avenues for new creative ideas to explore. It has been suggested that the capacity to chain together ideas for simple actions to generate ideas for complex actions such that the space of possible ideas was open-ended emerged some 1.7 million years ago, around the time of the transition from Homo habilis to Homo erectus (Donald 1991). This hypothesis is supported by mathematical (Gabora and Aerts 2009; Gabora and Kitto 2013) and computational (Gabora and Saberi 2011; Gabora and DiPaola 2012; Gabora, Chia, and Firouzi 2013) modelling. The fact that self-regulation of creativity was only found to be of lasting value in societies composed of agents capable of chaining suggests that there may have been insufficient selective pressure for self-regulation of creativity before this. Thus, prior to this time there would have been little individual variation across individuals in a social group with pronounced individual differences in creativity emerging after this time. These results do not prove that in real societies successful creators invent more and unsuccessful creators invent less; they merely show this kind of self-regulation is a feasible means of increasing the mean fitness of creative outputs. However, the fact that strong individual differences in creativity exist (Kaufman 2003; Wolfradt and Pretz 2001) suggests that this occurs in real societies. Whether prompted by individuals themselves or mediated by way of social cues, families, organizations, or societies may spontaneously self-organize to achieve a balance between creative processes that generate innovations and the imitative processes that disseminate these innovations. In other words, they evolve faster by tempering novelty with continuity. A more complex version of this scheme is that individuals find a task at which they excel, such that for each task domain there exists some individual in the social group who comes to be best equipped to explore that space of possibilities. The social practice of discouraging creativity until the individual has proven him-or herself may serve as a form of social self-regulation ensuring that creative efforts are not squandered. Individuals who are tuned to social norms and expectations may over time become increasingly concerned with imitating and cooperating with others in a manner that promotes cultural continuity. Their thoughts travel more well-worn routes, and they are increasingly less likely to innovate. Others might be tuned to the demands of creative tasks, and less tethered to social norms and expectations, and thereby more likely to see things from unconventional perspectives. Thus they are more likely to come up with solutions to problems or unexpected challenges, find new avenues for self-expression, and contribute to the generation of cultural novelty. In other words, what Cropley et al. (2010) refer to as the dark side of creativity may reflect that the creative individual is tuned to task needs at expense of human needs. Although in the long run this benefits the group as a whole because it results in creative outputs, in the short run the creative individual may be less likely to obey social norms and live up to social expectations, and to experience stigmatization or discrimination as a result, particularly in his/her early years (Craft 2005; Scott 1999; Torrance 1963). Once the merits of such individuals creative efforts become known, they may be supported or even idolized. Limitations of this work include that the fitness function was static throughout a run, and agents had only one action to optimize. In real life, there are many tasks, and a division of labor such that each agent specializes in a few tasks, and imitates other agents to carry out other tasks. It may be that no one individual is an across-the-board creator or imitator but that different individuals find different niches for domain-specific creative outputs. Another limitation is that currently EVOC does not allow an agent to imitate some features of an idea and not others. This would be useful because cultural outputs both in EVOC and the real world exhibit a version of what in biology is referred to as epistasis, wherein what is optimal with respect to one component depends on what is going on with respect to another. Once both components have been optimized in a mutually beneficial way (in EVOC, for example, symmetrical arm movement), excess creativity risks breaking up co-adapted partial solutions. In future studies we will investigate the effects of enabling partial imitation. Acknowledgments This work was supported by grants from the Natural Sciences and Engineering Research Council of Canada and the Flemish Fund for Scienti.c Research, Belgium. 2014_20 !2014 Automatic Detection of Irony and Humour in Twitter Francesco Barbieri Horacio Saggion Pompeu Fabra University Pompeu Fabra University Barcelona, Spain Barcelona, Spain francesco.barbieri@upf.edu horacio.saggion@upf.edu Abstract Following these lines of research, we first try to detect Irony and humour are just two of many forms of figurative language. Approaches to identify in vast volumes of data such as the internet humorous or ironic statements is important not only from a theoretical view point but also for their potential applicability in social networks or human-computer interactive systems. In this study we investigate the automatic detection of irony and humour in social networks such as Twitter casting it as a classification problem. We propose a rich set of features for text interpretation and representation to train classification procedures. In cross-domain classification experiments our model achieves and improves state-of-the-art performance. Introduction Irony and humour are just two examples of figurative language (Reyes, Rosso, and Veale 2013). Approaches to identify in vast volumes of data such as the internet humorous or ironic statements are important not only from a theoretical view point but also for their potential applicability in social network analysis and human-computer interactive systems. Systems able to select humorous/ironic statements on a given topic to present to a user are important in human-machine communication. It is also important for a system being able to recognise when users are being ironic/humorous to appropriate deal with their requests. Irony has also relevance in the field of sentiment analysis and opinion mining (Pang and Lee 2008) since it can be used to express a negative statement in an apparently positive way. However, irony detection appears as a difficult problem since ironic statements are used to express the contrary of what is being said (Quintilien and Butler 1953), therefore being a tough nut to crack by current systems. Reyes et al. (2013) approach the problem as one of classification training machine learning algorithms to sepatate ironic from non-ironic statements. Humour has been studied for a number of years in computational linguistics in terms of both humour generation (Stock and Strapparava 2006; Ritchie and Masthoff 2011) and interpretation (Mihalcea and Pulman 2007; Taylor and Mazlack 2005). In particular it has also been approached as classification by Mihalcea and Strapparava (2005) creating a specially designed corpus of one-liners (i.e., one sentence jokes) as the positive class and headlines and other short statements as a negative class. these topics separately; then, since they are both figurative language, and they may have some correlation, we also try to detect them at the same time (we use the union of them as positive example). This last experiment is interesting as it will give us hints for figurative language detection, hence it will help us exploring new aspects of creativity in language (Veale and Hao 2010b). This experiment can be seen as a small step toward the design of a machine capable to evaluate creativity, and with further work also capable to generate creative utterances. Our dataset is composed of text retrieved from the microblogging service Twitter1. For the experiments to be presented in this paper we use a dataset created for the study of irony detection which allows us to compare our findings with recent state-of-the-art approaches (Reyes, Rosso, and Veale 2013). The dataset also contains humorous tweets therefore being appropriate for our purpose. The contributions of this paper are as follows: the evaluation of our irony detection model (Barbieri and Saggion 2014) to humour classification; a comparison of our model with the state-of-the-art; and a novel set of experiments to demonstrate cross-domain adaptation. The paper will show that our model achieves and improve state-of-the-art performance, and that it can be applied to different domains. Related Work Verbal irony has been defined in several ways over the years but there is no consensual agreement on its definition. The standard definition is considered saying the opposite of what you mean (Quintilien and Butler 1953) where the opposition of literal and intended meanings is very clear. Grice (1975) believes that irony is a rhetorical figure that violates the maxim of quality: Do not say what you believe to be false. Irony is also defined (Giora 1995) as any form of negation with no negation markers (as most 1https://twitter.com/ of the ironic utterances are affirmative, and ironic speakers use indirect negation). Wilson and Sperber (2002) defined it as echoic utterance that shows a negative aspect of someones else opinion. Finally irony has been defined as form of pretence by Utsumi (2000) and Veale and Hao (2010b). Veale states that ironic speakers usually craft their utterances in spite of what has just happened, not because of it. The pretence alludes to, or echoes, an expectation that has been violated. Past computational approaches to irony detection are scarce. Carvalho et. al (2009) created an automatic system for detecting irony relying on emoticons and special punctuation. They focused on detection of ironic style in newspaper articles. Veale and Hao (2010a) proposed an algorithm for separating ironic from non-ironic similes, detecting common terms used in this ironic comparison. Reyes et. al (2013) have recently proposed a model to detect irony in Twitter, which is based on four groups of features: signatures, unexpectedness, style, and emotional scenarios. Their classification results support the idea that textual features can capture patterns used by people to convey irony. Among the proposed features, skip-grams (part of the style group) which captures word sequences that contain (or skip over) arbitrary gaps, seems to be the best one. Computational approaches to humour generation include among others the JAPE system (Ritchie 2003) and the STANDUP riddle generator program (Ritchie and Masthoff 2011) which are largely based on the use of a dictionary for humorous effect. It has been argued that humorous discourse depend on the fact that they can have multiple interpretations, that is they are ambiguous. These characteristics are explored in approaches to humour detection. Mihalcea and Strappavara (2005) study classification of a restricted type of humorous discourse: one-liners, which have the purpose of producing humorous effect in very few words. They created a dataset semi-automatically by retrieving itemized sentences from web sites whose URLs contain words such as oneliner, humour, joke, etc. Non-humorous data was created using Reuters titles, Proverbs, and sentences extracted from the British National Corpus. They use two types of models to separate humorous from non-humorous texts. On the one hand a specially designed set of features is created to model Alliteration, Antonymy, and Slang of a sexual oriented nature. On the other hand they tried a word-based text classification algorithm. Non surprisingly the word-based classifier is much more effective than the specially designed features. In (Mihalcea and Pulman 2007) additional features to model violated expectations, human oriented activities, and polarity are introduced. Veale (2013) also created a dataset of humorous similes by querying the web with specific similes patterns. Data and Text Processing The dataset used for the experiments reported in this paper has been prepared by Reyes et al. (2013). It is a corpus of 40.000 tweets equally divided into four different topics: Irony, Education, Humour, and Politics. The tweets were automatically selected by looking at Twitter hashtags (#irony, #education, #humour, and #politics) added by users in order to link their contribution to a particular subject and community. The hashtags are removed from the tweets for the experiments. According to Reyes et. al (2013), these hashtags were selected for three main reasons: (i) to avoid manual selection of tweets, (ii) to allow irony analysis beyond literary uses, and because (iii) irony hashtag may reflect a tacit belief about what constitutes irony and humour. Another corpora is employed in our approach to measure the frequency of word usage. We adopted the Second Release of the American National Corpus Frequency Data2 (Ide and Suderman 2004), which provides the number of occurrences of a word in the written and spoken ANC. From now on, we will mean with frequency of a term the absolute frequency the term has in the ANC. In order to process the tweets we used the Gate plugin Twitie (Bontcheva et al. 2013), an open-source information extraction pipeline for Microblog Text. We used it as tokeniser and part-of-speech tagger. We also adopted Rita WordNet API (Howe 2009) and Java API for WordNet Searching (Spell 2009) to perform operations on WordNet synsets (Miller 1995). Methodology We approach the detection of irony and humour as a classification problem applying supervised machine learning methods to the Twitter corpus previously introduced. When choosing the classifiers we had avoided those requiring features to be independent (e.g. Naive Bayes) as some of our features are not. Since we approach the problem as a binary decision (deciding if a tweet is ironic or not) we picked two tree-based classifiers: Random Forest and Decision tree (the latter allows us to compare our findings directly to Reyes et. al (2013)). We use the implementations available in the Weka toolkit (Witten and Frank 2005). To represent each tweet we use seven groups of features. Some of them are designed to detect imbalance and unexpectedness, others to detect common patterns in the structure of the tweets (like type of punctuation, length, emoticons). Below is an overview of the group of features in our model: Frequency (gap between rare and common words) Written-Spoken (written-spoken style uses) Intensity (intensity of adverbs and adjectives) Structure (length, punctuation, emoticons, links) Sentiments (gap between positive and negative terms) Synonyms (common vs. rare synonyms use) Ambiguity (measure of possible ambiguities) In the following sections we describe the theoretical motivations behind the features and how them have been implemented. 2The American National Corpus (http://www.anc.org/) is, as we read in the web site, a massive electronic collection of American English words (15 million) Frequency Unexpectedness and Incongruity can be a signals of irony and humour (Lucariello 2007; Venour 2013). In order to study these aspects we explore the frequency imbalance between words, i.e. register inconsistencies between terms of the same tweet. The intuition is that the use of many words commonly used in English (i.e. high frequency in ANC) and only a few terms rarely used in English (i.e. low frequency in ANC) in the same sentence creates imbalance that may cause unexpectedness, since within a single tweet only one kind of register is expected. We are able to explore this aspect using the ANC Frequency Data corpus. Three features belong to this group: frequency mean, rarest word, frequency gap. The first one is the arithmetic average of all the frequencies of the words in a tweet, and it is used to detect the frequency style of a tweet. The second one, rarest word, is the frequency value of the rarest word, designed to capture the word that may create imbalance. Written-Spoken Twitter is composed of written text, but an informal spoken English style is often used. We designed this set of features to explore unexpectedness and incongruity created by using spoken style words in a mainly written style tweet or vice versa (formal words usually adopted in written text employed in a spoken style context). We can analyse this aspect with ANC written and spoken, as we can see using this corpora whether a word is more often used in written or spoken English. There are three features in this group: written mean, spoken mean, written spoken gap. The first and second ones are the means of the frequency values, respectively, in written and spoken ANC corpora of all the words in the tweet. The third one, written spoken gap, is the absolute value of the difference between the first two, designed to see if ironic writers use both styles (creating imbalance) or only one of them. A low difference between written and spoken styles means that both styles are used. Structure With this group of features we want to study the structure of the tweet: if it is long or short (length), if it contains long or short words (mean of word length), and also what kind of punctuation is used (exclamation marks, emoticons, etc.). This is a powerful feature, as ironic and humorous tweets in our corpora present specific structures: for example ironic tweets are longer (mean length of an ironic tweet is 94.7 characters against 82.0467, 86.5776, 86.5307 of the other topics), and humorous tweets use more emoticons than the other domains (mean number of emoticons in a humorous tweet is 0.012 and in the other corpora is only 0.003, 0.001, 0.002). The Structure group includes several features that we describe below. The length feature consists of the number of characters that compose the tweet, n. words is the number of words, and words length mean is the mean of the words length. Moreover, we use the number of verbs, nouns, adjectives and adverbs as features, naming them n. verbs, n. nouns, n. adjectives and n. adverbs. With these last four features we also computed the ratio of each part of speech to the number of words in the tweet; we called them verb ratio, noun ratio, adjective ratio, and adverb ratio. All these features have the purpose of capturing the style of the writer. Inspired by Davidov et al. (2010) and Carvalho (2009) we designed features related to punctuation. These features are: number of commas, full stops, ellipsis, exclamation and quotation marks that a tweet contain. We also added the feature laughs which is the number of hahah, lol, ro., and lmao. Additionally, there are the emoticon feature, that is the number of :), :D, of :(, and ;) in a tweet. This feature works well in the Humour corpus as it contains four times more emoticons than the other corpora. The ironic corpus is the one with the least emoticons (there are only 360 emoticons in the Irony corpus, while in Humour, Education, and Politics tweets they are 2065, 492, 397 respectively). In the light of these statistics we can argue that ironic authors avoid emoticons and leave words to be the central thing: the audience has to understand the irony without explicit signs, like emoticons. Humour seems, on the other hand, more explicit. Finally we added a simple but powerful feature, web-links. It simply say if a tweet include or not an internet link. This feature result good for Humour and excellent for Irony, where internet links are not used frequently. Intensity We also study the intensity of adjectives and adverbs. We adopted the intensity scores of Potts (2011) who uses naturally occurring metadata (star ratings on service and product reviews) to construct adjectives and adverbs scales. An example of adjective scale (and relative scores in brackets) could be the following: horrible (-1.9) bad (-1.1) good (0.2) nice (0.3) great (0.8). With these scores we evaluate four features for adjective intensity and four for adverb intensity (implemented in the same way): adj (adv) tot, adj (adv) mean, adj (adv) max, and adj (adv) gap. The sum of the AdjScale scores of all the adjectives in the tweet is called adj tot. adj mean is adj tot divided by the number of adjectives in the tweet. The maximum AdjScale score within a single tweet is adj max. Finally, adj gap is the difference between adj max and adj mean, designed to see how much the most intense adjective is out of context. Synonyms As previously said, irony convey two messages to the audience at the same time (Veale 2004). It follows that the choice of a term (rather than one of its synonyms) is very important in order to send the second, not obvious, message. The choice of the synonym is an important feature for humour as well, and it seems that authors of humours tweets prefer using common terms. For each word of a tweet we get its synonyms with Word-Net (Miller 1995), then we calculate their ANC frequencies and sort them into a decreasing ranked list (the actual word is part of this ranking as well). We use these rankings to define the four features which belong to this group. The first one is syno lower which is the number of synonyms of the word wi with frequency lower than the frequency of wifi It is defined as in Equation 1: slwi = |syni,k : f(syni,k) f(wi)|} (3) wi We are now able to describe syno lower gap which detects the imbalance that creates a common synonym in a context of rare synonyms. It is the difference between word lowest syno and syno lower mean. Finally, we detect the gap of very rare synonyms in a context of common ones with syno greater gap. It is the difference between word greatest syno and syno greater mean, where syno greater mean is the following: |syni,k : f(syni,k) >f(wi)| sgmt = (4) n. words of t Ambiguity Another interesting aspect of irony and humour is ambiguity. We noticed that ironic tweets includes the greatest arithmetic average of the number of WordNet synsets, and humour the least; this indicates that ironic tweets presents words with more meanings, an humorous tweets words with less meaning. In the case of irony, our assumption is that if a word has many meanings the possibility of saying something else with this word is higher than in a term that has only a few meanings, then higher possibility of sending more then one message (literal and intended) at the same time. There are three features that aim to capture these aspects: synset mean, max synset, and synset gap. The first one is the mean of the number of synsets of each word of the tweet, to see if words with many meanings are often used in the tweet. The second one is the greatest number of synsets that a single word has; we consider this word the one with the highest possibility of being used ironically (as multiple meanings are available to say different things). In addition, we calculate synset gap as the difference between the number of synsets of this word (max synset) and the average number of synsets (synset mean), assuming that if this gap is high the author may have used that inconsistent word intentionally. Sentiments We analyse also the sentiments of irony and humour by using the SentiWordNet sentiment lexicon (Esuli and Sebastiani 2006) that assigns to each synset of WordNet sentiment scores of positivity and negativity. There are six features in the Sentiments group. The first one is named positive sum and it is the sum of all the positive scores in a tweet, the second one is negative sum, defined as sum of all the negative scores. The arithmetic average of the previous ones is another feature, named positive negative mean, designed to reveal the sentiment that better describe the whole tweet. Moreover, there is positive-negative gap that is the difference between the first two features, as we wanted also to detect the positive/negative imbalance within the same tweet. The imbalance may be created using only one single very positive (or negative) word in the tweet, and the previous features will not be able to detect it, thus we needed to add two more. For this purpose the model includes positive single gap defined as the difference between most positive word and the mean of all the sentiment scores of all the words of the tweet and negative single gap defined in the same way, but with the most negative one. Experiments and Results The experiments described in this section aim at verifying: (i) the discriminative power of our model, (i) the portability of the model across domains, and (iii) its state-of-the-art status. In order to carry out experimentation and to be able to compare our approach to that of (Reyes, Rosso, and Veale 2013) we use several datasets derived from the corpus used in the paper. Irony Detection Our first experiment addresses the problem of irony detection comparing the performance of our model with that of Reyes et al. (Reyes, Rosso, and Veale 2013). In order to replicate their experimental setting, three balanced datasets were created from the corpus: (i) Irony vs Humour, (ii) Irony vs Education, and (iii) Irony vs Politics. Each dataset is composed of 10,000 examples of irony and 10,000 examples of a different topic. A 10-fold cross-validation experiment was run in each dataset and precision, recall, and f-measure computed. The results of the experiments are presented in Table 1. Cross-domain Irony and Humour Detection Our second experiment addresses cross-domain adaptation, which has not been addressed in previous work. We designed three balanced training sets composed of 7500 positive tweets (irony or humour) and 7500 of each negative topic that remain available (Education/Humour/Politics when the positive is Irony and Education/Irony/Politics when the positive is Humour) and three balanced test sets composed of 2500 positive and 2500 of each negative topic (Education/Humour/Politics when the positive is Irony and Education/Irony/Politics when the positive is Humour). We carried out all the Train/Test possible combinations to verify Education Humour Politics Model P RF1 P RF1 P RF1 Reyes et. al .76 .66 .70 .78 .74 .76 .75 .71 .73 Our model .87 .87 .87 .88 .88 .88 .87 .87 .87 Table 1: Precision, Recall, and F-Measure over the three corpora Education, Humour, and Politics. Both our and Reyes et al. results are shown; the classifier used is Decision Tree for both models. Training Set Test set P Education R F1 P Humour R F1 P Politics R F1 Education .87/.89 .87/.89 .87/.89 .86/.86 .86/.85 .86/.85 .86/.87 .86/.87 .86/.87 Humour .78/.79 .77/.74 .77/.74 .88/.89 .88/.89 .88/.89 .78/.79 .77/.74 .76/.74 Politics .82/.83 .82/.83 .82/.82 .83/.83 .82/.82 .82/.82 .88/.89 .88/.89 .88/.89 Table 2: Results of Experiment 2 when positive topic is Irony and negative topics are Education, Humour and Politics. The table includes Precision, Recall and F-Measure for each Training/Testing topic combination written in the form Decision Tree / Random Forest as we used these two algorithms as classifiers. how the model works when the domain is changed (one such instance is to train in the Irony/Politics dataset and evaluate it in the Irony/Education dataset). The results of the experiments are presented in Tables 2 and 3. Figurative Language Filtering Our third experiment consists on treating irony and humour as a single class representing figurative language; here we want to verify whether our model can separate figurative from non-figurative language. We designed one balanced Training set composed of 15000 positive tweets (7500 of Irony and 7500 of Humour) and 15000 negative examples (7500 of Education and 7500 of Politics). Then a balanced Test set composed of 5000 positive tweets (2500 of Irony and 2500 of Humour) and 5000 negative examples (2500 of Education and 2500 of Politics). Table 4 presents results of this experiment comparing two classification algorithms: Decision Tree and Random Forest. Feature Analysis Finally and in order to have a clear understanding about the contribution of each features of our model, we also studied the behaviour of information gain in each dataset. We compute information gain experiments over the three Training sets of our cross-domain experiments. Information gain results are directly correlated to the classification results as we are using tree based classifiers and features with high information gain will be at the top of the tree i.e. important discriminators. Figure 1 shows the information gain when the positive topic is Irony, Figure 2 when the positive topic is Humour. In Table 5 (a) and (b) are shown the Pearson Correlation between information gain of each feature over different topics when training Irony and Humour. The correlation has been calculated to determine whether the system uses similar features for different negative topics (if the correlation is low we are likely to have cross-domain problems). The correlation can tell us how well correlated two topics are. Discussion Looking at the figures obtained in our irony detection experiments, it appears that our model is more balanced in terms of precision and recall and that our overall f-measure improves over previous work having the additional advantage of the features being easy to compute. Now turning to the cross-domain experiments we observe that our model performs reasonably well across-domains. That is to say except when we try to identify humorous tweets having trained with irony. This is in fact an interesting result which may indicate that not all features of our model are appropriate for humorous discourse, requiring the design of additional features for this type of figurative language. With respect to the figurative language filtering experiments, results seem promising. Our experiments can not be compared with previous approaches directly because of differences in datasets but we point out that in humour classification (Mihalcea and Strapparava 2005) using specially designed humour characteristics accuracy results are around 76%. Finally, our feature analysis experiments (Figures 1 and 2), we observe that features for structure, frequency, and synonymy are discriminators of irony. Although there is great variability across domains which is also shown in the correlation Table 5. Where humour is concerned, we see that features of structure, synonymy, frequency and intensity also are good discriminators again with great variability across domains. Features belonging to ambiguity and sentiment have little discriminative power. Regarding figurative versus not figurative experiment the best features are syno lower, rarest val, word length and adj/adv max. In comparison to education and politics, humour and irony include longer (word length) and more common words (syno lower, rarest val). Moreover, intensity of adjectives and adverbs (adj/adv max) is important characteristic as humour and irony include more intense terms. Training Set Test set P Education R F1 P Irony R F1 P Politics R F1 Education .78/.81 .78/.81 .78/.81 .55/.57 .53/.53 .46/.43 .72/.77 .71/.75 .71/.75 Irony .72/.64 .71/.61 .71/.58 .88/.89 .88/.88 .88/.88 .60/.67 .69/.63 .69/.61 Politics .73/.77 .73/.76 .73/.76 .60/.61 .56/.55 .51/.48 .80/.84 .80/.84 .80/.84 Table 3: Results of Experiment 2 when positive topic is Humour and negative topics are Education, Irony and Politics. The table includes Precision, Recall and F-Measure for each Training/Testing topic combination written in the form Decision Tree / Random Forest as we used these two algorithms for the classifications. P RF1 .80/.83 .80/.83 .80/.83 Table 4: Figurative language filtering results. Precision, Recall, and F-measure numbers correspond to two algorithms: Decision Tree/Random Forest. Figure 1: Information gain of each feature of the model. Irony corpus is compared to Education, Humour, and Politics corpora. High values of information gain help to better discriminate ironic from non-ironic tweets. Figure 2: Information gain of each feature of the model. Humour corpus is compared to Education, Irony, and Politics corpora. High values of information gain help to better discriminate humorous from non-humorous tweets. (a) Education Humour Politics Education 1 0.76 0.96 Humour 1 0.76 Politics 1 (b) Education Irony Politics Education 1 0.48 0.89 Irony 1 0.36 Politics 1 Table 5: Pearson Correlation between information gain of each feature over different topics when training on Irony (a) or Humour (b) Conclusion and Future Work In this article we have proposed a novel linguistically motivated set of features to detect irony and humour in the social network Twitter. The features take into account frequency, written/spoken differences, sentiments, ambiguity, intensity, synonymy and structure. We have designed many of them to be able to model unexpectedness and incongruity, a key characteristic of both genres. We have performed controlled experiments with an available corpus used in previous work which allow us to carried out experimentation in different scenarios. First, we carried out experiments to verify the performance of our set of features compared with previous work obtaining promising results. Second, we have carried out cross-domain experiments to show that the model can be used across domains. This experiment also shows that additional features are needed because irony and humour have their own particular characteristics. Third, we have performed an experiment to try to classify figurative language obtaining initial reasonable results. There is however much space for improvements. The ambiguity aspect is still weak in this research, and it needs to be improved. Also experiments adopting different topics may be useful in order to explore the system behaviour in a more realistic situation. We plan to model additional features to better distinguish between the two forms of figurative language. Acknowledgments We are grateful to three anonymous reviewers for their comments and suggestions that help improve our paper. The research described in this paper is partially funded by fellowship RYC-2009-04291 from Programa Ramon y Cajal 2009 and project number TIN2012-38584-C06-03 (SKATERUPF-TALN) from Ministerio de Economia y Competitividad, Secretaria de Estado de Investigacion, Desarrollo e Innovacion, Spain. We also acknowledge partial support from the EU project Dr. Inventor (FP7-ICT-2013.8.1 project number 611383). 2014_21 !2014 Knowledge Discovery of Artistic Influences: A Metric Learning Approach Babak Saleh, Kanako Abe, Ahmed Elgammal Computer Science Department Rutgers University New Brunswick, NJ USA {babaks,kanakoabe,elgammal}@rutgers.edu Abstract We approach the challenging problem of discovering influences between painters based on their fine-art paintings. In this work, we focus on comparing paintings of two painters in terms of visual similarity. This comparison is fully automatic and based on computer vision approaches and machine learning. We investigated different visual features and similarity measurements based on two different metric learning algorithm to find the most appropriate ones that follow artistic motifs. We evaluated our approach by comparing its result with ground truth annotation for a large collection of fine-art paintings. Introduction How do artists describe their paintings? They talk about their works using several different concepts. The elements of art are the basic ways in which artists talk about their works. Some of the elements of art include space, texture, form, shape, color, tone and line (Fichner-Rathus ). Each work of art can, in the most general sense, be described using these seven concepts. Another important descriptive set is the principles of art. These include movement, unity, harmony, variety, balance, contrast, proportion, and pattern. Other topics may include subject matter, brush stroke, meaning, and historical context. As seen, there are many descriptive attributes in which works of art can be talked about. One important task for art historians is to find influences and connections between artists. By doing so, the conversation of art continues and new intuitions about art can be made. An artist might be inspired by one painting, a body of work, or even an entire genre of art is this influence. Which paintings influence each other? Which artists influence each other? Art historians are able to find which artists influence each other by examining the same descriptive attributes of art which were mentioned above. Similarities are noted and inferences are suggested. It must be mentioned that determining influence is always a subjective decision. We will not know if an artist was ever truly inspired by a work unless he or she has said so. However, for the sake of finding connections and progressing through movements of art, a general consensus is agreed upon if the argument is convincing enough. Figure 1 represents a commonly cited comparison for studying influence. Figure 1: An example of an often cited comparison in the context of influence. Diego Vel azquezs Portrait of Pope Innocent X (left) and Francis Bacons Study After Velazquezs Portrait of Pope Innocent X (right). Similar composition, pose, and subject matter but a different view of the work. Is influence a task that a computer can measure? In the last decade there have been impressive advances in developing computer vision algorithms for different object recognition-related problems including: instance recognition, categorization, scene recognition, pose estimation, etc. When we look into an image we not only recognize object categories, and scene category, we can also infer various cultural and historical aspects. For example, when we look at a fine-art paining, an expert or even an average person can infer information about the genre of that paining (e.g. Baroque vs. Impressionism) or even can guess the artist who painted it. This is an impressive ability of human perception for analyzing fine-art paintings, which we approach to it in this paper as well. Besides the scienti.c merit of the problem from the perception point of view, there are various application motivations. With the increasing volumes of digitized art databases on the internet comes the daunting task of organization and retrieval of paintings. There are millions of paintings present on the internet. It will be of great signifficance if we can infer new information about an unknown painting using already existing database of paintings and as a broader view can inFigure 2: Gustav Klimts Hope (Top Left) and nine most similar images across different styles based on LMNN metric. Top row from left to right: Countess of Chinchon by Goya; Wing of a Roller by Durer; Nude with a Mirror by Mira; Jeremiah lamenting the destruction of Jerusalem by Rembrandt. Lower row, from left to right: Head of a Young Woman by Leonardo Da Vinci; Portrait of a condottiere by Bellini; Portrait of a Lady with an Ostrich Feather Fan by Rembrandt; Time of the Old Women by Goya and La Schiavona by Titian. fer high-level information like influences between painters. Although there have been some research on automated classification of paintings (Arora and Elgammal 2012; Cabral et al. 2011; Carneiro 2011; Li et al. 2012; Graham 2010). However, there is very little research done on measuring and determining influence between artists ,e.g. (Li et al. 2012). Measuring influence is a very difficult task because of the broad criteria for what influence between artists can mean. As mentioned earlier, there are many different ways in which paintings can be described. Some of these descriptions can be translated to a computer. Some research includes brushwork analysis (Li et al. 2012) and color analysis to determine a painting style. For the purpose of this paper, we do not focus on a specific element of art or principle of art but instead we focus on finding new comparisons by experimenting with different similarity measures. Although the meaning of a painting is unique to each artist and is completely subjective, it can somewhat be measured by the symbols and objects in the painting. Symbols are visual words that often express something about the meaning of a work as well. For example, the works of Renaissance artists such as Giovanni Bellini and Jan Van-Eyck use religious symbols such as a cross, wings, and animals to tell stories in the Bible. One important factor of finding influence is therefore having a good measure of similarity. Paintings do not necessarily have to look alike but if they do or have reoccurring objects (high-level semantics), then they will be considered similar. However similarity in fine-art paintings is not limited to the co-occurrence of objects. Two abstract paintings look quite similar even though there is no object in any of them. This clarifies the importance of low-level features for painting representation as well. These low-level features are able to model artistic motifs (e.g. texture, decomposition and negative space). If influence is found by looking at similar characteristics of paintings, the importance of finding a good similarity measure becomes prominent. Time is also a necessary factor in determining influence. An artist cannot influence another artist in the past. Therefore the linearity of paintings cuts down the possibilities of influence. By including a computers intuition about which artists and paintings may have similarities, it not only finds new knowledge about which paintings are connected in a mathematical criteria but also keeps the conversation going for artists. It challenges people to consider possible connections in the timeline of art history that may have never been seen before. We are not asserting truths but instead suggesting a possible path towards a difficult task of measuring influence. The main contribution of this paper is working on the interesting task of determining influence between artist as a knowledge discovery problem. Toward this goal we propose two approaches to represent paintings. On one hand high-level visual features that correspond to objects and concepts in the real world have been used. On the other hand we extracted low-level visual features that are meaningless to human, but they are powerful for discrimination of paintings using computer vision algorithms. After image representation we need to define similarity between pairs of artist based on their artworks. This results in finding similarity at the level of images. Since the first representation is meanFigure 3: Gustav Klimts Hope (Top Left) and nine most similar images across different styles based on Boost metric. Top row from left to right: Princesse de Broglie by Ingres; Portrait, Evening (Madame Camus) by Degas; The birth of Venus-Detail of Face by Botticelli; Danae and the Shower of Gold by Titian. Lower row from left to right: The Burial of Count Orgasz by El Greco; Diana Callist by Titian; The Starry Night by Van Gogh; Baronesss Betty de Rothschild by Ingres and St Jerome in the Wilderness by Durer. ingful by its nature (a set of objects and concepts in the images) we do not need to learn a semantically meaningful way of comparison. However for the case of low-level representation we need to have a metric that covers the absence of semantic in this type of image representation. For the latter case we investigated a set of complex metrics that need to be learned specifically for the task of influence determination. Because of the limited size of the available influence ground-truth data and the lack of negative examples in it, it is not useful for comparing different metrics. Instead, we resort to a highly correlated task, which is classifying painting style. The assumption is that metrics that are good for style classification (which is a supervised learning problem), would also be good for determining influences (which is an unsupervised problem). Therefore, we use painting style label to learn the metrics. Then we evaluate the learned metrics for the task of influence discovery by verifying the output using well-known influences. Related Works Most of the work done in the area of computer vision and paintings analysis utilizes low-level features such as color, shades, texture and edges for the task of style classification. Lombardi (Lombardi 2005) presented a comprehensive study of the performance of such features for paintings classification. Sablatnig et al. (R. Sablatnig and Zolda 1998) uses brush-strokes patterns to define structural signature to identify the artist style. Khan et al. (Fahad Shahbaz Khan 2010) use a Bag of Words(BoW) approach with low-level features of color and shades to identify the painter among eight different artists. In (Sablatnig, Kammerer, and Zolda 1998) and (I. Widjaja and Wu. 2003) also similar experiments with low-level features were conducted. Carneiro et al. (Carneiro et al. 2012) recently published the dataset PRINTART on paintings along with primarily experiments on image retrieval and painting style classi.cation. They define artistic image understanding as a process that receives an artistic image and outputs a set of global, local and pose annotations. The global annotations consist of a set of artistic keywords describing the contents of the image. Local annotations comprise a set of bounding boxes that localize certain visual classes, and pose annotations consist of a set of body parts that indicate the pose of humans and animals in the image. Another process involved in the artistic image understanding is the retrieval of images given a query containing an artistic keyword. In. (Carneiro et al. 2012) an improved inverted label propagation method has been proposed that produces the best results, both in the automatic (global, local and pose) annotation and retrieval problems. Graham et. al. (Graham 2010) pose the question of finding the way we perceive two artwork similar to each other. Toward this goal, they acquired strong supervision of human experts to label similar paintings. They apply multidimensional scaling methods to paired similar paintings from either Landscape or portrait/still life and showed that similarity between paintings can be interpreted as basic image statistics. In the experiments they show that for landscape paintings, basic grey image statistics is the most important factor for two artwork to be similar. For the case of still life/portrait most important element of similarity is semantic variable, for example representation of people. Extracting visual features for paintings is very challenging that should be treated differently from feature representation of natural images. This difference is due to, first unlike regular images(e.g. personal photographs), paintings have been created by involving abstract ideas. Secondly the effect of digitization on the computational analysis of paintings is investigated in great depth by Polatkan et. al (Gungor Polatkan 2009). Cabral et al (Cabral et al. 2011) approach the problem of ordering paintings and estimating their time period. They formulate this problem as embedding paintings into a one dimensional manifold. They applied unsupervised embedding using Laplacian Eignemaps (Belkin and Niyogi 2002). To do so they only need visual features and defined a convex optimization to map paintings to a manifold. Influence Framewrok Consider a set of artists, denoted by A = {al,l =1 Na}, l where Na is the number of artists,. For each artist, a, we l have a set of images of paintings, denoted by P l = {p,i = i 1, ,Nl}, where Nl is the number of paintings for the l-th artist. For clarity of the presentation, we reserve the superscript for the artist index and the subscript for the painting index. We denote by N = Nl the total number of paintl l ings. Therefore, each image p. RD is a D dimensional i feature vector that is the outcome of the Classemes classi.ers, which defines the feature space. To represent the temporal information, for each artist we have a ground truth time period where he/she has performed their work, denoted by tl =[tl ,tl ] for the l-th artist, startendwhere tl and tl are the start and end year of that time start end period respectively. We do not consider the date of a given painting since for some paintings the exact time is unknown. Painting Similarity: To encode similarity/dissimilarity between paintings, we consider two different category of approaches. On one hand we applied simple distance metrics (note that distance is dissimilarity measure) on top of high-level visual features(we used Classemes features) as they are understandable by human. On the other hand we applied complex metrics on low-level visual features that are powerful for machine learning, however they don not make sense to human. Details on the features used will be explained in experiment section. Predefined Similarity Measurement lk Euclidean distance: The distance dE(pi,pj ) is defined to be the Euclidean distance between the Classemes feature lk vectors of paintings pand pSince Classemes features ij . are high-level semantic features, the Euclidean distance in the feature space is expected to measure dissimilarity in the subject matter between paintings. Painting similarity based on the Classemes features showed some interesting cases, several of which have not been studied before by art historians as a potential comparison. Metric Learning Approaches: Despite the simplicity, Euclidean distance is not taking into account expert supervision for comparing two paintings together. We approach measuring similarity between two paintings by enforcing expert knowledge about fine art paintings. The purpose of Metric Learning is to find some pair-wise real valued function dM (x, x.) which is nonnegative, symmetric, obeys the triangle inequality and returns zero if and only if x and x. are the same point. Training such a function in a general form can be seen as the following optimization problem: min l(M, D)+ .R(M) (1) M This optimization has two sides, first it minimizes the amount of loss by using metric M over data samples D while trying to adjust the model by the regularization term R(M). The first term shows the accuracy of the trained metric and second one estimates its capability over new data and avoids over.tting. Based on the enforced constraints, the resulted metric can be linear or non-linear, also based on the amount of used labels training can be supervised or unsupervised. For consistency over the metric learning algorithms, we need to .x the notation first. We learn the matrix M that will be used in Generalized Mahalanobis Distance: dM (x, x.)= (x x.).M(x x.), where M by definition is a semi-positive definite matrix. Dimension reduction methods can be seen as learning the metric when M is a low rank matrix. There has been some research on Unsupervised Dimension Reduction for fine-art paintings. We will show how the supervised metric learning algorithms beat the unsupervised approaches for different tasks. More importantly, there are significantly important information in the ground-truth annotation associated with paintings that we use to learn a more reliable metric in a supervised fashion for both the linear and non-linear case. Considering the nature of our data that has high variations due to the complex visual features of paintings and labels associated with paintings, we consider the following approaches that differ based on the form of M or amount of regularization. Large Margin Nearest Neighbors (Weinberger and Saul 2009) LMNN is a widely used approach for learning a Mahalanobis distance due to its global optimum solution and its superior performance in practice. The learning of this metric involves a set of constrains, all of which are defined locally. This means that LMNN enforce the k nearest neighbor of any training instance should belong to the same class(these instances are called target neighbors). This should be done while all the instances of other classes, ,referred as Impostors, should keep a way from this point. For finding the target neighbors, Euclidean distance has been applied to each pair of samples, resulting in the following formulation: min(1 ) dM 2 (xi,xj)+ .i,j,k M (xi,xj ).T i,j,k s.t. : d2 M (xi,xj ) . 1 .i,j,k.(xi,xj ,xk) . I. M (xi,xk) d2 Figure 4: Map of Artists based on LMNN metric between paintings. Color coding indicates artists of the same style. Where T stands for the set of Target neighbors and I represents Impostors. Since these constrains are locally defined, this optimization leads to a convex formulation and a global solution. This metric learning approach is related to Support Vector Machines in principle, which theoretically engages its usage along with Support Vector Machines for different tasks including style classification. Due to its popularity, different variations of this method have been expanded, including a non linear version called gb-LMNN (Weinberger and Saul 2009) which we will use in our experiments as well. Boost Metric (Shen et al. 2012) This approach is based on the fact that a Semi-Positive Definite matrix can be decomposed into a linear combination of trace-one rank-one matrices. Shen et al (Shen et al. 2012) use this fact and instead of learning M, find a set of weaker metrics that can be combined and give the final metric. They treat each of these matrices as a Weak Learner, which is used in the literature of Boosting methods. The resulting algorithm is applying the idea of AdaBoost to Mahalanobis distance, which is quiet efficient in practical usages. This method is particularly of our interest, since we can learn an individual metric for each style of painting and finally merge these metric to get the final one. Theoretically the final metric can perform well to find similarities inside each painting style as well. We considered the aforementioned types of metrics(Boost metric and LMNN) for measuring similarity between paintings. On one hand it is been stated (Weinberger and Saul 2009) that Large Margin Nearest Neighbors outperforms other metrics for the task of classification. This is rooted in the fact that this metric imposes the largest margin between different classes. Considering this property of LMNN, we expect it to outperform other methods for the task of paintings style classification. On the other hand, as it is mentioned in the introduction, artists compare paintings based on a list of criteria. Assuming we can model each criteria via a Weak Learner, we can combine these metrics using Boost metric learning. We argue that searching for similar paintings based on this metric would be more realistic and intuitive. Artist Similarity: Once painting similarity is encoded, using any of aforementioned methods, we can design a suitable similarity measure between artists. There are two challenges to achieve this task. First, how to define a measure of similarity between two artists, given their sets of paintings. We need to define a proper set distance D(P l,P k) to encode the distance between the work of the l-th and k-th artists. This relates to how to define influence between artists to start with, where there is no clear definition. Should we declare an influence if one paining of artist k has strong similarity to a painting of artist l ? or if a number of paintings have similarity ? and what that number should be ? l Mathematically speaking, for a given painting pi . P l we can find its closest painting in P k using a point-set distance as l lk d(pi,P k) = mind(pi,p j ). j We can find one painting in by artist l that is very similar to a painting by artist k, that can be considered an influence. This dictates defining an asymmetric distance measure in the Figure 5: Map of Artists based on Boost metric between paintings. Color coding indicates artists of the same style. form of l Dmin(P l,P k) = mind(pi,P k). i We denote this measure by minimum link influence. On the other hand, we can consider a central tendency in measuring influence, where we can measure the average or median of painting distances between P l and P k, we denote this measure central-link influence. Alternatively, we can think of Hausdorff distance (Dubuisson and Jain 1994), which measures the distance between two sets as the supremum of the point-set distances, defined as lk DH (P l,P k) = max(maxd(pi,P k), max d(pj ,P l)). ij We denote this measure maximum-link influence. Hausdorff distance is widely used in matching spatial points, which unlike a minimum distance, captures the configuration of all the points. While the intuition of Hausdorff distance is clear from a geometrical point of view, it is not clear what it means in the context of artist influence, where each point represent a painting. In this context, Hausdorff distance measures the maximum distance between any painting and its closest painting in the other set. The discussion above highlights the challenge in defining the similarity between artists, where each of the suggested distance is in fact meaningful, and captures some aspects of similarity, and hence influence. In this paper, we do not take a position in favor of any of these measures, instead we propose to use a measure that can vary through the whole spectrum of distances between two sets of paintings. We define asymmetric distance between artist l and artist k as the q-percentile Hausdorff distance, as q% l Dq%(P l,P k) = max d(pi,P k). (2) i Varying the percentile q allows us to evaluate different settings ranging from a minimum distance, Dmin, to a central tendency, to a maximum distance DH . Experimental Evaluation Evaluation Methodology: We used dataset of fine-art paintings (Abe, Saleh, and Elgammal 2013) for our experiments. This collection contains color images from 1710 paintings of 66 artist created during the time period of 1400-1935. This dataset covers all genres and thirteen styles of paintings(e.g. classic, abstract). This dataset has some known influences between artists within the collection from multiple resources such as The Art Story Foundation and The Metropolitan Museum of Art. For example, there is a general consensus among art historians that Paul Cezannes use of fragmented spaces had a large impact on Pablo Picassos work. In total, there are 76 pairs i of one-directional artist influences, where a pair (a,aj) indicates that artist i is influenced by artist j. Generally, it is a sparse list that contains only the influences which are consensual among many. Some artists do not have any in.uences in our collection while others may have up to five. We use this list as ground-truth for measuring the accuracy of our experiments. There is an agreement that influence happens mostly when two paintings belong to the same style (e.g. both are classic). Inspired by this fact we used the annotation of paintings to put paintings from same style close to each other, when we learn a metric for similarity measurement between paintings. Learning the Painting Similarity Measure We experimented with the Classemes features (Torresani, Szummer, and Fitzgibbon 2010), which represents the high level information in terms of presence/absence of objects in the image. We also extracted GIST descriptors (Oliva and Torralba 2001) and Histogram of Oriented Gradients (HOG) (Dalal and Triggs 2005), since they are the main ingredients in the Classemes features. For the task of measuring the similarity between paintings, we followed two approaches: First, we investigated the result of applying a predefined metric (Euclidean) on extracted visual features. Second, for low-level visual features(HOG and GIST), we learned a new set of metrics to put similar images from same style close to each other. These metrics are learned in the way that we expect to see paintings from same style be the most similar pairs of paintings. However it is interesting to look at most similar pairs of paintings when their style is different. Toward this goal we computed the distance between all the possible pairs of paintings based on learned Boost metric and LMNN metric. Some of the most similar pairs across different styles(with the smallest distances) are depicted in figure 9(for LMNN metric) and figure 8 for Boosting metric approach. We also evaluated these metrics for the task of painting retrieval. Figure 2 shows the top nine closest matches for the Hope by Klimt when we used LMNN metric to learn the measure of similarity between paintings. Figure 3 represents results of the same task when we used Boost metric approach instead of LMNN. Although the retrieved results are from different styles, but they show different aspects of similarities, in color, texture, composition, subject matter, etc. Painting Style Classification To verify the performance of these learned metrics for measuring similarity, we compared their accuracy for the task of style classification of paintings. We train a set of one-vs-all classifiers using Support Vector Machines(SVM) after applying different similarity measurements. Each classi.er corresponds to one painting style and in total we trained 13 classifiers using LIBSVM package (Chang and Lin 2011). Performance of these classifiers are reported in table 1 in terms of average and the standard deviation of the accuracy. We compared our implementations with the method of (Arora and Elgammal 2012) as the baseline. Both variations of LMNN method (linear and non-linear)that are trained on low-level visual features outperform the baseline. However the trained classifier based on measure of similarity of Boosting metric performs slightly worse than the baseline. Table 1: Style Classification Accuracy Method LMNN gb-LMNN Boost Metric Baseline Accuracy mean(%) 69.75 68.16 64.71 65.4 std 4.13 3.52 3.06 4.8 Influence Discovery Validation As mentioned earlier, based on similarity between paintings, we measure how close are works of an artist to another and build an influenced-by-graph by considering the temporal information. The constructed influenced-by graph is used to retrieve the top-k potential influences for each artist. If a retrieved influence concur with an influence ground-truth pair that is considered a hit. The hits are used to compute the recall, which is defined as the ratio between the correct influence detected and the total known influences in the ground truth. The recall is used for the sake of comparing the different settings relatively. Since detected influences can be correct although not in our ground truth, so no meaning to compute the precision. Figure 6: Recall curves of top-k(x-axis values) influences for different approaches when q = 50. In all cases, we computed the recall figures using the influence graph for the top-k similar artist (k=5, 10, 15, 20, 25) with different q-percentile for the artist distance measure in Eq 2 (q=1, 10, 50, 90, 99%). Figure 6 shows this recall curve for the case of q = 50 and figure 7 depicts the recall curve of influence finding when q = 90. We computed the performance of different approaches for the task of influence finding when the value of K is fixed(K =5). Since these are supposed to be the most similar artists, which can suggest potential influences. Table 2 compares the performance of these approaches for different values of percentile (q) for a given k. Except the case of q = 10, gb-LMNN gives the bet performance. Table 2: Comparison of Different Methods for Finding Top-5 Influence q% Method 1 10 50 90 99 Euclidean on Classemes features 25 26.3 29 21.1 23.7 Euclidean on GIST features 21.05 31.58 32.89 28.95 23.68 Euclidean on HOG features 22.37 22.37 22.37 25 26.32 gb-LMNN on low-level features 27.63 22.37 36.84 35.53 30.26 LMNN on low-level features 23.68 22.37 35.53 35.53 28.95 Boost on low-level features 21.05 28.95 31.58 30.26 27.63 Figure 7: Recall curves of top-k(x-axis values) influences for different approaches when q = 90. As mentioned earlier based on similarity of paintings and following time period of each artist, we are able to build a map of painters. For computing the similarity between collection of paintings of an artist, we looked for the 50 percentile of his works (q = 50) and built the map of artist based on LMNN metric (shown in figure 4) and Boost metric (figure 5). For the sake of better visualization, we depict artist from the same style with one color. The fact that artist from the same style stay close to each other verifies the quality of these maps. Conclusion In this paper we explored the interesting problem of finding potential influences between artist. We considered painters and tried to find who can be influenced by whom, based on their artworks and without any additional information. We approached this problem as a similarity measurement in the area of computer vision and investigated different metric learning methods for representing paintings and measuring their similarity to each other. This similarity measurement is in-line with human perception and artistic motifs. We experimented on a diverse collection of pairings and reported interesting findings. Acknowledgment We would like to appreciate valuable inputs by Mr. Shahriar Rokhgar for his precious comments on painting analysis. Also we thank Dr. Laura Morowitz for her comments on finding the influence path in art history. 2014_22 !2014 Nehovah: A Neologism Creator Nomen Ipsum Michael R. Smith, Ryan S. Hintze and Dan Ventura Department of Computer Science, Brigham Young University, Provo, UT 84602 msmith@axon.cs.byu.edu, ventura@cs.byu.edu Abstract In this paper, we describe a system called Nehovah that generates neologisms from a set of base words provided by a user. Nehovah focuses on creating good neologisms by evaluating various attributes of a neologism such as how well it communicates the source concepts and how catchy it is. Because Nehovah depends on the user to weight the importance of various attributes of the neologism and to choose the source concepts, it is at this point most appropriately considered a collaborative system rather than an autonomous one. To demonstrate the utility of the system, we show several examples of system output and discuss the creativity of Nehovah with respect to several characteristics critical for any computational creative system: appreciation, imagination, skill and accountability. Introduction Boden (1994) made one of the first attempts to formalize the notion of creativity. Based on her formalization, computational creativity is often thought of as an exploration of a conceptual space and has been examined in a number of different areas including visual art (Colton, Valstar, and Pantic 2008; Norton, Heath, and Ventura 2011), music (Cope 2005), cooking (Morris et al. 2012), poetry (Rahman and Manurung 2011), metaphor generation (Veale and Hao 2007), and sentence generation (Mendes, Pereira, and Cardoso 2004). In this paper, we describe Nehovah, a computational system that generates neologisms. The generation of neologisms is an important task in many businesses to create a unique brand or company name to distinguish it from its competitors. This often comes in the form of a trademark. Trademarks include words, phrases, symbols and/or designs that identify and distinguish the goods of one party from those of others1. According to the United States Patent and Trademark Of.ce, 433,651 trademark applications were filed in 2013; a 4.5% increase from 2012 (The United States Patent and Trademark Office 2014). Thus, developing trademarkable phrases and words is a important step in many businesses. Additionally, neologisms are often used as a literary device in novels and books to convey meaning more concisely. For example, cyberspace was introduced in 1982 by William Gibson to combine the words cybernetics and space (Gibson 1982). In some cases, neologisms are used to add humor and interest. This technique was used heavily in the many works of Dr. Seuss to help children with limited vocabularies to enjoy reading (Baker 1999). Neologisms have previously been examined computationally, both from an interpretive standpoint and from a generative one. For example, Cook and Stevenson (2010) propose finding the meaning of neologisms using a statistical model that draws on observed linguistic properties of blends, while Duch and Pilichowski (2007) create neologisms using a neurocognitive model (though, unfortunately, many of the generated neologisms exhibit little to no linguistic/conceptual/cognitive value). Veales Zeitgeist system rather impressively exhibits both interpretive and generative abilities and is available as a web application. It can be used as a tool for enriching lexical resources such as WordNet (Fellbaum 1998) with modern words that are found in every day speech (Veale 2006) by utilizing Wikipedia2 to identify neologisms and by reverse engineering their source words using ideas from concept blending (Veale, ODonoghue, and Keane 2000). In addition, the Zeitgeist system can be used to generate neologisms by combining prefix and suffix morphemes that overlap by at least one letter (Veale and Butnariu 2006). Morphemes are hand-annotated with their semantic interpretations giving each morpheme a word gloss (such as astro=star and ology=study) and a WordNet identi.er that indicates where in the WordNet noun taxonomy a neologism with a morphemic suffix should be placed. Given two source words from predefined lists for prefixes and suffixes, the Zeitgeist system creates a set of neologisms that convey the chosen concepts by combining the prefix and suffix morphemes for the source words. The generated neologisms generally have valid word forms and convey the concepts well. On the other hand, Zeitgeist is limited to the morphemes that are annotated. As many of the morphemes are of Greek origin, some of the neologisms are somewhat predictable. For example, if food is chosen as a source prefix word, then gastro is almost always used. The use of morphemes also requires a knowledge of Greek or Latin word derivatives to understand the neologism. The neolo 1http://www.uspto.gov/trademarks/basics/ definitions.jsp 2www.wikipedia.com Figure 1: A high-level pipeline view of the process Nehovah uses to generate neologisms through finding synonyms, blending words, and scoring. gism ornithoencephalon is a neologism for bird-brain but the meaning is obvious only to the user who knows that the morpheme ornitho relates to birds and encephalon relates to the brain. Our system for generating neologisms, Nehovah, is sim ilar to Zeitgeist in that it attempts to preserve the source concepts through blending (as opposed to generating neolo gisms that represent entirely new ideas by themselves, e.g. Google). It differs from Zeitgeist by focusing on blend ing free-form, user-provided words and their synonyms and by incorporating dynamic web sources of popular cultural information. In addition, the web interface allows a user to weight the importance of several attributes of a neologism, facilitating a creative collaboration between the user and the system. A Framework for Blending Concepts The goal of generating neologisms by blending concepts from source words is to convey multiple concepts in a single plausible word, sometimes known as a portmanteau (Carroll 1871). We present a framework, containing three major steps, for generating such portmanteau neologisms from two source words: 1. Finding Synonyms. Synonyms increase the potential novelty of the neologisms by enriching the set of possible blends that convey the source concept. A greater diversity of synonyms expresses more imagination in the neologism. For example, the word God is arguably a more diverse/interesting synonym for creator than is the word maker. We call the set of synonyms for a source word wi the concept set for wi and denote it as C(wi). Note that it is always the case that wi fi C(wi). 2. Blending Words. Once the concept sets for the source words have been generated, the words from each concept set are blended together to create a set of neologisms. Blending the words from the two concept sets consists of three steps. First, each word from the concept sets is split into sets of prefixes and suffixes. Then, each prefix from one concept set is joined with each suffix from the other concept set. Finally, Nehovah checks that the word structure of the neologism is plausible. By plausible, we mean that the letter sequence produced from blending the words is natural compared to other real words. Any implausible neologism is discarded. The set of neologisms generated from two concept sets C(w1) and C(w2) is denoted N(C(w1),C(w2)). 3. Scoring/ranking the Neologisms. Once a set of neologisms N(C(w1),C(w2)) is created, they are scored or ranked such that a subset of best neologisms can be identified, allowing a potentially large set of neologisms to be quickly filtered. Scoring criteria can be adapted for a particular application and can also potentially incorporate feedback, facilitating online learning and thus dynamic qualification of neologisms. Nehovah A functional overview of Nehovah and its implementation of the three steps are shown in Figure 1 and are described in more detail in the following sections. The blue boxes represent each step in the framework for blending concepts and the gray boxes represent sets of words. An on-line version of Nehovah is available at http://axon.cs.byu.edu/~nehovah from which a screen shot is shown in Figure 2. Finding Synonyms In order to populate the set C(wi), Nehovah searches for synonyms from two different sources: WordNet (Fellbaum 1998) (a lexical database) and TheTopTens3 (a website of pop culture-inspired top ten lists). Nehovah queries WordNet with each source word wi (and with its stem) as a noun, verb, adjective, and adverb. If a source word or its stem is defined in WordNet, Nehovah adds to C(wi) the words contained in the synset for all senses of the word for all parts-of-speech for which it is defined. 3www.thetoptens.com Figure 2: A screenshot of the web interface for Nehovah. Two source words are input in the upper left. The lower left contains sliders that allow relative weighting of the four scoring attributes. On the right is a list of generated neologisms with their scores, in descending order, and these can be expanded to see the base words that Nehovah used to create the neologism and how the neologism is scored for each of the attributes. For example, the word school as a noun has the following senses: schooleducational institution school, schoolhousebuilding, edi.ce school, schoolingeducation schoolbody school, schooltime, school daytime period, period of time, period school, shoalanimal group and, additionally as a verb has the following senses: schooleducate educate, school, train, cultivate, civilize, civilisepolish, refine, fine-tune, down schoolswim (it has no senses as either an adjective or adverb). Therefore, the set of WordNet-derived synonyms for the word school, C(school)= {school, educational institution, schoolhouse, building, edifice, schooling, education, body, school time, school day, time period, period of time, period, shoal, animal group, educate, train, cultivate, civilize, polish, refine, fine-tune, down, swim}. Because a source word is specified without context, neither its part-of-speech nor its intended sense can be inferred, and, as a result the space of possible synonyms is increased, providing greater creative potential in the generated neologisms at the risk of potentially conveying an awkward or unintended conceptual blend. Nehovah queries TheTopTens with each source word wi using a custom API that returns lists of words from a set of top ten lists that match the query. For example, a query to TheTopTens using the source word car would return lists with titles such as Top Ten Best Car Companies, Best Car Brands, Greatest Songs by the Cars and Best Car Insurance Companies. Of course, some lists will be much more relevant than others. To minimize the number of included irrelevant words, Nehovah determines which of the returned lists are relevant based on their titles, by the identifying descriptive and plural words in the title. Descriptive words are identified as words that end with -est as is common practice on TheTopTens. If a descriptive word in a list title directly precedes the source word, then the list is deemed relevant. For example, the list Top Ten Best Car Companies would be accepted since the descriptive word best is describing the source word car. Also, if there are multiple plural words in a list title, Nehovah assumes the first plural word in the title identi.es the subject of the list. For example, in the list Greatest Songs by the Cars, there are two plural words: Songs and Cars. The list is determined to be about songs rather than cars since Songs appears before Cars and because the descriptive word greatest proceeds Songs rather than Cars. Nehovah also includes lists that have the source word directly before the first plural word such as Top Ten Car Movies, inferring that the source word is being used as a descriptor for the plural word. Once a list is determined to be relevant, the list items also need to be processed. Because TheTopTens is composed of user-defined free-form lists, some list items are more descriptive than others. For example, the Best Muscle Cars list may contain items such as 1961 Ford GT Mustang From Gone in 60 Seconds. While this information is beneficial for determining why an item made the list, it is difficult to use to generate neologisms. To compensate, Nehovah parses the list items so that any words or symbols that indicate descriptive information (from, in, , ,, etc) and any words that follow are not included. Another issue with user-defined lists is the lack of quality control. To filter out obscure (and/or misspelled) words and references, Nehovah only keeps list items that are also found in Wikipedia. Any list entries that survive this level of parsing and filtering are also included in C(wi). Note that using the words from TheTopTens adds hyponyms (e.g. Ford Mustang for car) rather than synonyms in some cases. We allow the use of hyponyms as the pop culture reference adds to the creativity and uniqueness of Nehovah and because it is difficult to distinguish between hyponyms and synonyms. Blending Words Given two concepts sets C(w1) and C(w2), Nehovah blends the words from the two concept sets to create a set of neologisms N(C(w1),C(w2)). Each word u fi C(wi) is into into a set of prefixes P (u) and a set of suffixes S(u). The words are split between syllables to maintain conceptual coherence and to reduce the likelihood of introducing invalid letter combinations during blending. Unfortunately, for English it is a non-trivial task to algorithmically identify syllable boundaries because pronunciation information is not (consistently) encoded in the spelling computational method Prefixes: Suffixes: Prefixes: Suffixes: -computational -method co mputational me thod com putational meth od compu tational method computa tional computati onal computatio nal computation al computational Table 1: Examples of how Nehovah splits words into prefix/suffix pairs by attempting to split on syllable boundaries. of the word. For example, io could create two separate vowel sounds as in lion or be a diphthong as in motion. To account for this, Nehovah conservatively splits each word u after every vowel (except the last) and between any two consecutive consonants (with exception of sh, th, and ch) after the first vowel and before the last vowel. Each such split yields one prefix to be added to the set P (u) and one suffix to be added to the set S(u). In addition, u is also added to both P (u) and S(u). For example, the word track would be split up into the prefixes track and tra and the suffixes ack and track. See Table 1 for additional examples. Slightly abusing notation, we define the set of neologisms formed by blending two words u and v using the sets P (u), S(u), P (v) and S(v) as N(u, v)={yz|y . P (u) . z . S(v) . K(yz)}. {yz|y . P (v) . z . S(u) . K(yz)} where K() is a predicate that returns FALSE if its argument contains a letter combination not found in WordNet and TRUE otherwise. Then, the full set of neologisms for the synonym sets C(w1) and C(w2) is generated by iterating over all pairs of words from these synonym sets: N(C(w1),C(w2)) = N(u, v) u.C(w1),v.C(w2) Scoring Nehovah scores each neologism n . N(C(w1),C(w2)) using four scoring criteria: word structure, concepts, uniqueness, and pop culture. Each scoring criterion can be assigned a relative weight, allowing the creation of different types of neologism. Word Structure. The word structure score W(n) measures how well a neologism retains aspects of the word structure of one or both source words, as maintaining source word structure tends to produce catchier neologisms that better convey the meaning of the base words. For example, ginormous is a combination of giant and enormous created by replacing the first syllable from enormous with the first syllable from giant. Enough of enormous is left that the meaning is still apparent. Another example is Linsanity, which replaces the first syllable in insanity with the single syllable word Lin (the last name of a professional basketball player). In this case, the overlap of Lin and insanity makes it easy to recognize the source words. To attempt to capture this kind of desirable structure, given base words u = y1z1 and v = y2z2, Nehovah calculates a raw structure score for a candidate neologism n = y1z2 as S(n)= .(y1,y2)+ .(z1,z2)+ B(n, u, v) where .(y1,y2) is the length of the suffix common to y1 and y2, .(z1,z2) is the length of prefix common to z1 and z2 and B(n, u, v) = max{.(#(n), #(u)),.(#(n), #(v))} where #(x) returns the number of syllables in x and . is the Kronecker delta function [B(n, u, v) equals 1 if neologism n maintains the same syllable count as either base word and 0 otherwise]. S(n) therefore quanti.es catchiness by measuring base word overlap and syllable count conservation. Given this, the word structure score W(n) of neologism n is the normalized raw score, with normalization taken over the set of all candidate neologisms. S(n) W(n)= maxn~.N(C(w1),C(w2)) S(~n) Concepts. One of the primary goals of Nehovah is to convey the concepts of the source words in the neologism. While word structure can aid in conveying a concept, Nehovah also explicitly measures concept clarity for a neologism by scoring how well the base concepts are communicated in its prefix and suffix. How clearly a concept is conveyed by the prefix or suffix of a base word obtained from WordNet is measured using MoreWords4, a tool for crossword puzzles and other word games. MoreWords uses the words from the Enable2k North American word list that is used in well-known word games. It contains 173,528 words and does not include any hyphenated words, abbreviations, acronyms, or proper nouns. Querying MoreWords with a prefix/suffix x returns the set of words Wx that have x as a prefix/suffix in MoreWords and the approximate number of times each word u~. Wx occurs per million words (FPM(~u)). FPM(~u) is estimated from studies on the British National Corpus5. Nehovah determines how apparent the concept is in a prefix/suffix by comparing the frequency of the word that the prefix/suffix is derived from with the frequencies of other words that begin/end with the same prefix/suffix. A distinctiveness score for a prefix/suffix x of base word u is calculated by first calculating a distinctiveness ratio: FPM(u).(x, u)= . FPM(~u) u~.Wx 4www.morewords.com 5http://www.natcorp.ox.ac.uk/ The distinctiveness score is then calculated using (an empirically determined) piecewise linear interpolation on the value of the distinctiveness ratio: 1, if .(x, u) . 0.1 .(x, u)= 0.8+2.(x, u), if 0.01 <.(x, u) < 0.1 Neologism Base Words Source Words Nehovah divinage machinative Spritependency Pepsidiction pisome pimazing iniquitivate immoralize coalesception portmanception neologism Jehovah divine coinage machine creative Sprite dependency Pepsi addiction pizza awesome pie amazing iniquity cultivate immorality civilize coalesce conception portmanteau conception neologism creator neologism creator machine creative soda addiction soda addiction awesome pizza awesome pizza evil school evil school concept blend concept blend .. . 80.(x, u), if 0 . .(x, u) . 0.01 This score differentiates between prefixes/suffixes that do not convey the concept, that partially convey the concept, and that completely convey the concept. Because many pop culture words are not contained in MoreWords, Nehovah measures how clearly a concept is conveyed by a pop culture base word obtained from TheTopTens as the normalized count of the number of times that a pop culture word u appears in the set of lists L(w) Table 2: A set of example neologisms generated by Neho vah with their base words and the source words that were provided to Nehovah. returned from TheTopTens for a given source word w: .(u, L(w)) .(u, w)= maxu~.T (w) .(~u, L(w)) Pop Culture. The pop culture score indicates if one or both of the base words are pop culture words, allowing the where .(u, L(w)) represents the number of times a base emphasis of pop culture references. The pop culture score word u appears in L(w), and T (w) represents the set of P(n) for a neologism n created from base words u and v is unique pop culture words in L(w). given by Note that this distinctiveness score indicates the popularity of the concept for a pop culture reference in the ne .. . 1 if u and v are pop culture words ologism by comparing the prevalence of other pop culture P(n)= 0.5 if u or v is a pop culture word words to the prevalence of the entire base word (rather than by considering just some prefix or suffix of the base word). Under the assumption that these distinctiveness scores correlate with conceptual content, given a source word w, 0 otherwise Combining Scores a base word u . C(w) and a prefix/suffix x of u, a concept The final score for a neologism is computed as a linear score for the base word is computed as combination of the four attribute scores, weighted by userselected coefficients (cf. the sliders in Figure 2): .(x, u), if u appears in WordNet c(x, u, w)= .(u, w), otherwise Finally, given a concept score for both a prefix y of base word u and a suffix z of base word v, the concept score C(n) of the created neologism n = yz is simply the average of the concept scores of the base words and their prefix/suffix: c(y, u, w1)+ c(z, v, w2) C(n)= 2 Uniqueness. A score for uniqueness should place greater value on words that are not commonly used (but still convey the source concept). For example, for the source word pants, the base word trousers is more common than the base word bloomers, although both convey the same concept. Uniqueness for a base word u . C(w) is calculated using the frequency per million words score from MoreWords (FPM(u)) relative to all of the other synonymous words in the concept set: FPM(u) .(u, w)=1 maxu~.C(w) FPM(~u) The uniqueness score U(n) for a neologism n formed from the base words u and v is simply the average of their uniqueness scores: .(u, w1)+ .(v, w2) U(n)= 2 S(n)= .W W(n)+ .CC(n)+ .U U(n)+ .P P(n) Evaluation of Nehovah We now examine Nehovah in the context of the creative tripod, which consists of skill, imagination, and appreciation (Colton 2008). Skill is the ability of a system to produce something useful. Imagination is the ability of the system to search the space of possibilities and produce something novel. Appreciation is the ability of the machine to self-assess and produce something of worth. We also evaluate Nehovah with respect to its accountabilitythe ability of the system to explain why it generated the artifact it generated. Skill Nehovah demonstrates skill by generating neologisms that convey the concepts in the base words and have proper word structure. First, proposed neologisms with invalid word structure are discarded. Next, Nehovah determines if a pop culture word is valid based on its presence in Wikipedia. Wikipedia is a dynamic source that does contain neologisms (Veale 2006) and consulting Wikipedia provides a safe-guard against low quality user-supplied content in TheTopTens. Finally, only splitting the words on their syllable boundaries aids in creating word fragments that convey meaning and are able to be blended in a way that forms a plausible word. The skill of any system is most easily demonstrated in the artifacts that it produces. Exhibit A for the Nehovah system is its own name, which is the direct result of providing the (originally anonymous) system with the source words neologism and creator. The name Nehovah is a mix of the words neologism and Jehovah, and it is readily apparent that Nehovah incorporates the word Jehovah; another candidate neologism was Neohovah, which conveys a bit more of the meaning of neologism but is not as structurally pleasing since an additional syllable is added. Other examples of neologisms created by Nehovah are shown in Table 2. As a further demonstration, consider the following arguably coherent sentence constructed from some of the neologisms from Table 2: Spritependency is a machinative neologism created through portmanception to describe someone who is addicted to Sprite. We also point out that the neologism immoralize is an actual word found in some dictionaries (it is not found in WordNet). According to the Merriam-Webster on-line dictionary, it means to make immoral6 which is what is conveyed by the neologism. In other words, the system (re)invented a real word, a nice demonstration of Bodens P-creativity. Accountability In addition to producing a set of neologisms, Nehovah also includes the base words that were blended together to produce the neologism (see the expansion of the third neologism in the righthand pane of Figure 2). Therefore, at some level Nehovah can explain how it created a neologism. The perceived creativity of the neologisms in Table 2 is likely increased with the available explanation of which base words were blended together as well as what the source words are. For example, portmanception is created from the source words concept and blend using portmanteau and conception as base words. Using portmanteau in the place of blend and conception in the place of concept conveys similar meaning; revealing the connection between the base words and source words helps justify the quality and creativity of the neologism. Imagination A Google search for most of the generated neologisms will show that Nehovah provides novel artifacts. The hits for Nehovah contain references to this project and an individuals name. Most of the neologisms have no hits when searched for in Google or the hits returned are names or screen names (divinage is a World of War Craft user name). Nehovah explores all possible combinations of prefixes and suffixes derived from the base words. Further, Nehovah also considers the synonyms for all possible senses of 6http://www.merriam-webster.com/ dictionary/immoralize Best Dog Breeds Best Hot Dog Toppings Pitbull Coney Sauce Rottweiler Mustard Chihuahua Stadium Mustard Great Dane Relish Miniature Pinscher Ketchup Table 3: The top five words returned from two lists from TheTopTens for the source word dog, demonstrating the range of synonyms that Nehovah uses as base words. each base word for each possible part of speech. Using all of the possible senses for all of the parts of speech for a source word along with an ever-expanding set of free-form, user-defined (pop culture) lists can create a potentially very large search space and produce unpredictable results. For example, if evil and school are used as the source words with the intended sense of school being an educational institution, then seeing a neologism such as Darth swim would likely be somewhat unexpected (the base words of the neologism are Darth Vader from the TheTopTens list The 10 Most Evil Villains in Video Games and swim, a hypernym of one of the senses of the verb school). This, however, demonstrates the imagination of Nehovah, since it takes into consideration other and unintended senses of a source word to produce more creative neologisms. Of course, the flip side of such imaginative creations is that unintended senses can cause problems, if the main goal is to create a neologism that captures a specific sense of a source word. Thus, there is a tension between creating a rich concept set that includes all of the possible senses for a source word and generating neologisms that convey the concept of the intended sense. Using the pop culture references allows Nehovah to demonstrate imagination in an unusual and contemporary fashion by using social/popular connections between words to convey meaning. Most people who are familiar with the Star Wars series would recognize the word Darth as having an evil connotation. As with using all the senses for a base word, some of the words from TheTopTens do not capture the intended concept of the base word. For example, consider the top five entries from two of the TheTopTens lists returned for the word dog shown in Table 3. The Best Dog Breeds list conveys the concept of dog to most users better than the Best Hot Dog Toppings list. An example set of neologisms is shown in Table 4 that shows the unintended use of the Best Hot Dog Toppings versus using Best Dog Breeds when blending the source words robot and dog. Despite being irrelevant for the animal dog, these examples demonstrate the imagination of Nehovah in generating neologisms. And, in fact, the neologism Terminaise could be a serendipitous discovery for an exciting new condiment if the intended sense of the worddog was hot dog. Appreciation Nehovahs appreciation is demonstrated by determining which neologisms are the best given a set of base words and which scoring criteria are weighted the highest. Ta Worst 10 Best 10 Neologism Base Words Score rottweilers: rottweiler Transformers: Revenge of the Fallen 0.786 Revenge of the Fallen Top Ten Best Dog Breeds Top Ten Best Robot Movies of All Time rottweilerminator 3 rottweiler Terminator 3 0.786 Top Ten Best Dog Breeds Top Ten Best Robot Movies of All Time automaton terrier automaton boston terrier 0.762 Top Ten Best Dog Breeds automatian automaton dalmatian 0.755 Top Ten Best Dog Breeds chihuahuaton chihuahua automaton 0.754 Top Ten Best Dog Breeds automestic automaton domestic 0.752 golden retrievers: golden retriever Transformers: Revenge of the Fallen 0.750 Revenge of the Fallen Top Ten Best Dog Breeds Top Ten Best Robot Movies of All Time dobermansformers: doberman Transformers: Revenge of the Fallen 0.714 Revenge of the Fallen Top Ten Worst Dog Breeds Top Ten Best Robot Movies of All Time doberminator 3 doberman Terminator 3 0.714 Top Ten Worst Dog Breeds Top Ten Best Robot Movies of All Time Rise chihuahuanic attack chihuahua panic attack 0.714 Top Ten Best Dog Breeds Greatest Robot Wars Robots Of All Time panicpoodle panic attack poodle 0.143 Greatest Robot Wars Robots Of All Time Top Ten Best Dog Breeds bulroadblock bull terrier roadblock 0.143 Top 10 Guard Dog Breeds Greatest Robot Wars Robots Of All Time cheeatomic cheese atomic 0.143 Top Ten Best Hot Dog Toppings Greatest Robot Wars Robots Of All Time labradorroadblock labrador retriever roadblock 0.143 Top Ten Best Dog Breeds Greatest Robot Wars Robots Of All Time borderrobots border collie robots 0.143 Top Ten Best Dog Breeds Top Ten Best Robot Movies of All Time bulrobots bull terrier robots 0.143 Top 10 Guard Dog Breeds Top Ten Best Robot Movies of All Time borderroadblock border collie roadblock 0.143 Top Ten Best Dog Breeds Greatest Robot Wars Robots Of All Time labradorrobots labrador retriever robots 0.143 Top Ten Best Dog Breeds Top Ten Best Robot Movies of All Time atomustard atomic mustard 0.143 Greatest Robot Wars Robots Of All Time Top Ten Best Hot Dog Toppings shetlandtornado shetland sheepdog tornado 0.143 Top 10 Smartest Dogs Greatest Robot Wars Robots Of All Time Table 5: Highest rated 10 and lowest rated 10 neologisms generated by Nehovah using the source words dog and robot with all scoring attributes equally weighted. The higher rated neologisms tend to flow better and convey the concepts of the base words better than the lower rated neologisms. ble 5 shows the highest rated 10 and lowest rated 10 ne-structure of both base words and the concepts are more ologisms created using the source words dog and robot clearly conveyed. as scored with all attributes equally weighted. The source Each of Nehovahs scoring attributes can be weighted by words dog and robot were chosen for this example be-a user to increase or decrease its relative importance. Ta-cause both source words have pop culture references and ble 6 shows a sampling of neologisms derived from blend-clearly demonstrate the effects of the different scoring ating the source words robot and dog, when weighting tributes. Comparing the two sets of neologisms in Table 5, is skewed completely to one of the four scoring factors. the highest rated 10 neologisms flow better and better cap-Each sub-table gives a set of neologisms weighted excluture the source concepts. The bottom 10 do not flow as sively for the factor titled above it. For example, looking at well and this often contributes to (further) obfuscation of the first sub-table (titled Pop Culture), for all neologisms, the source concepts. For example compare rottweilerminaboth source words are from the TheTopTens, although the tor and cheeatomicthe former better follows the word word structures may be awkward and the concepts may not Best Dog Breeds Neologism Base Words dobermaton doberman automaton rottweilerminator 3 rottweiler Terminator 3 dobermansformers doberman transformers Best Hot Dog Toppings Neologism Base Words sauerminator 3 sauerkraut Terminator 3 Terminaise Terminator 3 mayonnaise mustardmaton mustard automaton Table 4: A set of sample neologisms for the source words dog and robot using two different lists from TheTopTens for the source word dog. be apparent e.g. alasdo from the source words alaskan malamute and tornado. Neologisms in the list weighting only the Concept score tend to have prefixes and suffixes that are evocative of distinct base words, such as bot from the base word robot. When Word Structure is the sole factor, the created neologisms look the most like real words, e.g., Terman shepherd, strongly overlaps Terminator with German shepherd and preserves the number of syllables in German shepherd. In the case of weighting solely for Uniqueness, the resulting neologisms and their base words are often quite unusual, sometime at the expense of understandability, e.g. godiron from golem and andiron. As expected, weighting according to a single factor filters the neologisms, presenting only those that have a particular attribute, often at the expense of other factors. Overall, we tend to favor the word structure and concepts factors for creating the best neologisms. These help to convey the concepts contained in the base words and also produce more realistic appearing words as they have valid letter sequences and are similar to the base words. While favoring the concept and word structure factors, the pop culture and unique factors can be used as a secondary bias towards certain types of base words to be blended together. Conclusions and Future Work In this paper, we have presented Nehovah, a system that generates neologisms from a set of user-provided source words by searching the space of synonyms and then blending two base words. We have argued for Nehovahs ability to demonstrate some necessary characteristics for creativity, including skill, imagination, appreciation and accountability. Future work includes incorporating a learning mechanism so that users can indicate which neologisms they prefer. Nehovah could then use this information to better score the neologisms. An interesting line of future work includes generating a definition for a neologism using the base words. This would involve solving at least two difficult problems. The first problem is generating the definitions. Candidate definition components could be found by searching Wikipedia, an on-line dictionary, and/or another source for definitions for each source word. A potential definition would then be formed by blending candidate components in a way that both Pop Culture Base Words labrador retriever surrogates alaskan malamute tornado lhasa apso firestorm ketchup pussycat ibizan hound roadblock Concepts Base Words support mechanism scoundrel automaton domestic robot support robot scoundrel robot Word Structure Base Words pomeranian transformers automaton dalmatian Terminator 3 german shepherd .restorm domestic Terminator 3 doberman pinscher Uniqueness Base Words wiener golem golem familiaris blighter golem golem andiron golem .redog Table 6: Sample of neologisms created from the base words dog and robot using weighting schemes skewed completely toward a single factor, demonstrating Nehovahs appreciation for each scoring measure. Each set of neologisms possesses the desired attribute, often at the expense of others, e.g., the neologisms weighted for uniqueness are difficult to interpret and those weighted for pop culture have poor structure. conveys the concept from each source word and is readable (i.e. correct grammar). The second problem is validation of the potential definition, which may be accomplished, for example, through a user study/game where Nehovah could learn to match definitions to neologisms based on users votes. Acknowledgements We would like to thank Dylan Mills from TheTopTens for providing an API for Nehovah. 2014_23 !2014 Reading and Writing as a Creative Cycle: the Need for a Computational Model Pablo GervasCarlos Le on Instituto de Tecnologia del Conocimiento Facultad de Informatica Universidad Complutense de Madrid Universidad Complutense de Madrid 28040 Madrid, Spain 28040 Madrid, Spain pgervas@sip.ucm.es cleon@fdi.ucm.es Abstract The field of computational narratology has produced many efforts aimed at generating narrative by computational means. In recent times, a number of such efforts have considered the task of modelling how a reader might consume the story. Whereas all these approaches are clearly different aspects of the task of generating narrative, so far the efforts to model them have occurred as separate and disjoint initiatives. There is an enormous potential for improvement if a way was found to combine results from these initiatives with one another. The present position paper provides a breakdown of the activity of creating stories into five stages that are conceptually different from a computational point of view and represent important aspects of the overall process as observed either in humans or in existing systems. These stages include a feedback loop that builds interpretations of an ongoing composition and provides feedback based on these to inform the composition process. This model provides a theoretical framework that can be employed first to understand how the various aspects of the task of generating narrative relate to one another, second to identify which of these aspects are being addressed by the different existing research efforts, and finally to point the way towards possible integrations of these aspects within progressively more complex systems. Introduction The field of computational narratology has been steadily growing over the recent years. There have been many effort aimed at analysing narrative in computational terms (Mani 2012), and generating narrative by computational means (Gervas 2009). With respect to computational creativity, the latter is more immediately relevant. Though it is possible to argue for a strong role for creativity in the understanding of narrative, this is less obvious than the role of creativity in the generation of narrative. This kind of argument has lead over the years to many research efforts that focus on generation of narrative to the detriment of the understanding of it. This is also supported by an argument of a different kind related to the perceived difficulty of narrative understanding from computational terms, and the lack of success of the efforts accumulated on that topic over the years. Yet it is also very clear to any seasoned reader or writer that the task of generating narrative is intrinsically bound to that of reading it. A writer writes to be read, and a writer aiming to succeed writes with the reactions of possible readers in mind. This point was originally argued in the field of narratology by authors such as Barthes (Barthes, Miller, and Howard 1975) and Ecco (Eco 1984), and in the field of automated storytelling by Paul Bailey (Bailey 1997) but it has taken a long time for the research community to act upon it. In recent times, a number of research efforts arising from an initial focus on narrative generation have started to consider the task of modelling how a reader might consume the story based on the plausible inferences that arise from a narrative discourse. From a technical perspective, these approaches are based on techniques used to obtain a plausible inference of causal and intentional relations in the discourse (Niehaus 2009; Cardona-Rivera et al. 2012; ONeil 2013). These efforts arise from the need of generation processes to have access to some kind of feedback based on how the results of the construction process will be perceived by a potential reader. The pragmatic needs of research seem to require the implementation of at least some parts of this cycle between writing and reading that are intuitively evident to most people. The present paper provides a breakdown of the activity of creating stories into five stages that are conceptually different from a computational point of view and represent important aspects of the overall process as observed either in humans or in existing systems. A fundamental hypothesis of the proposed breakdown is that, even though intended as a model of the composing task, it includes two additional processes concerned with modelling the task of interpretation. These processes are aimed at estimating the impression that a composition will make on an asumed interpreter, and they provide a feedback loop to improve the results of composition. This extension provides the means both for including a model of the reader in the composition process, and for explicitly representing evaluation features as part of the construction process. The proposed breakdown into five stages is analysed in terms of its relation to existing models of: creative endeavour from a computational point of view, the writing task from a cognitive perspective, and natural language generation as a set of tasks. The set of five stages is postulated as a possible model to understand how existing efforts in the field of story generation relate to one another and how future progress in the field might explore possible interactions between them. To this end, a number of existing systems are reviewed in the light of the model. Previous Work The set of existing theoretical models or frameworks that may have a bearing on the task of story creation are reviewed in the following order. First, models of creative systems, then models of the writing task, and finally models of natural language generation. Computational Models of Creativity Wiggins (Wiggins 2006) takes up Bodens idea of creativity as search over conceptual spaces (Boden 2003) and presents a more detailed theoretical framework intended to allow detailed comparison, and hence better understanding, of systems which exhibit behaviour which would be called creative in humans. This framework describes an exploratory creative system in terms of a tuple of elements which include elements for defining a conceptual space as a distinct subset of the universe of possible objects, the rules that define a particular subset of that universe as a conceptual space, the rules for traversing that conceptual space, and an evaluation function for attributing value to particular points of the conceptual space reached in this manner. The IDEA model (Colton, Charnley, and Pease 2011) assumes an (I)terative (D)evelopment-(E)xecution(A)ppreciation cycle within which software is engineered and its behaviour is exposed to an audience. An important insight of this model is that the invention of measures of value is a fundamental part of the creative act. In the case of story generation this corresponds to developing models of reader response that can be used to provide feedback to the generation process. Cognitive Accounts of Writing and Narrative Comprehension Flower and Hayes (Flower and Hayes 1981) define a cognitive model of writing in terms of three basic process: planning, translating these ideas into text, and reviewing the result with a view to improving it. These three processes are said to operate interactively, guided by a monitor that activates one or the other as needed. The planning process involves generating ideas, but also setting goals that can later be taken into account by all the other processes. The translating process involves putting ideas into words, and implies dealing with the restrictions and resources presented by the language to be employed. The reviewing process involves evaluating the text produced so far and revising it in accordance to the result of the evaluation. Flower and Hayes model is oriented towards models of communicative composition (such writing essays or functional texts), and it has little to say about narrative in particular. Nevertheless, a computational model of narrative would be better if it can be understood in terms compatible with this cognitive model. Sharples (Sharples 1999) presents a description of writing understood as a problem-solving process where the writer is both a creative thinker and a designer of text. He provides a description of how the typical writer alternates between the simple task of exploring the conceptual space defined by a given set of constraints and the more complex task of modifying such constraints to transform the conceptual space. Apparently the human mind is incapable of addressing simultaneously these two tasks. Sharples proposes a cyclic process moving through two different phases: engagement and reflection. During the engagement phase the constraints are taken as given and the conceptual space defined by them is simply explored, progressively generating new material. During the reflection phase, the generated material is revised and constraints may be transformed as a result of this revision. Narrative comprehension involves progressive enrichment of the mental representation of a text beyond its surface form by adding information obtained via inference, until a situation model (representation of the fragment of the world that the story is about) is constructed (van Dijk and Kintsch 1983). A very relevant reference in this field is the work of (Trabasso, vand den Broek, and Suh 1989), who postulate comprehension as the construction of a causal network by the provision by the user of causal relations between the different events of a story. This network representation determines the overall unity and coherence of the story. Natural Language Generation The general process of text generation takes place in several stages, during which the conceptual input is progressively refined by adding information that will shape the final text (Reiter and Dale 2000). During the initial stages the concepts and messages that will appear in the final content are decided (content determination) and these messages are organised into a specific order and structure (discourse planning), and particular ways of describing each concept where it appears in the discourse plan are selected (referring expression generation). This results in a version of the discourse plan where the contents, the structure of the discourse, and the level of detail of each concept are already fixed. Although the overall process includes a number of additional stages (aggregation, lexicalization and syntactic choice -collectively referred to as sentence planning -, and surface realization) these will not be relevant for the purpose of the present paper, which remains focused at the level of discourse. The ICTIVS model At its most abstract level, the task of composing a narrative must be considered in the broader context of an act of communication (see Figure 1). The communication takes place as an exchange of a linear sequence of text that encodes a large and complex set of data that correspond to a set of events that take place over a volume of space time, possibly in simultaneous manner at more than one location. To convey this complexity as a linear sequence and recover it again at the other end of the communication process requires a process of condensing it first into a message and then expanding it again into a representation as close as possible to the original. There is a composer, in charge of composing a linear discourse from a conceptual source that may also Figure 1: The traditional view of the communication process. Each big circle corresponds to an operation by one of the actors involved, whereas each small circle corresponds to the type of information conveyed from one to another. Note that ideasI recovered by the interpreter need not correspond faithfully to the ideas originally conceived by the composer. have been produced by himself, and an interpreter, faced with the task of reconstructing a selected subset of the material in the conceptual source as an interpretation of the received narrative discourse. 1 The task of the composer involves four facets: the construction of the source material for the message as a conceptual representation, the selection of what subset of the conceptual source to convey, the linearization of that selection as a discourse, and the encoding of the message in a particular medium. The task of the interpreter involves a number of tasks concerned with the process of interpretation of the story into a conceptual representation, and validation of the corresponding content with respect to the criteria of the interpreter. The main hypothesis defended in this paper is that the composer also has the responsibility of ensuring that the discourse she produces is optimized to help the interpreter construct exactly the interpretation she desires to convey. To this end, the composer may need to resort to local models of the processes applied by the interpreter, used to produce copies of the conceptual interpretation and the validation that an interpreter might obtain by applying them. In consequence, the models of the interpretation process considered in this paper are not strictly concerned with the tasks carried out by the interpreter, but rather with how the outcomes of these tasks might best be modelled relying as much as possible on the resources and capabilities already available to the composer. Based on these ideas, an abstract model for covering these aspects of narrative has been created. It has been 1In real life, the role of the composer is usually played by a writer and the role of interpreter by a reader, but in the present case a more generic formulation has been preferred for generality. called ICTIVS (the name stands for INVENTION, COMPOSITION, TRANSMISSION, INTERPRETATION and VALIDATION of Stories). This model divides the communicative act of narration into five stages carried out by the composer as part of an iterative cycle. Figure 2 depicts this cycle as a refinement of the traditional view of the task of the composer, now extended with an explicit representation of the task of the interpreter. This model of the interpreter provides a feedback loop on the composition process that can be used for progressive refinement of the result. The ICTIVS model does not try to solve or study how each process is carried out from a social or psychological point of view, it rather identi.es those stages that are important from the Artificial Intelligence point of view, and those that help to model the human behaviour in narratives. During the INVENTION stage, the narrative content is created, based on incomplete knowledge or from scratch. Characters, narrative objectives, places and events (the ideas) all emerge and get related, thus creating a complex set of facts that constitute the source for the story. These facts could be understood as the log of a simulation run on the set of characters. As in real life, events produced in this way may have happened simultaneously in physically separated locations, and constitute more of a cloud than a linear sequence, a volume characterised by 4 dimensional space-time coordinates. The COMPOSITION stage arranges all data from the previous stage (INVENTION) and outputs a discourse. Composing a discourse for the source content involves drawing a number of linear pathways through the volume of space Figure 2: The ICTIVS model. It constitutes a model of the composing task. The picture includes a separate representation of the interpreter to capture two important ideas: that the proposed refinement is intended as a duplication of the interpretation task within the composer, and that the ideas (ideasC ) and the judgement (judgementC ) obtained by the composer may be different from those developed by the interpreter (ideasI and judgementI , as a result of the fact that the procedures applied to obtain them are different (InterpretationC .. = InterpretationI and ValidationC = ValidationI ). time produced by the invention stage. This type of linear pathway is sometimes referred to as a narrative thread. All the narrative threads deemed relevant from a given input (in truth a selection of all available ones or even a selection of fragments of the interesting parts of some of them) need to be combined together into a single linear discourse. As a result, this discourse is an ordered and filtered set of facts (properties, events, descriptions. . . ) that are to be conveyed to the interpreter. Filtering involves considering the readers common knowledge and inferential capabilities. Many concepts that the composer intends to convey may be omitted from the actual discourse if they can be considered to be known or obtainable via inference by the reader. It is also possible that the composer prefer to withhold particular items of information over particular stretches of the discourse, to create or enhance effects such as surprise, expectation, or suspense. Once a discourse has been composed, it can be rendered in a particular medium that can be consumed directly by the intended audience (whether a single interpreter or many). This stage has been called TRANSMISSION, as it involves the task of rendering the discourse in a given medium and making the medium available to the audience, but the part of the process we want to consider here is that of rendering, which involves constructive decisions and may be informed by reflection. The INTERPRETATION stage involves the reconstruction of the content of the message from the discourse for it. This process, when applied to a story received from an external source, constitutes the main task that an inter preter faces. Our stance in this paper is that an integral part of the task of the composer could be to apply a similar procedure to a recently composed discourse, with a view to obtaining feedback on how a hypothetical interpreter might view it. Whether from the discourse itself or from the medium produced to render it, the composer attempts to reconstruct the meaning as a user would to extract feedback on how the result of his composition task satis.es his communication goals. Over the reconstruction of the content of a story interpreted from a discourse, interpreters (and composers simulating the reaction of an interpreter) develop judgments on the medium, the discourse or the content of the story. This set of operations we refer to as the VALIDATION stage. As with interpretation, we consider that a composer may rely on a version of this stage to obtain feedback on how his output might be received by an interpreter. The role of the INTERPRETATION stage is crucial even if the model is nominally restricted to the task of composition. According to the Flower and Hayes model of the writing task, linearization would occur as part of the translation subtask (converting ideas into text), followed by a number of cycles of reviewing and improving the result. The accumulated literature on modelling story generation indicates that this reviewing stage of discourse, based on an attempt at reconstructing the desired content from the discourse and a comparison between the resulting interpretation and the selected subset of the source material, is a fundamental ingredient of the broader context of the task of story generation. We therefore consider that a model of the task of story generation should include all of the five stages described to be considered complete. One may be tempted to ascribe creativity within this model only to the INVENTION stage, on the grounds that it is there that new content is put together by combining more basic elements. However, there is also room for creativity in the COMPOSITION stage -to come up with new solutions for encoding a given content, possibly fulfilling additional goals in terms of surprise, suspense, while still meeting the communicative constraints -or the TRANSMISSION stage to produce alternative novel and valuable renderings for a given discourse. During the INTERPRETATION stage a new instantiation of the narrative message is created. In some cases, the process of COMPOSITION reduces the content so drastically that the INTERPRETATION process requires some creative mechanisms to come up with enough material to make sense of the story. In those cases new ideas not considered by the writer may emerge during this stage. The resulting story is not necessarily equal to the story that the writer invented and transmitted. This point aligns very well with the observations of postmodern literary studies -arising from the work of (Barthes 1977) -along the lines that a text does not acquire its ultimate value until it has been interpreted by a particular reader, and that the role of the reader in this process must be valued in comparable terms to that of the writer. The VALIDATION process is particularly interesting in terms of creativity. In line with the insights arising from the IDEA model of Colton et al, a fundamental part of the creative act may be the invention of new measures of value. This would correspond to applying creativity at the VALIDATION stage, and it is a feature that has received little attention in the past in terms of computational creativity research. Finally, it is quite possible that creativity as perceived by external observers arise only as a result of a complex interaction between all these processes. This possibility strengthens the argument in favour of models of the composition task that captures all these aspects in a single framework. The ICTIVS Model and Existing Related Frameworks The ICTIVS model is compared to a number of existing frameworks for understanding related processes, of creativity, of the writing task, and of natural language generation. ICTIVS and Models of Creativity Processes in the INVENTION and COMPOSITION stages would correspond to what Wiggins in his framework defines as rules for traversing the conceptual space. These stages carry out the identification of new artifacts in the conceptual space of stories of the working domain. On the other hand, both the INTERPRETATION and the VALIDATION stages can be seen as ingredients in an evaluation function function in Wiggins formalization. They both compose a process in which a story is received and judgments are formed. The TRANSMISSION stage is not explicitly addressed by Wiggins, as his model only considers the generation of creative artifacts. Although Colton et als IDEA model is formulated in the context of the development of creative software, its description of the process as an (I)terative (D)evelopment(E)xecution-(A)ppreciation cycle is applicable to the task of generating a story. Under this view, INVENTION would correspond to Development, COMPOSITION and TRANSMISSION would correspond to Execution, and INTERPRETATION and VALIDATION would correspond to Appreciation. ICTIVS and Cognitive Models of Writing From a cognitive point of view, the set of stages that constitute the ICTIVS model aligns reasonably well with the processes described by Flower and Hayes. In terms of Flower and Hayes model, the INVENTION stage would constitute specific operation of the planning process. The COMPOSITION stage might be considered partly within the planning process (as regards discourse planning decisions) and partly within the translating process (as regards sentence planning processes). The TRANSMISSION stage would fall directly within the translating process, including the particular restrictions and resources presented by the language to be employed, as Flower and Hayes phrase it. The INTERPRETATION and VALIDATION stages would correspond to the reviewing process of Flower and Hayes model. The possibility of considering different paths through the various stages of the model would correspond to enriching the model with interaction between the various processes as controlled by a monitor, which is an integral part of Flower and Hayes model. In terms of Sharples description of the writing task, it would be simple to say that INVENTION and COMPOSITION would correspond to the engagement phase, and that INTERPRETATION and VALIDATION would correspond to the reflection phase. However, Sharples analysis indicates that the process of writing is far from being a simple cycle over such stages, and involves coming and going between them over a period of time, before the actual stage of TRANSMISSION is ever contemplated. In fact, it would probably be fair to say that there might be specific phases of engagement associated with INVENTION, combined with phases of reflection over whatever representation is achieved at that stage, followed by iterations of INVENTION and COMPOSITION engagements (with interspersed phases of reflection as INTERPRETATION and VALIDATION of the resulting discourse), followed by iterations of INVENTION, COMPOSITION and TRANSMISSION engagement (also combined with phases of reflection as above). Such a complex process would match the idea of heavy interaction between planning, translating and reviewing (in Flowers and Hayes terms), and should be considered corroboration of the need for a monitor module to govern how these interactions take place. This monitor would also be in charge of deciding when the final product is finally ready to be transmitted to the addressee, or generally made public. The processes of progressive enrichment of the mental representation of a text beyond its surface form by adding information obtained via inference, as described by Van Dijk and Kintsch (van Dijk and Kintsch 1983) is the main component of the INTERPRETATION stage. This does indeed take place when a reader attempts to comprehend a given text. However, the ICTIVS model considers this stage also to be a fundamental part of the process of creation applied by the writer. Much in the way described by Colton et al in their IDEA model, the process of creating a story is seen as an interactive cycle of production of a text (through processes of INVENTION, COMPOSITION and TRANSMISSION) followed by a process of appreciation (during INTERPRETATION and VALIDATION). The result of this appreciation process can then be fed back to the next iteration of the productive part of the cycle. Although the cycle is described in full, going all the way to the production of text before entering an appreciation phase, it is perfectly possible (and extremely plausible if considered in terms of how this task is addressed by humans) that appreciation in this sense may be applied much earlier in the cycle: for instance, once a process of INVENTION has taken place, whatever has been obtained, possibly a set of ideas represented conceptually, or a sketch of the fabula -in narratological terms -may be appreciated and the resulting information can be fed back to further processes of INVENTION. As INVENTION does not include a step of selection and encoding of information (these tasks concern the COMPOSITION stage) no stage of INTERPRETATION is required as part of this cycle, and feedback may be obtained by direct VALIDATION. A similar internal loop may occur involving COMPOSITION, with appreciation of the output of a COMPOSITION stage being submitted to appreciation even before entering a stage of TRANSMISSION. In this case, a process of INTERPRETATION may be required before VALIDATION can be applied. Given that (Trabasso, vand den Broek, and Suh 1989) postulate the existence of a network of causal relations between the different events of a story as fundamental to determining the overall perception of its unity and coherence, it is very likely that VALIDATION of a story involve identification of an appropriate network of this nature. When VALIDATION is applied directly to the result of an INVENTION stage (fabula), it may consist simply of ensuring that such causal relations are present in the story. When applied to a narrative discourse, an intermediate stage of INTERPRETATION may be required to elicit a representation of such a network from the discourse. ICTIVS and Natural Language Generation At a first glance, with respect to the classic pipeline structure for natural language generation systems, the ICTIVS stage of INVENTION would correspond to the task of content determination, whereby a fabula is produced (content that may be told), with the discourse planning stage matching the COMPOSITION stage. However, there is a slight misalignment between the two models. The content determination stage of a NLG pipeline assumes all possible content to be present, and applies a selection process to establish what will be included in the communication under consideration. In contrast, the INVENTION stage is concerned with actual production of the content to be considered. In view of these, both content determination and discourse planning -as understood in NLG terms -can be considered as part of the COMPOSITION stage. In truth, all of the NLG pipeline could be considered as part of the COMPOSITION stage, with possibly only surface realization being included in the TRANSMISSION stage. Grounding the ICTIVS Model in Existing Story Generation Systems The applicability of the proposed model can be illustrated by using it to analyse existing efforts in story generation, with a view to recasting their apparent diversity into a homogeneous framework of understanding, and to better illustrate how they relate to the more complex aspects of narrative generation and to one another. A number of existing systems are discussed below. The selection is not meant to be exhaustive, and it has been designed to include examples of systems that cover different stages of the ICTIVS model. MEXICA (Perez y Perez 1999) was a computer model designed to study the creative process in writing in terms of the cycle of engagement and reflection (Sharples 1999). It was designed to generate short stories about the MEXICAS (also wrongly known as Aztecs). MEXICA pioneered in the realm of automated storytellers the idea of a cycle of generation and evaluation, with the results of the evaluation being fed back to inform the generation process. In this case, the engagement cycle of MEXICA can be seen as a particular type of INVENTION process that directly produces a linear discourse. Over this discourse, the MEXICA system applies an instance of the VALIDATION stage, which is fed back into the generation process. In addition to this, MEXICA had a procedure for building from a set of known stories the knowledge structures called Story Contexts, which represented explicitly the emotional links and tensions between characters in the story. This process would correspond to an ICTIVS stage of INTERPRETATION. Finally, MEXICA provide a template-based procedure for rendering the final discourses as text. This would correspond to a stage of TRANSMISSION. There is very little in the operation of the system that might be considered an instance of COMPOSITION. For ease of exposition, the reviewed systems are grouped into sets based on the stage that they devote most attention to. Mostly Inventors The Virtual Storyteller (Theune et al. 2003) introduces a multi-agent approach to story creation where stories are created by cooperating intelligent agents. Characters are implemented as autonomous intelligent agents that can choose their own actions informed by their internal states (including goals and emotions) and their perception of the environment. Narrative is understood to emerge from the interaction of these characters with one another. There is a specific director agent who has basic knowledge about plot structure and exercises control over agents actions by: introducing new characters and objects, giving characters specific goals, or disallowing a characters intended action. There is also a specific narrator agent, in charge of translating the system representation of states and events into natural language sentences. In terms of the ICTIVS model, most of the operation of the Virtual Storyteller would correspond to a stage of INVENTION, with very simple stages of COMPOSITION and TRANSMISSION encapsulated in the narrator agent. Fabulist (Riedl and Young 2010) was an architecture for automated story generation and presentation. The Fabulist architecture split the narrative generation process into three-tiers: fabula generation, discourse generation, and media representation. The fabula generation process used a planning approach to narrative generation and it would correspond to an ICTIVS stage of INVENTION. The discourse generation would correspond to an ICTIVS stage of COMPOSITION. The media representation would correspond to an ICTIVS stage of TRANSMISSION. Inventors-Composers MINSTREL (Turner 1992) was a computer program that told stories about King Arthur and his Knights of the Round Table. The program was started on a moral that was used as seed to build the story. Story construction in MINSTREL operates as a two-stage processes involving a planning stage and a problem-solving stage. At a high level of abstraction, the two processes described for MINSTREL seem to correspond to an amalgamation of the INVENTION and COMPOSITION stages. BRUTUS (Bringsjord and Ferrucci 1999) was a program that wrote short stories about betrayal. The operation of BRUTUS involves three basic processes, carried out sequentially. First a thematic-frame is instantiated. Then a simulation-process is set in motion where characters attempt to achieve a set of pre-defined goals, thereby developing a plot. The process of converting the resulting plot into the .nal output is carried out by the application of a hierarchy of grammars (story grammars, paragraph grammars, sentence grammars) that define how the story is constructed as a sequence of paragraphs which are themselves sequences of sentences. Of these, the instantiation of the thematic frame and the simulation-process would correspond to an ICTIVS stage of INVENTION, the application of the hierarchy of grammars would blend together stages of COMPOSITION and TRANSMISSION. Mostly Composers There have been a number of systems developed that address the task of generating a discourse for a given set of events (Leon, Hassan, and Gervas 2007; Gervas 2012; Gerv as 2013). These systems received as input a broad description of the set of events to consider and produce from it a conceptual representation of the discourse needed to tell them as a story. The main contributions of these systems correspond to implementations of an ICTIVS stage of COMPOSITION. Most of them include an additional stage of TRANSMISSION that renders the resulting discourses as text. In most cases these are intended for ease of evaluation, and little effort is invested in optimising the quality of the resulting texts. In the nn system for interactive fiction (Montfort 2007) (now evolved into the Curveship system (Montfort 2009)) the user controls the main character of a story by introducing simple descriptions of what it should do, and the system responds with descriptions of the outcomes of the characters actions. Within nn, the Narrator module provides storytelling functionality, so that the user can ask to be told the story of the interaction so far. The Narrator module of nn was a pioneer among storytellers in that it addressed issues such: order of presentation in narrative and focalization, chronology, and appropriate treatment of tense depending on the relative ordering of speech time, reference time, and event time. In this case, the Narrator module of nn combines a very refined instance of a COMPOSITION stage, that deals with the issue of variation in the narrative form, and a much simpler instance of a TRANSMISSION module, which renders the resulting discourse as text. Mostly Transmitters STORYBOOK (Callaway and Lester 2002) produced multi-page stories in the Little Red Riding Hood domain by relying on elaborate natural language generation tasks. Callaways system is a realtime narrative prose generator that takes an instance of the presentational ordering desired for the text and an instance of the sum of the factual content that constitutes the story as input, and intelligently combines information found in the two and stylistic directives to produce narrative prose. In this sense, STORYBOOK can be said to be centred on the TRANSMISSION stage of the ICTIVS model. The process of devising the presentational ordering desired for the text from the sum of the factual content that constitutes the story would correspond to the COMPOSITION stage of the ICTIVS model. The task of developing the sum of the factual content that constitutes the story -not actually addressed by STORYBOOK -would correspond to the INVENTION stage of the ICTIVS model. Inventors -Validators Stella (Leon and Gervas 2011; Leon and Gervas 2012) performs story generation by traversing a conceptual space of partial world states based on narrative aspects. World states are generated as the result of non-deterministic interaction between characters and their environment. This generation is narrative agnostic, and an additional level built on top of the world evolution chooses the most promising ones in terms of their narrative features. Stella makes use of objective curves representing these features and selects world states whose characteristics match the ones represented by these curves. Stella is an example of INVENTION based on VALIDATION of internal states. Composers-Interpreters A significant example is the INFER system (Niehaus 2009), a narrative discourse generation system that employs an explicit computational model of a readers comprehension process during reading to select content from an event log with a view to creating discourses that satisfy comprehension criteria. Mostly Interpreters An example is INDEXTER (Cardona-Rivera et al. 2012), a cognitive framework which predicts the salience of previously experienced events in memory based on the current event that the audience exposed to a narrative is experiencing. This system constitutes a model of the experience of the reader, and it involves a process of INTERPRETATION in the sense that it aims to model the online mental state of the audience which experiences the narrative. This requires progressive monitoring of the effect of each increment in the narrative on this model. A Shortage of Validators The VALIDATION stage of the ICTIVS model has not seen as many implementations over the years. There has been a significant research effort on the evaluation of results from story generators of various types but these consisted mostly on evaluations carried out by humans over results produced by generation systems. These efforts include: evaluating the effects of text choices on reader satisfaction (Callaway and Lester 2001), evaluating plots in terms of their acceptability and their novelty as perceived by users (Peinado and Gervas 2006), and development of specific frameworks for evaluating aspects of automatically generated narrative (Rowe et al. 2009). Some existing systems (Py P1999; Cheong erez erez 2007; Bae and Young 2008; Niehaus 2009; Leon and Gervas 2010) did include a specific module for validating their output as it is constructed. Of these, different systems focused on specific aspects, such as emotional tensions (P erez y Perez 1999), suspense (Cheong 2007), surprise (Bae and Young 2008), comprehensibility (Niehaus 2009) or conformance with a user given specification of the evolution over the story of particular parameters (Leon and Gervas 2012). All these systems involve some type of cycle of construction of a candidate story (sometimes a partial draft rather than a complete one) and applying some function to validate this before continuing. It is only in recent times that systems devoted specifically to validating properties of a narrative have been developed, such as the DRAMATIS model for evaluating suspense in narratives (ONeil 2013), which includes a significant stage of interpretation to make validation possible. Conclusions The arguments presented in this paper suggest that the inclusion of explicit processes of interpretation and validation to inform and complement the task of constructing narratives is plausible in terms of existing models of the task in terms of human cognition. They also show how existing efforts at modelling various aspects of the story telling task have already addressed computational modelling of the various aspects that would be required to implement such inclusion. The proposed solution would achieve the integration within the computational model of the narrative construction of both a model of the reader and specific procedures for the evaluation of candidates results. This would address longstanding requirements on the storytelling task (Bailey 1997) and more recently voiced requirements on the improvement of scientific rigour in the evaluation of creative systems (Jordanous 2011). However, it must be said that the ICTIVS model is not intended as a cognitively plausible model of the way humans deal with narratives. Instead, it is proposed as a conceptual framework that might help to understand the diversity of existing efforts in story generation, and how they relate to the more complex aspects of narrative generation and to one another. In this sense, the ICTIVS model is put forward as a rallying call for researchers in the fields of narrative modelling, story generation and computational creativity to start advancing along the difficult road of integrating together existing views and development efforts. The ICTIVS model may contribute to this task in two different ways. First, by naming and clarifying some of the subprocesses involved, it may allow future research efforts to focus on the less well explored aspects of the described cycle, which should help to enrich our overall understanding of the phenomenon. Second, by providing a simple framework for analysing existing systems in terms of a set of common elementary operations, it can help identify parts of existing systems that it might be useful to reuse in future developments or to combine with other existing ones. To this end, a conscious effort has been made to formulate the ICTIVS model at a purely conceptual level. To ensure compatibility with the broad variety of representations employed in existing systems, no detail is given of what specific representations might be considered for the data exchanged between different phases. Progress along the lines of defining formal interfaces between the various stages is desirable in the long run, but it would require a thorough and detailed review of existing efforts in search of a consensus on possible representations for the various stages. The WHIM project, funded by the European Commission under call FP7-ICT-2013-10 with grant agreement number 611560, is a three year project that sets out to explore technologies for ideation, with a particular focus on the role that narrative generation might play in evaluating the quality of ideas. Among its objectives, it includes an effort to provide a workable specification of narrative oriented towards generation. It is envisaged that this effort will contribute to clarifying some of the details that have been glossed over in the present paper. The effort invested so far in developing computational solutions aimed at achieving or improving computational generation of narrative has uncovered a number of different aspects to the basic phenomenon of telling a story. Whereas all these approaches are clearly different aspects of the task of generating narrative, so far the efforts to modelled them have occurred as separate and disjoint initiatives. There is an enormous potential for improvement if a way was found to combine results from these initiatives with one another. The model presented in this paper provides a theoretical framework that can be employed first to understand how these various aspects of the task of generating narrative relate to one another, second to identify which of these aspects are being addressed by the different frameworks, and finally to point the way towards possible integrations of these aspects within progressively more complex systems. Systems obtained in this way are more likely to be perceived as models of the human ability to generate stories. A set of important insights arise from the application of the model to a selection of existing systems: 1. there are several distinct computational processes involved in the generation of a story: invention of the material to be used, composition of the material as a valuable linear discourse, transmission of this discourse using some medium 2. each one of these processes contributes some features to the final story that may be evaluated separately: on the material to be used one may evaluate coherence or originality, on the discourse issues such as comprehensibility, surprise, suspense, on the final medium grammaticality or .uency 3. some of the features arise only as an interaction between the processes and some require an intermediate process of interpretation to bring out to the fore this interaction between the underlying material and the discourse used to convey it As a result, efforts at computational modelling must take into account the various processes, the interaction between them, and the need for a validation stage as an integral part of the process. From the point of view of creativity, it is important to note that most existing efforts at story generation have focused on obtaining acceptable stories, with very little attention to the perceived creativity of the process. Even in cases such as (Turner 1992; Perez y Perez 1999) that declare an explicit interest in creativity, the actual implementation and evalua tion process does not address issues that are considered fun damental in the emerging field of computational creativity, like novelty or sustained creativity. This is largely due to the inherent technical difficulties in achieving results that can be considered as acceptable stories, let alone creative ones. The creativity in story generation may arise from any of the processes involved and further creativity may arise from the interactions between them. Taking the argument above to the extreme, for story generators with an aspiration to being considered truly creative systems the validation stage must include specific solutions for measuring creativity related features beyond those that are elementary requirements of the story form. Finally, two important ideas arise from the interaction be tween the proposed model and considerations on creativity. The first one is that creativity may be involved in many of the processes involved in this model, not just in that of in venting the content of a story. Composition and interpreta tion of stories may involve significant amounts of creativ ity. The creation of innovative procedures for evaluation or validation of stories may be considered a highly creative achievement. The second one is that a perception of creativ ity in a storytelling system may arise from the interaction between all these processes rather than be located in a par ticular one. This constitutes a strong argument in favour of attempting the implementation and study of models of story telling along the lines of the proposed model. Acknowledgments This paper has been partially supported by the projects WHIM 611560 and PROSECCO 600653 funded by the European Commission, Framework Program 7, the ICT theme, and the Future and Emerging Technologies FET program. 2014_24 !2014 Social Mexica: A computer model for social norms in narratives Ivn Guerrero Romn1, Rafael Prez y Prez2 1Posgrado en ciencia e ingeniera de la computacin Universidad Nacional Autnoma de Mxico, Mexico D.F. 2Divisin de Ciencias de la Comunicacin y Diseno Universidad Autnoma Metropolitana, Cuajimalpa, Mxico D. F. cguerreror@uxmcc2.iimas.unam.mx, rperez@correo.cua.uam.mx Abstract Several models for automatic storytelling represent social norms by embedding into their structures social knowledge. In contrast, this model explicitly describes computational structures to represent knowledge related to social norms, mechanisms to identify when a social norm is broken within a narrative and a set of constraints and filters to employ such social knowledge during the narrative generation process. An implementation of the model employing MEXICA, an automatic storyteller based on the Engagement-Reflection creativity model, as source of story plots is presented. Lastly, the results of a survey are presented as a preliminary evaluation of the model. Introduction The study of automatic storytelling has served for several purposes: e.g. to cast light on how human creativity works, to identify which cognitive processes are involved, and so on. However, studies about how social knowledge can be explicitly represented and employed during plot generation is mostly absent among the current systems. A social norm is defined as a general expected behavior with social relevance inside a social group (Durkheim, 1982; Sherif, 1936); when the norm is broken the group sanctions the person responsible of it (e.g. social rejection). We are interested in studying how social norms can be exploited in the context of plot generation. We have the following hypothesis: The rupture of a social norm allows the development of an interesting and novel narrative. Nevertheless, a system that action after action breaks social norms may produce incoherent and uninteresting narratives (Prez y Prez, et.al. 2011). In this way, social knowledge is relevant for the storytelling generation process because it provides valuable information to ensure and evaluate aspects such as coherence, novelty and interestingness of a narrative. The rupture of a social norm may increase the tension of a story making it more interesting, but the abuse of this resource, may affect the coherence and overall quality of the generated narratives. When a story hero breaks a social norm, the novelty may increase; nevertheless, if this strategy is presented several times, the result may be the opposite. Automatic storytellers, such as Daydreamer (Mueller 1990), MEXICA (Prez y Prez 1999), or Fabulist (Riedl 2004), includes tacit social knowledge as part of their general structures. Sometimes, this knowledge is represented as actions preconditions to prevent the inclusion of incoherent material. In other cases, this information is hardcoded. However, none of these systems detect when a social norm has been broken neither take advantage of this information during plot generation. The purpose of this work is to provide our plot generator, MEXICA, with the capacity to employ social knowledge. Thus, we have developed mechanisms to extract social norms from inspiring stories, detecting the rupture of social norms and for taking advantage of this information during plot generation to improve the interestingness of the story in progress. Previous work Thespian (Si 2005), Comme il Faut (McCoy 2010) and Mimesis (Harrell 2012) are examples of computer models that include social knowledge into their structures. In this section, the procedures employed by each of the previously depicted systems to create narratives, and the structures employed to represent social knowledge, are briefly reviewed. Thespian is a system to create interactive narratives in a 3D world. One of the characters, handled by a human, travels through an environment interacting with other available characters. Each character has goals to accomplish, and known facts that conform his state. To fulfill a goal, dynamic functions, which alter the state of the characters, are employed. Thespian describes a model for social norms that guides the conversation between characters. The social norms described in this model serve the purpose of conducting a conversation, thus, a social norm is broken only when the expected conversation flow is broken. Comme il Faut is a playable computer model of social interactions that provides a set of characters with the ability to interact between them inside a virtual world. Every game starts by defining the characters (traits, basic needs, relations with other characters) and the set of known facts inside the virtual world. Every character additionally has a set of goals to fulfill during the game. At the beginning, all the goals are pondered, and one of them is selected to start. A social interaction is then depicted to satisfy the selected goal. Every social interaction has linked a set of possible results. Once a social interaction is performed, one of these results is selected relying on the available information of the world and the characters involved in the interaction. Finally, a new goal from one of the characters is selected and the process moves on until a predefined game goal is accomplished. This model contemplates social norms inside their knowledge structures in the form of rules (if exists a romantic relation between characters x and y, then x can start dating y). These rules are manually defined by the model designer and its contexts are sometimes not flexible to comprise different scenarios. Mimesis is a system for interactive narratives which explores the social phenomena of discrimination by employing games and social networks. The system provides with mechanisms to create characters based on the musical preferences of the player, which are retrieved from the information available in social networks. From this information, a set of attitudes are assigned to the character. The system further employs this information to retrieve social aggressions that are presented to the user as gestures in the character or as textual information. Despite these systems consider the inclusion of social knowledge their approaches still invite contention because of the lack of mechanisms to determine the rupture of social norms. Additionally, mechanisms to automatically incorporate new social norms should be developed, and their constrained potential to use social knowledge during the story generation process can be improved as well. Model description This paper describes a computer model for representing social norms, detecting their rupture and providing guidelines during plot generation to improve the interestingness of the story in progress. As mentioned earlier, a social norm is defined as a general expected behavior with social relevance inside a social group, and its rupture generates a sanction against the action performer. From all the expected behaviors present inside a social group, some of them are irrelevant to the group. Breathing is an expected behavior, but has no relevance inside a narrative. On the other hand, not preserving the life of a person is relevant to a social group because it jeopardizes their welfare. In this case a social norm arises to preserve the well-being inside the group. The concept of welfare preservation has multiple interpretations depending of the social group. Some definitions include terms such as happiness, health and prosperity, all of them terms with certain degree of subjectivity. In this work, the rupture of a social norm is delineated in terms of two premises. The first considers learning mechanisms to identify the relevant elements of scenarios where a rupture of a social norm occurs. The second is based upon the following premise: A social norm is broken when an action unjustifiably jeopardizes the welfare of a social group. On grounds of previous studies of social knowledge (Echebarria 1993; Durkheim, Cosman and Cladis 2001), a mechanism to learn social procedures is based on the recognition of the elements present when an action triggers a punishment from a social group against the action performer. The set of these detected elements shapes the context where the action occurred. The first mechanism to identify the rupture of a social norm is based on the detection and representation of such contexts, called social contexts, and its further identification inside a narrative. The second mechanism employs the concepts of welfare and justified action. To represent the welfare of a social group, the model can be configured with a set of behaviors considered as disturbances of such state. This element provides flexibility to the model and allows the user to determine when the welfare of a social group is threatened. The concept of justified action is built upon crime and social norms theory (Nieves 2010). These theories contend that the aggressor rights loose relevance in contrast to the defender rights. Based on this idea, the following premise is stated: Within a story, an action that threatens the welfare of a social group is justified if, previously during the story, the action receiver had originated a welfare threaten of equal or lower intensity against the action performer. There are different kinds of social norms employed inside narratives. Certain norms intend to preserve the cohesion inside a social group (a social norm that upholds an initiation ritual serves this purpose), others preserve different values for a group. The scope of our model of social norms is bounded to those that can be represented with a social context and that intend to preserve the welfare within a social group. This model consists of three parts. The first, called narrative model, presents the required elements to represent a narrative. The second, called social groups' representation, introduces the basic elements to provide the system with social groups. The last, called social norms' model, comprises the components employed to identify, represent and employ social norms during the story generation process. Narrative model Our model obtains its knowledge structures from MEXICA (Prez y Prez and Sharples 2001; Prez y Prez 2007) an automatic storyteller. For this reason, this system is explained in the following section. MEXICA This storyteller represents the writing process as a succession of two cycles. During the first of them, called engagement, the writer focuses his efforts on producing novel related ideas guided by several constraints, and transforming them into text. On the other hand, the reflection cycle presents a retrospective stage where the agent analyses de produced material, explores feasible modifications, transforms the text, and finally, triggers new constraints that will be employed in future iterations of the process. MEXICA employs several knowledge structures to implement this creativity model. An actions' library, an inspiring set of stories, and a group of characters and locations available in the system (see Table 1 for the list of available characters). The actions' library serves as a repository for the basic building blocks of a story, the primitive actions. Each primitive action consists of an action name and the following sets: characters, preconditions and post conditions. Tlatoani(T) Prince(P) Princess(Ps) Priest(Pt) Eagle and Jaguar Knights(EJ, JK) Fisherman(Fs) Virgin(V) Slave(S) Hunter(H) Lady(L) Enemy(E) Trader(Tr) Warrior(W) Farmer(F) Artist(A) Table 1: Available characters in MEXICA. The preconditions and post conditions are both samples of relations between characters. The available relations are of two types: emotional links and tensions. Emotional links represent affective reactions between characters. Each link consists of the following elements: type, valence and intensity. The type can be love or friendship between characters, the valence can be positive or negative, and the intensity is an integer number between the range [0, 3]. Tensions represent conflicts between characters, and consist of state (active -on-or inactive -off-) and type. A list with all the relations is shown in Table 2. Emotional links Tensions Love Actor dead (Ad) Friendship Life at risk (Lr) Health at risk (Hr) Prisioner (Pr) Clashing emotions (Ce) Love competition (Lc) Potential danger (Pd) Table 2: Available relations between characters in MEXICA. Figure 3 shows graphical representations for each type of relation between characters. An emotional friendship relation (upper left) is represented by a continuous line with the valence and intensity at the top. An emotional love relation (lower left) is represented by a dotted line with the valence and intensity at the top. A tension between two characters (right) is represented by a saw tooth linking two characters and the abbreviation of the tension type. characters. In MEXICA, a story is presented as an ordered sequence of actions. Each story has a knowledge structure associated, called story-context, where all the known facts in the story are registered. Every time an action is performed this story-context is updated. Another knowledge structure is the inspiring set of stories, which consists of multiple stories created by humans representing well-formed narratives. These stories are written on the same format as a regular story generated by MEXICA, as action sequences. Each inspiring story is analyzed to create additional computer structures called contextual structures. A contextual structure is a generalization of each story-context obtained by analyzing an inspiring story. They represent a situation that happened in the analyzed story. Each structure has associated a set of actions that can be performed if a similar situation occurs in a new story. The generalization process for a context consists in the replacement of each character with a variable. Every time a story context is generalized, the next action in the story is generalized as well, and added to the list of following actions of the generated contextual structure. Story generation process To create a story in MEXICA, an initial action is instantiated, and added to a new story. Each engagement step initiates by obtaining a list of feasible following actions. For this purpose, the context of the current story is generalized and compared against each of the available contextual structures. The similar structures are then filtered by a group of constraints activated during the reflective step. Then, the first is selected, and one of its following actions is instantiated and added to the story. A new engagement step begins until the maximum number of actions is reached. If there are no remaining contextual structures after the filtering process, an impasse is declared and a reflection cycle begins. Each reflective step initiates by determining the unsatisfied preconditions of each action in the story. When a precondition is not equivalent to a relation inside the story context, is called unsatisfied. To solve this problem, a new action with an equivalent post condition is instantiated and added to the story just before the analyzed action. When every single precondition of one action is satisfied, the next action in the story is analyzed. A story finishes when one of the following criteria is fulfilled: all the characters in the story are dead, a declared impasse couldn't be solved, or the maximum number of actions for a story is reached. Social groups' representation The original version of MEXICA doesn't contemplate structures that represent social groups. Its representation is relevant to the model because they constraint the scope of a social norm, and establish relations between the characters that allows the system to identify their ruptures. In this work, every group consists of an ordered set of hierarchies. A hierarchy is a set of characters, and has a numeric value associated, called level, which is employed to prioritize it inside a group. Table 4 shows the basic groups inside the model. They are defined by the user in a text file, so new collections can be added to the implementation. The only constraint is to maintain at least two basic components: one for the gender structure and other for the social structure of the characters. These two groups are relevant for the system since social and gender relations are often important to determine if a social rupture occurred. Social Hierarchy Level Characters Nobility 5 Tlatoani, Priest High Society 4 Prince, Princess Fighters 3 Eagle and Jaguar knights, Warrior Workers 2 Farmer, Fisherman, Artist, Lady, Virgin, Hunter, Trader Low society 1 Enemy, Slave Gender Hierarchy Level Characters Male 2 Tlatoani, Priest, Prince, Eagle and Jaguar knights, Farmer, Fisherman, Artist, Hunter, Enemy, Slave, Trader, Warrior Female 1 Princess, Lady, Virgin Table 4: Social groups inside the model. Social norms' model In this research we employ social relations, actions and contextual structures to represent norms. Social relations A social relation represents the awareness of the rupture of a norm inside a story. Our system works with two types: emotions and tensions. Emotional links represent reactions between characters due to an action with social concern. Each one consists of the following elements: type, sign and intensity. The current implementation only includes one type known as social acceptance between characters; the sign can be positive or negative; and the intensity is an integer number between the range [0, 3]. Tensions represent conflicts due to a norm breakage, and they consist of state (active or inactive) and type. Table 5 displays the available relations. Emotional links Tensions Social acceptance Social disobedience (Sd) Social burden (Sb) Social threat (St) Social clashing emotions (Sce) Table 5: Additional social relations between characters for the model. Social actions Social actions (s-actions) are employed to emphasize the presence of a socially relevant action inside a story. For instance, the fragment of story presented in Table 6 shows an s-action (in bold) employed to highlight that presence of a social rupture. The hunter hated the jaguar knight. The hunter attacked the jaguar night. The jaguar knight ran away. The jaguar knight was a coward fighter. Table 6: Fragment of story presenting an s-action When an s-action is appended to a story, it serves for adding social relations to the story context and to emphasize the rupture of a social norm; on the other hand, when s-actions are employed in an inspiring story, they serve as markers for social contexts where a rupture has occurred. These actions present evaluative clauses as part of their associated texts. These clauses can be employed to incorporate author values and valid norms to the story text. Each social action consists of an action name, a set of characters, a set of associated texts, a post condition, and its relations. The post condition of a social action consists of a social tension and its mode: insert, remove, or justify. The socially relevant character attribute can have one of the following values: Performer, Receiver, None, Both. The socially relevant relation attribute can have one of the following values: Gender, Social, None, Both. The socially relevant elements of these actions are employed during the story context generalization process to represent the elements of the story context that reflect the rupture of a social norm (see the following section for a detailed description of these elements). Action name: acted against Mexicas will with Character variables: A B Post condition: insert social rejection towards A Socially relevant character: Receiver Socially relevant relation: None Table 7: Example of a social action. Table 7 presents a social action employed to emphasize the rupture of a social norm by a character when he acts against the Mexicas customs. The effect of this action is to attach a social rejection link against the action performer from every character aware of the rupture. Social contextual structures Social contextual structures, which are similar to those employed by MEXICA during the engagement phase, are built to generalize social contexts. They consist of a social context, and a reference to the social action that engendered it. Their generation process initiates by generating the context. This is obtained by generalizing the story context when a social action is found inside an inspiring story. The process consists in the replacement of each character with a variable representing it. This process is constrained by the socially relevant character attribute of the social action. When the socially relevant character attribute of a social action is set to `Performer' or Receiver, that character is not generalized; if is set to `Both', none of the characters are generalized; if is set to `None', both characters are generalized. When the socially relevant relation attribute of a social action is set to `Gender' or Social, the distance between the hierarchies of the characters is stored. Once the social context has been obtained, every emotional link that does not involve both of the social actions characters is removed. In the same way, every tension that does not involve any of these characters is banned. Finally, the removed tensions are retained as part of the context. Finally, the social action detected is linked to the social contextual structure. The artist was friend of the prince. The enemy had an accident. The artist realized the enemy had an accident. The artist cured the enemy. The artist acted against Mexica's will with the enemy. ... Table 8: Example of a partial story. Table 8 presents a partial story conformed by four actions and one social action (in bold). Once these actions have been added to the plot, the story-context in Figure 9 is created. In this, the tension Hr (health at risk) is marked with a slash to represent that was removed from the story context with the action The artist cured the enemy. Additionally, a social emotion (represented by alternated short and long line segments) from the prince towards the artist has been added due to the identification of a social rupture on the fourth action of the story. This rupture was originated because the artist acted against the Mexicas will by rewarding the enemy. Figure 9: Story-context for the story in Table 8. From this context, the social contextual structure in Figure 10 is obtained. Inside its context, character variables are represented by upper-case letters, and non-generalized characters are presented with the prefix 'c'. In our example, only one of the characters of the context was replaced by a variable (the artist), since the social action employed (see Table 7) marked the action receiver as a socially relevant character. Also the relations from and to the prince were banned since he was not part of the social action. This context represents a social rule condemning positive emotional links from the enemy, even if he is in danger. Rupture of social norms Our model presents two mechanisms to determine when a social relation is added to the story context. The first looks up specific relations between the characters inside the story context, and, if present, a social relation is triggered. The second looks for social contexts inside the story context and appends the social relation linked it. Regarding to the first mechanism, the social emotional link with negative valence is triggered when a character breaks a norm. This link represents social rejection. The same link with positive valence appears when a character performs an action that removes a tension from the story context. These links go from each character that identifies the rupture towards the action performer. If several emotional links with the same valence but different intensities exist, only the one with the highest absolute intensity remains and the rest are removed. If several emotional links with different valences exist, the social clashing emotions tension is triggered, which represents ambivalent feelings towards a character. A tension of social disobedience is triggered when a character in a lower social level breaks a social norm against another character in a higher level. A tension of social burden represents malpractices from a character in a higher social level against another character. A tension of social threat identifies a character that has broken norms several times, or has broken an intense norm. The second mechanism is explained in detail during the section related to the rupture of norms. Mechanisms to identify social ruptures Two processes are proposed to identify when a social norm is broken inside a story. The first is based on the hypothesis presented to identify a threat to the welfare of a social group. The second consists in the identification inside the story context of any learned social context. Regards the first process, it is introduced into the system the tensions Lr, Hr, Pr and Ad, considered to alter the wellbeing of a social group. The first three tensions are called tensions with moderate social relevance; the last is called tension with intense social relevance. When a tension with social relevance is unjustifiably triggered inside a story, a social norm is considered to be broken. An action that triggered a moderate or intense tension is justified when, previously in the story, at least one of these two facts stands: . Another tension was triggered against the action performer (such as in self-defense). . Another tension was triggered against any positive ly linked character to the action performer, by the action receiver (as in the case of a father defending his child). A character is said to be positively linked to another character when, inside the story context, an emotional link with positive valence exist between them. A justified action is exemplified by the following actions. The princess was sister of the prince. The tlatoani hated the prince and decided to attack him. The last action (the tlatoani decided to attack the prince) causes the prince's health to be at risk, which is a moderate tension. Since previously in the story, no equivalent tension had been triggered, the action breaks a social norm. If the action, the princess attacked back to the tlatoani causing his death, is added to the example, it is justified. This is because, even when it originated an intense tension (a character was death), this tension is justified by the previous action of the tlatoani and because the princess is positively linked to the prince. The second process to identify a threat to the welfare employs the contextual structures stored. It initiates by analyzing the story context once an action is added to the story. If a social contextual structure, whose context is included inside the story context is detected, the last action in the story is marked as socially relevant. If a justifiable relation to the post condition of the social contextual structure is present inside the story context, the action is marked as justified; otherwise, the action is marked as unjustified. A relation justifies another if it is of the same type, its sign is equal, and its intensity is equal or lower. When an action is unjustified, the post condition of the social contextual structure, which triggered such state, is instantiated with the action characters, and added to the story context. This social link emphasizes the rupture of the social norm just detected. If the action is marked as justified or normal, no additional relations are added to the context. Relevance of social norms A story that presents low levels of tension usually focuses on introducing relations between characters or non-relevant actions, such as location changes. These stories frequently become boring due to the lack of remarkable actions. In Table 11, an example of such type of stories generated by the model is presented. The inclusion of tensions inside a story according to the Aristotelian tension curve gears into the generation of interesting and coherent stories. Nevertheless, some of the knowledge structures generated by MEXICA, such as contextual structures, still lack of enough information, such as social relations, which originates inconsistencies in the generated stories. The artist went to Texcoco lake with the lady. The virgin followed the artist. The virgin admired and respected the artist. The artist went to Tlatelolco market. The lady found by accident the artist. The artist was brother of the lady. Table 11: Story plot with low levels of tension generated by the model. The jaguar knight went hunting with the tlatoani. The fisherman hated the tlatoani. The fisherman attacked the tlatoani. The tlatoani attacked the fisherman. The jaguar knight made prisoner the fisherman. Table 12: Sample story generated by the implementation of the model. In Figure 13, the story context on the left was generated without employing the model after the third action of the story in Table 12. The story context on the right was generated employing the model on the same scenario. The contextual structure generated from the first context contains the same relations between the characters, but replacing them with the variables A and B. When this structure is employed for the generation of a new story, both characters are indistinguishable, since they have the same relations. The following action associated to the contextual structure is C made prisoner A, but since either of the characters can be selected, in a story where this contextual structure is selected, the tlatoani can be sent to jail. This last example states an important difference when employing the model of social norms. The problem introduced can be disentangled by the inclusion of a social relation towards the Fisherman, which was the character who broke a social norm. The fisherman was friend of the princess. The princess went to Texcoco lake with the fisherman. The princess had an accident. The artist realized that the princess had an accident. The artist did not cure the princess. The princess, in a sacrifice ritual, ended up with her life. Table 14: Story with few social norms broken generated by the implementation of the model. Testing the Model We employed a questionnaire to cast light on how the models implementation serves the purpose of generating more interesting narratives. For this purpose, three stories were presented to a group of forty people with at least a bachelor degree (in progress or concluded). The first story (presented in Table 11) was presented with the purpose of representing a scenario where no social norms where broken. The second story (presented in Table 14) proposes a scenario where a few social norms were broken, and the third story (presented in Table 15) provides a plot with multiple social norms broken. The warrior had an accident. The tlatoani realized that the warrior had an accident. The tlatoani cured the warrior. The virgin mugged the tlatoani. The warrior killed the virgin. The warrior sacrificed himself. Table 15: Story with multiple social norms broken generated by the implementation of the model. The first questions (see Table 17) focused on the overall evaluation of interestingness for each story. The range employed was from 1 (non-interesting) to 5 (very interesting). The average evaluations obtained for each story were the following: 2.62 for story 1, 3.35 for story 2, 3.43 for story 3. Figure 16 shows these results. 40% Interestingness 30% 20% 10% 0% Story 1 Lowest 2 Story 2 3 Story 3 Highest 4 Figure 16: Results of the interestingness evaluation of the stories. The vertical axis represents the percentage of students that selected each option displayed on the horizontal axis. In general, how interesting was for you the first story? In general, how interesting was for you the second story? In general, how interesting was for you the third story? Table 17: Questions for overall evaluation of interestingness. A second group of questions (see Table 18) focused on the appreciation of social norms ruptures inside each story. Only 23% of the students identified an action that broke a social norm inside the first story, 81% identified at least one social rupture inside the second story, and 86% detected social ruptures inside the last story. After reading the first story, which actions do you consider that break a social norm? After reading the second story, which actions do you consider that break a social norm? After reading the third story, which actions do you consider that break a social norm? Table 18: Questions for detecting social norm ruptures. In Figure 19, the percentages of students identifying an action breaking a social norm inside each story are presented. The vertical axis shows this percentage, and the horizontal axis represents the action number where the social rupture was detected. For the first story, no significant percentages occurred for any action. For the second story, only the last two actions presented significant results. For the third story, its last three actions were identified as representative examples of social norm breakages. Actions breaking a social norm 80% 60% 40% 20% 0% 1 2 3 4 5 6 Story 1 Story 2 Story 3 Figure 19: Percentage of students identifying a social norm rupture in an action. Lastly, an additional question (shown in Table 20) was designed to retrieve the factors contemplated by the respondents to determine their interestingness grading. The results obtained identified that 56% of them recognized that breaking a social norm increases the interestingness of a story. Which factors did you consider to evaluate the interestingness of a story? Table 20: Question for determining the factors involved when evaluating the interestingness of a story. Discussion and Conclusions The presented values for interestingness of the stories are consistent with the social norms hypothesis, which stated that the rupture of social norms may increase this value. Despite the fact that the overall interestingness evaluation for the last two stories is similar, the percentage of highest evaluations for the third story is significantly higher than the value obtained by the second story, which indicates that this story had the highest scores. According to the results presented, most of the students identified the rupture of social norms in the second and third stories, which is consistent with the purpose of the questionnaire. The implementation of our model was employed to validate our model with the actions identified by the respondents. When running the system, there were no actions identified that broke social norms for the first story; the last two actions of the second story broke a social norm because they unjustifiably introduced tensions; the last three actions of the third story broke social norms as well. These actions identified by the model are consistent with those found by the students in the survey. We proposed a model to represent, employ and identify for social norms in narratives. To identify when a social norm is broken inside a story, two processes are proposed as part of the model. The first is based on a hypothesis presented to identify a threat to the welfare of a social group. The second consists in the identification of any generalized social context inside the story context. The concept of unjustified actions has also been coined. When one of such actions is triggered inside a story, a social norm is considered to be broken. The procedure to identify justified actions is inspired in crime and social norm theories. An action that triggers a moderate or intense tension is considered justified when, previously in the story, another moderate or intense tension was triggered against the action performer, or against any positively linked character to the action performer, by the action receiver. A new kind of actions, called social actions, is proposed. They emphasize the presence of a socially relevant action inside a story, and also serve as containers for evaluative clauses, which incorporate author values and valid norms within the scope of the story. The implementation of the model has been presented to describe its operation. It introduced new computer structures to represent social knowledge and mechanisms to identify when a social norm has been broken within a narrative. The structures described to represent the social knowledge employed by the model are social relations between characters and social contextual structures. The last structure is particularly interesting because they represent the generalization of contexts where the rupture of social norms was identified. In this way, it becomes feasible for the system to incorporate new social norms to its knowledge structures from the analysis of inspiring stories. The results obtained from the survey as well as those retrieved from the analysis of the model of social norms seem to be aligned with the hypothesis related to the correspondence between social norms and the interestingness of a story. Additionally, when comparing the social norms detected by the model with the results from the survey, a correspondence was detected. These results suggest that the information incorporated by the model to the process of generation of narratives turns out to be valuable. Nevertheless, still additional experimentation should be performed to increase the accuracy of the model and to provide elements that can help on processes involved on the story generation and on the evaluation of the interestingness of the generated stories. Acknowledgements This research was sponsored by the National Council of Science and Technology in Mxico (CONACYT), project number: 181561. 2014_25 !2014 Creativity in Story Generation From the Ground Up: Non-deterministic Simulation driven by Narrative Carlos Leon Facultad de Informatica Universidad Complutense de Madrid 28040 Madrid, Spain cleon@fdi.ucm.es Abstract Creativity in narrative requires careful management of knowledge but story generation systems focusing on creativity have typically circumvented this level of detail by using high level descriptions of events and relations. While this has proven effective for plot generation, narrative generation can be drastically enriched with a grounded representation of actions based on low level simulation. This level of detail and robust knowledge representation can form the basis for a conceptual space exploration driven by narrative knowledge, namely by guiding non-deterministic generation of successive simulation states composing a story. This paper presents and updated version of the story generation system STellA that implements this hybrid model, along with results and discussion on the relative benefits of the described approach. Introduction Instances of story generation systems usually perform at a relatively abstract level, focusing on the plot and aggregating details that, if processed at a lower granularity level, could enrich a story to the point that these details themselves could potentially be the sources for new narrative constructions and unexpected plot twists (Turner 1992; Perez 1999; Riedl and Young 2010). This lack of erez y Pfine grain detail is usually due to the technical restrictions that the currently available knowledge representation models impose over the design of complete story generation systems. Classic knowledge representation methods have proven to set the same limits on the implementation of this kind of systems as on many other applications like expert systems (Bell 1985) or ontologies (Rosati 2007), to name a few. Lower level world-modelling techniques, like simulation, have different features than relation-based knowledge representation. In this context, we consider simulation as a process in which the whole world is modelled in a complete structure evolves step by step according to a certain, fully defined set of rules. This definition is broad enough to contain a number of different approaches to knowledge representation in general and plot generation in particular. Simulation-based modelling, as one of these techniques, can provide a good way to represent the needed information for Pablo Gerv as Instituto de Tecnologia del Conocimiento Universidad Complutense de Madrid 28040 Madrid, Spain pgervas@sip.ucm.es story generation while relatively different from logic-based approaches. Indeed, simulation has been used to model narrative generation, but it has not been widely used to create explicit models of creativity in narrative. This is probably because the most evident use of simulation is the reproduction of the evolution of a static model in order to examine some results, which seemingly contradicts the need for unpredictability, novelty and freedom usually assumed to play a fundamental role in creativity. The relatively reduced number of systems that use simulation to model creative processes contrasts with the undeniable success of simulation for gathering results and producing data from grounded models. When seen in the appropriate light, simulation becomes a powerful tool for generating a big amount of artifacts, but only if the generative process is able to complement the robust generation of simulation-produced data with techniques that let the generation produce and explore a conceptual space. In fact, simulation has been applied to story generation in several systems, but these have not put the focus on creative generation (Meehan 1977; Theune et al. 2003; Aylett et al. 2005). This context suggests that enhancing a process grounded in simulation with already available models used in Computational Creativity is a promising method for producing grounded data and at the same time explore a conceptual space. In particular, creative processes heavily influenced by knowledge representation and management, as story generation, can benefit from the features that both fields offer. Generating a story is a complex process where details can make a huge difference and simulation can provide this level detail when used against a proper model. Together with this granularity, explicit means for traversing a conceptual space trying to generate a story with certain properties can provide a useful pattern for story generation. This hybrid system mixing simulation and creative exploration for story generation is described along this paper. The current system description is an updated version of the story generation system STellA (Story Telling Algorithm) (Leon and Gervas 2011) that mixes a non-constrained simulation-based production of world states and narrative actions as source material for a conceptual space exploration engine. The system controls and chooses simulations in a non-deterministically generated space of partial stories until the generation finds a satisfactory progression of simulations that are rendered as a story. The previous design of STellA did not include a world simulation as a generative solution. Instead, knowledge was represented by means of logic facts and a elaborated set of domain rules. While this approach was carefully structured to permit incremental knowledge inclusion, the engineering effort for modelling the world became too big. We identified that an even more structured representation (a well defined structure resembling the world model used in simulations) could alleviate the required engineering effort. This paper thus describes the modification of the main generation engine to allow for a simulation-based knowledge representation and world evolution. This includes the design of a new representation system and the creation of a narrative-driven conceptual space exploration based on rules (objectives and constraints) and narrative curves. The previous version of STellA included curves and rules, but the way in which they were used was fundamentally different. Related Approaches to Automatic Story Generation In order to avoid ambiguity, we will restrict our analysis here to three levels of conceptual representation of a story, and refer to these as the fabula (the complete set of what could be told, organised in chronological order of occurrence), the discourse (what has been chosen to tell, organised in the order in which it is to be told) and the narrative (the actual way of telling it). Of all existing effort to build plots, the present review will be focusing on those that construct a fabula by means of a process of simulating the actions of a set of characters. The first story telling system for which there is a record is the Novel Writer system developed by Sheldon Klein (Klein et al. 1973). Novel Writer created murder stories within the context of a weekend party. It relied on a micro-simulation model where the behaviour of individual characters and events were governed by probabilistic rules that progressively changed the state of the simulated world (represented as a semantic network). The flow of the narrative arises from reports on the changing state of the world model. A description of the world in which the story was to take place was provided as input. The particular murderer and victim depended on the character traits specified as input (with an additional random ingredient). The motives arise as a function of the events during the course of the story. The set of rules is highly constraining, and allows for the construction of only one very specific type of story. Overall, Novel Writer operated on a very restricted setting (murder mystery at weekend party, established in the initial specification of the initial state of the network), with no automated character creation (character traits were specified as input). The world representation allows for reasonably wide modeling of relations between characters. Causality is used by the system to drive the creation of the story (motives arise from events and lead to a murder, for instance) but not represented explicitly (it is only implicit in the rules of the system). Personality characteristics are explicitly represented but marked as not to be described in output. This suggests that there is a process of selection of what to mention and what to omit, but the model of how to do this is hard-wired in the code. TALESPIN (Meehan 1977), a system which told stories about the lives of simple woodland creatures, was based on planning: to create a story, a character is given a goal, and then the plan is developed to solve the goal. TALESPIN introduces character goals as triggers for action. Actions are no longer set off directly by satisfaction of their conditions, an initial goal is set, which is decomposed into subgoals and events. The systems allows the possibility of having more than one problem-solving character in the story (and it introduced separate goal lists for each of them). The validity of a story is established in terms of: existence of a problem, degree of difficulty in solving the problem, and nature or level of problem solved. Lebowitzs UNIVERSE (Lebowitz 1985) modelled the generation of scripts for a succession of TV soap opera episodes (a large cast of characters play out multiple, simultaneous, overlapping stories that never end). UNIVERSE is the first storytelling system to devote special attention to the creation of characters. Complex data structures are presented to represent characters, and a simple algorithm is proposed to fill these in partly in an automatic way. But the bulk of characterization is left for the user to do by hand. UNIVERSE is aimed at exploring extended story generation, a continuing serial rather than a story with a beginning and an end. It is in a first instance intended as a writers aid, with additional hopes to later develop it into an autonomous storyteller. UNIVERSE first addresses a question of procedure in making up a story over a fictional world: whether the world should be built first and then a plot to take place in it, or whether the plot should drive the construction of the world, with characters, locations and objects being created as needed. Lebowitz declares himself in favour of the first option, which is why UNIVERSE includes facilities for creating characters independently of plot, in contrast to Dehn (Dehn 1981) who favoured the second in her AUTHOR program (which was intended to simulate the authors mind as she makes up a story). The actual story generation process of UNIVERSE (Lebowitz 1985) uses plan-like units (plot fragments) to generate plot outlines. Treatment of dialogue and low-level text generation are explicitly postponed to some later stage. Plot fragments provide narrative methods that achieve goals, but the goals considered here are not character goals, but author goals. This is intended to allow the system to lead characters into undertaking actions that they would not have chosen to do as independent agents (to make the story interesting, usually by giving rise to melodramatic conflicts). The system keeps a precedence graph that records how the various pending author goals and plot fragments relate to each other and to events that have been told already. To plan the next stage of the plot, a goal with no missing preconditions is selected and expanded. Search is not depth first, so that the system may switch from expanding goals related with one branch of the story to expanding goals for a totally different one. When selecting plot fragments or characters to use in expansion, priority is given to those that achieve extra goals from among those pending. The line of work initiated by TALESPIN, based on modelling the behaviour of characters, has led to a specific branch of storytellers. Characters are implemented as autonomous intelligent agents that can choose their own actions informed by their internal states (including goals and emotions) and their perception of the environment. Narrative is understood to emerge from the interaction of these characters with one another. While this guarantees coherent plots, Dehn pointed out that lack of author goals does not necessarily produce very interesting stories. However, it has been found very useful in the context of virtual environments, where the introduction of such agents injects a measure of narrative to an interactive setting. The Virtual Storyteller (Theune et al. 2003) introduces a multi-agent approach to story creation where a specific director agent is introduced to look after plot. Each agent has its own knowledge base (representing what it knows about the world) and rules to govern its behaviour. In particular, the director agent has basic knowledge about plot structure (that it must have a beginning, a middle, and a happy end) and exercises control over agents actions in one of three ways: environmental (introduce new characters and object), motivational (giving characters specific goals), and proscriptive (disallowing a characters intended action). The director has no prescriptive control (it cannot force characters to perform specific actions). Theune et al. report the use of rules to measure issues such as surprise and impressiveness. In general, approaches to Interactive Storytelling have some degree of simulation as conceived in this work (Aylett et al. 2005; Cavazza, Charles, and Mead 2002; Mateas and Stern 2005). While every approach models the problem of story generation in a specific way, there exist some degree of similarity in the way they perform, namely by chaining sequential states that are driven or selected by an implicit or explicit model of plot quality. Knowledge Representation in the Story Generation System: Simulation Narratives are known to share a relatively high amount of constructions and the complexity of common sense knowledge (Schank and Abelson 1977). Elaborated narratives are as complex as common human knowledge and thus its representation and processing is a long term problem of Artificial Intelligence. As an example, we can borrow a famous scene from The Hobbit (Tolkien 1972) in which Bilbo Baggins, when trying to win the game of riddles against Gollum, asks himself What have I got in my pocket?. While the scene can seem not very complex for human cognition, this seemingly simple event carries a huge amount of information that requires a fine grain representation of characters (property, clothes, value of items), intentions (trying to escape), self-awareness (asking something to himself), emotions (fear), focus and concentration of characters (focusing on something relatively independent from the current context) and many other aspects that confer relative narrative quality and richness. The complexity becomes a problem when trying to represent knowledge by classic means. Logic-based knowledge representations methods have been designed from the early years of Artificial Intelligence and, after the initial optimism (revived with the arrival of expert systems) the complexity of such systems became clear to the point that it is widely accepted that knowledge intensive systems are limited and their use is restricted only to very well known domains (Bell 1985). Many different kinds of formalisms for knowledge representation have appeared along the last years (Trentelman 2009; Sloman 1985), but the basic problems of knowledge representation are still present and relatively unsolved (Sowa 2000; Baral 2003). Logic-based knowledge representations for story generation has nonetheless been used in several story generation system, but with very restricted domains (Perez y Perez 1999; Bringsjord and Ferrucci 1999). This has classically lead to systems that perform well in their respective merits and contributions, but a big amount of rich stories has not been produced so far. In order to partially tackle this issue, the presented version of STellA follows the hypothesis that grounding knowledge representation as much as possible is determinant for allowing a story generation system to produce rich content. A rich representation complemented by conceptual space exploration guided by narrative are proposed as a solution for creative story generation. According to this hypothesis, making the simulation more complex could provide more complex worlds and interactions and therefore create a larger conceptual space traversable by the narrative-based driving engine. The system will hypothetically be able to generate many different stories and partially identify which ones are better according to a set of given objectives. Grounding Knowledge for Storytelling For the simulation engine to be able to produce states containing content suitable for narrative generation, an appropriate grounded representation and a corresponding set of rules for creating that information are needed. This is a new addition to STellA. Grounding knowledge representation for story generation requires a low level definition of concepts that are usually defined in a more abstract way by most other generation systems (Turner 1992; Perez y Perez 1999; Bringsjord and Ferrucci 1999). This results in an additional effort from the beginning since usual constructions inherited from logics as in(knight, room) must be refined so as to represent data better suited for simulation. In the previous example, in order to represent exact position, the data would have to become positionknight = (10, 20), assuming that (10, 20) is a valid coordinate inside the room. This is the kind of knowledge representation that the proposed system uses. This approach requires a fixed representation in which every construction or relation is grounded in the sense that the system includes mechanisms to process that construction internally. This grounding permits meta-representation of the world, which means that a mental state of the world, for instance, can be represented using the same formalism. This meta-representation STellA is provided with makes knowledge representation possible at two different levels: first, characters reasoning uses a set of rules that manage incomplete knowledge (characters can ignore aspects of their surrounding context). Then, the same set of rules is applied to the simulated world, in which there is no uncertain information since the whole state is available. This implies a relative reduced engineering effort compared with the maintenance of two different rule sets. Domain rules are a determinant part in this model. Narrative generation is a knowledge hungry process and any domain model is by definition incomplete (given the requirements of narrative this would imply modelling all human knowledge). This makes it almost impossible to recreate the needed amount of information in a single prototype, thus imposing the need to design a flexible, improvable system to let it evolve over time and manage a richer set of knowledge constructions. In order to keep the rule set maintainable, rule coupling has been reduced to a minimum in terms of the structure of the rule set. Rules are organized in a linear way, meaning that no hierarchical topology is imposed over the design. This lets the maintainer include new rules without taking a big structure into account. Additionally, rules can be enabled or disabled at will without affecting the rest of the system since no rule is dependent on any other by design. The semantic coupling between rules still exist, but this is kept to a minimum. For this independence of rules to be possible, a domain-specific language for rules has been included as part of the generation engine. The rules can query the world state and output actions that represent changes in the story, as the next section explains. Querying current state limits the scope in which rules can act, which constraints rule creation and makes them easier to produce. Rules cannot examine the story but only the current simulation. In this way, narrative processes are isolated. STellA offers a set of primitives for querying the current story so that the creation of these rules can be made without knowing the representation details. Figure 1 shows an example of objective rule for creating the story and the use of the story-querying primitives that the current version of the system provides the user with. finished(story) hs = humans(story) hsd = inDungeon(story, hs) length(hsd) == 0 Figure 1: Example of objective rule for the story generation process. A story must satisfy this rule to be valid. Rules are able to cope with incomplete knowledge in the generation system, which is also a new addition in this updated version of STellA. The meta-representation of the world that characters have can be incomplete, and thus some properties of the internal representations can have the uncertain value. When characters reason to decide their next action, use a simple uni.cation mechanism to instance the uncertain value with potentially valid grounded values. For instance, a character ignoring whether an enemy is equipped with a weapon searches over the possibilities and acts according the first plausible solution. More powerful inferencing techniques will be used in future versions. Non-deterministic generation of Narrative Actions If the described simulation process generates only one single sequence of actions and corresponding states, the room for creativity would be marginal. According to most frameworks of computational and non-computational creativity, the creation or exploration of a conceptual space, trying to produce unexpected and valuable artifacts is a determinant part of the creative process (Boden 1999; 2003). This update of STellA performs the exploration of the corresponding conceptual space generatively, that is, iteratively creating new states for subsequent simulation. This has been modeled and implemented as a non-deterministic process in which a certain simulation step can yield not one but many steps. From a classical Artificial Intelligence perspective, the conceptual space generated by STellA is a tree rooted in the original state (the base state from which the generation happens). Each intermediate node of the conceptual tree contains a partial simulation state that, when processed, generates possibly many candidate states that can be subsequently expanded, in this way modelling non-determinism. While state exploration works for expanding the conceptual space, connecting the simulation with the creation of a narrative structure requires a more detailed process. The grounded data coming from each generation step must be processed carefully because the state changes that a simulation step yields are heterogeneous from a narrative perspective. The changes happening from a simulation state to the next one that are produced in the non-deterministic expansions are referred to as narrative actions, which are a new addition to STellA. During the development of the described system the number of these actions has grown as more different kinds were detected. It is important to note that the way in which simulation is implemented in STellA affects the kind of actions that are produced and thus its identification, but the next list is likely to be applicable to other approaches as well: Character perception actions define the parts of the simulation that are perceived by the characters. This includes perceiving the surrounding objects, being aware of health and position, updating or forgetting the position of an object that has moved and so on. The generation of these actions are currently model as a non-deterministic process in which perceptions have a probability to happen. The algorithm then orders perceptions by probability, creating sets of perceived elements non-deterministically. Perception actions are the link between the complete world happening in the simulation and the inner representation of it that every agent in the story (every character) has. Deus ex actions are generated without any causal requirement. They must be consistent with the current state, but do not need to respond to any character need of model. Deus ex actions model events that are too serendipitous to need a detailed model, like a character stumbling upon a rock when running or raining. These actions are generated non-deterministically and have a probability of happening in their definition that is used by the generator to order these actions by their chance of occurring and not by pure randomness. This has been designed so to keep a complete model not depending on random numbers. Character desires actions are the output of a reasoning process that emulates character decisions. These decisions include eating if the character is hungry, trying to escape an enemy or maybe attacking him or her. These actions confer a relative degree of believability (Riedl 2004). Character desires actions, which are generated in a nondeterministic way, have both an associated probability and a priority. This priority is used by the characters in the next step of simulation to order desires and try to satisfy the most prioritized ones first. Character intentions complete desires and perception so as to reproduce a classic agent-like narrative model (Bratman 1987). Intentions are generated according to perceptions (beliefs in the classic model) and desires, which means that the representation of the external world in not taken into account when creating intentions (only the characters internal representation). This allows for a simpler creation of rules since less information must be taken into account. Character intentions actions are nondeterministic too and have an associate probability just like the other kinds of actions. Trying to go to some location that the character desires to be in or trying to attack the enemy that the character desires to be dead are examples of intentions. The difference between doing and trying to do is subtle but very in.uential in narrative generation since it permits richer character interaction. Physical world actions are non-deterministic and model causality of physical events that, under certain conditions, will necessarily happen with a certain probability. Things that fall to the ground if nothing holds them or moving an object if it is pushed with enough force are examples of physical world actions. This kind of actions have the additional role of representing success of failure of character intentions. In this way, a character can try an action and the physical state will decide whether the intention succeeded or not. This division makes sense from the point of view of story generation. The focus and detail on character behavior is clear and considered to be very important in narrative. This is complemented with serendipitous events and world physics in a broad sense. Probabilities are used to order actions in such a way that the main algorithm produces candidate updated versions of the current state of the simulation and gives priority to the most likely ones. Creativity can be explored by choosing less likely states, which is planned as part of the future enhancements of STellA. These five kinds of narrative actions are extracted from the simulation. Formally speaking, the output of each step of the simulation non-deterministically yields a set of new states along with their corresponding actions. This can be formally described as: (state, e, p, d, i, w. where state is the current state of the simulation, e is the set of deus ex actions generated from that step, p is the set of character perception actions, d is the set of character desires actions, i is the set of character intentions actions and w is the set of physical world actions. A fabula generated by STellA is then a list of tuples: [(state, e, p, d, i, w)] The generation can be represented formally in terms of a generative function . that accepts a state and returns a nondeterministic set of tuples: .(state)= {(state0,e0,p0,d0,i0,w0), (state1,e1,p1,d1,i1,w1), ... (staten,en,pn,dn,in,wn)} Having explained and formalize how to generate a conceptual space of stories from a grounded simulation, it is still necessary to complete the system by including a way to traverse this space an find valuable artifacts, namely valid stories. Narrative Drives the Simulation: Curves, Objectives and Constraints Simulation is a flexible and powerful tool for representing the state of a story and the transitions between states. However, producing a sequence of states that, when appropriately rendered, are acceptable as a narrative, requires control over the generation. STellA uses three types of mechanisms to drive the simulation: objectives, constraints and narrative curves. The presented generation process is fed with a set of objectives that the story must satisfy in order to be suitable to be accepted as finished and valuable by the system. This version of the story generation system models objectives as a group of boolean functions receiving a story. The user can thus use these to create declarative definitions of the kind of wanted story. Objectives are used post-hoc. When a partial story is reached by the system, it is checked against the set of these objectives and all of them must accept the story as valid. Figure 1 shows an example. Along with objectives, providing the system with means to restrict the generation is needed. Non-determinism in story generation is a powerful modelling tool, but unrestricted production of stories degenerates in a very big conceptual space whose whole traversal is intractable (Le on and Gervas 2010). This is not only consistent from a computational perspective but also from the point of view of creativity in story generation: the set of stories that can be generated from any starting state is very large. This characteristic is inherent to the domain of story production and cannot be eluded. The computational generation can, however, filter out those intermediate states that are not promising and should not be explored, as humans seemingly do (Sharples 1999). The current model uses constraints for avoiding exploring branches of the traversal process that are unpromising. The implementation of constraints is analogous to the implementation of objectives as constraints are defined in terms of declarative rules using the same kind of formalism and query primitives. Constraints, however, are used in the generation during the expansion of new states to be simulated and forbid the exploration of those candidates states that do not satisfy them. The use of constraints compared to objectives therefore leads to a less strict definition. In practical terms, constraints are usually less restrictive with regard to their scope: experience suggests that constraints are defined in term of specific features that a story should not have, while objectives tend to describe general aspects of a narration. Figure 3, showing an example of a constraint, exempli.es this. promising(story) hs = humans(story) .hi . hs : .hd . hs -{hi} : di = distance(story, hi,hd) av = average(d0,d1,...,dn) av <= threshold Figure 2: Example of constraint rule. A partial story not satisfying a constraint rule will not be accepted as promising and its corresponding state will not be explored. STellA uses a generalized version of tension curves to drive story generation. The design of these curves as a way to drive plot generation has been studied in previous versions of STellA (Leas 2011; Leas on and Gervon and Gerv2012). The main objective underlying this method is to represent the evolution of a set of narrative properties of a story as curves. As the conceptual space is traversed to find a suitable story, this evolution is iteratively compared with a set of objective curves. This comparison informs the traversal on every step and this information can be used as an additional source for deciding when a partial story is promising and whether a story is finished. Previous versions of STellA also considered these methods for plot generation, but they were applied differently. Objectives and constraints did were not as powerful as they are in this version regarding their both their expressive power and their scope. While the current version allows for evaluation of a complete story, previously only states were considered, additionally, full access to the world representation is allowed now. Curves have a more general definition now since they define generic metrics (distances, average values and others) and previous versions needed more elaborated definitions. This has been made easier by the use of a simulation-based representation. Algorithm 1 describes the overall generation algorithm. The non-determinism occurs, as previously described, when generating candidate sets of deus ex, character desires and character intentions actions. The generation algorithm iterates until a satisfying story is found and filters those exploratory branches that are unpromising according to the constraints imposed in the execution. Data: the current partial story [(state, e, p, d, i, w)] objective curves objective function constraint function Result: a set of candidate new tuples while current story is not finished according to curves and objectives do . last state tuple from current story -p non-det perception for . ordered by probability -e non-det deus ex for . ordered by probability d -non-det desire for . ordered by probability i -non-det intention for . ordered by probability -w non-det physical world for . ordered by probability .' apply (e, p, d, i, w) to . curves.! compute current curves for .' new story current story + .' if curves.! . curvesobjective. new story satis.es constraints then foreach .' do explore generation from .' end else reject .' end end return current story Algorithm 1: Story generation algorithm in STellA Example Output The described model has been implemented in three main modules: 1. The core engine for generating stories, containing the non-deterministic algorithms and basic narrative data structures. 2. The simulation engine defining the basic data structures and rules for the simulation to happen. 3. The set of rules both for generating actions and for de.ning story objectives. The core engine (1) corresponds to the implementation of Algorithm 1 and the simulation engine (2) has been imple mented according to the model previously described. A rule set (3) for an example prototype has been created for demon stration purposes. This rule set and the sample world place the action in a dungeon from which humans must escape. The simulated world is a two-dimensional grid in which every entity is placed in one single cell. Basic actions of characters are move in eight directions, attack adjacent enemies, eat food, escape, protect themselves and others, take and drop objects and apply objects on other entities (for healing an ally, for instance). Characters and creatures can sense their surroundings and use an A* based pathfinder to go from one place to another. Characters loose energy for being injured and doing things. The initial state includes 3 humans (located at one edge of the dungeon) and 5 creatures (located at the opposite edge, nearby the exit). Humans desire to escape and creatures are hungry and will try to eat the humans. Food, shields and weapons are spread out over the dungeon (10 items in total). The layout of the dungeon and the location of objects have been randomized. Three objective curves have been used to drive the generation in this example. These curves have simple definitions and try to capture the evolution of measurable aspects of the story that, in the current domain, match to some extent speci.c features of the narrative arc: danger, the perceived danger in the story, computed as the mean distance between humans and creatures. success, the level of success of characters, computed as the difference between humans that have escaped the dungeon and the number of humans that have died. richness, an additional measurement to ensure that the generation is rich enough, computed as the number of different actions that happen in the story. Richness avoids monotonous stories in which characters just find their way to the exit without any conflict. The input objective curves for the generation are a monotonously increasing line for danger, richness and success, forcing the generation to produce a story with an ending in which many things have happened (richness), the creatures surround the characters at the end (danger) and all characters escape (success). In order to keep the demonstration prototype simple, one single objective function has been used: no humans must remain in the dungeon (Figure 1). Analogously, the only constraint used for the example forbids states in which the group of humans splits up, the average distance between humans must be lower that a certain threshold (Figure 3). An example execution would start as follows: the generation starts as shown in Algorithm 1. First, the initial state is tested against the objective function which is not satisfied because there are 3 humans in the dungeon. Perception actions are computed and every cognitive entity (humans and creatures) update their internal representation of the world with their surrounding area. Deus ex rules are processed and no action is triggered, then desire rules are examined. A human with low energy desires to get food with a high priority (escaping is postponed) and the other two still decide to escape. All creatures decide to look for food. When intention actions are generated, all characters decide to move to find what the desire and this move is realized as a successful physic action because no obstacle limits their movement. After this step, the current values for the objective curves are computed and compared against the objective curves. The difference between the current and the objective curves is acceptable by the system (being the first step yields the resulting comparison negligible according to the thresholds). This state is thus valid and new other candidates from the initial state are similarly generated and filtered. Then one of this states in chosen (the current prototype choses the one with a higher number of actions) and the generation continues until the system has found a satisfying story. Then, the sequence of states and their corresponding actions are converted into a textual story. The rendering of the generated fabula as a discourse has been carried out with simple, ad-hoc rules to improve the apparent result. Figure 5 shows an example. Some redundant, easy to infer events and states were filtered (Figure 4) and sequential order was used (that is, events are told in the same order as they occur). The focus on the current prototype has not been put on the quality of the discourse and only a simple method has been used. Better narrative discourse planning, however, will be tackled in future versions of STellA. Figure 6 shows a fragment of the rendered output. The fragment has been selected by hand, but the whole story has been taken as-is without any form of curation or human intervention. Figure 7 shows part of the underlying representation corresponding to the text in Figure 6. The example shown corresponds to the sentence the knight was hungry. promising(story) hs = humans(story) .hi . hs : .hd . hs -{hi} : di = distance(story, hi,hd) av = average(d0,d1,...,dn) av <= threshold Figure 3: Example of constraint rule. A partial story not satisfying a constraint rule will not be accepted as promising and its corresponding state will not be explored. The fragment chosen and shown in Figure 6 exempli.es the level of detail that STellA is able to achieve. Specific focus on some generated events can shed more light on what STellA is able to do. For instance, when the knight is injured by the attack of the red creature, a new set of possible next steps in the simulation are generated. In some of them, the barbarian is not aware of the event and thus there is no reaction. According to the rules, these have a low probability of happening because the barbarian and the knight are nearby. In some others, chosen before because of their higher probability, the barbarian detects the attack. Since there is a rule stating that humans defend themselves against the creatures, the barbarian could non-deterministically choose what to do, either defend or ignore the knight. The system performs a space search to choose the best option among these two, that is, non-deterministically explores partial simulations from the current one and chooses the chain that fits the curves better. Since defending the knight maximizes the number of alive heroes, that one is chosen. In this way, the simulation and the narrative-based conceptual space search pro [... ] if event action is "pass" then filter event end if event action is "move" and character does not face enemy then filter event if event action is "get tired" and characters energy > 100 then filter event end [... ] Figure 4: Simple event filtering for demonstration purposes. The current prototype includes ad-hoc rules for redundant or excessively detailed events. duce rich, meaningful stories. The grounded representation allows a fine level of granularity in the action and the narrative information leads to relatively interesting scenes according to the formal metrics described in term of narrative curves and specific requirements encoded as objectives and constraints. Generating detailed interactions can provide rich content that an accurate discourse planner can aggregate where needed. However, this does not mean that any form of verbose or redundant generation can be easily fixed by a discourse planner. The content generator should be able to provide reasonably meaningful and useful content letting the discourse planner decide what is relevent for each kind of discourse. Discussion The empirical evidence during the development suggests that the initial effort needed for grounding knowledge pays off soon. While more research and comparable measurements are needed to make any strong claim, the development process and the relative effort to include rules in the system is relatively reduced as the system evolves. As previously detailed, many simulation-based story generation systems have already been created. STellA contributes to the field by focusing on creativity and exploration of a conceptual space. More specifically, several studied story generation systems perform a guided simulation in which some sort of general objectives (be it author or character goals) are pursued and fulfilled in a valid story (Lebowitz 1985; Dehn 1981; Theune et al. 2003). While the conjunction of goals and simulation links these systems with the presented version of STellA, the taken approach here is conceptually different: the simulation happens with no nar [... ] if kindOf(entity)=knight then print "the knight " end if energy(entity) < 1500 then print "was hungry" end ifenergy(entity) < 1500 then print "blocked " print attackerOf(entity) print " with " print objectDefense(entity) end [... ] Figure 5: Example rule for discourse and textual generation in STellA. The current version addresses simple text for demonstration purposes. rative information and the simulation is let to progress nondeterministically thus producing a growing tree of plausible states. Narrative is only included as an external process in which these successive simulations are selected as partial artifacts in the conceptual space. This puts a clear division between content generation with robust grounded generation and detailed filtering based on narrative rules. This somehow resembles the engagement and reflection model described by Sharples (Sharples 1999) and implemented in MEXICA (Perez 1999) in the sense that a model of erez y Pcreativity receives the focus. Other story generation systems rely on the underlying narrative-like features of logging the simulation of character actions and put little or no effort on making an explicit narrative model (Klein et al. 1973; Meehan 1977). This clearly contrasts with the approaches taken by STellA, which specifically focus on using narrative to control which simulations are plausible according to the current objectives. STellA explicitly addresses creativity both as a model and as objective. From a theoretical point of view and according to the theoretical framework described by Boden (Boden 2003) and formalized by Wiggins (Wiggins 2006), the nondeterministic simulation process would generate the conceptual space, and the mechanisms described to select and filter states would match the definition of the traversal function. The evaluation function would be composed by a mix of the curves and the objective function. The current prototype, however, is not reaching any high form of narrative creativity. The kind of story generation that STellA tries to achieve necessarily implies a complex management of knowledge and narrative structures. Before trying to create highly valu [... ] the knight was hungry. the barbarian was injured. the knight desired to protect the barbarian. the green creature wanted to eat the barbarian. the green creature tried to attack the barbarian. the knight blocked the green creature with the shield. the red creature tried to attack the knight. the red creature succeeded when trying to attack the knight. the knight was injured. the barbarian desired to protect the knight. the barbarian used the healing potion on the knight. the barbarian desired to attack the green creature. the knight desired to protect the barbarian. the green creature tried to attack the barbarian. the knight failed to block the green creature with the shield. the green creature succeeded when trying to attack the barbarian. the barbarian died. the knight took the sword. the knight desired to attack the green creature. the knight tried to attack the green creature. the knight succeeded when trying to attack the green creature. the green creature died. [... ] Figure 6: Fragment of a resulting story generated by STellA after the narrative-driven simulation process. knight0 : {position : (5, 51), energy : 1288, desire : { desire :escape, agent :knight0 }, items : {shield0}, kindOf :knight, strength : 100, speed :3, sight :7, weigth : 90, known : { knight0 : {...}, creature0 : {...}, wall26 : {...}, wall27 : {...}, [... ] }} Figure 7: Fragment of the underlying representation corresponding to the text in Figure 6. able stories, the detailed development line tries to build a robust framework that can be further improved with more knowledge. The preliminary results show that world representation can be made richer by simulation and that a creative process can be model by non-deterministic generation and explicit filtering and identification of valuable artifacts. Conclusions and Future Work Simulation is a powerful tool for modelling interactions and can produce grounded information. This information, when properly identified, can be used for driving story generation if enriched with narrative knowledge and generate a conceptual space of stories. This paper has described the development of an updated version of STellA, a story generation system that implements this model that mixes simulation and conceptual space exploration driven by narrative constructions. An example output generated by the current implementation is described and the relative benefits and drawbacks of the proposed solution are discussed. The system will continue to be developed according the discussed assumptions, namely that generating successive story states by simulating relations between characters and constructing a conceptual space by using narrative information is a plausible method for generating rich stories that can be deemed as creative by unbiased observers (Colton and Wiggins 2012). Thorough work, however, is still to be done for the system fully support these assumptions: the simulation must support richer constructions and the generation process based on narrative must be improved with more general information about narrative, probably with general models borrowed from narratology. Studying how driven non-determinism and probabilities can lead to better results in terms of novelty is a key aspect of the future improvements of STellA. The future work contemplates producing and evaluating stories that include unlikely events in such a way that novelty and quality are ensured to some measurable extent. Acknowledgments This paper has been partially supported by the projects WHIM 611560 and PROSECCO 600653 funded by the European Commission, Framework Program 7, the ICT theme, and the Future Emerging Technologies FET program. 2014_26 !2014 Baseline Methods for Automated Fictional Ideation Maria Teresa Llano, Rose Hepworth, Simon Colton, Jeremy Gow and John Charnley Computational Creativity Group, Department of Computing, Goldsmiths, University of London Nada Lavra!Znidarsi! ! c, Martin !c and Matic Perovsek! Department of Knowledge Technologies, Jo!zef Stefan Institute Mark Granroth-Wilding and Stephen Clark Computer Laboratory, University of Cambridge Abstract The invention of fictional ideas (ideation) is often a central process in the creative production of artefacts such as poems, music and paintings, but has barely been studied in the Computational Creativity community. We present here three baseline approaches for automated fictional ideation, using methods which invert and alter facts from the ConceptNet and ReVerb databases, and perform bisociative discovery. For each method, we present a curation analysis, by calculating the proportion of ideas which pass a typicality evaluation. We further evaluate one ideation approach through a crowd-sourcing experiment in which participants were asked to rank ideas. The results from this study, and the baseline methods and methodologies presented here, constitute a firm basis on which to build more sophisticated models for automated ideation with evaluative capacity. Introduction Ideation is a portmanteau word used to describe the process of generating a novel idea of value. Fictional ideation therefore describes the production of ideas which are not meant to represent or describe a current truth about the world, but rather something that is in part, or entirely, imaginary. As such, their purposes include unearthing new truths and serving as the basis for cultural creations like stories, advertisements, poems, paintings, games and other artefacts. Automated techniques for the derivation of new concepts have been important in Artificial Intelligence approaches, most notably machine learning. However, the projects employing such techniques have almost exclusively been applied to finding concepts which somehow characterise reality, rather than some fictional universe. While some concepts may be purported as factual, i.e., supported by sufficient evidence, others may only be hypothesised to be true. In either case, however, the point of the exercise is to learn more about the real world through analysis of real data, rather than to invent fictions for cultural consumption. A major sub-field of Computational Creativity research involves designing software that exhibits behaviours perceived as creative by unbiased observers (Colton and Wiggins 2012). However, in the majority of the generative systems developed so far within Computational Creativity research, there is no idea generation undertaken explicitly. An exception to this was (Pereira 2007), who implemented a system based on the psychological theory of Conceptual Blending put forward by Fauconnier and Turner (2008). By blending two theories about different subject material, novel concepts which exist in neither domain emerge from the approach. Using blending to reason about such fictional ideas was harnessed for various creative purposes, including natural language generation (Pereira and Gervas 2003), sound design (Martins et al. 2004), and the invention of character models for video games (Pereira and Cardoso 2003). Similarly, the ISAAC system (Moorman and Ram 1996) implements a theory for creative understanding based on the use of an ontology to represent the dimensions of concepts. By altering the dimensions of existing concepts within the ontology, for instance considering a temporal object as a physical one, the system is able to create novel concepts. In addition, in some projects, especially ones with application to natural language generation such as neologism production (Veale 2006), which are communicative in nature, it is entirely possible to extract ideas from the artefacts produced. However, it is fair to say that such software is not performing ideation to produce artefacts, but is rather producing artefacts that can be interpreted by the reader via new ideas. The work in (Goel 2013) shows the use of creative analogies in which problems of environmental sustainability are addressed by creating designs inspired by the way things work in nature. For instance, birds beaks inspired the design of trains with noise reduction. Although ideation here is being used for inspiration and not to create literal representations, this work shows the potential of using creative analogies for fictional ideation. As part of the WHIM project1 (an acronym for the What-if Machine), we are undertaking the first large-scale study of how software can invent, evaluate and express fictional ideas. In the next section, we present three straightforward approaches to fictional ideation which manipulate material from internet sources. These will act as our baseline against which more sophisticated ideation methods will be tested as the project progresses. In order to draw that baseline, we conducted a curation analysis of the ideas produced by each method, whereby we calculated the proportion of ideas which were typical in the sense of being both understandable and largely fictional, with details given below. We also 1www.whim-project.eu Figure 1: Ideation flowcharts using ConceptNet. present here a baseline methodology for estimating the true value of the ideas produced by our systems. To do this, we conducted a crowd-sourcing exercise involving 135 participants, where people were exposed to ideas in a controlled way, with the aim of evaluating components of ideas that could be used to predict overall value. A good fictional idea distorts the world view around it in useful ways, and these distortions can be exploited to spark new ideas, to interrogate consequences and to tell stories. A central hypothesis of the WHIM project is that the narrative potential of an idea can be estimated automatically, and used as a reliable estimate of the ideas worth. Hence the crowd-sourcing study had narrative potential as a focal point, and we tested an automated approach which estimates whether an idea has much narrative potential, or little. As discussed below, we found that, in general, people ranked those ideas that were assessed as having much potential higher than those assessed as having little. We present further statistical analysis of the results, which enables us to conclude by describing future directions for the WHIM project. Baseline Ideation Methods We investigate here three methods which use data mined from the internet for generating What-if style fictional ideas. In the next section, we analyse the results from each method. Fictional Ideation using ConceptNet ConceptNet2 is a semantic network of common sense knowledge produced by sophisticated web mining techniques at the MIT media lab (Liu and Singh 2004). Mined knowledge is represented as facts, which comprise relations between concepts in a network-like structure, e.g., [camel, IsA, animal, 7.0], [animal, CapableOf, hear sound, 2.0]. Currently, ConceptNet has 49 relations, including UsedFor, IsA, AtLocation, Desires, etc., and each fact is given a score, from 0.5 upwards, which estimates the likelihood of the relation being true, based on the amount of evidence mined. We have studied fictional ideation by inverting the world view modelled by ConceptNet, i.e., facts are transformed by negating their relations. For example, this can be done by introducing 2conceptnet5.media.mit.edu an action which was not previously possible, e.g., people cant .y becomes What if people could .y? or stopping an action or desire which was previously common, e.g., people need to eat becomes What if people no longer needed to eat?, etc. We investigated various inversion methods such as these, carried out using the FloWr flowcharting system described in (Charnley, Colton, and Llano 2014). Working in a story-generation context, we took inspiration from the opening line of Franz Kafkas 1915 novella The Metamorphosis: One morning, as Gregor Samsa was waking up from anxious dreams, he discovered that in his bed he had been changed into a monstrous verminous bug. In figure 1, we present five flowcharts we used to generate ideas by inverting and combining ConceptNet facts about people, animals, vegetables and materials. Flowchart A finds instances of animals by searching ConceptNet for facts [X, IsA, animal]. These are then rendered in the TemplateCombiner process as questions of the form: What if there was a person who was half man and half X? Flowchart B employs ConceptNet similarly, then uses a WordListCategoriser process to remove outliers such as [my husband,IsA,animal]. Then, for a given animal, A, facts of the form [A,CapableOf,B] are identified and rendered as: What if there was a person who was half man and half X, who could Y? Switching the CapableOf relation to Not-CapableOf enabled us to produce ideas suggesting a person who became an animal, but retained some human qualities. We augmented this by using the LocatedNear relation (not shown in figure 1) to add a geographical context to the situation, producing ideas such as What if a woman awoke in the sky to find she had transformed into a bird, but she could still speak? We found that these ideas had much resonance with the premise in The Metamorphosis. Taking our lead next from the surrealistic artworks of Dali, Magritte and colleagues, in flowchart C, we looked at bizarre visual juxtapositions. ConceptNet is used here to find an occupation, a vegetable and a location related to some animal, and the flowchart produces ideas such as: What if there was a banker underwater with a potato for a face? Similarly, in flowchart D, we produced ideas for paintings by finding materials, M, using facts of the form [X,IsA,thing] and [X,MadeOf,M], then finding organisms, O, with pairings of [X,IsA,live thing] and [O,IsA,X] facts. This led to ideas such as painting a dolphin made of gold, a reptile made of wood, and a flower made out of cotton. In the baseline evaluation section below, we describe the raw yield of flowcharts A to D, and the proportion of the results which were both understandable and mostly fictional. As mentioned above, we are particularly interested in estimating the narrative potential of an idea, by which we mean the likelihood that the idea could be used in multiple, interesting and engaging plots for stories. As a baseline method for estimating such potential, we investigated a technique consisting of building inference chains of ConceptNet facts whose starting point is the fact that is inverted in the idea. To illustrate the approach, from the seed idea What if there was a little bug who couldnt .y?, the following chain of relations can be obtained through ConceptNet: [bug,CapableOf,.y] [.y,HasA,wing] [wing,IsA,arm] [arm,PartOf,person] [person,Desires,muscle] [muscle,UsedFor,move and jump] Here, one can imagine a bug who cant .y, but instead uses his muscle-bound human like arms for locomotion. Our hypothesis is that, while each chain might be rather poor and difficult to interpret as a narrative, the volume and average length of such chains can indicate the potential of the idea. We implemented a ConceptNetChainSorter process to take a given idea and develop chains up to a specified length with no loops or repetitions. Flowchart E uses this process to order the facts from ConceptNet in terms of the sum of the lengths of the chains produced. Hence facts with many chains are ranked higher than chains with fewer, and longer rather than shorter chains will also push a fact up the rankings. Often there are no chains for a fact, and if there are, the number depends on the nature of the objects being related, and the relation. Looking at facts [X,R,Y], where [X,IsA,animal] is a ConceptNet fact, for each R, we found these percentages of facts had non-trivial chains: CapableOf 20 Desires 50 HasA 63 HasProperty 28 IsA 48 LocatedNear 100 Fictional Ideation using ReVerb The Washington ReVerb project (Fader, Soderland, and Et zioni 2011) extracts binary relationships between entities from text, like the ConceptNet relations described above. Output produced by running the system over a large corpus of web texts (ClueWeb09, ~1 billion web pages) is publicly available and we use it here to generate fictional ideas. Lin, Mausam, and Etzioni (2012) have linked the first argument (LHS concept) of a subset of ReVerb extractions with identi.ers of an entity in Freebase (Bollacker et al. 2008). This provides a means of unifying the various names by which a particular entity might be referred to (cow, cattle, etc.) and disambiguating entities that have the same name. In the ideation method described here, we use this dataset, and the input to the process is a Freebase ID. The relations vary in generality, as well as reliability. For example, some relations express a particular one-off event during which the entities interacted (Tony Blair converted to Catholicism), while others express general properties of the entities (cows eat grass). Both types of relations may be of interest to building world views for ideation, and we do not attempt to distinguish them currently. Using facts from ReVerb, we can generate fictional ideas by substituting one of the arguments for an alternative entity. For example, the extractions relating to cattle include [Cattle, were bred for, meat]. Looking at other facts that use the same relation (be bred for), with different LHS entities, we find things that are bred for speed, suggesting a possible fictional fact: [Cattle, were bred for, speed]. The following are desirable properties of such alterations: 1. They should be fictional (e.g., [Cattle, were bred for, meat] . . [Cattle, were bred for, milk]). 2. They should make sense (e.g., [Cattle, were bred for, meat] . . [Cattle, were bred for, rule of thumb]). 3. They should have a substantial effect on the narratives that could be generated (e.g., [Cattle, were bred for, meat] . . [Cattle, were bred for, hamburgers]). Establishing whether this last desideratum holds is a hard task which we leave for now to future work. Given an extraction [X, r, Y ], we wish to generate a .ctional [X, r, Y .]. The following requirements might serve to approximate the first two desiderata above: [X, r, Y ..] is common for some Y .., i.e., r is a common type of fact to say about X. [X., r, Y .] is common, i.e., Y . is commonly seen as the second argument of r (with different first arguments). [X, r, Y .] is rarely or never seen, i.e., this is likely not a fact we are already aware of. As we cannot rely on the dataset to contain all relevant facts, we impose a strong version of this, that [X, r, Y .] is completely unattested. As an example, the following alteration is well supported by these criteria: [Michael Jackson, was still the king of, pop] . [Michael Jackson, was still the king of, Kong]. The initial fact is chosen because Michael Jackson is frequently said to have been the king of things (popular music, music video, etc.) the first requirement. Kong is chosen as an alternative second argument, because Kong ranks highly among things that people are described as being still king of3 the second requirement. Finally, we have never seen Michael Jackson described as being still the king of Kong. The first two requirements given above can be expressed, and combined, as conditional probabilities. P (r|X) represents the probability of the relation given the first argument (the input). This will be high for the relations most often seen with X as the first argument (the most common things to say about X). P (Y .|r) will likewise be high for the most common second arguments of the relation in question, regardless of which X they have been seen with. To eliminate attested facts, we exclude any Y seen at all in [X, r, Y .]. For each of the top 100 facts about X found in the ReVerb extractions, all alterations Y with a non-zero P (Y .|r) are ranked according to P (r|X) P (Y .|r). 3High-scorers in the game Donkey Kong are described as such. Below are some examples of the alterations the system performs, with an analysis of the proportion of usable alterations given in the next section. The following are the top five alterations for entity cattle, showing the fact in its extracted form, then the systems alteration, which could be rendered as a What-if style idea: 1. Cattle evolved to eat grass . Cattle evolved to eat meat 2. Cattle occupy a unique role in human history . Cattle occupy a unique role in Israelite history 3. Cattle occupy a unique role in human history . Cattle occupy a unique role in modern distributed systems 4. Cattle occupy a unique role in human history . Cattle occupy a unique role in society 5. Cattle were bred for meat . Cattle were bred for speed Similarly, the top five for Scotland are: 1. Scotland is steeped in history . Scotland is steeped in tradition 2. Scotland is a part of the United Kingdom . Scotland is a part of life 3. Scotland is in Britain . Scotland is in trouble 4. Scotland is in Britain . Scotland is in order 5. Scotland is in Britain . Scotland is in progress In other tests, we produced ideas that express fictional his tories, which is a mainstay of creative writing, for instance: What if John F. Kennedy had been elected Pope? Fictional Ideation using Bisociative Discovery Koestler (1964) stated that different types of invention all share a common pattern, to which he gave the term biso ciation. According to Koestler, bisociative thinking occurs when a problem, idea, event or situation is perceived simul taneously in two or more matrices of thought or domains. When two matrices of thought interact with each other, the result is either their fusion in a novel intellectual synthesis, or their confrontation in a new aesthetic experience. The developers of the CrossBee system (Jursi!c!et al. 2012) followed Koestlers ideas by exploring a specific form of bisociation: finding terms that appear in documents which represent bisociative links between concepts of different do mains, with a term ranking method based on the voting of an ensemble of heuristics. We have extended this methodology with a banded matrices approach, described in (Perov!sek et al. 2013), which is used in a new CrossBee heuristic for evaluating terms according to their bridging term (b-term) potential. The output from CrossBee is a ranked list of po tential domain bridging terms. Inspecting the top-ranked bterms should result in a higher probability of finding obser vations that lead to the discovery of new links between dif ferent domains. Here, the creative act is to find the links which cross two or more different domains, leading out of the original matrix of thought. In the simplified ideation scenario addressed here, we used CrossBee for b-term ranking on documents from two domains to discover bridging terms, with the aim of combin ing statements from two domains. The first domain consists of 154,959 What-if sentences retrieved from Twitter with query what if, assisted by the Gama System R . PerceptionAnalytics platform.4 The tweets were filtered through the following steps, reducing the number to 65,811: All non-ASCII characters were deleted. Repeated letters were truncated, so that any character repeating consecutively more than twice in a word was ignored after the second repetition. For example, the word cooooool would be truncated to cool (but also looooooove would be truncated to loove). All characters are transformed to lower case. Non-English tweets were removed. Vulgar words were removed by comparison with a list of such words.5 From all items, only the sub-strings starting with the term what if and ending with a period, question mark or exclamation mark were considered. Items shorter than 9 characters were removed. Exact duplicates were removed. The second dataset is a collection of 86 moral statements from Aesops fables, which was created by crawling the Aesops fables online collection. Each What-if sentence and each moral statement was treated as a separate document, and all documents were further preprocessed using standard text mining techniques. We then applied our methodology to the data from the two domains to estimate the b-term potential of common terms. We used this indicator for ranking (a) single What-if sentences and (b) bisociatively linked What-if sentences and moral statements. Inspection of the What-if sentences obtained from tweets revealed that a great number of them make very little sense in general or are related to very specific contexts. Aesops morals, on the other hand, tend to be very general in nature. By composing sentences from these two domains using the terms with the best b-term potential indicator value, we hoped to produce a ranking mechanism that favours generally meaningful fictional ideas that might be useful for ranking individual What-if sentences. We used the mechanism to rank both single sentences and compound pairs, to test the hypothesis that using the b-term potential as an ranking coefficient can estimate which What-if sentences will be evaluated more favourably by people, both as individual sentences and in bisociatively combined sentence pairs. The effectiveness of b-term potential used as a ranking tool of single What-if sentences was evaluated as follows: we randomly shuf.ed the 10 best ranked sentences and 10 random What-if sentences. The collection of 20 sentences was then independently assessed by 6 human evaluators who used scores from 1 (bad) to 5 (very good) in answering the question: How good (generally interesting) do you find the following idea? The top 10 b-term ranked What-ifs received an average score of 2.92, whereas the randomly chosen ones scored 2.80 on average. Application of an Unpaired 4demo.perceptionanalyticsfinet 5urbanoalvarez.es/blog/2008/04/04/ bad-words-list/ T-test suggests that the difference among these two scores is not significant (p=0.6736). The best ranked What-if, according to the b-term potential was: What if a called myself the pope then charged into the vatican and demanded a duel to the death with an old man? This was also the sentence that achieved the best average score from the human evaluators. The impact of b-term potential ranking on compound sentence pairs was evaluated similarly. To do this, we took the top 4 What-ifs and the top 4 moral statements that contained the strongest b-term. By combining them, we created a collection of 16 pairs of sentences. This collection was compared to two other collections: (i) a collection of 16 pairs of sentences (What-if + moral) that shared a b-term regardless of its strength, and (ii) a collection of 16 randomly paired What-if and moral sentences. Our hypothesis was that the top ranked collection will score higher on average than the one with randomly ranked b-terms and significantly better than the one which was randomly put together, ignoring b-terms. The pairs were randomly shuf.ed and independently assessed by 6 human evaluators answering the question: How good do you find the combination of the two sentences?, scored again from 1 to 5. Surprisingly, the top ranked collection was scored signifficantly (p=0.0076) lower than the randomly ranked one, with average score of 2.43, compared to 2.96. Also, in an independent comparison, it scored lower than the randomly paired sentences, having an average score of 2.70 compared to 2.78, although this was not significantly lower (p=0.6677). The compound sentence pair with the best b-term rank was: What if a called myself the pope then charged into the vatican and demanded a duel to the death with an old man? Every man should be content to mind his own business. However, this sentence pair was ranked only 8th best among 32 manually evaluated compound pairs. Given the encouraging result of the ranking mechanism for single What-if sentences, and the bad performance on its target compound data, the usefulness of the bisociative discovery methods for ideation and idea assessment cannot be confirmed. Hence, we plan further implementation and experimentation. In particular, we will enlarge the dataset of moral statements, to strengthen the bisociation approach. Curation Analyses Recall that we plan to use the above ideation methods as a baseline against which to compare more sophisticated approaches as the WHIM project progresses. Colton and Wig gins (2012) introduce the term curation coefficient as an informal reading of the typicality, novelty and quality measures put forward in (Ritchie 2007). In essence, this involves a project team member examining the output from their generative software, and calculating the proportion that they would be happy to present to others. For our purposes here, we used slightly lower criteria: we took all the ideas from each method, or a sample when there were too many, and recorded how many were suitable for assessment, i.e., the proportion of ideas that were both understandable and fictional, without any judgement of quality. In figure 1, we presented flowcharts A to D for generating fictional ideas using ConceptNet. Facts in ConceptNet are FC Example T1 T2 Yield C-Coeff(%) A He was half man, half bird 1 3 5 --97 21 14 72 90 93 B He was half man, half .sh, who could live in a lake 5 5 5 1 2 5 453 94 27 78 88 100 B He was a cat, but he could still write 5 5 1 3 48 7 88 100 C Composer in a nest with turnip for a face 272 56 D Dolphin that is made out of gold 871 76 Average 190.4 84.1 Table 1: Curation analysis: ConceptNet approach. Criteria Yield C-Coeff(%) Fictional 500 90.9 Understandable 500 94.6 Non-duplicate 500 73.6 Overall 500 59.1 Table 2: Curation analysis: ReVerb approach. Evaluation Yield C-Coeff(%) What-if + moral (b-term) What-if + moral (random) 32 16 28.1 6.25 Table 3: Curation analysis: bisociative discovery approach. scored for truth likelihood, and flowchart A is parametrised by a threshold, T1, for the minimum score that ConceptNet facts must achieve to be used. Flowchart B uses Concept-Net twice, hence has thresholds T1 and T2. Flowcharts C and D were not parametrised, and used a fixed ConceptNet threshold of 1. Table 1 shows the number of ideas (yield) that each flowchart (FC) produced, with various threshold settings. The table also shows the curation coefficient (C-Coeff), i.e., the proportion of understandable and (largely) fictional ideas. We see that the yield reduces as higher thresholds T1 and T2 are imposed, but the curation coefficient increases, because fewer spurious or nonsensical facts are inverted for the ideas. In one case for flowchart B, by setting T1 and T2 to 5, we were able to produce a set of 27 ideas with a 100% curation coefficient. We noted an average yield of 190.4 and an average curation coefficient of 84.1%. We generated 500 ideas with the ReVerb approach, using as seed queries the top six names from an online list of the most famous people of all time6. There were three issues with the ideas: (i) some happened to be true facts, or very close to a true fact (e.g., What if John Kennedy was elected vice president?); (ii) some happened to be nonsensical (e.g., What if Elvis Presley is inducted into St?), and (iii) some were an exact or very close duplicate of one already seen in the output (e.g., What if Leonardo da Vinci was born in New York? and What if Leonardo was born in New York?). In table 2, we report the curation coefficients with each of 6www.whoismorefamous.com these three issues in mind, and an overall coefficient for the ideas which have none of these issues. We see that each issue reduced the curation coefficient, which was 59.1% overall. For the bisociative discovery approach, we performed an analysis of the ideas that combine a What-if sentence with a moral statement, since these are automatically generated, rather than just mined from Twitter. We compared the 32 sentence pair ideas where there was a shared b-term with the 16 randomly concatenated pairs of sentences. Table 3 shows the results of the curation analysis for the ideas from the bisociative discovery approach. We found that the ideas generated by the bisociative discovery method were entirely understandable, as they were concatenations of two already understandable sentences. However, the results were often non-fictional, because the method doesnt explicitly attempt to distort reality. This explains the low curation curation coefficient of 28.1% for the b-term method, but it is important that it significantly outperformed the random approach. With the ConceptNet and ReVerb approaches, data-mined notions of reality were inverted and altered respectively, hence the ideas were largely fictional. With respect to nonsensical ideas, for the ConceptNet-based ideas, we learned that control over quality could be exerted, at the expense of yield, through the usage of the ConceptNet thresholds. For the ReVerb results, completely nonsensical ideas were rare, since we used only arguments that are well attested with the relation. Errors were generally due to the open-domain IE extraction method used to compile the original facts. With the ReVerb approach, many of the (almost) true ideas occur because of substitutions for similar arguments, e.g., substituting president with vice-president. The system cannot recognise that the two are similar, and consequently the output contains a high proportion of almost exact duplicates: often almost the same thing is substituted many times over. This suggests that the results could be improved by incorporating a measure of semantic similarity which prefers dissimilar substitutions. Alternatively, the data integration technique from (Yao, Riedel, and McCal lum 2012) could be used by the system to rule out ideas that, although not seen explicitly before, are highly probably repeats, given the observed facts. A Crowd-Sourcing Evaluation Ultimately, the fictional ideas we want to automatically produce will be for general consumption. Hence a large part of the WHIM project will involve crowd-sourcing responses to fictional ideas and using machine learning techniques to derive an audience model that can predict whether generated ideas are going to be of value. To study a baseline methodology for this, and to get a first tranche of feedback from the general public, we focused on the ConceptNet approach within the context of anthropomorphised animal characters which could feasibly appear in a Disney animated film. This context was chosen because Disney movies are familiar to most people and somewhat formulaic, hence we could be reasonably confident that when we surveyed people, our questions would be interpreted appropriately. During a pilot study reported in (Llano et al. 2014), we focused on ideas generated by the CapableOf relation in the second ConceptNet node of flowchart B in figure 1, i.e., we studied ideas of the type: What if there was a little X, who couldnt Y? With an online survey of four questions, we asked 10 English speaking participants to rank the same list of 15 such Disney characters, in terms of (a) general impression (b) emotional response provoked (c) narrative potential: number and quality of potential plot lines imaginable for the character, and (d) how surprising they found the character to be. Our aim was to measure the influence of emotional provocation, narrative potential and surprise on general impression. Recall that we wrote routines to produce chains of ConceptNet facts. The 15 Disney characters in the survey comprised 5 from ideas with no chains, 5 from ideas with multiple chains, and 5 ideas where the RHS of a ConceptNet fact was replaced with a randomly chosen verb. This pilot study showed that ConceptNet ideas were ranked much higher than the random ones for three questions, with average ranks of 5.21 vs. 10.98 for general impression, 6.08 vs. 11.5 for emotional provocation and 5.00 vs. 11.32 for potential for narrative potential. Within the ConceptNet examples, those with chains were ranked slightly higher than those without: average ranks of 4.78 vs. 5.21 for general impression, 3.42 vs. 6.08 for emotional response and 4.68 vs. 5.00 for narrative potential. However, when assessing levels of surprise, the random ideas were ranked as best with an average rank of 4.48 vs. 8.18 for ConceptNet ideas with no chains, and 8.44 for those with chains. On reflection, we determined that this resulted from an inconsistent interpretation of the word surprising. We also found in the pilot study that there was a strong positive correlation r between general impression and both emotional response (r=0.81) and narrative potential (r=0.87), confirming that both these elements are key components of participants general impressions of value. However, we found a strong negative correlation between general impression and surprise (r=-0.77). Hence, this suggests that more surprising ideas arent generally well received. Building on and learning from the pilot study, we undertook a larger scale experiment. For this, we used three sets of Disney characters generated using ConceptNet facts with the CapableOf (CO) relation as before, in addition to the Desires (D) relation (What if there was a little X who was afraid of Y?) and the LocatedNear (LN) relation (What if there was a little X who couldnt find the Y?) In order to evaluate participants preferences, we designed four surveys: one per relation, and a fourth that mixed Disney characters from the three relations. In order to prevent bias or fatigue, each participant completed only one of the surveys. Each survey consisted of four questions that asked participants to rank Disney characters in order of their general impression (GI) of the characters viability, the degree of emotional response (ER) they felt upon reading and interpreting the idea of the character, the quantity and quality of the plot lines; i.e., narrative potential (NP), that they felt might be written about each, and to what level each character met their expectation (LE) of a Disney character. This last question replaced the final question from the pilot study. The relation-focused surveys had a set of 14 ideas, eight ConceptNet non-chaining (NC) ideas (i.e., only one associated Q CO D LN Avg NC CC NC CC NC CC NC CC GI ER NP LE 7.41 7.62 7.88 7.00 7.85 7.04 7.95 6.90 7.76 7.15 8.03 6.80 8.03 6.80 8.15 6.63 8.05 6.77 7.85 7.03 7.95 6.90 8.01 6.81 7.74 7.18 7.92 6.94 7.94 6.91 8.04 6.78 Q Mixed CO D LN GI 7.48 7.70 8.81 ER 6.55 8.44 9.01 NP 7.86 7.48 8.66 LE 7.24 8.46 8.30 (a) Average participant rankings for three relation (b) Average participant focused surveys by type of idea: Non-Chaining (NC) rankings for Mixed sur and ConceptNet Chaining (CC). vey by inverted relation. GI&ER GI&NP GI&LE Avg. Corr. (. ) 0.34 0.36 0.31 ER&NP ER&LE NP&LE Avg. Corr. (. ) 0.35 0.32 0.37 (c) Average rank correlation between all the questions of the four surveys: General Impression (GI), Emotional Response (ER), Narrative Potential (NP) and Level of Expectation (LE) . Q Correlation (.) CO D LN Mixed Avg GI 0.09 0.25 0.27 -0.24 0.09 ER 0.17 0.25 0.26 0.26 0.23 NP 0.22 0.22 0.21 0.23 0.22 LE 0.14 0.27 0.22 0.08 0.17 Q Correlation (. ) CapableOf Desires LocatedNear Mixed Avg IsA CO CB IsA D CB IsA LN CB IsA Rel CB IsA Rel CB GI ER NP LE 0.25 0.19 0.31 0.18 0.22 0.25 -0.02 0.07 0.03 0.39 0.11 0.44 0.42 0.17 0.40 0.51 0.10 0.49 0.46 0.07 0.44 0.46 0.10 0.44 -0.17 0.34 -0.17 -0.07 0.21 -0.03 -0.07 0.27 -0.03 0.02 0.17 0.06 0.20 0.27 0.31 0.22 0.40 0.39 0.23 0.26 0.26 0.18 0.29 0.31 0.17 0.24 0.21 0.21 0.23 0.27 0.15 0.16 0.17 0.26 0.16 0.31 (d) Rank correlation between av. par(e) Rank correlation between average participant rankings and ConceptNet relations rankings. ticipant rankings & chaining rankings. Figure 2: Crowd-sourcing experiment results for four surveys: CapableOf (CO), Desires (D), LocatedNear (LN) and Mixed. chain) and six ConceptNet chained (CC) ideas (i.e., with multiple associated chains) random ideas were not evaluated as they scored significantly worse in the pilot study. The mixed-survey used a set of 15 CC-ideas, five per relation. These ideas were chosen by sampling systematically at equal intervals in terms of chaining score. Results A total of 135 participants completed the crowd sourcing experiment, with at least 27 participants per survey. Contrary to the pilot study, the crowd sourcing evaluation was not restricted to native English speakers. Therefore, we had respondents with different levels of .uency: 1 was at a basic level, 12 consider themselves at an intermediate level, 68 participants were .uent and 54 were native English speakers. These figures show that at least 90% of the participants were .uent or native, which provides a high level of con.dence in the reliability of the results. Moreover, 64 participants were female, 70 were male and 1 person preferred not to specify their gender. This shows an almost even participation from both genders. The participants were between 18 and 74 years old; more specifically, 12 were in the age range between 18 and 24 years old, 74 in the range 25-34, 33 in the range 35-44, 7 in the range 45-54, 7 in the range 55-64 and 2 in the range 65-74. The highest concentration is seen in participants between 25 and 34 years old; however, most age ranges were represented in the surveys. After completing the surveys we asked the participants to select their level of confidence, between very low, low, medium, high and very high, when answering each question. Table 4 shows that most of the participants answered each question with a medium level of confidence or higher. This increases the confidence we have in the results. Figure 2(a) shows the average rankings given for each class of ideas in the relation-focused surveys. As suggested in the pilot study, in general, the CC-ideas are ranked around Table 4: Percentage of participants who answered each question with a medium level of confidence or higher. Percentage of Participants Question CO D LN Mixed GI ER NP LE 97 97 78 85 90 90 82.5 80 94 88.5 83 80 96 92.5 85 78 1 position higher than the NC-ideas. This supports the hypothesis that the ConceptNet chaining evaluation technique provides a reliable measure of value for fictional ideation using ConceptNet. Using a Friedman test comparing the mean ranks for CC and NC ideas in each response, we found that the difference between their ranks is highly significant overall (p<0.001). This effect remained significant across all question and survey subgroups. Figure 2(b), which presents the results from the fourth survey, shows that, in general, the CO-ideas were ranked highest, followed by the D-ideas and then the LN-ideas. A Friedman test showed these differences to be highly signi.cant overall (p=0.001). Our interpretation is that participants considered that, in some cases, the D-ideas and LN-ideas failed with respect to the feasibility of the fictional characters they portrayed, therefore, they were ranked lower. More specifically, respondents suggested that they felt apathy towards anthropomorphisations such as a little goat who is afraid of eating (D-idea), which threatened fundamental aspects of animals lives, as well as ideas such as a little oyster who couldnt find the half shell (LN-idea), which were found difficult to interpret. On the contrary, participants pointed out that some of the CO-ideas were reminiscent of existing cartoons, placing them into a higher rank, e.g., a little bird who couldnt learn to .y (which resembles the plot of the animated film Rio). These type of participant judgements played an important role when ranking the ideas, resulting in a clear overall preference for the CO-ideas. We also wanted to confirm the pilot study suggestion that emotional response, narrative potential and level of expectation are key components of participants general impression of value. We used a Kendall rank correlation coefficient (.) for this analysis. Figure 2(c) shows the average corre lation results between all the components, showing a positive correlation between all the surveyed components. However, a Friedman rank sum test indicated that the particular differences between correlation values are not significant (p=0.2438), i.e., all question pairs were similarly correlated. Figure 2(d) shows the correlation between the chaining scores and the overall rankings of the participants. We see that weak positive correlations were found for most of the aspects evaluated in the four surveys and the chaining scores. These results confirm that, as suggested in the pilot study, the chaining technique can be used as a measure to evaluate fictional ideas, and we plan to investigate the value of generating other semantic chains to increase the effectiveness of this technique. Figure 2(d) also shows that a weak negative correlation exists between participants general impression and the chaining scores for the mixed-survey. This suggests that participants found it more difficult to decide on the rankings when the rendering of the ideas was mixed. Finally, two facts are used for each idea generated with ConceptNet: facts that tagged words as animals with the IsA relation, and facts to be inverted, which use the CapableOf, Desires and LocatedNear relations. Figure 2(e) shows the results of calculating the correlation between the average participants rankings and each ConceptNet fact score, as well as the combination of both (CB). We see that, except for the LN-survey, most of the results show a weak positive correlation. This supports the finding from the pilot study that the values people project onto ideas is somewhat in line with the score assigned by ConceptNet to the underlying facts. Moreover, the highest correlations are presented in the D-survey with the IsA relation. We believe that people tend to rank higher ideas associated with more common animals, such as dogs or cats, used in multiple ideas of the D-survey, than ideas involving relatively uncommon animals, such as ponies, moles or oxen, which were used in the LN-survey. The correlations between the participants rankings and the chaining and ConceptNet scores (Figures 2(d) and 2(e)) led us to believe that these scores could be used to predict peoples preferences when ranking fictional ideas. To test this hypothesis, we used the Weka machine learning framework (Hall et al. 2009). We provided Weka with the scores of: ConceptNet chaining, ConceptNet strength for the IsA relation, ConceptNet strength for the inverted relations, word frequencies for the LHS and RHS of inverted facts, and semantic similarity between the LHS and RHS of inverted facts, obtained using the DISCO system7. We classi.ed each idea into good (top 5), bad (bottom 5) or medium (middle 5) based on the average participants rankings. We tested a variety of decision tree, rule-based and other learning mechanisms, with the results given in Table 5, along with the name 7www.linguatools.de/disco/disco_en.html MCC GI ER NP LE Method ZeroR Ridor RandTree NBTree RandTree Accuracy(%) 35.08 49.12 56.14 43.85 54.38 Table 5: Predictive accuracy for general impression, emotional response, narrative potential and level of expectation. Note that MCC value was the same for all evaluated aspects, i.e., GI, ER, NP and LE. of the learning method which produced the best classifier. We found that the RandomTrees approach consistently performed well, but was only the best method for two aspects of evaluation. We used Weka to perform a Paired T-Test, which showed that the predictors are significantly better than the majority class classifier (MCC) which simply assigns the largest class as a prediction with up to 95% confidence. Conclusions and Future Work While essential to the simulation of creative behaviour in software, fictional ideation has barely been studied in Computational Creativity research. Within the WHIM project, we have implemented three approaches to automated .ctional ideation which act as a baseline to compare future ideation methods against. We presented baseline methodologies for assessment, in the form of a curation analysis and a crowd-sourcing study where participants ranked fictional ideas. The curation analysis showed that when guided in a strong context such as Disney characterisations, automated ideation methods work well, but they degrade when the context becomes weaker. The crowd sourcing study showed that an inference chaining technique inspired by the hypothesis that ideas can be evaluated through narratives involving them provides a reliable measure of value with which to assess the quality of fictional ideas. Also, we found positive correlations between the rankings of general impression and each of emotional response, narrative potential and expectation, showing that these are key elements of participants general impression of fictional ideas. Finally, we demonstrated that machine learning techniques can be used to predict how people react to a fictional idea along these axes, albeit with only around 50% predictive accuracy. The baselines presented here provide a firm foundation on which to build more intelligent ideation methods. We plan to improve open information extraction techniques for web mining, and to investigate ideation techniques involving metaphor and joke generation methods and the subversion of category expectations. Also, we plan to use extrapolation to explore scenarios that arise from a fictional idea. For instance, from the seed idea What if there was an elevator with a million buttons? we could extrapolate the distance the elevator can reach and come up with a scenario in which elevators can reach as high as space. Identifying that the current distance reached by elevators is significantly lower than the distance to space is crucial in order to select this idea as an interesting scenario. Using quantitative information can help achieve this goal. The Visuo system (Gagne and Davies 2013) uses semantic similarity to estimate quantitative infor mation for input descriptions of scenes by transferring quantitative knowledge to concepts from distributions of familiar concepts in memory. We will explore the use of Visuo in the production of scenarios from a fictional idea. The generation and assessment of narratives will be a key factor, enabling the system to curate its output. We will derive a theory of idea-centric narratives and implement methods for generating them and assessing ideas in terms of the quality/quantity of narratives they appear in. Our ConceptNet chaining technique shows much promise. Based on the correlation found between general impression and emotional response, we plan to improve the predictive power of the technique using sentiment analysis, as in (Liu, Lieberman, and Selker 2003), where the affect of a concept is assessed through a chaining process. The final major aspects will be to experiment with rendering methods where obfuscation and affect are used to increase audience appreciation of an idea; and the machine learning of a detailed audience model which will influence the entire ideation process. The WHIM project is primarily an engineering effort to build a What-if Machine as a web service and interactive engine, which generates fictional ideas, and provides motivations and consequences for each idea, potential narratives involving it, and related renderings such as poems, jokes, neologisms and short stories. The first version of the What-if Machine is available online8, and uses Flowchart E from figure 1. Users can parametrise the method for exploration, or simply click the Im feeling lucky button. This online implementation will be used to gather feedback for audience modelling, and hopefully help promote fictional ideation as a major new area for Computational Creativity research. Acknowledgements We would like to thank the members of the Computational Creativity Group at Goldsmiths for their feedback, Jasmina Smailovic for preprocessing the tweets used in the bisociative approach, the participants of the crowd sourcing study for their time, and the anonymous reviewers for their constructive comments. This research was funded by the Slovene Research Agency and supported through EC funding for the project WHIM 611560 by FP7, the ICT theme, and the Future Emerging Technologies FET programme. 2014_27 !2014 The Three Layers Evaluation Model for Computer-Generated Plots Rafael Prez y Prez Departamento de Tecnologas de la Informacin Universidad Autnoma Metropolitana, Cuajimalpa Av. Vasco de Quiroga 4871, Col. Santa Fe Cuajimalpa, Mxico D. F., C.P. 05300 rperez@correo.cua.uam.mx, www.rafaelperezyperez.com Abstract This paper describes a model for evaluating a com-puter-generated plot. The main motivation of this project is to provide MEXICA, our plot generator, with the capacity of evaluating its own outputs as well as assessing narratives generated by other agents that can be employed to enrich its knowledge base. We present a description of our computer model as well as an explanation of our first prototype. Then, we show the results of assessing three computer-generated narratives. The outcome suggests that we are in the right direction, although much more work is required. Introduction The engagement-reflection (ER) computer model of writing (Prez y Prez and Sharples 2001) represents creativity as a constant interplay between the generation of ideas and their evaluation. As a core characteristic, such processes strongly interact and influence each other. Thus, from the ER perspective, assessment is an integral part of the creative process. In the same way, evaluation plays an essential role after the creative process has ended: i.e. following a particular criterion, it provides elements to establish the value of an agents output. In this way, we can distinguish two different goals for the same process: 1) to contribute to the development of a story in progress; 2) to estimate if the systems output might be classified as creative. The work reported in this paper concentrates in the latter. From now onwards, we refer to a computer agent that is capable of assessing a product as evaluator. The main motivation of this project is to provide MEXICA, our plot generator, with the capacity of evaluating its own outputs as well as assessing narratives generated by other agents that can be employed to enrich its knowledge base. We can summarise it as follows: MEXICA = plot generator + evaluator. What are the elements that need to be considered in a computer model of evaluation? In this work we present three. The following lines describe each of them. 1) A creative process generates at least two types of outputs: a final product (e.g. a solution to a problem, a poem, a story, a piece of music) and novel knowledge that expands the expertise of the creator. It is not possible to think of creativity without these two elements. Sometimes, authors engage in creative tasks with the main purpose of expanding their expertise in particular topics. For example, Picasso developed several sketches in preparation to paint El Guernica. Based on these observations, we claim that computerised creativity (c-creativity) occurs when as a result of the creative process an agent generates knowledge that does not explicitly exist in its original knowledge-base and which plays an important role in the produced output (Prez y Prez and Sharples 2004); such novel knowledge becomes available within the agents knowledge base for the generation of more original outputs (Prez y Prez under revision). That is, an essential aim of creativity is the generation of expertise and experience that is useful for the creative process itself. We believe that the same principle can be applied during the assessment of a narrative. A computer model of evaluation must consider if the evaluator, as a result of the assessment process, incorporates new knowledge structures into its knowledge base. This idea seems to echo the thoughts of some writers about the importance of reading. For instance, David Lodge claims that reading other authors is the best way to learn about the world and about the technical abilities required for writing (Lodge 1996). Thus, a good narrative allows discovering new perspectives in a given situation, new features that had not been seen before, novel ways of understanding a situation. In other words, it generates new knowledge in the reader. 2) The second aspect to be considered is related to the concept of story. Different authors agree that a story is defined as a sequence of actions that follow the classical Aristotelian structure: setup, conflict, complication, climax and resolution (e.g. see Claude Bremond 1996; Clayton 1996, p.p. 13-15). Usually, conflict is described as obstacles that oppose a more satisfactory state or desire. During complication, the difficulties introduced by the conflict arise incrementing the tension produced in the reader, until the climax is reached. Then, all conflicts are sorted out releasing all accumulated tensions. In other words, if one follows the Aristotelian concept of a story, a narrative must produce in the reader increments and decrements of the dramatic tension. Thus, a computer model of plot evaluation must be able to recognise if the events that comprise a narrative satisfy the Aristotelian requirements. In order to achieve this goal, one needs an agent capable of representing affective responses. (It is worth pointing out that, although in this work we adopt the Aristotelian view, there are other valid options to represent narratives). 3) The third aspect considers that an agent must be able to determine if the sequence of actions that comprise a story satisfies common sense knowledge. In sum, a computer model of plot evaluation requires a story to be evaluated, and an agent capable of transforming the sequence of actions that comprise the story into internal representations that allows detecting novel knowledge structures (cognitive changes), its coherence (common sense knowledge) and representing increments and decrements of the dramatic tension of the tale (affective responses). In the same way, it is necessary to determine how these components influence each other. This type of model requires an agents knowledge-base that represents the experience of the evaluator: a structure is novel when it does not previously exist in its knowledge-base; the information necessary to evaluate the coherence and the storys tension resides within this repository. Thus, different agents with different knowledge and beliefs should produce different evaluations of the same product. Even the same agent, if its knowledge base is modified, might produce different evaluations of the same product. The following lines describe a computer model for plot evaluation that subscribes to these ideas. It is built on top of the results we obtained from previous research on this topic. Related Work Ritchie (2007) suggests criteria for evaluating the products of a creative process (the process is not taken into consideration); in general terms such criteria evaluate how typical and how valuable the product is. The goal is, using existing evaluations of typicality (and atypicality) and value, to construct more complex criteria. Colton (2008) considers that skill, imagination and appreciation are characteristics that a computer model needs to be perceived to have (see also Pease et al. 2001). Jordanous (2012) employs a group of human experts to develop criteria for evaluation of a computer generated product. It includes characteristics like Spontaneity and Subconscious Processing, Value, Intention and Emotional Involvement, and so on. All these are interesting ideas, although some are too general and difficult to implement (e.g. see Pereira et al. 2005). Some work has been done in evaluation of plot generation. Peinado et al. (2010) also have worked in evaluation of stories, although they work was oriented to asses novelty. I am not aware of any model of plot generation that includes the characteristics of the present work. In your review of related work, Ritchie's criteria aren't merely evaluating how typical/valuable products are, but using existing evaluations of typicality (and atypicality) and value to construct more complex criteria. Also, although Jordanous's case study example uses human expert evaluations to evaluate different criteria, she does not insist that her criteria are measured by human experts -quantitative/automated tests could also be used. Our Plot Generator Our research in generation and evaluation of narratives is based on the MEXICA agent (Prez y Prez and Sharples 2001; Prez y Prez 2007). We claim that, as a result of engagement-reflection cycles, our storyteller produces plots that are novel, coherent and interesting. MEXICA employs a dictionary of story-actions and a set of Previous Stories, both defined by the user as text files, to construct its knowledge base. Story-actions have associated a set of preconditions and post conditions that represent common sense knowledge. For example, the precondition of the action character A heals character B is that B is injured or ill. Otherwise, the action does not make sense. In MEXICA, a story is defined as a sequence of actions that follows the next format: character performing the action, description of the action, object of the action (another character); for instance, the jaguar knight attacked the enemy. The format allows some variations, e.g. only one character performing an action; for instance, the princes went to the forest. We refer to this way of organising a narrative as MEXICAs format. The Previous Stories represent well-constructed narratives and provide information about how the story-world works. They represent the experience and knowledge of the agent. Any new story generated by MEXICA can be added to the Previous Stories. The Contextual Structures are the main representation of knowledge within the system. They associate emotional links and tensions between characters with logical actions to perform. For instance, a Contextual Structure might register that when a character A is in love with a character B (an emotional link between two characters) something logical to do is that A buy flowers to B, or that A serenades B, and so on. Contextual Structures are built from the set of Previous Stories; later, they are employed to generated new outputs during plot generation. Employing the same process, knowledge structures can be built from any new story created by the system or by any other agent (as long as the story follows the MEXICAs format). Tensions represent conflicts between characters. When the number of conflicts grows the value of the tension rises; when the number of conflicts decreases the value of the tensions goes down; when the tension is equal to zero all conflicts have been solved. Thus, the storyteller keeps a record of the dramatic tension in the story. The following are examples of situations that trigger tensions: when the life of a character is at risk; when the health of a character is at risk; when a character is made a prisoner; and so on. Every tension is assigned a value. So, each time an action is performed by a character the system calculates and records the value of all active tensions. With this information the storyteller is able to graph the curve of tension of the story. Such a curve is referred to as the Tensional Representation. Description of the Model The work reported in this paper employs and extends the results we obtained in previous efforts to understand automatic plot evaluation. The approach we have followed is to break this complex problem into relatively simpler sub problems. Thus, we developed a computer model for assessing novelty (Prez y Prez et al. 2011) and a computer model for assessing interestingness (Prez y Prez and Ortiz 2013) as first steps before building the integral model of evaluation (we did not publish the result of our model for assessing coherence) . Based on those results, we came out with a general model that I present here. The following lines provide a general view of this work. We exploit the infrastructure built for MEXICA. Thus, a dictionary of story-actions and a set of Previous Stories, both defined by the user as text files, are used to construct the evaluators knowledge base. It is interesting to notice that our agent employs the same information to generate a plot and to evaluate a plot. We have been successful in developing tools that are capable of transforming a sequence of actions (i.e. a story in MEXICAs text format) into internal structures that our computer agent can manipulate. Employing such tools, it is possible to perform an analysis of the dramatic tension of the story under evaluation and of the changes that such a plot produces into the agents knowledge structures. I refer to the process of transforming a sequence of actions into structures that represent knowledge and affective reactions as Interpretation (see figure 1). Narrative represented Knowledge Structures as text and the model of interestingness; the generation of unusual situations (new knowledge structures) is important for the model of interestingness and the model of novelty; and so on. It is possible to employ the three models mentioned above to obtain a global evaluation of a story. That is, given a plot, we can run the system that evaluates novelty, then the system that evaluates interestingness and lastly the system that evaluates coherence; finally, we can calculate the average result. However, this procedure has some flaws. As mentioned earlier, some story-characteristics are employed in more than one model. As a result, they might be overrepresented in the overall calculus distorting the final value. In the same way, story-characteristics might be linked in ways that individual models cannot represent. For instance, one story might get a high score in novelty but a low score in coherence. However, it does not make sense to claim that a story is very original when it is unintelligible. A famous example of a similar situation is the sentence Colorless green ideas sleep furiously (Chomsky, 1957); this sentence does not seem mean anything coherent but sound like an English sentence. Thus, it seems sensible to have one model for a general evaluation, where all story-characteristics can interact, rather than three individual ones. Some of the story-characteristics, although useful, are not essential for a good plot. So, if they are present they help to enhance the story; if not, the story still can be a good narrative. We referred to such characteristics as Story 1 Virgin disliked Jaguar knight Virgin laughed at Jaguar knight Interpretation Jaguar knight attacked Virgin Virgin fought Jaguar knight Jaguar knight Enhancers. For instance, if the problems of a character seem to be solved and out of the blue new conflicts arise (reintroducing complications) the plot might be considJaguar knight ran 200 wounded Virgin away 150 ered as more exciting. This characteristic is not required Jaguar knight went 100 back to Texcoco Lake to develop a good plot but its presence helps. So, En 50 Jaguar knight did not 0 hancers add extra points to the evaluation. The use of cure Virgin Enhancers might be conditioned to the good results of Figure 1. The Interpretation Process transforms a sequence of actions in a text format into a set of knowledge structures and affective reactions (dramatic tension). Once the interpretation has been performed the agent has the necessary information to analyse the attributes of the story under assessment. Based on our previous work, we have selected a set of eight features, known as the story-characteristics, which are useful for evaluating a plot: opening, closure, climax, reintroducing complications, satisfaction of preconditions, repetition of sequences of actions and two types of novel knowledge structures. Typically, they have a value ranging from zero to one, where one is the most desirable value. They represent knowledge structures and affective reactions. Details of the story-characteristics are given some lines ahead. In order to implement the model for assessing the novelty it was necessary to choose a set of story-characteristics that were associated to the production of original plots; the same applies for the model of evaluation of interestingness and coherence. Some story-characteristics are used in more than one of those systems. For instance, sorting out all the problems that characters have at the end of the story (correct closure of the narrative) is important for both, the model of coherence other characteristics. For instance, if a given story is unoriginal it does not make sense to consider it more interesting only because there is a reintroduction of complications. Following the same logic, the model contemplates the use of Debasers, i.e. story-characteristics that, when they are missing, they decrement in some points the global evaluation of a plot. In our previous models the relationships of the story-characteristics are defined by expressions like the following: E = C1W1 + C2W2 + C3W3+ CnWn where E represents the result of the evaluation, C one of the characteristics to be assessed and W its weight. However, this expression lacks flexibility. For example, it is not possible to represent conditioned Enhancers or Debasers. In the same way, some characteristics might play a more relevant role during one stage of the assessment than during others. For example, a story must be lucid; otherwise, it is not worth evaluating the plot. So, at this point those characteristics associated to coherence have a high priority for the evaluation process. However, once this requirement is satisfied, other characteristics start to take precedence. To illustrate this situation the reader can picture a logic story that is boring, i.e. it lacks increments and decrements of tension. In this case, those characteristics associated to interestingness became more relevant for the evaluation process. As a result, the global assessment probably would produce a low value even if the coherence is pretty good. The model also considers what we refer to as the compensation effect. In the overall evaluation, characteristics highly rated might compensate those with lower grades by adjusting their weights. For example, picture a story that shows exceptional original situations; even if the plot suffers for some coherence problems, the overall rate might still be pretty high. Description of the Story-Characteristics The following lines describe the story-characteristics that I employ in this work and how to calculate their values. Opening: We consider that a story has a correct opening when at the beginning there are no active dramatic tensions in the tale and then the tension starts to grow. If at the beginning of the story the value of the tension is zero, then Opening is set to one; if at the beginning of the story the value of the tension is equal to the main peak (the climax), then Opening is set to zero; otherwise, Opening is set to a proportional value between zero and one. Opening = 1 (Tension at the first action /Peak) Closure: We consider that a story has a correct closure if all the dramatic tensions in the story are solved when the last action is performed. That is, following Prez y Prez and Sharples, a story should display an overall integrity and closure, for example with a problem posed in an early part of the text being resolved by the conclusion (Prez y Prez and Sharples 2004). If at the end of the story the value of the tension is equal to the main peak (the climax) then Closure is set to zero; If at the end of the story the value of the tension is equal to zero (all problems are solved), then Closure is set to 1; otherwise, Closure is set to a proportional value between zero and one. Closure = 1 (Tension at last action/Peak) Climax: All stories should include a climax. In the graphic of tensions the climax is represented by highest peak. However, it is not the same a story with an incipient peak that a story with a clear elevated crest. In order to evaluate the peak, MEXICA calculates the average value of all Previous Stories climax and employs it as a reference. Thus, if the peaks value is equal or major than the reference, then Climax is set to 1; if there is no peak, then Climax is set to zero; otherwise, it is set to a proportional value between zero and one. Climax = (Current climax/Reference value climax) If Climax > 1 then Climax = 1 Reintroducing Complications: We refer to the situation where a narrative has a resolution and then tensions start to rise again as reintroducing-complications. In this work, we appreciate narratives that seem to end and then new problems for the characters emerge, i.e. where all tensions are solved and then they rise again. This formula can be observed in several examples of narratives like films, television-series and novels. MEXICA calculates the average number of complications that are reintroduced in the Previous Stories and employs it as a reference. Thus, if the number of times that the current story reintroduce complications is equal or major than the reference, then Reintroducing Complications is set to 1; if there is no reintroduction of complications, then Reintroducing Complications is set to zero; otherwise, it is set to a proportional value between zero and one. Novel Contextual Structures: In this work a new story generates new knowledge when it generates structures that did not exist previously in the knowledge base of the system and that can be employed to build novel narratives. Each action within a plot has the potential of introducing an unknown context for the agent. So, if all actions that comprise the story under evaluation generate unknown contexts, then Novel Contextual Structures is set to one; if none of the actions produce an unknown context, then Novel Contextual Structures is set to zero; otherwise, Novel Contextual Structures is set to a proportional value between zero and one. Original Value: Besides calculating the number of novel contextual structures, it is necessary to determine how original they are with respect to the information that already exists in the knowledge base. With this purpose we define a parameter known as the Limit of Similitude (LS) that represents the maximum percentage of alikeness allowed between two knowledge structures. If the percentage of similitude between a given Contextual Structure and all structures in the knowledge base is minor to LS, we refer to such Contextual Structure as original. In this way, we can distinguish between novel situations and really original ones. Thus, the Original Value is equal to the ratio between the total number of original structures and the total number of contexts produced by the tale. Preconditions: All actions have associated preconditions that represent common sense knowledge. If the preconditions of all story actions are fulfilled, then Preconditions is set to one; if none of the preconditions of all story actions are fulfilled, then Preconditions is set to zero; otherwise, it is set to a proportional value between zero and one. Repetition of Sequences: There are some attributes that contribute to the lack of coherence in a plot. The repetition of sequences of actions performed by the same characters illustrates this situation. We include this feature to show some of the problems that computer generated narratives might suffer. Thus, in this implementation, Repetition of Sequences is set to one when there are no repetitions; otherwise, it is set to zero. The Three-Layers The model described in this paper represents evaluation as a process organized in three layers (see figure 2). Layer-0 includes those characteristics that a plot must satisfy in order to be considered for evaluation. These characteristics do not add points to the evaluation; they are requirements that need be satisfied in order to proceed to evaluate the plot. Otherwise, the process is ended. They are known as the required-characteristics. Layer-1 includes what I refer to as the core-characteristics. They are the backbone of the evaluation process and represent those essential features that form a plot. Layer-2 includes what I refer to as the Enhancers and the Debasers. Enhancers are characteristics that add extra points to the result obtained from the previous layer. Debasers represent features that decrement the result obtained from Layer-1. Their use might be conditioned to the result of other story-characteristics. Result of the evaluation Figure 2. The three layers evaluation model. A story-characteristic can be employed in more than one layer. Actions preconditions illustrate this situation: it is not worth to evaluate an unintelligible story (Preconditions in Layer-0); however, a mainly sounded story with few inconsistencies might only be penalized with some negative points (Preconditions in Layer-2). The following lines provide details about the implementation. Layer-0: In the current implementation, the number of Fulfilled Preconditions and the number of Novel Contextual Structures are selected as the Required-Characteristics. If most actions within a story have unfulfilled preconditions or the story under evaluation is too similar to any of the previous stories, then the systems considers that is not worth evaluating the plot. The user provides the minimum rates that the story-characteristics Fulfilled Preconditions and Novel Contextual Structures must reach to continue with the evaluation process. Layer-1: In the current implementation, the following elements have been selected as the core-characteristics: Climax, Closure and Novel Contextual Structures. All they have been assigned the same weight. These characteristics have been chosen because: a narrative without climax is not a story; Closure is important to keep the coherence and interestingness of the tale; novelty is an essential feature of any story. The result of the evaluation in Layer-1 is the average value of the three core-characteristics. Layer-2: In the current implementation, Preconditions and Repeated Sequences have been chosen as Debasers. They represent features that we take for granted; however, if they are missing within a narrative we immediately notice them. Thus, if they have a value lower than a reference provided by the user, the result of the evaluation obtained in Layer-1 is decremented by n units, where n is a parameter defined by the user. IF Preconditions < Reference-Preconditions THEN Decrement-Result-Evaluation-1 IF Repetition-Sequences < Reference-RS THEN Decrement-Result-Evaluation-1 The following characteristics have been chosen as Enhancers: Opening, Reintroducing Complications and Original Value. Thus, if they have a value higher than a reference provided by the user, then the result of the evaluation obtained in Layer-1 is incremented by m units, where m is a parameter defined by the user. Enhancers are only employed when there are not repetition of sequences of actions, the evaluation in Layer-1 and the Closure reach a minimum value defined by the ser. IF (Repetition-sequences = 1) and (Result-Layer-1 > Reference-L1) and (Closure > Reference-Closure) THEN BEGIN IF Opening > Reference-Opening THEN Increment-Result-Evaluation-1; IF Reintroducing-Complications > Reference-RC THEN Increment-Result-Evaluation-2; IF Original-Value > Reference-OV THEN Increment-Result-Evaluation-3; END As a final step, the evaluator generates a report to explain the criteria employed during the process of evaluation. The report is divided in four sections: section one includes a general comment about the whole narrative; section two provides observations about the storys coherence; section three incorporates notes about the storys interestingness; and section four offers comments about the narratives novelty. The report is generated by matching the value of some of the story-characteristics with predefined texts. In general, there are at least five possible options that can be employed for each of such story-characteristic. IF Value-Story-Characteristic > 0.9 THEN Employ-Text-1 ELSE IF Value -Story-Characteristic > 0.8 THEN Employ-Text-2 ELSE IF Value -Story-Characteristic > 0.7 THEN Employ-Text-3 ELSE IF Value -Story-Characteristic > 0.6 THEN Employ-Text-4 ELSE Employ-Text-5; The following lines describe the way each section is built. Section one. The system employs the final result of the evaluation process (output of Layer 2) to select the right text. Section two. The coherence section includes three types of comments: one associated to the satisfaction of preconditions, one related to the right closure and the last one connected to the repetition of sequences of actions. The first two types of comments are always printed; the last type of comment is omitted when the tale does not include repeated sequences of actions. Thus, the system employs the story-characteristics Preconditions, Closure and Repetition of sequences to generate the text. Section three. The interestingness section includes five types of comments, each one related to the following story-characteristics: Opening, Climax, Reintroducing complications, Closure and Original value. The first two comments are always included in the report while the last three comments are only printed when some requirements are satisfied. The next lines explain the conditions that need to be satisfied in order to incorporate the last three remarks into the report. If the story-characteristic Climax . 0.7 then the system adds comments about the closure. This makes sense because the climax represents the conflicts in the story and the closure indicates how those conflicts are sorted out. If the story-characteristic Closure . 0.7 then comments regarding the original value are inserted in the report. That is, the system only includes comments about singular features of the plot when it has an adequate ending. That is, in the current implementation originality loses importance when the story has a bad finale. If the story-characteristic Closure . 0.7 and the Reintroduction of complications . 0.75 then the system inserts some comments about the reintroduction of complications in the report. In this case, besides considering the closure, the system requires that the story includes a clear instance of the reintroduction of complications. Otherwise, it is no point to make comments about this feature. All these parameters can be modified by the user. Section four. The novelty section includes comments about the originality of the story. The system selects the appropriate text depending on the value of the story-characteristic Novel contextual structures. Testing the Model To test the model we evaluated three stories: two generated by MEXICA and one generated by another story teller. In Layer-0 we established the following conditions to continue with the evaluation process: Preconditions > 0.7 and Novel Contextual Structures > 0.35. In Layer-2 we established the following requirements for the Debasers: IF Preconditions < 0.7 THEN Decrement-Result-Evaluation-in-2points; IF Repetition-Sequences < Reference-RS THEN Decrement-Result-Evaluation-in-3points; In Layer-2 we established the following requirements for the Enhancers: IF (Repetition-sequences = 1) and (Result-Layer-1 . 0.7) and (Closure > 0.75) THEN BEGIN IF Opening = 1THEN Increment-Result-Evaluation-in-0.5points; IF Reintroducing-Complications > 0.8 THEN Increment-Result-Evaluation-in-1point; IF Original-Value > 0.5 THEN Increment-Result-Evaluation-in-1.5points; END The values of the parameters are the result of several tests we have performed. Story 1. This story was developed by MEXICA-impro and reported in (Prez y Prez et al. 2010). Jaguar knight is introduced in the story Princess is introduced in the story Hunter is introduced in the story Hunter tried to hug and kiss Jaguar knight Jaguar knight decided to exile Hunter Hunter went back to Texcoco Lake Hunter wounded Jaguar knight Princess cured jaguar knight Enemy kidnapped Princess Enemy got intensely jealous of Princess Enemy attacked Princess Jaguar knight looked for and found Enemy Jaguar knight had an accident Enemy decided to sacrifice Jaguar knight Hunter found by accident Jaguar knight Hunter killed Jaguar knight Hunter committed suicide The following lines show the values of the storycharacteristics: Preconditions: 1 Opening: 1 Closure: 0.6 Climax: 1 Novel Contextual Structures: 0.71 Original Value: 0.71 Repeated Sequences: 1 Reintroducing Complications: 0 Result-Layer-1: 0.77 Figure 3 shows the graphic of tension of story 1. Because the Closure did not reach the value of 0.75 the Evaluator decided not to employ the Enhancers. Tensions 200 150 100 50 0 Actions Figure 3. Tensional Representation of story 1. The following lines produced by the agent provide the reasons of the final result: EVALUATION OF THE STORY This is a good effort. With more practice you will be able to create nice plots. Here are some comments about your work that I hope will be a useful feedback. COHERENCE The story is very logical; all actions are nicely integrated and form a coherent unit. It requires that all complications that characters faced are sorted out by the end of the last part. You need to pay more attention to this aspect. INTERESTINGNESS The text has a good introduction. The story reaches a nice climax with a good amount of tension. This is an important characteristic of a good narrative. Great! Sadly, the bad closure damages the interestingness of a story. NOVELTY The plot is kind of inventive. My evaluation of your story is ->77/100 Story 2. This story was produced by MEXICA for this paper. Virgin disliked Jaguar knight Virgin laughed at Jaguar knight Jaguar knight attacked Virgin Virgin fought Jaguar knight Jaguar knight wounded Virgin Jaguar knight ran away Jaguar knight went back to Texcoco Lake Jaguar knight did not cure Virgin Tlatoani was an inhabitant of the Great Tenochtitln Tlatoani and Jaguar knight were rivals Tlatoani fought Jaguar knight Jaguar knight ran away Jaguar knight went back to Texcoco Lake Jaguar knight did not cure Virgin The following lines show the values of the story-characteristics: Preconditions: 1 Opening: 0.8 Closure: 0.28 Climax: 1 Novel Contextual Structures: 0.86 Original Value: 0.86 Repeated Sequences: 0 Reintroducing Complications: 1 Result-Layer-1: 0.71 Figure 4 shows the graphic of tension of story 2. The story has a really bad Closure; however, the good Climax and the relatively good result of Contextual Novel Structures push the result in Layer-1. However, Repeated Sequences are highly punished (the succession of actions 6, 7 and 8 is repeated at the end of the tale) and therefore the evaluator decrements in 3 point the final result. Tensions 140 120 100 80 60 40 20 0 Actions Figure 4. Tensional Representation of story 2. The following lines show the report explaining the evaluation process. EVALUATION OF THE STORY Sorry, but this story is not good. Here are some comments about your work that I hope will be a useful feedback. COHERENCE The story is very logical; all actions are nicely integrated and form a coherent unit. Unfortunately, there are several loose ends that need to be worked out (it reminds me of the really bad end of the TV show Lost). As a result the plot lacks an adequate conclusion, an important characteristic of a good narrative. You are repeating sequences of actions; as a consequence the plot is confusing! INTERESTINGNESS The plot starts with some tension. The story reaches a nice climax with a good amount of tension. This is an important characteristic of a good narrative. Great! Sadly, the bad closure damages the interestingness of a story. NOVELTY I find this story pretty original! I love it! My evaluation of your story is ->41/100 Notice the last sentence in the report. Because the Original Value got a high rate the evaluator includes this sentence. It is necessary to correct this problem. Story 3. This story was produced by MINSTREL (Turner 1993, p. 622). The original tale narrates the story of a knight, known as Lancelot, how was hot tempered. Andrea was a lady of the court and one day she went to the woods to pick berries. By accident, Lancelot found Andrea in the woods and he fell in love with her. Sometime later, Lancelot found again Andrea in the woods, and he saw that she was kissing another knight known as Frederik. So, Lancelot thought Andrea was in love with Frederik and got really jealous; so, he killed Frederik. Andrea told Lancelot that Frederik was her brother. Lancelot hated himself and became and hermit; Frederik was buried in the woods and Andrea became a nun. In the following lines we show the same narrative but as a MEXICA plot: Lady and Eagle Knight were brothers Lady went to Chapultepec Forest Jaguar knight found by accident Lady Jaguar knight was very impressed by Lady Jaguar knight fell in love Lady Lady went to Tlatelolco Market with Eagle Knight Jaguar knight found by accident Lady Jaguar knight got intensely jealous of Eagle knight Jaguar knight attacked Eagle knight Jaguar knight killed Eagle knight Jaguar knight realised that Lady and Eagle Knight were brothers Jaguar knight hated Jaguar Knight Jaguar knight exiled Jaguar knight We transformed this narrative by trying to find similar actions in MEXICAs dictionary to those described in the original tale. The following lines show the values of the story-characteristics: Preconditions: 1 Opening: 1 Closure: 0.75 Climax: 0.8 Novel Contextual Structures: 0.54 Original Value (surprise): 0.54 Repeated Sequences: 1 Reintroducing Complications: 0 Result-Layer-1: 0.70 Figure 5 shows the graphic of tension of story 3. In this case, it is possible to employ the Enhancers and as a result the evaluation reaches the value 0.9. This happens because the opening and the original value contribute with two points. Tensions 140 120 100 80 60 40 20 0 Actions Figure 5. Tensional Representation of story 3. The following lines show the report explaining the evaluation process. EVALUATION OF THE STORY This is a good story. Great! Soon you will become a real writer. Here are some comments about your work that I hope will be a useful feedback. COHERENCE The story is very logical; all actions are nicely integrated and form a coherent unit. At the end there are still some tensions that are not solved; it would help to the coherence and interest of the narrative if characters worked them out by the conclusion. I recommend you to avoid repeating actions (e.g. Jaguar knight Found by accident the Lady). INTERESTINGNESS The text has a good introduction. The climax of the story is good, although for my taste I would prefer a little extra tension. A better end would contribute to have a more interesting tale. There are surprising events that make the story appealing. I enjoyed that! NOVELTY The plot is kind of inventive. My evaluation of your story is ->90/100 Discussion and Conclusions This paper reports a computer model for plot evaluation. The model is based on the idea that affective reactions and the generation of new knowledge are important characteristics of plot evaluation. It requires a story and a process that allows transforming a sequence of actions into structures that the agent can manage. In this way, it is possible to evaluate any story produced by any agent, as long as the narrative fulfils the constraints of the format. I refer to the process of transforming a sequence of actions into structures that represent knowledge and affective reactions as Interpretation. This work shows the importance of interpretation and its role during evaluation. If a group of agents share similar interpretations, and similar knowledge structures and beliefs (knowledge bases), they probably will produce similar evaluations. Otherwise, they will generate different outputs, maybe even contradictory ones. The three layers provide a flexible way to work with the story-characteristics. It allows giving different weights to some features during one stage of the assessment than during others; employing what we refer to as the compensation effect; conditioning the use of the Enhancers and Debasers; and so on. The work reported in this paper is based on an Aristotelian view of what a story is. Under this framework, the model proposes a way to understand how the evaluation process might work. However, it is well known that there are other valid approaches to build, and therefore to assess, interesting narratives. Unfortunately, it is not possible yet to develop a model that comprises all of them. Evaluation is a very complex task and we are far to understand it. So, it makes sense to develop achievable programs and then start to build on top them. Hopefully, in few years we will be able to incorporate different approaches in our system. In the current model there are several aspects that need to be revised. For instance, it is necessary to represent features like suspense, flashbacks, and so on. Similarly, it is necessary to incorporate mechanisms that allow the system to manipulate in more creative ways the structures that are already represented; e.g. we would like to provide the evaluator with the capacity of explicitly leaving unsolved conflicts as part of an interesting closure within a narrative (when this resource is properly employed it has very positive effects on the reader). So, there is much work left to be done. Some colleagues seem to be concerned about some characteristics of this work. Their main objection has to do with the fact that The implementation of the used metrics is based on features certainly not present in all plot generation systems (anonymous reviewer). There is a misunderstanding here. Our model evaluates plots; we do not necessarily care about the characteristics of the storyteller. That is, the system assesses the features present in the narrative, not in the program that generated it. So, we do not see a problem here. Nevertheless, clearly this research has been developed around our storyteller. The main goal of this project is to provide MEXICA with the capacity of evaluating its own outputs. As explained earlier, the system can also evaluate a plot produced by any other agent as long as it is represented as text with the following format: character performing the action, description of the action, object of the action (another character). (It is also necessary that all story actions employed in the plot are declared in the dictionary of the system). That is the scope of our model. It is necessary to consider that some plot-generators might produce outputs in the MEXICAs format that include features that cannot be interpreted by our system and therefore cannot be included as part of the assessment (e.g. suspense). So, in these cases the evaluation performed by our model might be considered as incomplete. Can this model be employed in other domains? We believe that the answer is yes. The model requires a product to be evaluated and a way to interpret such a product, i.e. a mechanism to perceive its relevant characteristics. The three layers provide a flexible method to organise and analyse such characteristics. As a result of the evaluation process the agent incorporates new structures into its knowledge base and represents affective responses. We believe that all these essential features of our model apply in other areas like, for instance, visual composition. Hopefully, this document will encourage some researchers to test the model in novel areas. Acknowledgements This research was sponsored by the National Council of Science and Technology in Mxico (CONACYT), project number: 181561. 2014_28 !2014 Poetic Machine: Computational Creativity for Automatic Poetry Generation in Bengali Amitava Das Department of Computer Science and Engineering University of North Texas Denton, Texas, USA amitava.santu@gmail.com Abstract The paper reports an initial study on computational poetry generation for Bengali. Bengali is amorpho-syntactically rich language and partiallyphonemic. The poetry generation task has beendefined as a follow-up rhythmic sequence generation based on user input. The design processinvolves rhythm understanding from the given input and follow-up rhyme generation by leveragingsyllable/phonetic mapping and natural languagegeneration techniques. A syllabi.cation engine based on grapheme-tophoneme mapping has been developed in orderto understand the given input rhyme. A SupportVector Machine-based classifier then predicts thefollow-up syllable/phonetic pattern for the generation and candidate words are chosen automatically, based on the syllable pattern. The final rhythmic poetical follow-up sentence is generatedthrough n-gram matching with weight-based aggregation. The quality of the automatically generated rhymes has been evaluated according to threecriteria: poeticness, grammaticality, and meaningfulness. Introduction Cognitive abilities can be divided into three broad categories: intelligence, aesthetics, and creativity. Supposesomeone has read a sonnet by Shakespeare and is askedthe following questions: Do you understand the meaning of this sonnet?If the reader says yes, s/he has used her/his intelligence together with knowledge of the English language and world knowledge to understand it. Do you like this sonnet?Whatever is answer, the reader is using a subjectivemodel of liking and this is what is called aestheticappreciation or sentiment. Can you add two more lines to this sonnet?So the reader has to write some poetry and has touse her/his creative ability to do it. Artificial Intelligence is a now six-to-seven decadesmatured research field. The majority of the research Bjrn Gambck Department of Computer and Information Science Norwegian University of Science and Technology Trondheim, Norway gamback@idi.ntnu.no efforts until now have concentrated on the understanding of natural phenomena. During the latest twodecades, we have witnessed a huge rise of research attention towards affect understanding, that is, the second level of cognition. However, there have so far been pretty few attempts towards making machinestruly creative. The paradigm of computational creativity is actually still in infancy, and most of thoseefforts that have been carried out have concentrated on music or art. Still, computer systems have already made some novel and creative contributions inthe fields of mathematical number theory (Colton 2005;Colton, Bundy, and Walsh 2000) and in chess openingtheory (Kaufman 2012). In this paper, in contrast, we look at computationallinguistic creativity, and in particular poetry generation. Computational linguistic creativity has only inthe last few years received more wide-spread interestby language technology researchers. A book on linguistics creativity was recently written by Veale (2012), andin particular the research group at Helsinki Universityis very active in this domain (Toivanen et al. 2012; Gross et al. 2012; Toivanen, Toivonen, and Valitutti 2013; Toivanen, Jrvisalo, and Toivonen 2013). Someother interesting research attempts have also been made(Levy 2001; Colton, Goodwin, and Veale 2012, e.g.,),but the approaches still vary widely. The field of automatic poetry generation was pioneered by Bailey (1974), although Funkhouser (2009)quotes work going back to the 1950s. These systemswere written by actual poets who were keen to explorethe potential of using computers in writing poetry andwere not fully autonomous. Thereafter, Gervs and hiscolleagues were the first to discuss sophisticated approaches to automatic poetry generation (Gervs 2000;2001a; 2001b; 2002a; 2002b; Daz-Agudo, Gervs, andGonzlez-Calero 2002; Gervs et al. 2007). Gervs work established the possibility of automatic poetrygeneration and has in the last decade been followed bya moderate number of attempts at linguistics creativityand in particular at automatic poetry generation. The system developed by Manurung (2004) uses agrammar-driven formulation to generate metrically constrained poetry out of a given topic. In addition to the scienti.c novelty, the work defined the fundamental evaluation criteria of automatic poetry generation: meaningfulness, grammaticality, and poeticness. A complete poetry generation system mustgenerate texts that adhere to all these three properties. An alternative approach to evaluation wouldbe to adopt the criteria specified by Ritchie (2007;2001) for assessing the novelty and quality of creativesystems in general based on their output. All these previous efforts were inspiration points forthe present work, but as we are unable to conclude whatmethod performs best, we decided to propose a new architecture by following the rules and practices of Bengali poems and writings. There is no previous similarwork in Bengali, nor on other Indian languages, exceptattempts at automatic analysis and generation of Sanskrit Vedas (Mishra 2010) and at automatic Tamil lyricgeneration (Ramakrishnan A, Kuppan, and Devi 2009;Ramakrishnan A and Devi 2010). The basic strategy adopted here is not to try to makethe system create poetry on its own, but rather in collaboration with the user. And not a complete poem, but rather one poetry line at a time. The user enters a line of poetry and the system generates a matching, rhymingline. This task then in turn involves two subtasks: rhyme understanding and rhyme generation. Rhymeunderstanding entails parsing the input line to understand its poetic structure. Rhyme generation is basedon the usage of a Bengali syllabi.cation engine and aSupport Vector Machine (SVM) based classifier for predicting the structure of the output sentence and candidate word generation, combined with bigram pruningand weighted aggregation for the selection of the actualwords to be used in the generated rhyming line. The rest of the paper is laid out as follows: To givean understanding of the background, we first discussthe Bengali language as such and the different rhythmsand metres that are used in Bengali poems. Thereafterthe discussion turns to the chosen methods for poetryline understanding and generation, starting by givingdetails of a corpus of poems collected for rhyme understanding, and then in turn describing the rhyme understanding and the rhyme generation tasks, and theirrespective subparts. Finally, an evaluation of the poetry generation model is given, in terms of the threedimensions poeticness, grammaticality, and meaningfulness. Bengali and Bengali Poetry Bengali (ethnonym: Bangla) is the seventh largest (interms of speakers) language worldwide. It originatesfrom Sanskrit and belongs to the modern Indo-Aryanlanguage family. Bengali is the second largest languagein India and the national language of Bangladesh. Bengali poetry has a vibrant history since the 10th century and the modern Bengali poetry inherited its basicground from Sanskrit. As the first non-European NobelLiterature Laureate and known mainly for his poems, Rabindranath Tagore (18611941) was the pioneer whofounded the firm basis of modern Bengali poetry. Bengali Orthography and Syllable Patterns Bengali, just as all Modern Indo-Aryan languages being derived from Sanskrit, is partially phonemic. Thatis, its pronunciation style depends not only on orthographic information, but also on Part-of-Speech (POS)information and semantics. Partially phonemic languages use writing systems that are in between strictlyphonemic and non-phonemic. Bengali and manyother modern Indo-Aryan languages still uses Sanskrit orthography, although the sounds and the pronunciation rules have changed to varying degrees. The modern Bengali script contains the characters(known as ak.ara) for seven vowels (/i/ /u/, /e/, /o/, /a/, /O/, /a/), four semi-vowels, (/j/, /w/, /e./, /o./),and thirty consonants. Many diphthongs are possible,although they must always contain one semi-vowel, butonly two of the diphthongs are represented directly inthe script (i.e., have their own ak.ara: /oi/and /ou/).All vowels can be nasalized (written as /a/, etc.) andvowel deletion (e.g., schwa deletion) is common, particularly in word medial and final positions. A phonetic group of Bengali consonants is called aborgo (..). As we shall see below, these groups areparticularly important in poetic rhymes. There are five basic borgos in Bengali and four separate pronunciation groups, as shown in Table 1, where each consonant is displayed together with its pronunciation inthe International Phonetic Alphabet (IPA). Many consonant sounds can be either unaspirated or aspirated(e.g., //vs /h/). The first five borgos are named according to their first character. In each borgo, the first consonant takes the least stress when pronounced andthe last takes the highest stress. The first member isthus called less-stressed (alpo-pra.: .O J..), the secondto forth members are called high-stressed (maha-pra.: ...J..), and the .fth and last is a nasal (nasik: .....). Following the classification of Sarkar (1986), Bengalihas 16 canonical syllable patterns, but CV (consonantvowel) syllables constitute 54% of the whole language(Dan 1992). Patterns such as CVC, V, VC, VV, CVV,CCV, and CCVC are also reasonably frequent. For more detailed recent overviews of Bengali phonetics, werefer the reader to, for example, Sircar and Nag (2014),Barman (2011) or Kar (2009), and just take the examples below of Bengali orthography originally devisedby Chatterji (1926) to illustrate how it has deviatedfrom the strictly phonemic orthography of Sanskrit. Consonant clusters are often pronounced as geminates irrespective of the second consonant. Thus: bAkya /bakko/, bakSha /bO/kkho, bismaYa /biSSOe/. Single grapheme for multiple phonemes: The vowel [e] is pronounced as either /e/or /a/. The ambiguity cannot be resolved by the phonological context alone as the etymology is often the underlyingreason. For example: eka /ak/, but megha /megh/. Borgo Name Consonant Members . (k)-borgo . (tS)-borgo . ()-borgo . (.)-borgo . (p)-borgo .G.U(internal)-sound .(warm)-sound .......(scolding)-sound ...O..(parasitic)-sound . (k) . (tS) . () . (.) . (p) . (A) . (S) . () . (kh) . (g) . (gh) . (tSh) . (A) . (Ah) . (h) . (a) . (ah) . (.h) . (.) . (.h) . (ph) . (b) . (bh) . (e.) . (R) . (l) . (S) . (s) . (h) . () .. (h) . (N) . (n) . (n) . (n) . (m) .. (N) Table 1: Bengali borgo-phonetic groups [a] is pronounced as /o/word medially or word finallyin specific contexts: nagara /nOgor/, bakra /bOkro/. Vowel harmony or vowel height assimilation: [a] and [e] are pronounced as /o/ resp. /e/ if followed by a high vowel (/u/ or /i/): patha /pOth/, but pathika /pothik/; ekaTA /aka/, but ekaTu /eku/. Schwa deletion: [a] is deleted from word final ormedial open syllables under specific conditions dependent on phonotactic constraints and etymology.For example: AmarA /amra/, darbAra /dOrbar/. Metres and Rhythms in Bengali Bengali poetry has three basic and common metres: ak.ara-v.tta, matra-v.tta, and svara-v.tta. The first two were inherited from Sanskrit, while the third is more genuinely Bengali. However, before Tagore popularized it, the svara-v.tta was used mainly for nursery rhymes and not really recognised as a serious poetic metre. The matra-v.tta and svara-v.tta metres are based on the length of the vowels. The ak.ara-v.tta metre is in contrast in Sanskrit based on the number of letters in a line (ak.ara is the Sanskrit letter); however, in Bengali poetry the number of syllables are counted rather than the number of letters. The letters . (a), . (i) and . (u) are counted as being of one unit (matra) each, that is, a short vowel (mora), while . (e), . (ai), . (o), and . (au) are counted as being two units each, that is, along vowel (macron). Furthermore, at the end of a linea short vowel may be counted as a long one. The concepts of open and closed syllables are alsocentral to Sanskrit prosody and poetry: closed syllablesare those ending with a vowel sound, while those endingwithout vowels are called open. In Bengali, a syllableis considered as being one or two units long dependingon its position in a line, rather than on whether it isopen or closed. If a line begins with a closed syllable,the syllable is counted as one unit, but if it occurs atthe end of a line it is counted as two units. In the matra-v.tta metre, the position of closed syllables doesnot matter; they are always counted as two units. In asimilar fashion, in the svara-v.tta each vowel (svara) is counted as one unit, regardless of whether the syllablesare open or closed. There are three types of rhymes in Sanskrit poetry,depending on whether the rhyme is on the first syllable of each line (adiprasa), or on the second syllable (dviteeyakshara prasa), or if it is the final syllable ofthe line which is rhyming (antyaprasa). The most important rhyme for our purposes is antyaprasa, which is known as tail-rhyme or end-alliteration in English, and as anto-mil in Bengali poetry. There are many overviews and in-depth analysesof the metres and rhythms of Bengali poetry writtenin Bengali, but fairly few available in English. The reader is referred to Arif (2012), or the writings of Aurobindo (2004) that give a more poetic angle. Here, wewill concentrate on poems written in matra-v.tta metre with anto-mil rhyme, as these poems are relatively easyto understand and generate. The Poetry Generation Model The previous efforts on investigating computer poeticcreativity vary widely in terms of the poetry generationapproaches. Some have used document corpus-basedmodels (Manurung 2004; Toivanen et al. 2012), whileothers have used constraint-programming based models (Toivanen, Jrvisalo, and Toivonen 2013) or geneticprogramming based models (Manurung, Ritchie, andThompson 2012). In contrast, we choose a conversation follow-up modelhighly inspired by the Bengali movie Hirak Rajar Deshe (Kingdom of Diamonds, 1980) by Oscar winning director Satyajit Ray (the son of Sukumar Ray,the poet whose writings form the basis of our rhymeunderstanding corpus, as further discussed below). In Satyajit Rays movie, the entire conversation wasin rhythm. For example: ... .. I... ... (1) Era yatabeoi pa.e they as much more read The more they read .. I... .... (2) Tata that beoi more janeknow The more they learn .. .. .... (3) Tata kama mane that less obey The less they obey For the present task, the follow-up model means thatthe system automatically generates a follow-up rhythmic line based on the users one-line poetry input. For example, if the given sentence is: .. ..... ... ... (4) Ei duniyyara sakala bhala this world everything good All is well in the world the machine could generate a follow-up line such as: ... ... ... ... (5) Asala bhala nakala bhala best good fake good Real is good, even fake is also good There are two essential modules for effective follow-up poetry generation in Bengali: rhyme structure understanding of the given user input and matching rhymegeneration. The development of those modules is discussed in turn in the next two sections. Rhyme Understanding The initial step involves understanding the rhyme in aninput line given by the user. The actual rhyme understanding module consists of syllable identification followed by borgo identification and open/closed syllableidentification. Firstly, however, it is necessary to collect a corpus in order to understand the rhythm andmetre structures of Bengali poems. Corpus Acquisition To collect the corpus, several dedicated Bengali poemsites (called Kobita in Bengali)1 were chosen. For the present task, we choose mainly poems written for children, as they mostly are written in matra-v.tta metre and with anto-mil (tail) rhyme, which is relatively easyto start with for the task of automatic poetry generation. The poems chosen were mainly written by Sukumar Ray (18891923), as the rhyme structure of thosepoems is fairly easy to grasp. A few of Tagores poems, in particular those written for children, were alsocollected. Corpus size statistics are reported in Table 2. This corpus was used later on to train a classifier topredict follow-up rhyme syllables. Therefore, from thecollected poems only those pairs of lines were extractedthat had both matra-v.tta metre and anto-mil rhythm. 1http://www.bangla-kobita.com/ Type of units Number Sentences 3567 Words 9336 Unique tokens 7245 Table 2: Bengali poem corpus size statistics Syllabi.cation Syllabi.cation processes depend directly on the pronunciation patterns of any language. In Bengali poetry,open and closed syllables have been used deliberately tocontinue or stop rhythmic matras (units), as describedin the section above on Bengali poetry. These are important features for syllabi.cation. In order to implement a syllabi.cation engine, we developed a grapheme to phoneme (G2P) converter following the methods discussed by Basu et al. (2009).The consonants and vowels IPA patterns were inherited from that work, while the orthographic and contextual rules were rebuilt. An open-source Bengali shallowparser based POS tagger2 was used for the task. With the help of this list, the syllabi.cation enginemarks every input word according to its borgo. If a word stars with a vowel, the system marks it as a v group.Only the rules mentioned in the paper by Basu et al.have been included, whereas a few things that are notclearly described in the paper remain unattended, forexample, some orthographic and exception rules. Anexample of syllabi.cation output is given in Table 3,where the input is the first line of Sukumar Rays poemCloud Whims, Meghera khe yyala(I.... I...). Borgo Identification For open syllabic words, identification of the borgo class for the final character is quite important. In case no rhythmic follow-up word is available for the last word inthe given sentence, an alternative approach is to choosea word that ends with a consonant belonging to the same borgo. This helps in keeping the rhythm alive. For example, in the following sequence (also fromSukumar Rays poem Cloud Whims) the first line ends with .(/h/) and the final word of the second line endswith a member of the same borgo, namely . (/t/). .... .... ... I.. .... .. ... (6) Bu.o bu.o dha.i megha.hipihayye u.he old old inveterate cloud mound becomes The very old inveterate cloud looks like a hill .. ... ... ... ....... ..... (7) Ouyye base sabha kare saradina ju.e laid sitting meeting all day fellows They were meeting all the day with the gatheredfriends. 2http://ltrc.iiit.ac.in/showfile.php?filename= downloads/shallow_parser.php Input ...... akasher akaoera ..... ....... ma yyadane bataser ma yyadane batasera ... vore bhare English In the sky with the air Syllables Syllable count Open/Closed Borgo aka-oe-ra 3 o v ma yya-dane bata-se-ra 2 3 c o p p bhare 1 c p Table 3: Sample syllabi.cation output Rhyme Generation The automatic rhyme generation engine consists of several parts. First, an SVM-based classifier predicts syllable sequence patterns. Then, a set of candidate outputwords are selected from preprocessed syllable-markedword lists. In order to preserve the rhythm in the generated sentence, a few other parameters are checked,such as borgo classes, anto-mil, and whether the syllables are open or closed. Finally, bigrams are used toprune the list of candidate words and weighted sentenceaggregation used to generate the actual system output.These steps are described in detail in turn below. Syllabic Sequence Prediction A machine-learning classifier was trained for the syllabic rhyme sequence prediction. The Weka-based Support Vector Machine (SVM) implementation (Hall etal. 2009) was chosen as basis for the classifier The collected poetry corpus described above was used here fortraining and testing. The training corpus was split intorhythmic pairs of sentences, where the first line wouldrepresent the user-provided input whereas the secondline would be the one that has to be generated by the system. The input features for the syllabic sequenceprediction are: the syllable count sequence of the givenline, open/closed syllable pattern sequence of the givenline, and the borgo group marking sequence of the firstgiven line. The output labels for the training and testing phases are the syllable counts of each word. For simplicity only those pairs of sentences were chosen where the number of words are same in both the lines. The overall task has been designed as a sequencesyllable count prediction, but there are tricky trade-o.sfor initial position and the last position. The commonrhythmic pattern in Bengali poems is anto-mil (tailrhyme), so it is necessary to take care of the last wordssyllables separately. Therefore three different ML engines have been trained: One for the initial position,one for the final position, and one for other intermediate positions. Feature engineering has been kept thesame for each design, whereas different settings havebeen adopted for the intermediate positions. Word Selection A relatively large word collection was used for the wordselection task. The collection consists of the created poem corpus and an additional news corpus.3 For rhythmic coherence, all words are kept in their in.ectedforms. In practice, stemming changes the syllable countof any word and may therefore affect the rhythm of therhythmic sequence. All word forms are pre-processed and labelled withtheir syllable counts using the G2P syllabi.cation module. For the word selection, the following strategies havebeen incorporated serially in the same sequential orderas they are described here, in order to narrow down thesearch space. Syllable-wise: All words with similar syllabic patterns are extracted from the word list. Closed Syllable / Open Syllable: Depending onthe word in the previous line at the corresponding position, either open or closed syllabic words are chosen.The rest of the words are discarded. Semantic Relevance: Semantic relevance is very essential to keep the generated rhyme meaningful. Thereis neither any WordNet publicly available for Bengalinor any relational semantic network like ConceptNet.Therefore the English ConceptNet (Havasi, Speer, andAlonso 2007) and an English-Bengali dictionary (Biovas2000) were used to measure the semantic relevance ofthe automatically chosen words. Before the semantic relevance judgement, each Bengali word from the given input is stemmed using themorphological analyser, packaged with the Bengalishallow parser. After stemming, those words are translated to English by dictionary look-up. The translatedEnglish words are then checked in the ConceptNet andall the semantically related words are extracted. Now,if a selected word co-occurs with the given word in theConceptNet extracted list, then it is considered as relevant. Otherwise it is discarded. For the ConceptNet 3http://www.anandabazar.com/ search, only nouns and verbs are considered. For example (same as in Table 3) if the given line is: ...... ..... ....... ... (8) Akaoera mayyadane batasera bhare sky field air filled The sky is filled with the air from the fields The words that will be searched in ConceptNet are sky(....), field (....), andair(.....). The extracted wordlist will then definitely contain words such as cloud(I..), which was used by Sukumar Ray in the originalpoem (again Meghera khe yyala or Cloud Whims): I... . .... ..... .. I.. .... (9) Cho.a ba.a sada kalo kata meghacare small large white black many clouds grazing Many large and small, black and white clouds are grazing. Borgo-wise: Borgo-wise similarity is checked and only words ending in the same borgo classes are keptfor the last position word. The other words are checkedfor first letter borgo-similarity, and the non-matching are discarded. Anto-mil: For anto-mil or tail-rhyme matching, anedit distance (Levenshtein 1966) based measure hasbeen adopted. If the Minimum Edit Distance is . 2, then any word is considered as homophonic and kept.This strategy only works for the final word position.The remaining members are excluded. Pruning and Grammaticality The methods described so far are able to produce word-lists for each word member from the input. Appropriatepruning and natural language techniques are requiredto generate grammatically correct rhythm sequencesfrom these word options. N-gram (bigram) matching followed by aggregationis used for the final sentence generation. The n-gramshave been generated using the same word collection asdescribed above, that is, the poem corpus plus the newscorpus. The system computes weights (frequency/total number of unique n-grams in the corpus) for each pair ofn-grams. For example, suppose that the total numberof generated word candidates for the first position wordis n1 and for the second position word it is n2. Then n1 n2 valid comparisons have to be carried out. Thepossible candidates will be: n1n2 .. 12 w w (10) ii i=0 i=0 Where the sums intend to represent the relevance ofusing one term after another to create a meaningfulword sequence. Suppose the targeted sentence has m Figure 1: Word sequence selection by n-gram pruning number of words. The process will then be continuedfor each successive bigram pair, for example, for 2345 m 1 w 2 w 3 w 4 w m-1 w w ,w ,w ,w ,...,w Finally, the best possible combination is chosen bymaximizing the total weighted path as a multiplicationfunction (that is, by maximizing over the dot productof all the possible n-gram sequences). The process isillustrated in Figure 1. Experiments and Performance The generated system has been evaluated in two ways:through a set of in-depth studies by three dedicatedexpert evaluators and in more free-form studies by tenrandomly selected evaluators. As discussed in the introduction, three major criteriafor the quality assessment of automatic poetry generation have been used previously: poeticness, grammaticality, and meaningfulness (Manurung 2004). The sameevaluation measures have been applied to the presenttask. The evaluation process is manual and each of thethree dimensions is assessed on a 3-point scale: Poeticness: (3) Rhythmic (2) Partially Rhythmic (1) Not Rhythmic Grammaticality: (3) Grammatically Correct (2) Partially Grammatically Correct (1) Not Correct Meaningfulness: (3) Meaningful (2) Partially Meaningful (1) Not Meaningful The evaluation results are reported in Table 4, wherethe scores assigned by three in-depth evaluators are reported separately, while the randomly selected evaluators have been grouped according to whether theyshould give short (not more than five words) input linesor whether they could give unrestricted length input.The whole assessment process is elaborated on below,including explanations for the scores given by the different evaluators. Evaluators #1 Dedicated experts #2 #3 Randomly chosen . 5 words unrestricted Poeticness 2.4 1.2 2.1 2.3 1.9 Grammaticality 1.7 1.0 1.4 1.8 0.6 Meaningfulness 1.5 0.9 1.1 1.6 0.8 Table 4: Evaluation of the Bengali poetry generator In-Depth Evaluation Three dedicated expert evaluators were chosen for anin-depth evaluation. One of them is a Bengali literature student, the second a Bengali journalist, and thethird a technical undergraduate student. Each of themwere asked to test the system performance on 100 inputsentences, chosen by themselves. Evaluator 1: Literature Student The Bengali literature student was instructed to collect100 simple poem lines from various poets, whose poemswere not included in our training set. Through discussion with the evaluator, we decided to choose lines from Satyendranath Duttas (18821922) poems since he isknown for his rhyme sense and renowned as the wizardof rhymes (..H. .....) in Bengali literature. Also, hiscreatures are very easy to understand. We started with the famous The Song of the Palanquin, Palkir Gan (...... ...). Following are someexamples of the output the system produced. The second lines in the examples were generated by the system,while the first lines were given to the system as input. ..... ... ! (11)Palanquin moves! .... .... Trot pace uL .... (12)Stunned village .G i... Cloggy doors The output in Example 11 is surprisingly good. Actually, the same line has been used as follow-up to thisinput line in one of the paragraphs of the original poem.The output in Example 12 is also good in terms of poeticness, but is less meaningful, while the first output isfabulous for all the evaluation criteria poeticness, meaningfulness and grammaticality. However, we obviouslyalso got many bad output sequences. Evaluator 2: Journalist The journalist evaluator was requested to judge thesystems performance on news line input and was instructed to chose short sentences with a prior assessment of having a possible poetic sequence. He choselines from the Bartanam newspaper.4 The best system 4http://bartamanpatrika.com/ output was the one in Example 13, where first line againis the input line and the second line has been generatedby the system. I. .... J....I. ? (13)Who will be the prime minister? ... I..... ...I. Conspirator for the throne However, most of the system output in the news domain was unsatisfactory. From discussions with the evaluator, it was eminent that it also is very difficultfor humans to generate poetic sequences for any givenline, so it is naturally quite difficult for a machine to dothis, in particular if the lines are coming from a nonrhythmic news domain. Evaluator 3: Technology Student The technical undergraduate student was asked to choselines from modern Bengali songs, and was instructed tochose smaller and simpler sentences. In the evaluation,she assigned a high score to poeticness, but lower scoresto grammaticality and meaningfulness. Thus the system performed better than in the news domain, but inferior to the poetry domain. The best output producedby the system is shown in Example 14. ..... ... (14)Dive into the depth of your heart .... ... Rectify yourself Evaluation by Random Evaluators Ten randomly selected evaluators (not connected to theresearch in any way) were asked to evaluate the systemsperformance on sentences given by themselves, with theonly restriction given that they should provide simpleexamples with possible tail-rhymes. The first five of them were instructed to limit their input to five words only. This is in order to understandsystem performance on longer vs shorter sentences. Asa result, we found that system performance is good onall the three aspects on shorter sentences, but that itdegrades drastically when longer sentences are given asinput. As can be seen in Table 4, this is in particularthe case for the dimension of grammaticality, and alsotrue for meaningfulness, while the scores on poeticnessare not that bad overall. Conclusion This paper has reported some initial experiments onautomatic generation of Bengali poems. Bengali is amorph-syntactically rich language which has inheritedthe characteristics and fundamentals of its poems fromSanskrit. Automatic rhyme generation for Bengali istherefore a relatively complex problem. The approachtaken here is novel and based on interaction with the user who enters a line of poetry, which the system thenaims to understand in order to generate a correspondingtext line, adhering to the rules and metres of Bengalipoetry and rhyming with the input. This basic system has many drawbacks and limitations, especially in the understanding of wide varietiesof rhythms and in terms of grammaticality. The rhymegeneration utilises a Bengali syllabi.cation engine andan SVM-based classifier for predicting the structure ofthe output sentence and for the candidate word generation, which is based on a notion of semantic relevance in terms of proximity mappings derived from ConceptNettranslations. The final selection of the actual poeticwords is presently done through bigram pruning andaggregation. Using the notion of semantic relevance is a computationally cheap way to automatically create meaningful rhymes, although poetry written by humans obviously do not always contain semantically related words.However, this is initial work and using ConceptNet is astraight-forward approach; and even though conceptualsimilarity hardly is the ultimate way to measure wordrelevance for poems, it is probably one of the easiestways. In the future, we would aim to involve furthernatural language generation techniques to create moremeaningful poetry. Acknowledgments Many thanks go to the evaluators for all their efforts,assistance and comments. We would furthermore like to thank the anonymous reviewers for several commentsthat helped to substantially improve the paper. We are grateful to the late Satyajit Ray (19211992)for directing the movie Kingdom of Diamonds (Hirak Rajar Deshe) which originally inspired our approach. A very special token of appreciation goes to the threeBengali poets who in the early years of the previouscentury wrote the verses that were used in the building,training and evaluation of our system: Sukumar Ray,Rabindranath Tagore and Satyendranath Dutta. 2014_29 !2014 Coming Good and Breaking Bad: Generating Transformative Character Arcs For Use in Compelling Stories Tony Veale School of Computer Science and Informatics University College Dublin, Belfield D4, Ireland. Tony.Veale@UCD.ie Abstract Stories move us emotionally by physically moving their protagonists, from place to place or from state to state. The most psychologically compelling stories are stories of change, in which characters learn and evolve as they fulfil their dreams or become what they most despise. Character-driven stories must do more than maneouver their protagonists as game pieces on a board, but move them along arcs that transform their inner qualities. This paper presents the Flux Capacitor, a generator of transformative character arcs that are both intuitive and dramatically interesting. These arcs which define a conceptual start-point and end-point for a character in a narrative may be translated into short story pitches or used as inputs to an existing story-generator. A corpus-based means of constructing novel arcs is presented, as are criteria for selecting and filtering arcs for wellformedness, plausibility and interestingness. Characters can thus, in this way, be computationally modeled as dynamic blends that unfold along a narrative trajectory. Metamorphosis As Gregor Samsa awoke one morning from uneasy dreams, he found himself transformed in his bed into a monstrous vermin. So starts Franz Kafkas novella of transformation, titled Metamorphosis, in which the author explores issues of otherness and guilt by exploiting a characters horrific (if unexplained) change into an insect. Authors from Ovid to Kafka demonstrate the value of transformation physical, spiritual and metaphorical asa tool of character development, just as storytellers fromHomer to Kubrick demonstrate the value of journeys assupport-structures for narratives of becoming and change.Even narratives that are primarily plot-focused or action-centric can, many times, be succinctly summarized bylisting key character transformations. Consider Gladiator,an Oscar-winning action film from 2000. The main villain of that piece, Emperor Commodus, summarizes the plot with three successive transformations: The general who became a slave. The slave who became a gladiator. The gladiator who defied an emperor. Note how the third transformation is implicit, for the gladiator Maximus has transformed himself into a potential leader of Rome itself. Kafka presents his driving transformation as a fait accompli in the very first line of his story, while in OvidsMetamorphoses, characters are transformed by Gods intotrees or animals with magical immediacy. Most narrativetransformations occur gradually, however, with a storycharting the course of a characters development from astart-state S to target-state T. In this respect the television drama Breaking Bad offers an exemplary model of theslow-burn transformation. We first meet the shows main character, Walter White, in his guise as a put-upon high-school chemistry teacher. Chemistry, he tells us, is the study of change. Though Walter has a brilliant mind, helives a dull suburban life of quiet desperation, until adiagnosis of lung cancer provides a catalyst to look anewat his lifes choices. Walter decides to use his chemistryskills to cook and sell the drug Crystal Meth, andrecruits former student Jessie as a drug-savvy partner. In62 episodes, the show charts the slow transformation ofWalter from dedicated teacher to ruthless drug baron. As the shows writer/creator Vince Gilligan put it, I wanted to turn my lead character from Mr. Chips into Scarface. Walters progress is neither smooth nor monotonic. Hebecomes an unstable, dynamic blend of his start and end states. Though he commits unspeakable crimes, he neverentirely ceases to be a caring parent, husband or teacher.As viewers we witness a true conceptual integation of histwo worlds: Walter brings the qualities of a drug baron tohis family relationships, just as he brings the qualities of ahusband and father-figure to his illicit business dealings.To fully appreciate this nuanced character transformation,we must understand it as more than a monotonic journeybetween two states: characters must unfold as evolvingblends of the states that they move between, so they canexhibit emergent qualities that arise from no single state. This paper presents a CC system The Flux Capacitor for generating hypothetical character arcs for use instory generation. The Flux Capacitor is not itself a storygeneration system, but a stand-alone system that suggestswhat-if arcs that may underpin interesting narratives. Though it is a trivial matter to randomly generate arcsbetween any two conceptual perspectives say between teacher and drug-baron, or terrorist and politican the Flux Capacitor generates arcs that are well-formed, well-motivated, intuitive and of dramatic interest. It does so byusing a rich knowledge-representation of our stereotypicalperspectives on humans, knowing e.g. what qualities areexhibited by teachers or criminals. It uses corpus analysisboth to acquire a stock of valid start-and end-states and tomodel the most natural direction of change. It further usesa robust model of conceptual blending to understand theemergent qualities that may arise during a transformation. The Flux Capacitor builds on a body of related work which will be discussed in the next section. The means by which novel transformative arcs are formulated is then presented, before a model of property-level blending and proposition-level analogy/disanalogy is also described. The Flux Capacitor does more than generate a list of possible character arcs: it provides to a third-party story generator a conceptual rationale for each transformation, so a story-teller may properly appreciate the ramifications of a given arc. In effect this rationale is a pitch for a story. Before drawing our final conclusions, we describe how such a pitch can be constructed from a blending analysis. Related Work and Ideas What is a hero without a quest? And what is a quest that does not transform its hero in profound ways? The scholar Joseph Campbell has argued that our most steadfast myths persist because they each instantiate, in their own way, a profoundly affecting narrative structure that Campbell calls the monomyth. Campbell (1973) sees the monomyth as a productive schema for the generation of heroic stories that, at their root, follow this core pattern either literally or figuratively: A hero ventures forth from the world of common day into a region of supernatural wonder: fabulous forces are encountered and a decisive victory is won: the hero comes back from this mysterious adventure with the power to bestow boons on his fellow man. Many ancient tales subconsciously instantiate this schema, while many modern stories such as George Lucass Star Wars are consciously written so as to employ Campbells monomyth schema as a narrative deep-structure. A comparable schematic analysis of the heroic quest is provided by Propps Morphology of the Folk Tale (1968). Like Campbell, Propp identifies an inventory of recurring classes (of character and event) that make up a traditional Russian folk tale, though Propps analysis can be applied to many different kinds of heroic tale. Transformative elements in Propps inventory include Receipt of Magical Agent, which newly empowers a hero, Transfiguration, in which a hero is rewarded through change, and Wedding, through which a heros social status is elevated. Propp also anticipates that a truly transformed hero may not be recognized on returning home (Unrecognized Arrival) and may have to undergo a test of identity (Recognition). The basic morphemes of Propps model can be used either to analyze or to generate stories, in the latter case by using a variant of Fritz Zwickys Morphological Analysis (1969). Propps morphemes have thus been used in the service of automated game design (Fairclough and Cunningham, 2004) as well as creative story generation (Gervs, 2013). Campbells monomyth and Propps morphology can each be subsumed under a more abstract mental structure, the Source-Path-Goal (SPG) schema analyzed by Johnson (1987). Johnson argues that any purposeful action along a path from going to the shops to undertaking a quest activates an instance of the SPG schema in the mind. In cinema the SPG is most obviously activated by road movies, in which (to quote the marketing campaign for Dances With Wolves), a hero goes in search of America and finds himself. Such movies use the SPG to align the literal with the figurative, so that a hero starts from a state that is both geographic and psychological, and reaches an end-point that is similarly dual-natured. The SPG schema is also evident in comic-book tales in which an everyman is transformed into a superheroic form that permits some driving goal (revenge, justice) to be achieved. Forceville (2006) has additionally used the SPG to uncover the transformative-quest structure of less overtly heroic film genres, such as documentaries and autobiographical films. Storytelling is a purposeful activity with a beginning (Source), middle (Path) and end (Goal) that typically shapes the events of a narrative into a purposeful activity on the part of one or more characters. Computer systems that generate stories as described in e.g. Meehan (1981), Turner (1994), Perez y Perez & Sharples (2001), Riedl & Young (2004) and Gervs (2013) are thus, implicitly, automated instantiators of the Source-Path-Goal schema. This is especially so of story systems, like that of Riedl & Young, that employ an explicitly plan-based approach to generation. These authors use a planner that is anchored in a model of the beliefs and internal states of the storys characters, so as to construct narrative plans that call for believable, well-motivated actions from these characters. The use of a planner also ensures that these actions create the appearance of an intentional SPG path that is viewed as plausible and coherent by the storys audience. Outside the realm of myths and fairy-tales, the deepest transformations are to the beliefs and internal states of a character, though such profound changes may be reflected in outward appearances too, such as via a change of garb, residence, place of work, or choice of tools. Consider the case of a prostitute who becomes a nun, or the altogether rarer case of a nun who breaks bad in the other direction. Such transformations are dramatically interesting because they create oppositions at the levels of properties and of propositions. Though frame-level symmetries are present, since each kind of person follows a particular vocation in a particular place of work while wearing a particular kind of clothing, the specific frame-fillers are very different. We can imagine a tabloid headline screaming Nun burns habit, buys thong or Nun flees convent, joins bordello. Analogies and disanalogies between the start-and end-states of a transformation provide fodder for the evolving blends that need to be constructed to ferry a protagonist between these two states in a narrative. Conceptual blending is a knowledge-hungry process par excellence (see Fauconnier and Turner, 1998, 2002). However, Veale (2012a) presents a computational variant of conceptual blending, called the conceptual mash-up, that is robust and scalable. Propositional knowledge is milked from various Web sources such as query completions from Web search engines and, using corpus evidence, this knowledge is mapped to more than one concepts. Veale (2012b) also presents a robust method for mining stereotypical properties from Web similes, such as as chaste as a nun and as sleazy as a prostitute. Used here, these representations allow the Flux Capacitor to analyze the blending potential of a transformative arc, and so construct a conceptual rationale as to why a given arc has the potential to underpin an interesting narrative. Opposites Attract At its most reductive, a transformative character arc is an unlabeled directed edge Sa.T that takes a character from a conceptual starting-state S to a conceptual end-point T, where S and T are different lexicalized perspectives on a character (such as e.g. S=activist and T=terrorist). To be a truly transformative arc, as opposed to an arbitrarily random pairing of S and T states, an arc should induce a dramatic change of qualities. Superficially, this change may be reflected in a reversal of affective polarity from S to T. Thus, if S is viewed as a positive state overall, such as activist, saint or defender, and T is predominantly seen as a negative state, such as terrorist, prostitute or tyrant, then a character will break bad by following this arc. Conversely, if S is most often seen as a negative state, and T is typically seen as a positive state, then a character will come good by following this arc. Naturally, our overall affective view of a concept will be a function of our property-level perception of all its stereotypical qualities. If S typically evokes a preponderance of positive qualities then it will be viewed as a positive state overall. Likewise, if S typically evokes a preponderance of negative qualities then it will be viewed as a negative state overall. A means of mapping from property-level representations to overall +/-affective polarity scores is presented in Veale (2011). Stories thrive on conflict and surprise, and surprising transformations arise when the pairing of S and T gives rise to a clash of opposing properties. Consider again the case of the prostitute (=S) who becomes a nun (=T). The transformation Sa.T at the conceptual-level implies the property-level oppositions dirty-pure, immoral-moral, promiscuous-chaste and sleazy-respected, affording an opportunity for a truly dramatic Proppian transfiguration. Generalizing, we say that a character arc Sa.T implies a direct opposition at the property-level if S and T each exhibit properties that can produce antonymous pairs. We thus use WordNet (Fellbaum 1998) as a comprehensive source of antonymy relationships (such as pure-dirty), which we apply to any putative arc Sa.T to determine whether the arc involves a dramatic conflict of properties. This property-level analysis allows The Flux Capacitor to identify nuanced transformations that allow a character to come good while also breaking bad. Consider the arc beggara.king. A character following this arc may come good in many ways, by going from lowlya.lordly, poora.lofty, brokea.wealthy, impoverisheda.privileged and raggeda.regal. Yet such an arc may induce negative effects too, changing a character from humblea.arrogant, humblea.haughty and humblea.unapproachable. Perhaps a beggar that becomes a king may come to rue his change of station, while a king that becomes a beggar may derive some small comfort from his fall from grace? Yet S and T need not conflict directly at the property-level to yield an opposition-rich transformation. The clash of properties may be indirect, if S relates to a concept S in the same way that T relates to T, and if a clash of opposing properties can be observed between S and T. For instance, scientists and priests do not directly oppose one another, but a property-level clash can be found in the stereotypical representations of science and religion, since science is stereotypically rational while religion is often seen as irrational. Since scientists practice science while priests practice religion, a character that goes from being a scientist to being a priest will, in a leap of faith, reject rational science and embrace irrational religion instead. A gifted storyteller can surely make an transformation, no matter how random or illogical, seem interesting. Such is the art of improvizational comedy, after all. However, rather than abdicate its responsibility for making an arc interesting to a subsequent story-telling component, the Flux Capacitor applies it own filtering criteria to find the arcs it considers to have dramatic potential. An arc Sa.T is generated only if S and T possess opposing qualities, or if S and T are indirectly opposed by virtue of being analogously related to a concept pair S and T that do. We now turn to how S and T are found in the first place. Charging the Capacitor We often speak of children in terms of what they may one day become, but speak of adults in terms of what they have already become. Some concepts are more naturally thought of as start-states in a transformation, while others are more naturally viewed as end-states. Beyond the clear cut cases, most concepts sit on a continuum of suitability for use on either side of a transformation. To determine the suitability of a given concept C as either a start state or an end state, we can simply look to a large text corpus. The frequency of the 2-gram C+s become in a corpus such as the Google n-grams (Brants and Franz, 2006) will indicate how often C is viewed as a start-state, while the frequency of the 2-gram become C+s will indicate Cs suitability as an end-state. Since the n-gram frequency of become terrorists (7180) is almost 7 times greater than the frequency of terrorists become (1166), terrorist is far more suited to the role of end-point than to start-point. The Flux Capacitor limits its choice of start-states to any stereotype S for which the Google n-grams contains the bigram S+s become. Similarly, it limits its choice of end-states to any stereotype T for which Google provides the bigram become T+s. Within these constraints, the Google n-grams suggests 1,213 person-concepts to use as start-states, and 1,529 to use as their ultimate end-states. The Google n-grams contains a small number (< 500) of well-established transformations between person-types that can be found via the pattern S-turned-T. Examples include friend-turned-foe, bodybuilder-turned-actor and actor-turned-politician. Though some turns have dramatic value (like bully-turned-Buddhist), most are well-trodden paths with little to offer a creative system. Nonetheless, the Google n-grams are a valuable source of inspiration for the generation of novel transformations that combine complementary ideas. For the n-grams can tell us whether two ideas have a history of working well together, either in harmony or as part of an antagonistic double-act. Consider the 3-gram pattern X+s and Y+s, which matches all instances of coordinated bare plurals in the Google n-grams. Examples include angels and demons, nuns and prostitutes and scientists and priests. While these attested coordinations often bring together opposing concepts, they are concepts drawn from the same domains or semantic fields, and thus seem fitted to each other. So while a transformation linking two such conflicting states may strike one as a surprising turn of events, it will also likely strike one as a fitting turn of events. By mining the Google 3-grams for instances of this pattern that connect a valid start-state to a valid end-state, where these states also exhibit either a direct or indirect conflict of qualities, the Flux Capacitor harvests a large collection of potential state-pairs for its own transformative character arcs. The question of which state can best serve as a start-state, and which should serve as the end-state, is decided afterwards. Coordinations are a rich source of explicit constrasts between conceptual states, but other n-grams are an even richer source of implicit contrasts. Consider the 3-gram army of dreamers. The typical member of an army is a soldier, not a dreamer, as borne out by the systems own propositional world-knowledge. This 3-gram thus implies a clash of soldiers and dreamers, which in turn implies the property-level conflicts disciplined-undisciplined and fit-lethargic. Generalizing, we mine all Google 3-grams that match the pattern of +s, such as church of heretics, army of cowards and religion of sinners, to identify any cases where the stated member (sinner, coward, etc) contrasts with a known stereotypical member of the group. A large pool of contrasting concept pairs is mined in this way from the Google n-grams, to be used to form each side of a transformative character arc. But what trajectory should each transformation follow? Which concept will serve as the start-point S of an arc, and which as its end-point T? We infer the most natural direction for an arc by again looking to corpus data. For a pair of contrasting concepts X and Y, we calculate a score for the arc Xa.Y as the sum of the n-gram frequencies for X+s become and become Y+s. Likewise, we calculate the score for the arc Ya.X as the sum of the n-gram frequencies for Y+s become and become X+s. We then choose the arc/direction with the greatest score. Consider, for example, the pair militant and politician, which share, in the world-view of the Flux Capacitor, this implicit contrast: militants launch celebrated rebellions, whilst politicians launch hated wars. Corpus data suggests that politician is more suited to be the end-state of an arc than its start-state, perhaps because politicians must be elected, and election is an obvious goal-state in the SPG schema. In contrast, militant is slightly more comfortable in the role of start-state than end-state, no doubt because militants fight so as to initiate some future change. Thus, the arc militanta.politician is favored over its inverse, politiciana.militant, and so only the former is generated. Blended States In character-led stories, key transformations often unfold gradually through a build-up of incremental changes. So as characters follow their trajectory along an arc that takes them ever closer to their final state, they will exhibit more of the qualities we stereotypically associate with the endpoint of their arc and fewer of the properties we associate with their starting point. In effect, a changing character becomes a dynamic blend of the starting-point and end-point concepts that define its narrative trajectory. The theory of conceptual integration networks, also known as conceptual blending (see Fauconnier & Turner, 1998, 2002), offers a principle-driven framework for the interpretation of any blend, while Veale (1997) further explores the workings of character blends that gradually unfold during a narrative. A character blend a character that moves between two states and thus assumes a mix of the properties and behaviors associated with each can be modeled computationally at the level of properties and of propositions. To model the former, we explore the space of complex properties that integrate nuances from each of the inputs, while to model the latter we draw on Markman and Gentners (1993) theory of alignable differences. Consider a proposition-level blend in the shocking case of our nun-turned-prostitute. The alignable differences in this example concern the propositions associated with nuns and with prostitutes that can be aligned by virtue of positing exactly the same relationship for each subject, but with different values for their objects. For instance, nuns work and reside in convents or cloisters, under the supervision of a mother superior, while prostitutes work and reside in bordellos under the supervision of madams and pimps. So as this transformation is effected, convents and cloisters will give way to bordellos, while mother superiors will lose out to pimps and madams, just as wimples and habits will transition into an altogether racier style of dress. It is a simple matter to connect propositions with alignable differences such as these, to produce a structural blend that is part analogy and part disanalogy. The Flux Capacitor is also sensitive to the reversals of status and power that accompany a given transformation. By attending to the relationships that link a subject A to an object B, and the relationships that reciprocally link B as a subject to A as an object, it learns how to recognize situations where a protagonists social inter-relationships are dramatically reversed in a blend. Thus, for instance, it observes a fundamental tension between the verbs obey and control, between ruling and being led, and between governing and electing. In the case of a king-turned-slave then, it perceives an interesting reversal of power, where a once-mighty king goes from being served by respectful followers to being led by haughty and arrogant rulers, just as he may go from appointing fawning servants to being managed by dominant and exalted masters. The scale of each reversal is emphasized by highlighting the most pointed contrasts between the blended states; thus, it also suggests that our deposed king goes from being served by honorable knights to being led by depraved rulers. While these new rulers need not be depraved, it heightens the dramatic potential of the blend to assume that they are. At the property-level, we strive to understood how a property A associated with a start-state S, and a property B associated with an end-state T, might yield an emergent property AB that arises from a characters transformation from S into T. Might our nun-turned-prostitute retain a residual sense of piety, even if such piety were to be unjustified or even immoral? The Google 2-grams inform us that the phrase immoral piety denotes an attested state (with a Web frequency of at least 49). Since nuns are typically pious and so practice piety, while prostitutes are typically seen as immoral, immoral piety denotes the kind of nuanced state that may arise as one state gives way to the other. The Google n-grams also suggest, in this vein, that a nun-turned-prostitute might be a moral prostitute, a compassionate prostitute, a religious prostiute or, at least, a spiritual prositute, one that commits pure or virtuous sins despite practicing a sleazy morality and a dirty faith. Likewise, when intellectuals become zealots, attested 2grams that bridge both states include inspired rant, misguided superiority, uncompromising critique, extreme logic, intellectual obsession, scholarly zeal and even educated stupidity. The Google n-grams attest to the validity of a great many complex states that can be surprising and revealing. By seeking out nuanced states that bridge the properties of the conflicting concepts in a character arc, the Flux Capacitor can tap in to the vast, collective imagination of readers and writers as exercised for other, past narratives. Hold The Presses These blend interpretations serve to advertize the merits of a given character transformation: the richer the blend, in terms of aligned propositions and nuanced properties, the richer the narrative it should yield when turned over to a dedicated story-generation system. In many ways then, these blend interpretations are the computational version of a Hollywood story pitch, in which a screenwriter sells his or her vision of a story to the studio that will make it. Like a Hollywood studio, which can only afford to make a small number of films per year, a story-generation system will need some narratological basis to judge which stories ideas to further refine and which to reject outright. The Flux Capacitor is not a story-generation system, but a creator of high-concept story ideas. Yet to better sell these ideas, it uses natural-language generation techniques to convert its blend analyses into simple pitches. Consider the following pitch, in which each mapping in the blend for nuna.prostitute has been realized as its own sentence: Nun condemns chastity, wallows in wickedness Nun criticizes convents, bounces into brothels Nun chucks crucifixes, gropes for garters Nun fatigued by fidelity, veers toward vices Nun hates habits, stockpiles stilettos Nun mistreated by mother superiors, pulled to pimps Nun skips out of spectacles, loves latex Nun vents about veils, crazy for corsets Nun vents about virginity, seduced by shamelessness Nun whines about wimples, grabs garters Nun goes from being managed by abbesses and mother superiors to being controlled by pimps Nun goes from carrying beads to carrying infections Does strict chastity struggle with wild promiscuity? How long can outer purity suppress inner filth? Nun goes from being unflinchingly faithful to being increasingly unfaithful Nun goes from living in cloisters and convents to working in brothels and bawdy houses Can inner morality be transformed into naked sin? Nun goes from practicing chastity to practicing vices How long can a superficial respectability suppress pervasive sleaze? Nun goes from wearing habits and crucifixes to wearing corsets and fishnets Nun goes from wearing veils and spectacles to wearing latex and stilettos Nun goes from wearing wimples to wearing hotpants Note the simple structure of each sentence in the pitch. Wherever possible, a tabloid-headline style is employed, using alliteration as in condemns chastity, wallows in wickedness to make each stage of a transformation seem more compelling. Such devices, though simple, embody a strategy that psychologists call the Keats heuristic, for the use of even the most rudimentary rhymes has been empirically shown to heighten the perceived truthfulness of a statement (see McGlone and Tofighbakhsh, 2000). Conversely, character transformations can also be used to craft rhetorical questions and figurative allusions for automated poetry. The Stereotrope system of Veale (2013) thus generates rhetorical questions such as how does a selfish wolf become a devoted zealot?, how does a devoted zealot become a selfish bully? and how does a mindless zealot become a considerate lover? to allude to unknown protagonists whose identify must ultimately be determined by the reader. Transformative Possibilities The Flux Capacitor uses the corpus-based techniques of the previous sections to construct 63,016 unique character transformation arcs, using a combination of the Google n-grams, a large database of stereotypical properties, and a propositional model of world-knowledge. Each arc links character states that conflict either directly or indirectly, where each gives rise to its own blending interpretation. Some arcs simply demand too much from an audience. Novel character arcs may be provocative, but they should rarely be jarring. Arcs that strain credulity, or require an element of cod science to work at all, are best avoided. While it is not possible to predict every faultline along which a narrative may rupture, it is worth considering the most obvious problem-cases here, as these allow us to draw broad generalizations about the quality of our arcs. The first problem-case concerns gender. Though there exist famous and dramatically successful exceptions to this rule, such as Virginia Wolffs Orlando, characters rarely change their gender during a transformation. Of the valid start/end states used by the Flux Capacitor, 84 are manually annotated as male, such as pope and hunk, while 72 are annotated as female, such as geisha and nun. All other states are assumed to be compatible with both male and female characters. In all, 9,915 of the 63,016 arcs that are generated involve one or more gender-marked states. Of these, only 7% involve a problematic mix of genders (e.g. popea.mother). Though a creative story-teller might make lemonade from these lemons (e.g. as in the tale of Pope Joan, who passed as a man until made pregnant), the Flux Capacitor simply filters these arcs from its output. The second problem-case concerns age. Once again, though Hollywood may occasionally find a cod-science reason to reverse times arrow, characters rarely transform into people younger than themselves. Not wishing to paint a story-teller into a corner, where it must appeal to a dust-blown plot device such as time travel, body swapping or family curses to get out, the Flux Capacitor aims to avoid generating such arcs altogether. So of its valid start/end states, 52 are manually tagged for age to reflect our strong stereotypical expectations. Elders such as grandmother, pensioner and archbishop are assigned a timepoint of 60 years, while youths such as student, rookie and newcomer are given a timepoint of 18. Younger states, such as baby, toddler, child, kid, preteen and schoolgirl, are assigned lower time-points still, while those states unmarked for age are all assumed to have a default timepoint of 30. In all, 7,892 arcs are generated for which one or more states is explicitly marked for age. Now, if our corpus-based approach to determining the trajectory of an arc is valid, we should expect most of these 7,892 arcs to flow in the expected youngera.older direction. In fact, 76% of arcs do flow in the right direction. The remaining 24% are not simply discarded however. Rather, these arcs are inverted, turning e.g. mentora.student into studenta.mentor. The ultimate test of a character transformation is the quality of the narrative that can be constructed around it. We cannot evaluate the quality of these narratives until they have been woven by a subsequent story-generation system, acting as a user of the Flux Capacitors outputs. Nonetheless, the diversity of the Flux Capacitors outputs 63,016 well-formed arcs, bridging 1,213 start-states to 1,529 end-states in interesting ways that pair concepts that conflict and which also exhibit corpus-attested affinities is a reason to be optimistic about the quality of the many as-yet-unwritten stories that may employ these arcs. Back to the Future Georges Braque, who co-developed Cubism with Pablo Picasso, was less than impressed with the arc of Picassos career, noting late in life that Pablo used to be a good painter, but now hes just a genius. If character arcs induce change, such changes are just as likely to remove a desirable quality as add it. For Braque, to go from noted painter to certified genius was to follow a downward arc, for Picasso was now to be feted more for his politics, his lifestyle and his women than for any of his painterly gifts. Braques view of Picassos career is witty because it runs against expectation: to become a genius is often seen as the highest of achievements and not a vulgar booby prize. As we strive to make the Flux Capacitor generate arcs that seem interesting yet plausible, we must remember that it is not just a transformation per se that can be original, but the manner in which we choose to interpret it, not to mention the way we ultimately use it in a story. Creativity requires more than generative capability, and a generative system is merely generative if it can perform neither deep interpretation nor critical assessment nor insightful filtering of its own outputs. Though the Flux Capacitor is just one part of a story-generation pipeline, it is not just a mere generator of character arcs. It operates in a large space of possible transformations, sampling this space carefully to identify those transformations that change a character in dramatically interesting ways into something that is at once both incongruous and fitting. The property transfers that accompany a transformation may serve as causes or as effects. That is, some property shifts initiate a change while others naturally follow on as consequences of these root causes. Consider the case of a king-turned-slave, in which the Flux Capacitor identifies the following wealth of property conflicts and shifts: worshippedcontemptible, reveredcontemptible, lofty inferior, loftysubservient, loftysubmissive, anointedcursed, powerfulpowerless, powerful contemptible, powerfulfrightened, powerfulscared, powerfulinferior, magisterialpowerless, learned illiterate, learneduneducated, commanding cowering, commandingsubservient, commanding passive, commandingpowerless, commanding submissive, richpowerless, richmalnourished, rich miserable, merrymiserable, merryunfortunate, crusadingfrightened, augustcontemptible, celebrated contemptible, honoredcontemptible, regal powerless, spoiledwhipped, spoiledabused, spoiled overworked, spoiledexhausted, spoiled malnourished, spoiledoverburdened, spoiled exploited, comfortablemiserable, contentedunhappy, contentedmiserable, delightedunhappy, leading submissive, leadingsubservient, rulingsubmissive, rulingsubservient, lordlyinferior, pampered whipped, pamperedabused, pamperedoverworked, pamperedexhausted, pamperedmalnourished, pamperedoverburdened, pamperedexploited, prestigiousinferior, prestigioussubservient, prestigioussubmissive, reigningsubmissive, reigning subservient, royalinferior, royalsubservient, exaltedinferior, exaltedsubservient, deified powerless, belovedcursed, belovedmiserable, belovedcontemptible, belovedcondemned, magnificentpowerless, magnificentmiserable, magnificentcontemptible, honorablecontemptible, greatpowerless, dominantsubservient, dominant dependent, dominantinferior, dominantsubmissive, mightypowerless, mightylow-level, mighty contemptible, mightyscared, fortunateunfortunate, fortunatecursed, fortunateunhappy, fortunate miserable, consecratedcursed, worthymiserable, adoredcontemptible, happyunhappy, happymiserable, happyunfortunate, veneratedcontemptible, grand powerless Dramatic changes are very often precipitated by external actions, and some states expressed as past-participles are easily imagined as both the primary cause and direct effect of a transformation. Thus, the property cursed may serve as both cause and effect of the dramatic humbling of a king, when perhaps cursed by a witch, demon or other entity as suggested by attested n-grams (e.g. cursed by a witch). Further n-gram analysis will also suggest that one who is cursed may also be be condemned and abused, while one who is abused is more likely to be hungry and dependent. Or perhaps our king is first defeated, since the Google 3-grams suggest defeat leads one to become powerless, that being powerless leads to being oppressed, and that oppression leads one to being tortured, miserable and unhappy. The next stage of the Flux Capacitors development will thus focus on imposing a plausible causal ordering on the properties that undergo change in a transformation, to provide more conceptual insight to any story-generation system that exploits its character arcs. A story-generation system may then use a Proppian or Campbellian analysis to impose narrative structure on any such character arc. For a transformed character effectively undertakes a journey, whether or not this journey takes place entirely within ones mind or social circumstances. By better understanding how the arrow of causality may impose a narrative ordering on the property-changes in a story, a system can better impose the morphology of a folk-tale or a monomyth on any generated character arc. This system may ask which property changes conform to what Propp deemed a Transfiguration, and which can best underpin the role of a Magical Agent in a story? Does a character return, or attempt to return, from the end-state of a transformation, and which actions or events can make such a Return possible? What property changes make a character difficult to Recognize post-facto, and which initial properties of a character continue to shine through? We do not see the Flux Capacitor as a disinterested sub-contractor in the story-telling process, but an active collaborator that works hand-in-glove with a full story-generator to help weave surprising yet plausible stories. As it thus evolves from being a simple provider of arcs to being a co-creator of stories in its own right, we expect that its usefulness as a sub-contractor to existing story-generation systems will yield insights into the additional features and functionalities it should eventually provide. Out of the Mouths of Bots To showcase the utility of the Flux Capacitor as a subcontractor in the generation of creative outputs, we use the system as a key generative module in the operation of a creative Twitterbot. Twitterbots, like bots in general, are typically simple generative systems that autonomously perform useful, well-defined (if provocative) services. A Twitterbot is an automated generator of tweets, short micro-blog messages that are distributed via the social media platform Twitter. Most twitterbots, like most bots, are far from creative, and exploit mere generation to send superficially well-formed texts into the twittersphere, so in most cases, the conceit behind a particular twitterbot is more interesting than the content generated by the bot. Twitter is the ideal midwife for pushing the products of true computational creativity such as metaphors, jokes, aphorisms and story pitches into the world. A new twitterbot named MetaphorIsMyBusiness (handle: @MetaphorMagnet) thus employs the Flux Capacitor to generate a novel, well-formed, creative metaphor or story pitch every hour or so. As such, @MetaphorMagnets outputs are the product of a complex reasoning process that combines a large knowledge-base of stereotypical norms with real usage data from the Google n-grams. Though encouraged by the quality of the bots outputs, we continue to expand its expressive range, to give the twitterbot its own unique voice and identifiable aesthetic. Outputs such as What is an accountant but a timid visionary? What is a visionary but a bold accountant? show how @MetaphorMagnet frames the conceits of the Flux Capacitor as though-provoking metaphors, to lend the bot a distinctly hard-boiled persona. Ongoing work with the bot aims to further develop this sardonic voice. There are many practical advantages to packaging creative generation systems as Web services, but there are just as many advantages to packaging these services as twitterbots. For one, the panoply of mostly random bots on Twitter that make little or no use of world knowledge or of true computational creativity such as the playfully subversive @metaphorminute bot provide a competitive baseline against which to evaluate the creativity and value of the insights that are pushed out into the world by theory-driven and knowledge-driven twitterbots like @MetaphorMagnet. For another, the willingness of human Twitter users to follow such accounts regardless of their provenance, and to favorite or retweet the best outputs from these accounts, provides an empirical framework for estimating (and promoting) the quality of the back-end Web services in each case. Finally, such bots may reap some social value in their own right, as sources of occasional insight, wit or profundity, or even of useful metaphors or story ideas that are subsequently valued, adopted, and re-worked by human speakers. 2014_3 !2014 A Four Strategy Model of Creative Parameter Space Interaction Robert Tubb and Simon Dixon Centre for Digital Music Queen Mary University of London London, E1 4NS, UK r.h.tubb@qmul.ac.uk simon.dixon@eecs.qmul.ac.uk Abstract This paper proposes a new theoretical model for the design of creativity-enhancing interfaces. The combination of user and content creation software is looked at as a creative system, and we tackle the question of how best to design the interface to utilise the abilities of both the computer and the brain. This model has been developed in the context of music technology, but may apply to any situation in which a large number of feature parameters must be adjusted to achieve a creative result. The model of creativity inspiring this approach is Wiggins Creative Systems Framework. Two further theories from cognitive psychology motivate the model: the notion of creativity being composed of divergent and convergent thought processes, and the dual process theory of implicit vs. explicit thought. These two axes are combined to describe four different solution space traversal strategies. The majority of computer interfaces provide separate parameters, altered sequentially. This theory predicts that these oneto-one mappings encourage a particular navigation strategy (Explicit-Convergent) and as such may inhibit certain aspects of creativity. Introduction Although enhancing creativity is often the implied goal, researchers in music technology seem wary of attacking the question of what manner of tools may augment the creativity of the musician. This is perhaps understandable: being one of the most mysterious products of our immensely complex brains, creativity is a great challenge to research. Individuals can vary enormously in how they go about being creative, and results from cognitive neuroscience are still rather contradictory (Dietrich and Kanso 2010). Therefore theoretical guidelines are scarce, and measuring success is difficult. This paper attempts to tie in some findings of cognitive psychology, computational creativity and digital musical instrument (DMI) research, to propose a simple four strategy model of creative interaction. A model that may explain many of the subjective experiences of computer musicians, and assist the design of creativity enhancing interfaces. Creative Cognition Guilford (1967) characterised the creative process as a combination of convergent and divergent thinking. Divergent production is the generation of many provisional candidate solutions to a problem, whereas convergence is the narrowing of the options to find the most appropriate solution. Most modern theories have similar processes present in some form, sometimes referred to by different names such as Generative and Evaluative. Campbell (1960) and Simonton (1999) have considered creativity as a Darwinian process, and propose a process of idea variation and selection. Another interesting process model of creativity is the incubation-illumination model (Wallas 1926). Illumination is more or less synonymous with insight. Insight problems are a tool that psychologists have used to study this phenomenon. These are puzzles that no amount of step by step reasoning can solve. They often involve setting up some functional fixedness (commonly known as a mental block). Insight occurs when the problem is suddenly seen from a different angle. One claim is that conceptual combination processes can yield insight, but are beneath the level of consciousness. The special process model holds that these problems require completely different brain processes from logical or verbal problems (Schooler, Ohlsson, and Brooks 1993). Wiggins Creative Systems Framework (CSF) (Wiggins 2006) is a more formal descendent of Bodens theories of artificial creativity (Boden 1992). It describes creativity in terms of the exploration of conceptual space. It consists of the universe of all possible concepts U , an existing conceptual space (for example domain knowledge) C , rules (constraints) that define this conceptual space R, a set of techniques to traverse the space T , and an evaluation method E : a way to assign value to a location c that yields a fitness function. Exploratory creativity is said to proceed as follows: if traversal takes us outside the space of existing concepts this results in an aberration. If the aberration proves valuable according to E , then the new point is included in the domain, and the conceptual space is extended. Wiggins claims that transformational creativity (a fundamental shift in the rules of the domain) can be viewed as no different from exploratory creativity but on a meta-level. This is to say that a transformation of conceptual space can be achieved by exploring the conceptual space of conceptual spaces. Later we attempt to adapt this model to apply to a parameter space, to propose what creativity might mean in the (very reduced) case of adjusting continuous controls of a sound synthesis engine. System 1 / Implicit System 2 / Implicit associative rule-based holistic analytic automatic controlled relatively undemanding demanding fast acquisition by biology + experience slow acquisition by cultural and formal tuition evolved first evolved recently short term reactions long term planning parallel serial large associative memory limited working memory Table 1: Contrasts between implicit and explicit thinking (Stanovich and West 2000). Dual Process Models of Cognition The formal definition of intuition states that it is the ability to acquire knowledge without the use of reason. This is a rather negative definition, and inspires the question: what mechanisms are present in the brain apart from reason? A more positive approach to nailing down intuitiveness is to make use of the dual process theory of reasoning (Evans 2003; Kahneman 2011). The dual process hypothesis is that two systems of different capabilities are present in the brain. The first (System 1) is fast, parallel and associative, but can suffer from inflexibility and bias. The second (System 2) is more rational and analytical, but is slower, requires intentional effort, and has limited working memory. In this paper we shall use the more illustrative terms Implicit and Explicit to refer to System 1 and 2 respectively. Table 1 lists descriptions of the two systems, taken from Stanovich and West (2000). This portrayal is often used by social psychologists to explain why many decisions that humans take (under, for example, time constraints) seem to be irrational (De Martino et al. 2006). The theory, however, is also relevant to a great deal of other human behaviour, including problem solving, human-computer interaction, and surely creativity. It should be noted that both these systems are extremely broad high-level categorisations. Implicit processing, for instance, encompasses a whole host of perceptual, motor, linguistic and emotional systems. For this reason Stanovich (2009) proposes that implicit system should be called TASS (The Autonomous Set of Subsystems), and also suggests the explicit system breaks down into two subsystems: the reflective and the algorithmic. How might the two processes relate to creativity? Holistic thinking has historically been associated with the right brain, and also with creativity. However, whilst left/right asymmetries can be dramatic (McGilchrist 2009), creativity is unlikely to be an exclusively right-brain phenomenon (Dietrich 2007). One might also conflate divergent thinking with the fast-unconscious system, and convergent thinking with the slow-conscious. However, tacit thinking is mostly quick-access default behaviour, and can be stubbornly inflexible, exactly the opposite of novel idea generation. It is also clear that explicit thinking can create wildly divergent ideas. That is, by asking new questions, intentionally avoiding the obvious by imposing constraints, or redesigning the creative process itself, a point in the solution space may be reached that is very distant from existing concepts (Joyce 2009). This nonetheless relies on a conscious, symbolic, and often systematic approach. Therefore a particularly important aspect of the explicit systems abilities is reflection, or meta-cognition: the ability to inspect ones own thoughts (Buchanan 2001). In Pearce and Wiggins cognitive model of the composition process, at least three out of the five processes relate to reflective abilities (Pearce and Wiggins 2002). So associating artistic creativity with intuitive thinking misses this fact that transformations can result from using analytical symbolic thought to intentionally change the rules, strategies and even value systems of the creative domain. Next we shall investigate the ramifications of both fast and slow systems being able to conduct both divergent and convergent strategies, and try to define them in terms of solution space traversal mechanisms. This model then prompts consideration of how the interface may help or hinder creative work. Creative Interaction with Synthesis Parameters The CSF terminology becomes useful for asking what creativity might mean when navigating a finite, continuous parameter space, such as that provided by a music synthesiser. Whilst the complete CSF is not yet rigorously applied, the main components map well onto the various elements of the human-computer system. As the musician is interacting with the parameter space, and is constrained by it, it is ostensibly a space of viable compositions Cparam, and the interface provides Tparam: the mechanisms to navigate the space. Obviously there are cultural and emotional associations that sounds may possess that are not represented in this very reduced domain. Parameters such as pitch, filter cut-off frequency, and amplitude envelopes only represent the lowest levels in the hierarchical conceptual space of music. Nevertheless, for this work we assume that the the higher level concepts mainly influence E . By assuming that the evaluation of the fitness of a given point in parameter space is carried out by the user, difficult questions such as the cultural associations of particular sounds can be side stepped. The interface designer can assume some complex fitness function is being optimised, without needing to know its exact form (though interesting work has been done both tracking users paths through solution space and obtaining value ratings (Jennings, Simonton, and Palmer 2011)). However this does not mean that the navigation of solution space is entirely carried out within the brain. The constraints and affordances (Norman 1999) of the tools, notations and abstractions used for composition have a significant effect on the finished product (Mooney 2011; Magnusson 2010). For example, the following situations may arise: 1. The composer will sometimes have a idea in mind, and will therefore need to optimise parameters such that the idea is realised. 2. The composer will, at other times, not have anything specific in mind, and is looking to engage in an exploratory process that may produce inspiration. These two scenarios map very well to notions of convergent and divergent thinking. In the first case the creative act has already occurred in the brain of the composer, and all that is necessary is an interface that enables the user to adjust parameters such that the data converges to the idea. Such would be the case in live performance of a score: the piece exists, but should be realised accurately, and according to the performers expressive intent. This is of course a great design challenge. But the second scenario is just as important: the composer embarks on an interactive journey, and unpredictability is a key ingredient. Accidents and surprise are often seen as key components of the creative process (Kronengold 2005; Fiebrink et al. 2010). Therefore would appear that some of the divergent thought can be outsourced to the technology. These technological flukes are analogous to the aberrations in the CSF. Thus the design of the instrument affects creativity, not just in the surface sense that different instruments have varying timbres, but in a deep sense of the interface frames and guides the process, similar to the way language guides thought, or that unconscious priming may change behaviour. A previous experiment has shown that divergent and convergent stages can be best served by different types of interfaces (Tubb and Dixon 2014). Divergent and convergent modes seem also to have a different relationship to E . Many musicians and sound designers intentionally put themselves into states of mind where they temporarily suspend criticism1. This implies that it is useful to disengage evaluation, in order that local minima in that fitness function may be escaped. The mapping of physical controllers to sound synthesis parameters has been an active research topic for at least twenty years (Winkler 1995; Wanderley and Depalle 2004). Mapping has a significant effect not only on what sounds are easy or difficult to create, but also the subjective experience of the user. The principal distinctions between types of mappings are as follows (Hunt, Wanderley, and Kirk 2000). One-to-many: one control dimension is mapped to many synthesis parameters. Many-to-one: many control parameters affect one synth parameter. Many-to-many: a combination of the above. Research has shown (Hunt and Kirk 2000) that complex many-to-many mappings appear to be more effective for expressive performance, and may lead to greater performance improvements with practice. This seems to imply that if a mapping is multi-dimensional, and confounds the users attempts to analyse and manipulate the dimensions separately, then implicit learning cognitive systems are employed. Dimensions that are amenable to being bound together perceptually are termed integral (Jacob et al. 1994). For example 1It is unlikely that musicians turn off all judgement. It could be that they switch to assessment using fast gut feeling assessments (Implicit), rather than more demanding evaluations using analytical, art-theoretical evaluations or a theory of other minds (Explicit). Figure 1: A cognitive model of altering synthesis parameters to match a desired goal. With the use of complex multidimensional controllers (the upper action-perception loop), implicit processes are hypothesised to compute mappings from multidimensional feature sets to motor movements beneath conscious awareness. colour space is formed of 3 integral dimensions, however colour and position are mutually separable. Timbre space is large;y integral, therefore one may question the approach of providing dimensions separately. Practice of a complex controller is less like carrying out a series of commands, and more like learning to ride a bike. Eventually this leads to increased processing bandwidth in the action-perception loop. Hunt also suggests that implicit learning frees up explicit resources to work on other things. A tentative cognitive model of how the implicit and explicit systems navigate parameters is shown in figure 1. This applies to the case when the composer has a specific target in mind, although there is always the possibility that a chance discovery will produce an aberration and an alternative target may be suggested. On the left is the technology, the sound parameters, synthesis engine. Two interfaces are shown, the lower one a unidimensional (slider or WIMP interface) interface and a multi-dimensional (physical controller with complex mapping). If the multi-dimensional interface is well learned, then automatic, holistic processing can process in parallel a large number of features that must otherwise be sequentially adjusted, whilst the goal and its features are held in working memory. The drawbacks of the fast action-perception cycle are firstly, that to become accurate it requires large amounts of practice, and secondly, that it will be poor at adapting to unencountered target sounds or interface mappings. It is worth noting that these drawbacks only apply in the convergent case. For divergence, even an unlearned multidimensional interface may be beneficial (Tubb and Dixon 2014). A Four-Strategy Model of Creative Interaction This theory details how a simple two stage model of creativity (divergence vs. convergence) and dual process theory (implicit vs. explicit) can be combined to inform the design of creative composition interfaces. It is worth setting out the exact scope of this model. It is not intended to be a model of separate systems within the brain. It is not intended to have any predictive power outside the domain of interaction with a parameter space, though it may prove useful in other areas, and we speculatively propose how these four strategies may interact to produce insight. Furthermore, important cultural, personality and emotional considerations have been ignored. It only addresses what Boden (Boden 1992) terms P-creativity, rather than the H-creativity found in culturally significant achievements. Specifically, it is intended to be a categorisation of parameter search strategies, a summary of how those strategies work together (or not) to create novelty and value, and how parameters should be mapped to gestures to assist each of these processes. This design methodology should prevent the designer forcing the user into the wrong creative problem solving strategy at the wrong time. Divergent and Convergent Solution-Space Traversal First of all we attempt to define divergent and convergent processes with reference to the CSF (Wiggins 2006). Convergent processes are traversal mechanisms that improve the fitness of solutions. These could be a series of discrete options, for example selecting the best sound from a number of candidates, or they could be a continuum, for example finding the best setting for a synthesis parameter is a convergent process. Convergence requires both a fitness evaluation E , and some prediction of what change will increase value, which yields a parameter traversal strategy T . E is therefore actively employed in guiding T . This is analogous to a gradient descent algorithm (these algorithms are said to converge on a solution). So whilst some models of creativity postulate generative and evaluative stages, where convergence is just evaluation and selection, in our model convergence can still change the solution (i.e. incremental improvement rather than just evaluation or selection c.f. the honing theory of creativity (Gabora 2005)). A second method of convergence is more analytical: where E can be broken down into smaller individual success criteria, each of which requires a non-creative solution. Divergent processes are different in that they set aside questions of improving any fitness value, and generate candidate solutions distant from the current ones, e.g. creating lots of more or less randomly scattered points. E may still operate in the background in order to spot promising new ideas, but is disengaged from directly determining T , in order to prevent it revisiting unoriginal ideas. An alternative divergent approach can be carried out on the meta-level: deliberately transforming the fitness function or the constraints. Convergence by itself will rarely produce novelty, as multiple runs will settle in the same local minimum. Diver-Figure 2: The four quadrants of implicit vs. explicit thinking (left/right) and divergent and convergent thinking (top/bottom). Examples of useful information transfer are shown in green. Examples of detrimental interference effects shown in red. gence by itself will produce useless noise. It is the careful blending of these processes that yields progress. Examples abound from machine learning that combine both divergent and convergent behaviours, such as random forests, genetic algorithms and particle swarm optimisation. Balancing the two tendencies is also known as the exploration-exploitation trade-off (Barto 1998). Often such algorithms progressively reduce the diversity component as the search progresses. So by defining divergence and convergence in this way, we see that by strategically connecting and disconnecting judgements of fitness from the parameter navigation strategy, the musician can produce both novelty and value. The Four Quadrant Model The central hypothesis in this section is that both fast and slow brain systems may conduct convergent or divergent searches. This results in four distinct parameter space traversal strategies. Figure 2 shows the four categories: divergent-implicit (exploratory), divergent-explicit (reflective), convergent-implicit (tacit) and convergent-explicit (analytic). These may be strategies carried out within the brain (conceptual space traversal), or actual manipulations of the controls of an instrument (parameter space traversal). Below, each quadrant is described in more detail, both in terms of cognitive processes and interfaces that may augment them. Exploratory (implicit-divergent) refers to stochastic, associative, combinatorial or transformational processes that can quickly generate a large number of points across a solution space. Examples may be the unconscious process of conceptual recombination, techniques such as brainstorming, or simple playfulness. Computers effectively generate random, transformed and recombined data, therefore exploration is easily augmented. Tacit (implicit-convergent) is intended to refer to those instinctive or learned techniques that quickly produce a valuable, but probably unoriginal local solution to a problem. These could be instinctive, or learned well enough to become automatic. The appropriate interface is a well learned complex, multi-dimensional, space-multiplexed interface such as a traditional musical instrument, but could also be interaction metaphor such as a physical model that makes use of instinctive understanding of the physical world. Analytic (explicit-convergent) processes break a problem down into separate components, and solve them in a sequential way. In the solution space it would proceed in a city-block fashion, one dimension at a time. An analytic interface is one such as a DAW2 that provides individual parameters as knobs and sliders, and sequential, time-multiplexed input devices such as the mouse. These tend to rely on the perceptual aspects of the parameters themselves being fairly independent and separable. The great advantage of this mode is that complex problems can be broken down into simpler parts. With well defined goals and predictably behaved parameters, accurate location of desired solutions can be achieved in linear time, despite the exponential increase in the size of the space. Reflective (explicit-divergent) refers to meta-cognitive analytical methods that can take existing conceptual spaces and infer new ones: proposing entirely new problem spaces by asking questions or generating hypotheses. One mechanism is that the analytic system transforms the solution space, the constraints and/or the fitness function, deliberately forcing converged points out of their local solution finding complacency3. Other reflective strategies may be use of metaphor and analogy. For truly transformational creativity this meta-exploration ability is essential. A reflective musical interface might be one that offers the ability to create new musical abstractions, for example a musical programming language (Blackwell and Collins 2005; Bresson, Agon, and Assayag 2011). The final component to add to this model regards the evaluation process. Judgement too can be divided into implicit and explicit manifestations. Implicit judgement is fast and affective (I like this or I dont like that). Explicit judgement is more demanding but it is more of a sighted process i.e. also providing the value function gradient (I like this because... or I dont like that, it needs the following...). All four quadrants play a part in creativity. Take the incubation-illumination model as a, highly speculative, illustration, purely in the cognitive domain. Preparation is the process of asking a new question, or finding a new problem (reflective), and attempting to solve it, consciously via the (methodical) solutions of the past. Applying methods based on past rules and concepts leads to repeated failure, but this process is both activating concepts in the subcon 2The Digital Audio Workstation. Effectively a software reconstruction of an entire recording studio. 3A useful analogy would be tipping the surface of a tilt maze in order to extract a ball from a hole, and help its progress to the final goal. scious for recombination (a process known as priming), and tacitly learning how to quickly select a solution (constructing a neural fitness landscape that will function as an unconscious solution recogniser). At some point one of the many divergent (exploratory) subconscious combinations will be implicitly recognised, and then miraculously provided to the conscious mind4. In this way implicit parallelism can be set to work exploring large regions of a complex solution space. Insight may be an example of when these strategies gel, however there may also be inhibition effects (some are shown as red arrows in figure 2) when they work against one other. Probably the single most important inhibition effect is that explicit processing is serial, with limited working memory. Therefore if it is fully engaged with analytic processing, e.g. dealing with many separate musical parameters, there will be less resources available for meta-cognition and high level reasoning. Tasks such as critical listening have been shown to suffer under interface-induced higher cognitive load (Mycroft, Reiss, and Stockman 2013). Other inter-quadrant interference effects include explicit monitoring, also known as analysis paralysis: a phenomenon where if an attempt is made to consciously control an automatic action, performance suffers (Masters 1992; Wan and Huon 2005). Habit naturally inhibits exploration: an automatic action will tend to be repetitive and inflexible (Barrett 1998). One final inhibition effect is that analytic thought involves narrowed attention: users may be less open to peripheral cues and remote associations emerging from exploratory processes (Ansburg and Hill 2003). This prediction seems to align with many users reports of using computers to make music: the fact they can get hung up on details, lose perspective and miss the big picture of what they are attempting to express. Evaluation of ones own work requires taking a step back to get a perspective of structure at longer time scales (Nash and Blackwell 2012). Lack of perspective can be a problem when manipulating complex interfaces: Participants voiced strong feelings that computer-music systems encouraged endless experimentation and fine-tuning of the minutiae of sound design, in conflict with pushing forward and working on higher-level compositional decisions and creating finished works. (Duignan, Noble, and Biddle 2010) Unfortunately the reflective attention monitoring system may itself be inhibited, therefore preventing the realisation that perspective has been lost. So, in summary, there seems to be a high risk that explicit-convergent interfaces may inhibit high level transformational creativity. 4Wiggins proposes that the criterion for admission into consciousness is not only the certainty of the idea as a good solution, but also an information theoretic measure of surprise: implying that novelty generation is practically hard-wired into the threshold between implicit and explicit thought (Wiggins 2012). Discussion References The principal application of the above framework is to generate a number of guidelines by which to design and evaluate creative interfaces. Some of these will already correspond with those put forward within the HCI and DMI literature, some may be novel. However, we propose one underlying principle: just as the dimensional structure of the interface (how the parameters are presented and mapped) must match the perceptual nature of the task (Jacob et al. 1994), so also the structure of the interface must be able to match the current creative strategy of the artist. The computer interface should follow the human thought process as closely as possible, not only in terms of the steps required to render a final product, but also in terms of the different geometries of the search strategies employed to discover that final product. Therefore the interface must support exploratory, reflective, tacit and analytic modes. We propose that the incubation-illumination cycle outlined in the previous section is already somewhat mirrored in creative technological interaction. However, to date this has not been specifically designed for, so there is surely room for improvement. Technologies exist that augment each individual quadrant, but principally lacking are easy transitions between strategies. For example switching between instrumental play to computer based editing to designing ones own musical abstractions is currently quite demanding, and generally stalls any creative flow. How could all four modes be provided without merely increasing the cognitive load? How, specifically, are these twelve possible5 transitions to be carried out? This is our topic of further research. Almost all user interfaces for creative software provide parameters such that features are edited in a separate, serial fashion. These interfaces are used to create music, animation, industrial design, architecture and computer games. They find their way into almost every aspect of 21st century digital culture. If this interaction paradigm really does change the way that people are creative, this seemingly innocent and logical arrangement may already have had significant consequences for the quality of artistic innovations. Will new multidimensional interaction devices encourage a different approach? Currently, this is just a speculative model, albeit informed by and retrodicting other research and experiences in the electronic music community. Further work will attempt to find evidence for the efficacy of this approach via experiments, interaction data analysis and interviews regarding the artists own strategies of using computers to be creative. 5Enumeration of all of these is beyond the scope of this paper. However one illustrative example would be to start by improvising with a complex tacit interface, but then abstract major themes (perhaps automatically) from that improvisation. These themes would be then gathered in a reduced space, to be explored, recombined and performed using the same multi-dimensional interface. Themes in the explorations in this new space could again be extracted, producing a recurrent exploratory/reflective process that also leverages tacit skill. 2014_30 !2014 A Model of Runaway Evolution of Creative Domains Oliver Bown Design Lab, University of Sydney, NSW, 2006, Australia oliver.bown@sydney.edu.au Abstract putational social science (Conte et al., 2012), of attempting Creative domains such as art and music have distinct properties, not only in terms of the structure of the artefacts produced by in terms of their cultural dynamics and relation to adaptive functions. A number of theories have examined the possibility of functionless cultural domains emerging through a runaway evolutionary process. This includes models in which engaging in creative domains is actually counterproductive at the individual level, but is sustained as a behaviour through an evolutionary mechanism. I present a multi-agent model that examines such an evolutionary mechanism, derived from these theories. Introduction The study of computational creativity involves both general theory and domain-specific theoretical and experimental studies. Domains such as music, visual art and humour have very different properties owing mainly to the ontological and structural nature of the artefacts produced. But we also know that these domains have different socio-cultural natures. For example, Hargreaves and North (1999) and Huron (2006), discuss social functions and contextual factors that appear to be specific to music, and may not have any relevance to art or humour (although they could). A major contribution to computational creativity therefore involves the computational modelling of specific domains, as in the classic examples described in Miranda, Kirby, and Todd (2003), and more abstract notions of creative domain dynamics, as studied by Saunders and Gero (2001) and Sosa and Gero (2003), drawing on the theoretical formulation of Csikszentmihalyi (1990). The specific analysis of creative domains their origins, dynamics and relations to individual motivations makes a critical contribution to computational creativity by framing how we should understand the evaluation of automated creative agents acting in those domains. This paper follows the latter work but looks at the more fundamental evolutionary question of the emergence of creative domains, i.e., how humans came to exhibit behaviour in speci.c realms such as art and music, either through genetic or cultural evolutionary processes. The approach used here follows the epistemological method, established in multi-agent modelling fields such as artificial life (Di Paolo, Noble, and Bullock, 2000) and com-to reveal novel mechanisms through the study of the emergent qualitative outcomes of local interactions in computer simulations. The model presented in this paper is based on theories of the evolution of music and takes the form of a minimal abstract model of biological evolution. However, it does not directly look at modelling music, but at a proposed model of underlying social interactions that would allow a runaway evolutionary process to take place. Theoretically this is grounded in the ideas of cultural evolution provided by Boyd and Richerson (1985) and Laland, Odling-Smee, and Feldman (2000). In the language of Laland, Odling-Smee, and Feldman (2000), the model is an experimental study of the construction of cultural niches which remains generic for the sake of simplicity, but could be later developed into a specific model of the construction of a music niche, or applied comparatively to different creative activities. A niche is defined here as a site of fitness acquisition for an individual. Niches can be pre-existing, as in the use of trees for birds, or constructed, as in the alteration of an ecosystem by a beaver building a dam. The model can be interpreted as a general model of runaway evolution of creative domains. In my conclusion, I discuss the applicability of this model, and more generally this type of modelling, to developing a richer understanding of creative domains that may inform computational creativity. This and similar models provide candidate properties of creative domains that directly inform the way we view the analysis and evaluation of individual creative systems within specific domain contexts. Runaway Theories of the Origins of Musical Behaviour The origins of music are mysterious and highly contested. In The Descent of Man (1883), Darwin introduced the principle of sexual selection and suggested that various aspects of human appearance and behaviour, including music, may be sexually selected. The theory of sexual selection states, in modern genetic terms, that since reproductive achievement is key to the perseverance of genetic lineages, then genetic adaptations that increase ones attractiveness to potential mates will prosper. The theory of sexual selection was developed considerably by Fisher (1915), who proposed that a runaway selection of arbitrary traits could occur if male traits and female preferences coevolved (since females typically have the greater investment in reproduction they are typically the choosier sex). The question of whether sexually selected traits can be fully arbitrary has been the subject of much debate. As part of a general principle that underlies the contemporary study of honest signalling theory, Zahavi (1975) proposed that female preference is likely to be guided towards traits that are actually an external (visible or audible) indicator of some positive quality. Thus when male traits and female preferences coevolve, it is those pairings that lead to stronger fitter males that persevere. For example, the quality of a bowerbirds nest indicates the ability of the bowerbird in foraging. More recently, Miller (2000) has revived the argument that music, amongst other aspects of human appearance and behaviour, is sexually selected. Miller presents musical ability as an indicator trait of general intelligence and health. The theory continues to attract attention but competes with a number of other theories about the origins of musical behaviour. Two strong competing theories are that music serves some cooperative function (Brown, 2007), and that music has no function at all, instead being a cultural innovation that exploits human aesthetic preferences (Pinker, 1998). Both runaway sexual selection and this cultural exploitation theory .t well with an apparent lack of function in music. Although evidence does exist to support social functions in music that would support the cooperative view, this view has also struggled to gain traction due to uncertainty surrounding plausible mechanisms for the evolution of altruism (Fisher, 1958). The sexual selection view has also been criticised because of a lack of typically sexually dimorphic traits in humans with respect to music, and the prevalence of music in situations that appear to have nothing to do with courting, such as at funerals and heavy-metal concerts (Huron, 2001). However, runaway evolutionary processes are not limited to sexual selection. Zahavis (1975) examples of honest signalling, for example, extend to other coevolutionary situations. Boyd and Richerson (1985) propose a runaway cultural evolutionary process based on a set of heuristics describing how individuals adopt cultural traits, based on frequency and status. They hypothesise that people are more likely to adopt a cultural trait the more other people adopt that trait, and the higher the status of the people are. They also propose that minimal discrimination is applied to the choice of traits to adopt, on the basis that false positive assumptions are more acceptable than false negative assumptions. In this way potentially arbitrary traits exhibited by high status individuals can easily and rapidly become adopted. Blackmore (1999) develops similar principles through the theory of memetics, and suggests that various aspects of culture, even language, might be understood as having emerged as parasites, exploiting human behaviour to become established. These views align with Pinkers view of music as a functionless cultural innovation. Such theories also raise the possibility of a coevolution between genes and culture, which has been explored by a number of theorists, most notably Laland, Odling-Smee, and Feldman (2000). Their extensive theoretical and empirical review suggest that sexual selection and Boyd and Richersons runaway cultural evolution are just instances of a more general tendency for runaway evolutionary processes to occur between environments and organisms, and that there may be other ways in which runaway evolution could occur in cultural systems. Here the term environment includes culture, and culture is viewed as a site with great potential to exhibit runaway evolutionary processes. A Model of Runaway Evolution Little research has been done into how specific cultural forms such as music might be explained by runaway evolution. In this paper, I present a model that provides a very simple mechanism whereby runaway selection of arbitrary cultural domains can become established. The model is predicated on the broad question underpinning runaway evolutionary processes: under what circumstances will populations of individuals evolve to exhibit traits or engage in behaviour that has no net advantage? Models such as those of runaway sexual selection present such circumstances and show how they are viable. Whilst peacock tails are a burden to peacocks as far as flying or escaping predators are concerned, they give the individual peacock with the better tail a reproductive advantage and thus a net fitness gain. The peacocks tail is understood in terms of the niche created by the peahens evolved sexual preference, and vice versa. By analogy, in the present case, the goal is to examine examples of cultural behaviours where a similar emergent cultural niche could be established. In our case, we choose to examine a scenario that is not underpinned by sexual selection, but by economics. Primate social organisation is sufficiently complex to lend to the idea that human evolution has been guided by very simple but significant forms of economic interaction. In particular, simple forms of transferrable wealth might have had the capacity to in.uence fitness dynamics, stimulating the emergence of new cultural niches through positive feedback. Transferrable and cumulative wealth has the capacity to influence evolutionary fitness by allowing one person to effectively take fitness from another person, and, on a macroscopic scale, for societies to develop systems by which to organise their collective wealth, in effect providing some top-down determination of fitness. Under such circumstances, the nature of that social system would have a significant influence on an individuals choice of fitness strategies and this might ultimately have an influence on culturally evolved behaviour, and possibly even a genetic influence. Note that transferrable wealth could mean something such as rights to land that is not achieved technologically, but merely requires a simple concept of ownership or title, although in the present case wealth is also considered cumulative, which might entail something being harvested, or simple things such as clothing being made. Given their simplicity, these factors plausibly predate the creative domains under consideration. But what has this got to do with creative domains such as art and music? A number of recent studies have looked at how creative success is organised at a social level, suggesting that there is inherent positive feedback in the way that we allocate reward for creative achievements. Salganik, Dodds, and Watts (2006), for example, show that music ratings are directly influenced by ones perception of how others rated the music, not just in the long term but at the moment of making the evaluation. The result is a winner-takes-all outcome, where a piece of music that is rated highly by others is more likely to be highly rated in the future, as long as people are aware of the already-high regard given to the work. Rather than directly appraising creative works in terms of their content, they are appraised as social artefacts, subject to social processes that transcend the creative content itself. If this is true, then one potential effect of individuals engaging in creative domains is to create winner-takes-all redistributions of some social entity, most broadly described as prestige, that may be assumed to relate in some way to wealth. Accepting the assumption that any given creative domain has no other fitness-enhancing function, then in evolutionary terms it can be understood as a time and effort commitment that needs to be explained. The present model looks to reduce such a scenario to its simplest abstraction and consider the evolutionary effects (whether generic or cultural). In particular, it asks whether it is possible that the creative domain acts to reinforce itself over time, thus providing a evolutionary explanation in the form of niche construction. For this to be demonstrated, a population must be shown to transition from not engaging in the creative behaviour to engaging in it. This occurs when those who engage in the creative behaviour are more successful than those who do not engage over evolutionary time. The model presented here looks at how this can happen over evolutionary time, despite the net average benefit for engaging in creative domains being lower than for avoiding them. Model Design 1 The model has a very specific purpose, which is to show how an arbitrary activity can emerge amongst a population of rational selfish agents. Underlying the model, a simple economic system is implemented in which wealth is tied to evolutionary fitness. Agents with higher wealth have a greater chance of survival and are therefore driven by natural selection to maximise wealth. The purpose of the model is to demonstrate evolutionary scenarios in which emergent social conditions favour acting in an apparently irrational way, by engaging in an arbitrary functionless behaviour: a game. The functionless behaviour in turn provides the conditions for runaway evolution. Note that evolution here can refer to the evolution of genes or of culturally (vertically) acquired traits, interchangeably. Thus the model works as either a biological or a cultural evolutionary model. For the purpose of this paper I refer to genes in the model, but these can be replaced by memes that are vertically transmitted. The model consists of a fixed population of N agents. 1All code for the software model can be found at https://www.dropbox.com/s/48oy1v32lx0utp0/LotteryMain.java. The variables described in this paper differ from those in the code, which are based on the scenario of a lottery game. Evolutionary competition is implemented through tournament selection. Each agent has the following genetic variables: Tendency to play the game (Gi): the probability that an agent will chose to play the game in a given round. At each time step, each agent is identified either as a gamer or a non-gamer; Competence (Ci): the game is predominantly random, but there is a bias towards agents with a higher competence; Taxation Vote (Ti): all agents vote on a level of taxation that non-gamers should pay into the game, the tax at each round is the average of these Ti. Each agent also has a wealth variable, W , which is modi.ed through transactions as described in the sequence below. The following sequence is run at each time step: 1. All agents accumulate a fixed pay, p. 2. A globally imposed non-gamer tax, t is calculated as the average of all agents Taxation Votes, Ti. 3. All agents are asked if they wish to play the game in the current round, resulting in a number n of gamers. The tendency to play the game, Gi, is treated as a probability that determines this choice. 4. All gaming agents pay a fixed cost, c, whilst all non-gaming agents pay the non-gamer tax, t. Non-gamer agents also receive the fixed non-gamer bonus, b. 5. The game winner is determined as follows: two different agents are randomly chosen from the list of gamers. The agent with the greatest competence, Ci, out of these two candidates wins. In the case of equal ability to cheat, a random agent is the winner. The winner receives all of the bids, n c, and all of the tax, (N n) t. 6. A fixed number m of reproductive tournaments are run as follows: two different agents are randomly chosen from the population. The agent with the greatest wealth is the winner. In the case of equal wealth, a random agent is the winner. The loser is replaced by a child (mutated copy) of the winner. The parent gives a fixed proportion, w, of its wealth to its child. 7. All agents wealth is depreciated by a wealth depreciation coefficient, d (0 . d . 1). Each agents wealth is scaled by this number. Childrens Gi, Ci and Ti values, the genetic variables, are copies of the parent with a Gaussian mutation with a standard deviation of 0.001. Gi and Ti values are constrained between 0 and 1. Ci values are only constrained with a lower bound of 0. The parent gives a fixed proportion, w, of its wealth to its child. Unless otherwise stated, initial values for all agents are Gi =0, Ci =0.5 and Ti =0. The model variables used in the studies detailed in this paper used the values specified in Table 1. Starting from an initial value of zero, an increase in the mean tendency to play the game, G is then interpreted as a scenario in which game-playing behaviour has become established in the population. The model is designed to reveal Table 1: Values used in experiments. All values are fixed except the experimental variable d. Var Description Value p Pay for all agents at each time step 1 c Cost of bid paid by gamers 1 b Bonus paid to non-gamers 1 m Reproduction tournaments per iteration 10 w Proportion of wealth paid to children 0.2 d Depreciation of wealth at each time step 0-0.999 the conditions that are required for this to arise. G is subject to dynamic selection pressures and can also drift, if no strong selection is observed. Through propagation through a population the range of drifting G values can appear to have low variance, so low variance is not considered sufficient to indicate strong selection. Constraint of the variable to a speci.c range over a long period of time and multiple runs is used: if G sits consistently above 0.8, it is concluded that a game behaviour has emerged in the population. We assume that individuals are equally able to generate transferrable wealth at a fixed rate, p, per time step. For the game, players put a unit of their wealth, c, into a pot and one individual, chosen at random, wins the entire pot. In addition, we assume that game-playing has a fixed time cost. This is implemented as a further payment, b, to non-gamers. The relative values of p, c, and b therefore define a space of possible model parameters with possible outputs with regard to how G evolves. Results The wealth depreciation coefficient (d) was compared across 4 values, 0, 0.9, 0.99 and 0.999. In the first case, wealth is transitory, acquired at the beginning of each time-step, then either spent in the game or kept, and then used to compete in tournaments. For values of d approaching 1, wealth becomes increasingly cumulative. This has two implications: firstly, wealth reaches higher levels, since with a constant income the stable state wealth value is greater. Greater wealth takes longer to accumulate and means that individual gains are ultimately less relevant. Secondly, the gains of short-term successes stick around longer and are more likely to transform into reproductive success. These can also be transferred to children. Figure 1 shows model outcomes for the values of d, 0.9, 0.99 and 0.999. Each graph shows the average of the ten dency to play the game genetic variable, G, in the population over time, with 20 runs of the model superimposed on each graph. G tends toward its upper bound in models with d =0.999, whereas it does not drift far away from zero in models with low d (d =0 and d =0.9). Even for d =0.999 there is the potential for G to drift down as well as up, indicating that population-wide game behaviour under favourable circumstances is not as strong an evolutionary stable-state as game-avoidance under unfavourable circumstances. These results show that the durable, transferrable forms of wealth discovered by humans create a situation conducive to the formation of game-playing. Figure 1: Evolutionary runs with wealth depreciation coefficient, d, values of (from top to bottom) 0.9, 0.99 and 0.999. Each graph shows the mean tendency to play the game genetic variable, G, evolving over 100 million time-steps, repeated over 20 runs of the model. The taxation vote is allowed to evolve genetically. Figure 2 shows a typical instance of the model for d = 0.999 and evolvable taxation vote T , with G in red and the mean Taxation Vote genetic variable, T in green. Both values are attracted towards their upper bound of 1, with T more inclined to drift. It may be a reasonable assumption that these variables are positively mutually reinforcing, though this has yet to be tested. In order to understand the specific economic pressures on individuals, a simplified study was conducted with the taxation vote set to a fixed value. To further clarify the model, the accumulated tax was not passed to the game winner, as described above, but was instead discarded. This makes it easier to measure the average expected incomes of individuals in the non-gaming and gaming categories, since average incomes are no longer frequency-dependent (as compared to the standard model where tax channels wealth from non-gamers to the winning gamer). In this simplified model, Figure 2: An example evolutionary run with wealth depreciation coefficient, d, of 0.999. The graph shows the mean tendency to play the game genetic variable, G, in RED, and the mean taxation vote genetic variable, T , in GREEN, evolving over 100 million time-steps for a typical run. non-gamers gain (p + b T ) units of wealth at each time step. Gamers do not gain benefits b or pay tax T . Since the game is zero-sum their average income is simply p. Non-gamers all receive the average income, whereas gamers real incomes are skewed according to the outcome of individual games. Figure 3 shows the emergence of game playing (situa tions where G tends towards 1) for different values of T , under these conditions. For T =0.4 game playing begins to emerge. The transition from non-game to game takes the form of a sudden phase shift with an erratic onset, and no transitions occur in the opposite direction, implying that game-playing is evolutionarily stable in the population once established. With T =0.6 game playing consistently emerges. In the latter case, the average non-gamer income is (p + b T ) = (2+1 0.6) = 2.4 whereas the average gamer income is p =2. Therefore even when the non-gamer group is fitter on average, the gamer group comes to dominate. This shows a minimum requirement for game playing to emerge. By comparison, the graph at the bottom of Figure 1 shows that this result is robust if T is allowed to vary genetically, even when initial values for G and T are zero. Figure 4 shows the mean competence genetic variable, C increasing steadily without limit for the same run as the graph in Figure 2. C exhibited this increase consistently across all runs, even with d =0. By the model design there can be no circumstances under which lower C is advantageous, and always the occasional accidental game that selects in favour of higher C. The purpose of modelling C is not to show that it increases, which is inevitable and obvious, but to show that it has no impact on the emergence of the game behaviour, despite undermining its fairness. We can say that random success is sufficient for the game to emerge, and may enable the initial adoption of the behaviour, but that it is not strictly required. What matters is that the game is robust once established, and creates a stable scenario in which C is driven to evolve. In this model, C is just a numerical variable that is driven to evolve indefinitely towards higher values, but in its place more complex models could explore the potential for the game-playing niche to drive a runaway Figure 3: Evolutionary runs with wealth depreciation coef.cient, d, values of (from top to bottom) fixed 0.999. Each graph shows the mean tendency to play the game genetic variable, G, evolving over 100 million time-steps, repeated over 20 runs of the model. In this case the taxation vote is fixed at 0.4 (top) and 0.6 (bottom). Furthermore, in these instances taxes are not passed onto the game winner but are simply discarded. arms-race of game-playing skill, with each winner passing on the greatest skill traits to the next generations. Discussion Summary of Results To summarise the key results, the model shows how a population can evolve an apparently economically irrational behaviour that drives inequality. A greater durability of wealth increases the tendency for game playing to occur, even if the net benefit to the average individual is lower. The emergence of evolutionarily stable game playing behaviour creates a selective pressure driving the constant and rapid increase in game playing ability, but as the population evolves together towards greater competence, the game itself is sustained. As discussed, the properties of this system resemble a set of hypothesised properties of creative domains, satisfying a niche construction view of their emergence. The results therefore reveal a hypothesised emergent cultural niche which, too all extents and purposes, is functionless, but provides a site for individual fitness acquisition (albeit achieved by lottery) by individuals, and drives a runaway competitive coevolution amongst the population of greater competency in this domain. Figure 4: A simulation run (d =0.999) showing the mean competence genetic variable, C, evolving over time. In all cases, including d =0, C increased without limit. On Randomness The choice to base the model on a lottery-like game was not discussed in the theoretical background, but is also grounded in a well-founded evolutionary concept. Given the evidence for winner-takes-all processes in human artistic domains, the possibility that randomness is a significant part of the process is actually something that should be seriously considered. A possible role of randomness in structuring social systems, proposed by Wilson (1994), supports a functional role for randomness. Along with heredity and meritocracy, Wilson (1994) shows that chance can and does play a role in the construction of socially structured systems. The clearest and most striking example of this is the determination of gender, a stochastic process occurring in development, that leads to a prominent social distinction, underpinned by physiological divergence. Looking at the abstract properties of our biological system of gender, Wilson (1994) argues that there may be any number of other behavioural traits determined through a similar process: genetically determined phenotypic variations derived from a common genotype, allocated stochastically. They are, by this definition, not environmentally determined, and are therefore strictly chance allocations, not local adaptations. It is through a stochastic process that a given distribution of possible behaviours emerges, just as in the case of gender, where we end up with a roughly 5050 split. Wilson proposes variation along a boldness-shyness personality scale as a candidate example. Assume that boldness and shyness are both proven to be optimal behavioural strategies in different social contexts (in the context of art we could map these onto traits such as creativity and conformity). Typically we think of phenotypic plasticity as the only approach to arriving at good context-dependent behavioural strategies such as these. A plasticity-based view of these traits is that an individual would learn from cues in their environment to be either shy or bold. An equally plausible explanation, Wilson argues, is that the trait is randomly assigned by a stochastic developmental process. Assuming that, to some extent, individuals can find roles that suit their phenotypes (i.e., there are places in the social system where both shy and bold individuals can thrive better than the other), and that an appropriate range of roles is available, then all individuals can emerge well-adapted. Thus a social structure that demands a mix of traits can coevolve with this kind of stochastic allocation of traits. The principle of self-organisation can explain the resulting assignment of roles. This explanation is also satisfying because the genetic mechanism for stochastically switching between two evolved behavioural variations is arguably simpler than the psychological mechanism required to work out which behaviour strategy is successful in a given, novel context. In addition, the precise source of randomness might be at a number of different stages other than in the genetics. For example, boldness-shyness development could be triggered by events that are effectively random, i.e., there is nothing in the content of the trigger that conveys relevant information about the environment. In the case of creative domains, as suggested by the present model, creative success could be allocated randomly, with the effect that those creatively successful individuals act to reinforce the existence of the creative domain for future generations. This is only to say that random allocation of creative success may be sufficient for the creative domain to work. In reality, creative success may also depend on non-random processes, as with our competence variable. An important clari.cation of this principle is that it is not necessary for every individual to do equally well out of the situation for it to be evolutionarily viable: a principle well established by sociobiologists, as in the respective reproductive fitness of different individual ants in a colony. Instead, the process can produce clear inequalities. This parallels the principle of kin selection; kin-directed altruism is able to evolve in proportion to the degree of relatedness between kin, based on the fact that altruism between close kin is as good a way for genes to persevere as individual sel.shness. Kin-selection is widely believed to be the most robust mechanism by which cooperative behaviour emerges in nature (Maynard Smith and Szathmary, 1995). Conclusion The model of runaway evolution presented in this paper simply provides a mechanism whereby a pattern of behaviour resembling human creative domains can emerge. The provision of a mechanism does not in any way help prove the theory that music and other creative domains emerged through runaway evolution, but enables predictions derived from the mechanism provided. The simulation model can be tested against studies of the nature of creative success over multiple generations, taking into account the relationship between creative success, core economic motivations, overall fitness and other contextual factors. In particular, the model predicts that the motivation to engage in creative domains is irrational in the short term but evolutionarily stable in the long term. We can test this by looking at the immediate payoff to art practitioners of varying levels of success. The model predicts that this payoff would be poor in the short term, but that this apparently irrational behaviour could be explained by a process of reinforcement occurring at the social level, whereby creatively successful individuals effectively assert the status quo. Such factors provide a wider context for thinking about the evaluation of artificial creative systems. Evaluation as presently conducted on an individual case-by-case basis (system by system or output by output) may need to be revised to take into account a more complex understanding of the relationship between long-term creative dynamics and short term creative success. Rather than building one virtual Mozart or virtual Picasso, we may need to deploy millions of them in virtual communities in order to truly understand creative success. 2014_31 !2014 Computational Creativity: A Philosophical Approach, and an Approach to Philosophy Stephen McGregor, Geraint Wiggins and Matthew Purver School of Electronic Engineering and Computer Science Queen Mary University of London s.e.mcgregor@qmul.ac.uk, geraint.wiggins@qmul.ac.uk, m.purver@qmul.ac.uk Abstract philosophical discussions, often in the form of reductiones This paper seeks to situate computational creativity in relation to philosophy and in particular philosophy of mind. The goal is to investigate issues relevant to both how computational creativity can be used to explore philosophical questions and how philosophical positions, whether they are accepted as accurate or not, can be used as a tool for evaluating computational creativity. First, the possibility of symbol manipulating machines acting as creative agents will be examined in terms of its rami.cations for historic and contemporary theories of mind. Next a philosophically motivated mechanism for evaluating creative systems will be proposed, based on the idea that an intimation of dualism, with its inherent mental representations, is a thing that typical observers seek when evaluating creativity. Two computational frameworks that might adequately satisfy this evaluative mechanism will then be described, though the implementation of such systems in a creative context is left for future work. Finally, the kind of audience required for the type of evaluation proposed will be briefly discussed. Introduction In quotidian interactions, either on a personal or social level, computers are such familiar devices that their operations are taken for granted as having the same kind of relatively universal grounding that humans engaging in interpersonal exchanges of information employ. When computers become either the platform for or the object of philosophical enquiries, though, it becomes necessary to talk about them as information processing systems or as symbol manipulating machines (per Newell and Simon, 1990): in this sense, the operations which computers perform must be seen as transpiring in an abstract space, defined by a system of information grounded somehow relative to an observer. This quality of computation immediately introduces a problematic element of subjectivity to the assessment of a purely informational systems ability to generate meaning, and an ambiguity arises over whether such a system can really autonomously produce output which has been invested with semantic content. It is due to precisely this key feature of computational systems, their dependence on an observer for operational coherence, that computers have become an element in various ad absurdem, exercises aimed at problematising both reductionist and internalist accounts of mental phenomena. Putnam (1988) in particular has argued for the computational signifficance of the internal states of a rock, while Searle (1990) constructed his famous Chinese room argument to demonstrate the absence of intentionality in machines which merely manipulate symbols, a stance subsequently used as a platform for questioning the very basis of cognitive experience. In these examples, computers come out as the foils for arguments about the intractable difficulty of defining or even talking about human consciousness. Rather than treating computers as the theoretical objects of thought experiments, this paper will argue, as Sloman (1978) did several years ago, that computers should be considered essential tools for doing good philosophy, and that in particular the question of whether computers can be autonomously creative is philosophically valid. This papers first objective is to place the field of computational creativity within the context of the philosophy of mind, and in particular to consider how the field might be used as a vehicle for empirically exploring the problem of dualism which has been characteristically at the centre of questions regarding the mind and consciousness in modern Western philosophy. To this end, a strong counterargument to the traditional mode of dualism, which argues that the mind and physical matter occupy two mutually irreducible spaces, can be found in considering ways in which symbol manipulating machines might be able to autonomously produce informational artefacts that are new and valuable and that furthermore bear some sort of meaning relevant to the way in which the creative system itself operates. If a computational system can produce new, valuable artefacts in a way that is deemed suitably creative, and yet these systems are themselves reducible to manipulations of symbols grounded in the workings of a physical machine, there seems to be no case to make for the idea that the act of generating new meaning in the world transpires in some intangible mental domain. The second objective of this paper is to propose a new mechanism for evaluating creative systems, motivated by insights into the way that humans view themselves. Taking the intransigence of the mind/body problem as a starting point, it will be suggested that it is precisely the kind of representational internal states that dualists have attributed to the immaterial space of the mind that should be sought in the operations of creative agents. While a positive assessment of the creativity of an informational system would clearly negate the premise of a mental space separate from physical reality, it is argued herein that precisely this negation serves as a good basis for using the mere impression of such states in a system as an ersatz device for evaluating the real presence of creative behaviour. To this end, two topical computational frameworks, vector space models and deep belief networks, will be put forward as candidates for future work in various domains of computational creativity, with the view that these approaches to computation have the potential to build conceptual structures which might be considered by some observers as corresponding to the type of mental representations attributed to humans. Computational Creativity and the Demise of Dualism Descartes (1911) theory of a mind/matter divide, and the notion of internal mental representations which in particular have characterised the type of introspective reports of the mental space described by philosophers of this bent, have been at the centre of the development of modern Western philosophy, with subsequent canonical philosophers routinely name checking Descartes. The dualism inherent in the mind/matter world view has, however, fallen so severely into disrepute with latter day theorists of mind that a cognitive scientist recently felt comfortable in asserting that in the field today, even the word Cartesian is often used as a term of abuse, (Rowlands, 2010, p 12). Indeed, in their immensely public debate over the nature of consciousness, Dennett and Searle (1995) resort to mutual accusations of existential partitioning, with both thinkers avowing their own faithfulness to what they perceive as the fundamental, indeed, explicitly monist type of data on which an analysis of existence should be based and upon which any theory of consciousness must supervene. So the great feuds of contemporary philosophers have been characterised not by a debate over the extent of the merits and faults of dualism, but rather by quarrelling about the precise way in which this dead idea should be autopsied. Whether from the material perspective of reductionist science or from the subjective vantage point of emergent intentionality, the idea that the mind inhabits some physically irresolvable realm has been rejected. This rejection has done little, however, to mitigate the deep issues which characterise the problem of cognition. Furthermore, where strong dualism has been largely vanquished from the philosophical vanguard, it seems even more clear that blunt behaviourism has been thoroughly rebuffed: the idea that cognition can be discussed in terms of simply observed bodily reactions is considered philosophically infeasible (see Boden, 2006, for an overview). The mind evidently experiences the world not as simulating data, but as an array of semantically loaded entities that interact on various levels and according to various rules. The consequent problem of what constitutes perceptual cognition has been characterised as the binding problem, by which the mind must somehow perform the trick of corralling multifarious sensory stimuli into a Unified experience of reality consisting of discernible, describable things which exist on various levels of abstraction. With this in mind, certain radical views are open to misinterpretation as harbingers of a Cartesian resurrection. For instance, Chalmers (1996) describes a nuanced functionalism by which an agent is conscious by merit of the processes that it performs on a certain level of abstraction regardless of the physical mechanisms of those processes, and Pattee (2008) posits that language and physics should be viewed as two intertwined but mutually irreducible phenomena. Humans are somehow engaging in the act of meaning, in the sense that Wittgenstein indicated when he wrote that only the act of meaning can anticipate reality (Wittgenstein, 1967, p 76): it is the characteristically human ability to see a world full of meaningful things rather than just a world full of data. It is not clear, however, how the binding problem is solved, and how the multifarious world is transformed from material input into expressions which are likewise fundamentally material through a cognitive process which is somehow perceptive and expressive. The human ability to perpetually perform this trick is the subject of the dispute between Searle and Dennett, and is the object of what Chalmers has characterised as the hard problem of consciousness. The answers to these questions remain arguably as opaque as they were in Descartes world. It would seem that computational creativity should, in principle, be the darling of any effort to empirically vanquish any remnant of dualism: to show that a physically grounded symbol manipulating machine is capable of participating as an agent in meaning-making interactions could only illustrate the fallacy of the supposition that such things occur in some kind of non-material space. Wiggins (2012) has recently argued that creativity is in fact the substrate of consciousness, with the capacity for an agent to imagine the world as being different than it is serving as the basis for cognitive action in an environment. In this scenario, an information theoretical process corresponds to Wittgensteins act of meaning, with statistical computations of perceptual data emerging as semantically gravid expectations of what will happen in an environment. Creativity itself becomes precisely Wittgensteins act of producing new meaning, of building new ways of perceiving and anticipating the world on different levels of abstraction. Notwithstanding the resilient arguments from Searle (1990) that purely informational symbol manipulating systems cannot have intentionality at the root of their machinations, it would seem that just the ability for an algorithmic machine to be creative would at least prove that the basis of consciousness can be in the material world of physics. So the argument here is not that, in being creative, in participating in the act of meaning, computers have some chance of becoming conscious. Even with this caveat, though, there is an inherent causal ambiguity in the stance that computers can be creative: it is not clear that the idea that machines can autonomously generate meaningful output a priori is necessarily sound. In fact, short of imposing emergent phenomenological properties on hardware, acceptance that a computer can be creative implies a de facto rejection of dualism on the grounds that the machine cannot imaginably be partly located in some immaterial mental space. A tautology emerges by which a positive result for computational creativity is dependent on precisely the reductionist premise that it will hopefully be used to prove. Rejection of dualism and of the corollary representations which inhabit a placeless mental space, on the other hand, do not necessarily entail an acceptance of the idea that computers can participate in the same kind of creative meaning-making as humans. Indeed, a notable trend in contemporary cognitive science is a move away from the idea that symbolic approaches to the mind can have anything to do with cognition at all, as characterised by the work of Noe (2004), Row-lands (2010), and Chemero (2009). This unfolding movement in the theory of mind traces its roots back to the en-activism of Varela, Thompson, and Rosch (1991) and to the ecologically situated psychology of Gibson (1979): these traditions seek to embed the thinking organism in a physical environment from which the processes underlying consciousness cannot be isolated. In terms of building creative machines, this bodily, environmental approach seems to indicate something more like robotics than the traditional conception of computational creativity as involving the algorithmic construction and traversal of abstract informational state spaces per Boden (1990). Hence, if computational creativity is to be used as a tool for talking empirically about philosophical questions, the burden of proof shifts onto demonstrating somehow or another that information processing systems can behave creatively in the first place. If this can be done, then it seems likely that an analysis of the specific types of systems which generate creative output might yield some interesting philosophical insights into the nature of cognition. But as will be illustrated in the next section, the evaluation of computational creativity is not by any means a straightforward issue. Evaluating Symbol Manipulating Systems The problem of evaluation is a significant aspect of Bodens (1990) classic treatment of computational creativity, where it is argued that in order for computer generated artefacts to be considered as creative output, the program that generated them must likewise be judged as somehow creative in its procedures. In Wiggins (2006a) subsequent formalisation of Bodens model, the creative agent itself is bestowed with an evaluative function which it uses to assess its own output, effectively building a sense of creative value into the agents procedure. Ritchie (2001) has likewise formally described the operation of creative systems in terms of an inspiring set of known good artefacts of a certain type: this set becomes both the basis for the way the system will structure its own output and the index beyond which the system must extend itself in order to be considered creative, in a process which involves sequences of self-evaluation moving from a basic set of possible items, through a consideration of the inspiring set, to the output of artefacts which are hopefully both new and valuable. In more recent work, Ritchie (2007) considers the merits of the view that the creativity of a system should only be considered in terms of its output. Part of Ritchies reasoning is that human creators are generally only judged on the basis of what they do, not how they do it. On the surface this might seem to be in line with Wiggins definition of computational creativity in terms of behaviour exhibited by natural and artificial systems, which would be deemed creative if exhibited by humans, (Wiggins, 2006b, p 210). In fact, though, it would be a mistake to take behaviour here in the Skinnerian sense of observable responses to stimuli; what is really in question in terms of behaviour is the way in which the agent goes about making the artefact. And finally, Gervas (2010) has proposed a model for creative output that involves cycles of production and reflection on the work in progress. This is again ostensibly in a similar vein to Ritchies chain of evaluation of different stages in an overall creative process, but Gerv as, in support of the signifficance of procedure, actually specifically suggests that it is perhaps misguided to try to build systems to appear operationally like creative humans, reasoning that there are a multitude of engineering solutions for a given objective, and blind imitation is rarely the best approach. From these stances a range of approaches to evaluation emerge, aligned along two main axes: on the one hand, there is the problem of whether or not the system should be considered in terms of its internal workings, and on the other hand, there is the question of whether or not the system should attempt to be humanlike. But establishing what exactly counts as a creative process in the first place has proven extremely difficult. Where human creators are easily forgiven for keeping their methodologies secret where, indeed, the mysteriousness of creativity is enshrined in humans through terms such as genius such vagary is deemed unacceptable in a computer. The problem at least partially lies in the question of precisely where the act of meaning occurs: can a computational system really make meaning, or is it the observer who gives meaning to output which is merely the result of informational shuffling? In particular, a problem arises in terms of defining what counts as internal with regards to an information processing system. Given that the operation of a symbol manipulating machine is based on an interpretation of symbols which is fundamentally relative to a subjective observer (Putnam, 1988), the idea of a computational system being anything other than observable seems to fall apart, in which case everything that the actual system does can only be construed as output. If this is the case there is at least an argument to be made from a philosophical perspective for Ritchies (2007) view that changes in the systems process must themselves be viewed as output in order to be assessed. One practical approach to resolving these issues of evaluation has been formalised by Colton, Charnley, and Pease (2011), who, through their FACE model, propose a four step process for generating creative artefacts, or, in their terminology, expressions of concepts. Crucially, this process involves the establishment of framing information that potentially contextualises or justi.es corresponding generative acts. The FACE model is complemented by the IDEA model, a framework specifically designed for the evaluation of creativity, both in terms of artefacts and actions. In an explication of the theories behind these models, Pease and Colton (2011a) are motivated by an appreciation of the tensions that arise between creators and observers in the course of creative generation and evaluation, and seek to place the generation of new meaning in this dynamic relationship. By grounding the context of meaningful expression in public information, the hope is that the problem of trying to conceive of mechanical systems with internal states might be resolved. The FACE model has been implemented by Colton, Goodwin, and Veale (2012): the output of the system developed by these authors offers, in conjunction with new poetry, a narrative alleging motivations on the part of the system in the course of poetic production. Furthermore, this narrative is grounded in an analysis of sentiments and concepts found in an external source, namely, in newspaper articles from a chosen date. This re.exive procedural commentary is specifically motivated by the view that observers do take into account creative process when evaluating an artefact, a stance which is also expressed by Colton (2008) in earlier theoretical work. By ascribing a phenomenology, however implausible, of intentions and emotions to the computational agent, the system generates a secondary level of arti.ce wherein the artefact is the result of some process of conceptualisation, representation, and execution. The hope is that humans will associate a capacity for creativity with the impression of intentionality. What seems to be happening here is the simulation of precisely those properties of internal mental states that, as discussed in the previous section, have been attacked by contemporary cognitive scientists and philosophers of mind. Despite this, the stance taken in this paper is that this type of simulation is, broadly, the correct approach to take towards the evaluation of creativityan evaluative act which, looking at it from the other end of the equation, might as easily be described as persuasion on the part of the agent. However, the stance here is also that mere mimicry of phenomenology is not ultimately a compelling argument for the creativity of a system. Rather, what is needed is a system that legitimately instantiates mechanisms with some similar properties to those that result in the appearance of mental states in cognitive agents. In their zeal for non-representationalist, anti-dualist theories of mind, the contemporary mode of environmentally oriented approaches to cognition have arguably overshot the philosophical mark: not only do they reject the Cartesian stance; their rejection is so thorough that they neglect to properly consider why the mind/body divide has preoccupied Western thinkers for so long in the first place. But the appeal of the idea of an inner life of the mind is powerful on a collective level, running so deep in society that it has been instantiated in the form of intellectual property law, whereby authorship is ascribed on the basis of an ill defined creative spark, (Fei, 1991). Indeed, in a legal sense, and therefore also to some degree on the scale of society, ownership of expressions is construed in terms of the distinction between creation and discovery, (ibid). Elsewhere, McGregor (2014) has proposed that intellectual property law itself might be considered as one viable mechanism for the evaluation of creativity, and that something in the creative process or artefact might be offered up to appease the laws requirement for a distinguishing creative aspect. This is a problematic stance for the prevalent model of computational creativity, which, again per Boden (1990), involves a combinatorial exploration of a well defined state space, where the artefacts of such an exploration must be construed as discovered rather than generated. If the computational agent is to be presented as creative on a social level, then, it would seem the only course of action is to somehow trick the public into thinking of certain informational manipulations as being somehow inherently mental. The idea of trickery isnt totally new to the field. In particular, where the theoretical work of Colton (2008), like that of Gerv as (2010), plainly states that the computational agent should be straightforward about its own nature, the practical implementation of Colton, Goodwin, and Veale (2012) develops an agent that sets about selling its own product with an appeal to intentionality which might almost be described as deceptive. Similarly, albeit in a different domain, Leymarie and Tresset (2012) have designed an ingeniously conceived robotic portrait artist that is programmed to simulate behaviours the programmers have determined sitters and onlookers expect to witness in human artists: the robot enacts a roving quality to its video-camera eye, accompanied by built in pauses which create the illusion that the device is contemplating its work. The deception here, though, is transparent, and is committed with the good faith of honest artistry: it is unlikely that many observers believe these processes, which in the cases in question involves prefigured semantic networks and sentiment analysis or else an encoded parroting of creative behaviour, actually build up any kind of intentionality prior to the production of the output. Even a philosophically disengaged observer should not be expected to accept that phenomenology and intentionality can arise simply through the application of preconceived frequentist methods of data interpretation, or through the robotic rehearsal of a choreographed sequence of stereotypical gestures. So it is proposed here that observers look for familiar processes when analysing creativity, but that this familiarity should be on the level of the impression of developed internal mental states rather than just super.cial expressions; it is further proposed that the right approach to building creative agents is therefore to construct systems which remit the appearance of some kind of internal representations which are developed and manipulated in the course of searching for new, interesting artefacts in any given domain. The claim will be that, in such systems, while the base level of artefacts output for a target domain may be considered simply discovered within the search space chosen by the agent, creativity happens in terms of shaping the search space in the first place, not in terms of the subsequent traversal of that space, an idea which lines up nicely with Bodens (1990) notion of high level transformational creativity. Of course, this attempt to move creativity up a level, so to speak, suggests a secondary search space for new search spaces. Ritchie (2007) touched on this idea when he suggested that the creative process itself should be considered an abstract artefact of the system, but what emerges is an infinite regression of spaces of spaces which immediately calls to mind the parallel homunculus problem in the philosophy of mind. This well travelled argument against representationalist theories of mind questions the basis for a secondary observation of mental representations by some internal observer a homunculus, so to speak an evidently necessary and likewise confounding condition for mind/matter dualism (see Den-nett, 1991). And this is precisely the point: entertaining this approach to the evaluation of computational creativity, namely, the consideration of an agent as being composed of a recursive hierarchy of creative search spaces, results in the same kind of untenable scenarios which characterise the dualist world view. The Cartesian outlook begs the question of who or what is perceiving internal mental states, and, more pointedly, suggests that these internal observations must likewise yield to some form of dualism, setting off a concatenation of ever deepening layers of internal states with no explanation of how this chain could terminate. In the same way, suggesting that a system becomes truly creative when it actually changes the parameters for discovering new and valuable artefacts necessitates a secondary search space with some sort of overview of the primary space from which it might seek the appropriate transformations; this secondary space, however, immediately becomes subject to the same criteria for transformation as the primary space, and an infinite regression rapidly develops. By this untenable mechanism of an infinite hierarchy of spaces in a finite system, a deeper operational correspondence between dualist theories of mind and transformational theories of computational creativity emerges. When the external impression of phenomenology is constructed by merely using information processing systems to analyse input and then combine indexical terms into the semblances of intentions and emotions, the impression of creativity can only be ephemeral. When a system actually reveals that it is operating in a way which establishes the kind of conceptual structures and recursive levels of abstraction associated with what is popularly, if erroneously, considered to be the dualist nature of cognition, on the other hand, the system has much more of a chance of being considered autonomously creative. The question, then, is what exactly quali.es as a semblance of dualist operation in a symbol manipulating system. Implementing Representations The next section of this paper will briefly consider two emerging computational models in terms of their potential as operational frameworks for computationally creative systems. Both vector space models and deep belief networks have been developed for the purpose of computing with high-level conceptual structures, and each system has been at least somewhat successful in its applications to speci.c informational domains. The question addressed here is whether the operation of either of these systems is such that they might be considered to produce the same kind of structures which observers imagine correspond to the mental representations attributed to the immaterial minds of humans under a dualist world view. The hope is that these systems might show some promise in generating computationally graspable conceptual structures which can play a part in the act of meaning: more than just arrangements of data, these conceptual entities would stand for processes to be performed on data, abstract actions in the symbolic world of the computer, realised only through observation. The development of these systems for creative, generative purposes, however, is left for the future. Vector Space Models Initially developed as a mechanism for document indexing (Salton, Wong, and Yang, 1975), vector space models are built of high dimensional spaces whose dimensions correspond to the relational terms associated with a linguistic object: the object is described on the basis of the frequency with which each of the dimensional terms occurs in its context, and thus can be represented by a vector in the space. The idea is that similarity between two objects represented in such a space can be interpreted from the degree of the cosine angle between their corresponding vectors. In more recent work, vector space models have been applied to more basic problems of meaningfulness through distributional models of language, where words are represented in terms of their context and in particular through vectors representing either the frequency or the probability with which they occurred in the context of other words. This approach has been used to attack problems such as word disambiguation (Schutze, 1998) and compositional semantics (Mitchell and Lapata, 2008; Coecke, Sadrzadeh, and Clark, 2011). The compositional approach in particular has revealed the utility of the mathematical nature of the vector space models. As illustrated in Grefenstette and Sadrzadehs (2011) implementation of Coecke, Sadrzadeh, and Clarks (2011) framework, the properties of these kind of high dimensional representations allow for the composition of new representations through the use of Kronecker products, a technique which, by virtue of its non-commutativity, produces different spaces even for different combinations of the same wordsa desirable outcome, given that word order can make a significant contribution to meaning in a sentence. This feature of vector space models allows for the construction of increasingly complex spaces as words are incrementally built into phrases and then sentences. The result is a system containing a vocabulary, so to speak, of highly modular compositional elements: the spaces of words can be easily concatenated into larger meaningful elements on the level of sentences, which become spaces themselves through the mathematical operations which can be performed on these types of structures. In terms of computational creativity, what emerges from the perhaps somewhat complicated mathematics of vector space models is a mechanism for possibly representing what Davidson has described as meanings as entities, (Davidson, 2001, p 116): the raw data of language become objects that can interact in ways that might produce valuable, surprising new semantic combinations. This approach to the composition of conceptual structures abstracts the problem of semantics away from the level of data processing, and likewise away from ungainly interventions of word associations and semantic ontologies that leave an observer wondering if the real creativity hasnt been imposed on the system through a preconceived framework. Instead, by generating and manipulating representations with operations that seem far removed from the logic of mental states or the syntax of the source language, a vector space is effectively promoted to the same level as the meaning-rich kind of encounter that humans have with the world and seems to thereby manifest some of the same mysteriousness associated with that way of being. Rather than relying on an externally grounded observation to give a system of symbols meaning, the objects that populate vector spaces can interact in ways native to their abstract mathematical domain, and in so doing instantiate entities that at least can be construed as conceptual representations analogous to the internal imagery of the Cartesian mental space. While the use of vector space models for creative purposes remains unexplored, the indication from the work done in text analysis gives grounds for proposing that this could be a good method for likewise compositionally building linguistic artefacts which meet the constraints of a creative search space. And, importantly in terms of the subject of this paper, there seems to be good reason to hope that these conceptual structures might stand a chance of convincing a sceptical observer that a system employing them creatively could be utilising something similar to the types of internal representations which have been associated with the human use of language and the human mode of thought, per the likes of Chomsky and Halle (1968) and Fodor and Pylyshyn (1988). Certain other systems have, in fact, taken a generative approach to vector spaces. The latent Dirichlet allocation model (Blei, Ng, and Jordan, 2003) is in particular a topic modelling technique that discovers topics within a range of documents and then builds a probability distribution for words across these topics. Latent Dirichlet allocation is generative in the sense that it picks potential words based on a probability distribution over a topic: the distribution of topics across a potential documents suggests likelihoods for the words which might occur in that document, albeit without the word ordering critical to a meaningful use of language. This is not necessarily an ideal strategy for modelling creative behaviour, however, as, in addition to the absence of compositionality, generative models tend to predict output that is highly likely but, conversely, not very surprising. In the context of generating meaningful and unexpected new language, the compositional approach discussed above seems to hold more promise for finding the semantically loaded output expected from a creative agent. Deep Belief Networks Where vector space models have proved particularly powerful for language, deep belief networks have been used effectively for work in the domains of both computational linguistics and computer vision. Deep belief networks were proposed by Hinton, Osindero, and Teh (2006) as high parameter frameworks that would learn to identify handwritten numerals by developing a model for generating the same artefacts. In this case the generative quality of deep belief networks do specifically point to a creative application, in that the network actually learns to match new, noisy percepts with semantically tagged representations by actually learning to produce those representations in an initial stage of development. Across its many levels of processing, the network purportedly develops different layers of feature detection, and these features for instance, lines, contours, or, eventually, at a high level, concepts arguably convey the impression of the internal states corresponding to the mental perception of properties in the world. The idea is that densely connected networks consisting of a large number of artificial neurons rising over several diminishing layers in a pyramid type structure can be ef.ciently and effectively trained if they are constructed with the right kind of architecture. The keys to this architecture are a special mechanism at the low level related to Ackley, Hinton, and Sejnowskis (1985) earlier work on Boltzmann Machines (another type of neural net that utilises a stochastic mechanism), as well as the simplicity with which the connecting weights between neurons are updated. With their highly interconnected structure, deep belief networks might be seen as the next phase in the historic cycle of interest in connectionist approaches to computing; the new element in this latest manifestation is the stacking of several operational layers where parameters are established in a layer by layer fashion. The operational key to deep belief networks is the idea that, by allowing a single neuron on a higher level to represent the clusters of neurons which feed into it from the level below, an exponential reduction of computational space can be realised (Bengio, 2009). In this way, these networks establish elevating levels of abstraction that might be construed as internal representations. Indeed, in precisely this sense, deep belief networks seem to relate to the idea of the act of meaning by which potentially diffuse visual data are resolved into higher level percepts with some semantic value. The argument put forward here is that this on the one hand instantiates the approach to cognition through the creative reconstruction of anticipated events in the world endorsed by Wiggins (2012), and, on the other hand, creates structures which might be recognised by an observer as something similar to internal mental states. Going back to some of the original literature from the first wave of neural networks, the structure of the human brain was clearly a primary motivation in the effort to compute using weighted networks of nodes (McCulloch and Pitts, 1990). Deep belief networks have inherited this property, and have taken inspiration from another aspect of neuroscience: the multiple layers in a deep belief network specifically resemble the hierarchical structure of the visual cortex in the human brain (Bengio, 2009). Indeed, Serre et al. (2005) have done work towards isolating the ways in which different levels of the primate visual cortex build up different aspects of representations of raw visual stimuli, ultimately resulting in the high level perceptions of parametrically bound entities which seeing, thinking agents experience in the world. In the same way, deep belief networks seek to use increasingly complex clusterings of input data to form higher levels of representation within their architecture. Coupled with the fact that these systems are fundamentally generational, such networks seem like an excellent candidate for consideration as visually creative agents with a convincing impression of internal representations, and probably warrant exploration in other domains, as well. Conclusion Dualism was born of a simple thought experiment: Descartes (1911) imagined himself plagued by a demon .xated on deceiving him, and in response strove to strip away from his experience of reality everything which could possibly be considered illusory. He was left with the certainty of his own irreducible mental existence, but maintained that this existence must also be involved in some sort of likewise irreducible physical reality. Since even before Descartes time, various similar imaginative exercises have characterised the development of Western philosophy, from Platos (1892) cave to Wittgensteins (1967) beetle. Notable recent thought experiments seeking insight into the mind have included Putnams (1996) twin earth, Davidsons (1987) swampman, and Chalmers (1996) philosophical zombiesand, as mentioned earlier, the computer has played a part in some other recent enquiries, though generally as a device for demonstrating the absurdity of certain views of cognition that can be reduced to mere data shifting. The purpose of pointing out this tradition of thought experiments is to highlight the role which the peculiar act of introspection has played in the development of modern Western philosophy. The preoccupation with intentionality and phenomenology have grown out of an intellectual culture of examining the self, and the willingness which humans have to accept the creativity and indeed the very meaningfulness of the expressions of other humans seems to stem from the recognition of a similarly calibrated other-self. What has been proposed in this paper is that the external alienation of an encounter with a computational system can be replaced with a look into the exposed operations of the system, and, in this exposure, there may be some hope of acceptance that the symbol manipulating machine is behaving in a way which is creative, in the operational sense of behaviour described by Wiggins (2006b). The idea that information processing systems should be investigated for indications of familiar processes in order to be considered creative is not new. Gervas (2010) has argued that hardware which operates in a highly parallel manner should be taken more seriously as a candidate for instantiating creative agency, as this type of procedure to some degree mirrors the evident dispersion of activity in the human brain. Perhaps even more fundamentally, Pease, Winterstein, and Colton (2001) call for a criterion of procedural complexity intended to measure the extent of the creative search space and the difficulty of the agents traversal of this space. It is not clear, however, how such mechanisms dont become just another aspect of the agents output, adjunct to the creative artefact itself. What is called for in this paper is a probing of the machine an extrospection, so to speak for the representational type of processes that society at large seems to deem, in the tradition of Descartes, should count as cognitive and potentially creative. It is for the observer to seek out and identify the structures which form these representations rather than for the system to simply present them either though a statement of intentionality or an exposure of process. What form these representional structures would take remains to be defined, though two possible candidates have been proposed here. A further area for enquiry is the question of what kind of observer would be able to recognise these structures in the first place: is some combination of expertise in philosophy and computer science necessary in order for a computationally creative agent to be recognised as such? Ideas along these lines have been proposed by Pease and Colton (2011b) and Boden (2014), all of whom suggest that computational creativity may be best judged by an audience with a degree of knowledge about how computers work. On the one hand, the idea of expert criticism informing the public as to the value of creativity has long been common in various domains such as art, literature, and film, and some degree of expertise is probably necessary to achieve recognition of the relatively complex frameworks discussed earlier in this paper. On the other hand, relying on computer scientists for assurances of the legitimacy of creative agents risks further alienating an audience already confronted with a very new and different mode of creation, and, indeed, of creator. So even the proposal for a solution to the problems laid out in this paper seems to open the door on another potential debate. Such is the nature of philosophy. Nonetheless, this paper has sought to show that computational creativity as a field is an appropriate platform for engaging in discussions about not only aesthetics but also cognition and theories of mind, and has at least presented an avenue for further philosophical investigation. Acknowledgements This research has been supported by EPSRC grant EP/L50483X/1. 2014_32 !2014 Is it Time for Computational Creativity to Grow Up and start being Irresponsible? Colin G. Johnson School of Computing University of Kent Canterbury, Kent, UK C.G.Johnson@kent.ac.uk Abstract A recent definition of computational creativity has emphasised that computational creativity systems should take on certain responsibilities for generating creative behaviour. This paper examines the notion of responsibilities in that definition, and looks at a number of aspects of the creative act and its context that might play a role in that responsibility, with an emphasis on artistic and musical creativity. This problematises the seemingly simple distinction between systems that have responsibilities for creative activity and those which support or provide tools for creativity. The paper concludes with a discussion of an alternative approach to the subject, which argues that the responsibility for creative action is typically diffused through a complex human/computer system, and that a systems thinking approach to locating computational creativity might ask better questions than one that tries to pin creative responsibility to a particular agent. Introduction A recent paper by Colton and Wiggins (2012, p21) gives a succinct definition of computational creativity as the philosophy, science and engineering of computational systems which, by taking on particular responsibilities, exhibit behaviours that unbiased observers would deem to be creative. Compared to earlier attempts to define this area, this definition is notable because it does not define computational creativity with regard to human creativity. By contrast, earlier definitions have been grounded in comparisons with human creative behaviour. For example, Ritchie (2007, p69) grounds his list of criteria for attributing creativity to a computer program thus: A central assumption here is that any formal definition of creativity must be based on its ordinary usage; that is, it must be natural and it must be based on human behaviour.. Furthermore, an earlier overview by Colton, de Mantaras and Stock (2009, p11) begins with the statement that, At its heart, computational creativity is the study of building software that exhibits behavior that would be deemed creative in humans.. In this paper I will explore a specific phrase in the definition taking on particular responsibilities which is the main difference with previous definitions. I would like to explore where these particular responsibilities might sit in the creative process, and how the use of computers might change our idea of where that responsibility might sit. In particular, my focus will be on artistic and musical creativity, though there may be implications for other creative areas. Who/what is responsible for a particular creative artistic act? We can argue that there are a number of things that share this responsibility (here we frame these in the context of a human artist): The artist themselves, their actions and patterns of behaviour. The artists motivation to create the work. The background knowledge that the artist has acquired through life, which reflects their general cultural background and specific things that they have encountered or learned. The context in which they are making the work. The materials that they are using to make the work. In particular, the resistance, grit and grain offered by some materials, which can provide new material that can be serendipitously exploited by the artist. It is commonly seen as the first of these as the action that takes on responsibility for the artistic creation. However, when we try to pin down why this is so, we might start by arguing that had the artist not decided to carry out that particular behaviour, to decide not to create that particular work, then the work would not exist. But, the same argument can be applied to the other items on the list: had the artist not had the relevant background knowledge, or had the material worked in a different way, and so on, the work would not have been capable of being created. We can, of course, take this argument to ludicrous extremespart of the responsibility for the art being the artists own existence, etc. Indeed, this is not just an intellectual exercise; determining the responsibility (or credit) for a creative act is important for legal arguments concerning intellectual property rights. McGregor (2014) has recently argued that the legal arguments around creativity might provide a framework for considering computational creativity; along similar lines, Koza (2010) has argued for the use of patentability as a criterion to determine when an AI system is creating artefacts that require human-competitive levels of intelligence. We might stop at a proximate cause as being the primary point at which the responsibility lies. But, what is the proximate cause? We might argue that the immediate actions of making the art are the artists behaviours, in putting pencil to paper in a particular way. But, even at such a proximate level, we can see that that activity interacts closely with the artists motivation, and that during the time-span of creating even a single, simple piece of work there might be a complex interaction between motivation and action. So, where is the computer in all of this? I have argued elsewhere (Johnson, 2012) that computational creativity research has focused too much on the role of the individual creator, favouring the view of the creative romantic hero over forms of creativity that are based on collaboration or the mediation of interaction. In this paper I would like to argue further that the nature of computer-grounded artistic creativity makes assigning this responsibility even harder than it would be for traditional artforms. The remainder of this paper splits into three sections. The first is concerned with the role of materials, and in particular whether computational artistic and musical materials present a particular challenge for making the distinction between passive tools/materials and active agents to which creative responsibility can be ascribed. The second is an examination of context and background, and considers, through examples of search-based art and semantic mass, whether these can be considered to have any responsibility for the creative action. Finally, a concluding section examines whether a better way to examine this is through a systems thinking view, rather than a view based on the notion of responsibility. Materials What is an artistic material, or an artistic tool, and how does it differ from something that plays a collaborative role in artistic creation? For traditional artworks, the distinction is clear. For example, an artist uses a tool such as a pencil to create their art, a musician uses an instrument to create a piece of music. Part of an artistic training is to learn to master such tools; to learn how to realise artistic intents through the coordinated use of perception, thought and the manipulation of tools. But, even at the level of simple physical tools, there is some level of interaction between the tool and artistic creationpart of the study of a particular artistic medium is learning its constraints, and learning how to make adaptations when a particular intentional action does not realise the intended aim. Certain computational artistic media and tools make this distinction between passive tools and media that are manipulated by an artist or musician, and participants that take an active (creative?) role in the artistic creation. Rowe (1993) has discussed a continuum of interactive computer music systems, ranging from simple action-response systems where a performer makes a physical gesture and generates a consistent sound, to systems which listen and make sound as an autonomous and equal participant in a musical interaction. An example of the latter might be the Voyager system (Lewis, 2000); these ideas have been taken further by Paine (2002). We would probably consider something to the latter end of the continuum to sit comfortably within definitions of computational creativity such as that discussed at the beginning of this paper. Whilst a system of that kind might always perform within an interactive context with other (human or computer) performers, it is responsible for holding up its own end in the music being produced, and creating music that is sensitive to the current situation. There are multiple responsible agents involved, and nothing playing the role of mere tool. But, when we look at systems towards the middle of that continuum, the allocation of creative responsibility becomes murkier. For example, consider the LIES system by San.lippo (2012). This consists of a number of acoustic feedback loops, which initially create sounds by creating positive feedback cycles that can start from tiny fluctuations in the performance environment. These are modified by a large number of digital filters and feedback networks. The performer interacts with this system by adjusting the parameters of the various filters and intensity of the feedback system. What is responsible for the final creative output in this system? The interaction between human and machine is complex and at times incomprehensible to the human; the performance mode is one where the human sometimes tries to control the sound being generated to bring it into line with a desired sound (sometimes successfully, sometimes not), sometimes just letting the sound unfold without interference, and sometimes to explore the effect of parameter changes with, depending on context, a greater or lesser understanding of the likely effects. Certainly, the system generates a decent amount of the creative material here, with the human sometimes (importantlynot all of the time) being unable to shape the systems outputs in any comprehensible way. The systems view of this creativity is articulated well by the creator of the system: . . . the human and machine are considered as inseparable: two autonomous entities which, unavoidably, will influence each other, creating a unique meta-system made up of these two elements. The human and the machine establish a dialectics, a talking through the other, with no attempts of subordination, creating a performance which is the result of their cooperation, where, thus, the performer creates together with the machine. (San.lippo, 2012). Is this a computational creativity system? All of the sounds are coming from the system (but, the same is true for a piano). The human would not be able to make the work without the machine (but, the same is true of an artist without a pencil or whatever). Nonetheless, the computer/electronic system seems to be playing a stronger creative role in this interaction than that. Perhaps part of this is that the human is sometimes reacting to the outputs of the computer system as much as they are trying to shape it. Contexts and Background Knowledge In the list towards the beginning of the paper, we identified the background knowledge of the creator, and the context in which they were working in, as other things that could form part of what is responsible for a particular creative action or outcome. Where is this background knowledge in a computational creativity system? In many cases it has been included as a part of the basic architecture of the system: for example, in Cohens AARON system (Cohen, 1995), its figurative works are generated from parameterised algorithms that describe the basic figurative structures that are used to create the work. Other work draws on internet search algorithms as a way of accessing a background of knowledge (Johnson, 2013). Can a way of accessing information, enabled by a technology, become part of the creative responsibility that a computer system provides to a creative activity or outcome? To what extent does the choice to use a complex, unpredictable computational technique in creating a work of art mean that that artwork has had a creative contribution from the computational system? Let us consider a specific example. The image in Figure 1 is created by using the well-known Google image search functionality to search for images related to the word secure (filtered for images of a certain colour palette). If I choose to exhibit this as an artwork, where does the responsibility for the creative decisions sit? With me, alone? But I have hardly done anything! With the Google information retrieval system? With the people who have provided images for the system? One role that computer systemsnot just individual computers, but networked collections of computers with an associated infrastructure of information gathering and information retrievalmight play is to facilitate whole new areas of creativity. For example, the existence of vast online collections of images, together with technology of evergrowing sophistication to search and group such images by their meanings, facilitates a way of creating artworks that we might describe as semantic mass, where large collections of related information are gathered together and displayed. Consider an example such as Jennifer Millss work Whats in a Name? (Figure 2), consisting of a large number of postcard-size paintings, each of which represents a person with the name Jennifer Mills, gained from a search on Facebook. Is this work an artists reflection on the ready ability to track down all of these people using the computer system, or is this a piece of collaborative creativity between the artist and that system? Even if it is not, does the system bear any responsibility for the artworkany more than the paintbrush used to create the work? Manovich (2002) has made related observations, that a technology can, by facilitating a change in the speed or scale of a process, create something which an observer might see as a genuinely new system. This can be seen by contrasting the Mills piece with comedian Dave Gormans pre-websearch project Are You Dave Gorman?, where he tracked down a large number of people with the same name has him (Gorman, 2001). The Gorman project is focused on the labour of making the connections; the Mills piece on its effortlessness. A number of artists and musicians have chosen to deliberately divest themselves of the responsibility of making creative choices in their art. Perhaps the best known of these is John Cage, who created musical/theatrical works based on chance processes or on transcoding (Manovich, 2002) non-musical objects. An example of the latter is Atlas Eclipitcalis, where star-charts were transcribed onto music staff paper, with stars representing notes, and the resulting music performed. By refusing the composers traditional responsibility to decide (at a detailed level) where the notes go on a page, where has responsibility for the artwork gone? Perhaps an argument can be made that the responsibility has been abstracted to a higher levelthat the details of the notes dont matter, but the choice of star maps, rather than any any other printed material, is where the creator has chosen to vest his responsibility. A version of this argument has been made by Xenakis (1992), presenting a form of music in which the composer manipulates large-scale parameters of generative algorithms, rather than details. There is a connection too, to the ideas of Goldsmith (2011), who has discussed the idea of ostensive creativity, i.e. a means of being creative by pointing at material in the world, or organising it in a way that makes us see it afresh. Internet search based art can be seen as a form of this. But, who is doing the pointing? Again, we are drawn back to a system view of the idea of creative responsibility. All of these components have some bearing on the final creative activity, and it is their interactions that lead to creativity happening, rather than one playing a responsible role and the others a supporting role. Conclusions At first it seems easy to distinguish between a system that is a tool that can be used in the aid of creative action, and one that takes on the responsibility for the creative act itself. However, when we look at complex, resistant artistic materials, systems containing complex interactions between humans and computers, and the kinds of human creativity and relational creativity that depend irreducibly on computers or networks of computers, then the distinction between a responsible creative agent, a creativity support system, and the more complex kind of tool become rather blurred. It is easy to understand why the idea of responsibility finds its way into a definition of computational creativity. There is always a sneaky suspicion in a system involving interaction between humans and computers that all of the creativity is coming from the human (even when that human demonstrates surprise at the output from the system!). There is also the desire to distinguish creative systems from mere tools. It is fairly clear that this can be done, up to a point, but the point at which tools slip over from being passive to being an active player in the creative process is a rather vague one. Indeed, it is precisely because computers can be used to build complex, interactive, indeterminate systems that this distinction starts to become more problematic. Indeed, it is perhaps naive to assume that even in traditional non-computational artistic and musical creativity that a simple distinction can be drawn between individuals responsible for their creative action and the tools and concepts that they make use of. After all, reams of pages are written that attempt to explain why a particular artistic action was done by Figure 1: Google image search result for the word secure. Figure 2: Extract from Whats in a name?, Jennifer Mills, 2009-11. contextualising it in the political, economic and social situation in which it is created. One alternative approach would be to apply a systems thinking approach (Churchman, 1968) to this question. This approach would argue that it is futile to try and assign a particular component of the art creating system the de.nitive responsibility for producing the art work. Instead, there is a complex system of interacting agents and properties that lead to the work being realised (or not!) in the form that it ends up. By doing this, we are not throwing our hands in the air and saying that nothing can be said about how the work is produced. Instead, we are arguing that there is a complex system of interactions which in itself needs to be studied. Indeed, Csikszentmihalyi (1988) has explored a similar approach to explaining human creativity. Perhaps we can modify the Colton/Wiggins definition in the following way: the philosophy, science and engineering of computational systems which, by playing a role in an interactive system, contribute to that system producing behaviours that unbiased observers would deem to be creative. Note that the systems have to play a role in the system; this opens up the possibility of many different possible roles. This would seem to bring many activities that are currently seen as part of computational creativity squarely into the definition. For example, Veale (2011, 2013) has discussed the idea of creativity as a service, i.e. the provision of computational components that can are designed to be part of a larger creative system, glued together using web services frameworks. The main point, however, is not to contribute to a pedantic (if sometimes enlightening) debate on definitions, but to shift the emphasis of computational creativity research. Rather than trying to identify the single actor in a complex, interactive system that is responsible for the creativity, instead we should recognise that this responsibility is diffuse and part of the behaviour of a complex human/computer system. That then leads onto much more interesting questions about how such systems gives rise to creativity, how components can be engineered for such systems, and how interactions in such systems can be managed, rather than searching for the single romantic hero who is the fount of all creativity in the system. 2014_33 !2014 Towards Dr Inventor: A Tool for Promoting Scientific Creativity D.P. ODonoghue1, H Saggion2, F. Dong3, D. Hurley1, Y. Abgaz1, X. Zheng3, O. Corcho4, J.J. Zhang5, J-M Careil6, B. Mahdian7, X. Zhao8 1 National University of Ireland Maynooth, Ireland. 3 University of Bedfordshire, UK; 5 Bournemouth University, UK 7 ImageMetry, Prague, Czech Republic Abstract We propose an analogy-based model to promote creative scientific reasoning among its users. Dr Inventor aims to find novel and potentially useful creative analogies between academic documents, presenting them to users as potential research questions to be explored and investigated. These novel comparisons will thereby drive its users creative reasoning. Dr Inventor is aimed at promoting Big-C Creativity and the H-creativity associated with true scientific creativity. Introduction Reasoning with analogical comparisons is highly flexible and powerful, playing a significant role in the creativity of scientific and other disciplines (Koestler, 1964; Boden, 2009). The role played by various analogies in both helping and (implicitly) hindering scientific progress is discussed by Brown (2003). Dunbar and Blanchette (2001) found that analogies were used extensively by working scientists as part of their day-to-day reasoning, playing significant roles in processes from explanation to hypothesis formation. This paper discusses some initial work on an analogy-based model (called Dr Inventor), which will offer computational creativity as a web service to its users who are practising scientists. Dr Inventor is focused on helping research scientists by discovering creative analogical comparisons between academic documents and related sources for their consideration. So Dr Inventor will act as a creativity assistant, while its cognitively inspired architecture also offers one possible model of people thinking creatively. The Web has become a ubiquitous source of publications, source code, data, research websites, wiki and blogs. These form the Research Objects (Belhajjame et al, 2012), used by Dr Inventor a tool for the discovery and presentation of creative analogies between research objects. Dr Inventor is targeted on the Big-C Creativity (Gardner, 1993) sought by practising scientists. Indeed, the aspirations of Dr Inventor include supporting analogy driven H-creativity (Boden, 1992; 2009). Analogies compare a source to a target problem highlight some latent similarity between them. A creative analogy uses a novel source to bring new and creative possibilities to light. Dr Inventor aims to discover novel analogies between academic resources, bringing unnoticed possibilities out of the shadows. Cognitive studies have shown that 2 Universitat Pompeu Fabra, Barcelona, Spain. 4 Universidad Politcnica de Madrid, Spain 6 Intellixir, Manosque, France 8 Ansmart, Wembley, UK exposure to even a single analogical comparison can induce significant differences in peoples response to a given problem (Gick and Holyoak, 1980; Thibodeau and Boroditsky, 2011). This paper is focused on identifying novelty and quality (Boden, 1992) essential qualities of creativity: Baydins (2012) model generated creative analogs for a given target. CrossBee (Jurie et al, 2012) looked for bridging concepts between documents from two given domains of interest. Kilaza (ODonoghue and Keane, 2012) generated creative analogies but it relied on hand-coded data. Dr Inventor will offer a more complete model of creative analogising and blending (Fauconnier and Turner, 1998 Veale, ODonoghue and Keane, 2000), addressing a broad range of the aspects of creativity. Dr Inventor Overview Dr Inventor will include a multi-phase model of analogy encompasing representation, retrieval, mapping and validation. It may become the first web-based system that supports the exploration of scientific creativity via a computational approach offering creativity as a web service to its users, i.e. researchers. Dr Inventor is built upon the vision that technologies have a great potential to enhance the broader discipline of scientific creativity. It will build on technologies, such as information extraction, document summarization, semantic web and visual analytics to exploit the great potential in supplementing human ingenuity. Dr Inventor will become researchers personal research assistant by reporting to the researchers on a wide variety of relevant concepts through machine-powered search and visualization. It will assess an input research document through comparison with recognized research approaches and suggest new research ideas to the users in an autonomous manner. Dr Inventor will, to a degree, replicate one mode of human creativity to combine diverse information resources and generate new concepts with unexpected features. The new concepts may come from radical transformations inspired by other semantically distant but analogically similar concepts. Dr Inventor will be based on computational models of analogical reasoning and conceptual blending. Computational models can arguably offer greater creative ability than human reasoners for at least three specific reasons. Firstly, problem fixation frequently acts to limit peoples ability to think creatively (Lopez et al, 2011). Secondly, people often fail to notice analogies when they are present (Gick and Holyoak, 1980). Thirdly, people often discard useful distant analogies once they have been discovered (Lopez et al, 2011). People tend to rate distant analogies as less useful even if they produce better results. People also suffer from memory limitations, selective thinking, perception limitations, biases, etc. A computational model may help to address some of these limitations. The work of Dr Inventor will synergistically explore techniques for information extraction, document summarization and semantic identification to support the analysis of research objects and the generation of new ontologies for scientific creativity. Interactive visual analytics will be applied to support a user centred creative process. The outcome will be evaluated through appropriately developed evaluation metrics, baselines and benchmarks. Dr Inventor will focus its evaluation on a specific scientific domain (i.e. computer graphics) exploring Research Objects (RO) from various sources. These will include: free research papers on the Web, research websites, Wikipedia, Internet forums, the home pages of many research institutes and groups, as well as individual researchers. In addition, research sites and social networks such as CiteSeerX, ResearchGate and Google Scholar offer large numbers of freely accessible research papers; research source code is available from GitHub, SourceForge, etc.; and data can also be downloaded for research in computer graphics and image processing, e.g. from Flickr and from benchmarking archives. Secondly, it will use scholarly open-access journals. Finally, it will use online professional digital libraries for top-class research publications. Patents will also be considered within the scope of analysis by Dr Inventor. The Dr Inventor Model Research objects will be represented by skeletons to allow further processing. A Research Object Skeleton (ROS) represents the key concepts and relationships extracted from each RO. Retrieving and representing these ROS is the first task for the Dr Inventor model. The main challenges of the Dr Inventor project are now described in turn. Information Extraction, Summarization, and RO Skeleton Generation Information Extraction (IE) and Text Summarization (TS) (Poibeau et al., 2013) are two key technologies for transforming document content into concise, manageable semantic representations for use by our creativity model. Dr Inventors IE aims to find not only general scientific concepts and relations such as: authors, institutions, research objectives, methods, citations, results, conclusions, developments, hypothesis postulation, hypothesis rejection, comparisons, etc. but also domain specific computer graphic concepts/relations such as algorithms, 3D modelling, rendering techniques, etc. Initial investigations have identified difficulties in extracting text from papers in PDF format. Issues include: ff and fi being represented as single characters, word-flow problems particularly in multi-column documents, representation of mathematical expressions, footnotes and page numbers appearing within the text. PDFX (Constantin et al., 2013) will be used to assist in the text extraction process. The inventory of entities to be extracted from different data-sources will be modelled in a domain ontology developed for Dr Inventor (see next Section). The most important methods to be used for IE are based on machine learning both supervised and semi-supervised. Indeed, in order for our methods to be applicable to different domains, techniques which are able to learn conceptualizations from raw text and propose new concepts are needed (Saggion, 2013), in this way IE will closely interact with ontology learning so as to expand scientific ontologies with specialized domain information. The GATE (http:///gate.ac.uk) system provides us with the basic infrastructure for developing and integrating basic and advanced IE components. Our current IE system is composed of modules for entity recognition (Ronzano et al. 2014) based on support vector machines (Li et al, 2009) and a rule-based approach for relation extraction based on dependency parsing output (Bohnet, 2010). Summarization research in Dr Inventor is focusing on adaptation of summarization to scientific data by developing content relevance measures that take into account among other the scientific article rhetorical structure. We are producing an annotated data using an annotation schema based on work by (Liakata et al., 2010). Summaries will be used both as textual surrogates to allow scrutiny for scientist and as content briefers to identify main semantic information in the input. The work is being based on available generic summarization technology being adapted to the scientific domain (Saggion, 2008). Methods to produce these generic summaries are currently based on statistical techniques; however adaptation will be required to target the rich information present in scientific documents eg Qazvinian et al., (2010). To generate the ROS we need to extract sentence components such as the nouns and verbs, and the structure joining them. For example, from the sentence This paper in contrast, proposes a surface-oriented FFD, we extract the grammatical subject of the sentence: paper, the grammatical object: FFD and the relationship holding between them: propose. In addition to propositions, information regarding the structure of the article is also available (e.g., the fact that the proposition is extracted from a purpose rhetorical zone in the article). Semantic Technologies & Ontology We will use existing semantic technologies to build up concepts and to identify the relationships between them. Domain ontologies will be built through the learning from a wide variety of research objects, including: documents, datasets, scripts, etc. Domain ontologies will also be used and connected to an upper-level ontology network, which will be developed in Dr Inventor as well, reusing existing ontologies covering scientific discourse, document structures, bibliographies and citations (e.g., DoCO, BIBO, EXPO, SPAR etc.) (Belhajjame et al, 2012). The extracted information related to authors, co-authors, affiliations, impact factors, h-indices, etc., will be used to facilitate the retrieval and ranking of ROs but it will not be required in the analogy based model. We will also focus on knowledge extraction from user-defined tags associated to research objects and their aggregated objects, following on current work in ontology learning from folksonomies. In addition, extending existing work on social recommendation of research objects, we will be able to discover implicit relationships between different pieces of work that were originally not considered by the author in a basic literature exploration activity that can increment creativity in research. Such an ontology network will be designed to allow the representation of scientific discourse for scientific creativity. With respect to ontology matching, we want to make use of existing techniques (Shvaiko and Euzenat, 2013) in the context of applying structured similarity evaluations between the aggregations of objects that are represented by research objects. In this context, knowledge extracted from documents and other artefacts should be seen as a skeleton set of information that summarizes key ideas, which allows researchers to explore the content of existing RO's in the process of their evaluation and of the generation of scientific innovation. This will contribute to the similarity measure for comparing research object skeletons for the creativity process. Finally, ontologies will also be used to provide personalized recommendations of scientific ROs, using different sets of recommendation techniques. Retrieval Model Retrieval will combine several techniques to identify homomorphic skeletons. A vector space model will enable quick, inexpensive comparison between skeletons, using numeric qualities representing the topology of each skeleton. This will also account for the inferences we expect to find in creative source domains. Analogy/Blending Model Dr Inventors comparison model will identify and extend detailed similarities between ROS. It will typically search for a source to reinterpret a given target problem, but can also select its own targets. Dr Inventors final structure may be best seen as a conceptual blending (Fauconnier and Turner, 1998) model. It accepts as input two ROS, a generic space represents ontological and other commonalities while the output space represents the new creative concept (blend). (Space doesnt permit proper treatment of the similarities and differences between analogy and conceptual blending). Dr Inventor presents many challenges to similarity based discovery, such as; identifying a compelling source ROS, balancing structural and semantic factors in the mapping phase and performing quality assurance on the resulting inferences. Choosing the correct interpretation(s) of each domain to find an appropriate mapping will also be crucial. The analogy-based model envisages the re-description of any given target using a prestored collection of sources with which to re-interpret that problem. This requires a rich memory of background knowledge to seek creative interpretations of the targeted problem through an extensive analogical comparison to a wide range of objects. In this context, Dr Inventor aims at exploring the potential of web resource to promote scientific creativity. From the previous example sentence in the IE section, we have the graph [paper](propose)[ffd] where [] is a concept node and () denotes a relation connecting concept nodes. Visual Analytics In Dr Inventor, visual analytics will serve to visualize the analogical reasoning and conceptual blending processes. Graphs visualization is a natural choice for the visualization of the ROS, which can also be supported by other means of visualizations. This could involve a large number of skeletons with a considerable level of uncertainty originated from similarity measures between the ROSs. Also, to allow effective handling of large scale visualization, we will investigate aggregation techniques such as binning, abstraction, hierarchical clustering to create effective aggregation of data at different levels of details. User interaction with a creative system is an interesting research issue. The interaction techniques are categorized as select, explore, reconfigure, encode, abstract/elaborate, filter, and connect. An important task of user interaction is to help user navigation of the data. To this end, the interaction will follow the recommendation of overviews first and details-on-demands by working together with the data aggregation. Also, techniques that support zoom in within local areas, focus+context and coordinate views will help users to interactively explore comparisons without losing the perception towards the overall data structure. Web-Based Creativity Service Dr Inventor will present a web-based system for exploring scientific creativity. It will offer a front-end web interface and a back-end mechanism addressing data transfer, access and federation, resource management, etc. The backend crawler will constantly gather research objects from the web extending the ROS repository. Information extraction and subsequent activities will be applied, as previously discussed. At the front end, a web-based interface will be built to provide interface to allow interactive browse, search and visualization of the analogies of ROSs from the repository that contains analogically matched skeletons to inspire user creativity; to provide interface to assess an input RO; to provide interface for a creativity inspiration engine that allows scientific creativity promotion in highly interactive ways. The system is expected to be linked to a social network service (e.g. LinkedIn, Facebook or Twitter) to enhance the interaction and to explore common interest between the researchers. Finally, APIs will be developed to support further development. Evaluation Among the remaining significant challenges will be evaluation of Dr Inventor, assessing its impact on the creativity of its user groups. This will rely heavily on access to a group of domain (computer graphics) experts for assessment and evaluation. Just as important to Dr Inventor is the development of a set of benchmarks and metrics for evaluating progress of this project. Conclusion Models of analogical reasoning are presenting new horizons for intelligently processing information, unearthing creative possibilities in new and surprising ways. Using analogy-based models upon academic resources is a broad and open-ended challenge, requiring advances in areas like document analysis, representation, ontology, analogy & blending, visualization etc. Dr Inventor aspires to Big-C Creativity (Gardner, 1993) hoping to support the transformational creativity (Boden, 1992) associated with significant scientific progress. Boden (1998) identifies two major bottlenecks for transformational creativity. Firstly, the domain expertise required for mapping the conceptual spaces to be transformed and secondly, valuing the results produced by a transformationally creative system. We believe that both challenges will be addressed by the combined efforts of the different activities in Dr Inventor, leading to a powerful tool that will invigorate the research communities opening up new and exciting possibilities. A number of high-level issues arise related to Dr Inventor. Firstly, is documented information sufficiently complete to allow fruitful comparisons to be drawn between research papers, collections of papers or other sources? Can Dr Inventor adequately identify creative analogies from such sources? Will users be sufficiently receptive to accept creative inspiration from Dr Inventor? How can we maximize the impact from each component of Dr Inventor to produce comparisons with the greatest effect on its users? These and many other challenges await. Acknowledgements The research leading to these results has received funding from the European Union Seventh Framework Programme [FP7/2007-2013] under grant agreement 611383. 2014_34 !2014 Combining Representational Domains for Computational Creativity Agnese Augello, Ignazio Infantino, Giovanni Pilato, Riccardo Rizzo, Filippo Vella Istituto di Calcolo e Reti ad Alte Prestazioni Consiglio Nazionale delle Ricerche sede di Palermo, ed. 11 e-mail: name.surname@cnr.it Abstract The paper describes a combinatorial creativity module embedded in a cognitive architecture. The proposed module is based on the focus of attention model proposed by and is implemented using Self Organising Map (SOM) neural networks. Introduction Creativity is mainly perceived as a high level cognitive characteristic, which should always be referring to a conceptual space, whether it is conceived to explore or to transform such space (Boden 2009). One of the components of creativity is an associative memory capable of restoring an incomplete sensory input stimulus by adjusting focus of attention. A cognitive model for creativity based on the ability of adjusting focus of attention has been proposed in (Gabora 2002). According to this model a variable focus of attention, while pointing the basic idea, also collects other concepts that are parts of the stream of thought. The focus of attention can be considered as a basic idea, a framework that drives the creative process which is connected to the analytical mode of thought. At the same time another basic component of the cognitive model proposed by Gabora is the associative memory. By means of associations between different concepts and completion mechanisms, new and surprising results can emerge (Bogart and Pasquier 2013). This kind of creative process can be bound to the process that Boden calls combinatorial creativity which is related to making unusual combinations, consciously or unconsciously generated, of familiar ideas (Boden 2009). The Arcimboldo painting can be a good example for clarifying what we intend.The painting of a human figure presumes a very precise framework that is constituted by figure details, as nose, eyes, lips, rules and relative positioning and all the other details that made a human figure. The attention focus is what we use to navigate on the framework, what is pointing at the details of the figure, that can be substituted with elements belonging to another domain (as in the painting in .g. 1) exploiting the associative memory. We believe that, during the creative process of imagining the painting, the attention is relaxed and other images, searched in another domain, come in mind and take the place of the original parts of the human figure. We consider completion operation in a very large meaning. The basic point of the combinatorial creativity is to mix together parts coming from different sources. In this sense completion is a way to enrich a framework with new items in order to obtain new combinations. In our opinion it is possible to have robust fusion algorithms and completion through the combination of various models of neural networks: an example of such an approach is described in (Thagard and Stewart 2011) that allows emphasising associations useful to generate creative ideas by simple vector convolution. The importance of associative mechanisms is also underlined by neurobiological models of creativity, many of which are based on the simultaneous activation and communication between brain regions that are generally not strongly connected (Heilman, Nadeau, and Beversdorf 2003). In this paper we illustrate an approach aimed at supporting the execution of an artificial digital painter (Augello et al. 2013b) (Augello et al. 2013a). The approach is exploited by the Long Term Memory (LTM) module of the cognitive architecture presented in (Augello et al. 2013b) and reported in .g. 2. The proposed approach is based on a multilayer mechanism that implements an associative memory based on Self Organizing Maps (SOMs) (Kohonen, Schroeder, and Huang 2001) and it is capable to properly mix elements belonging to different domains. Figure 1: A detail from Spring (1563), an Arcimboldo painting (Image from Wikipedia). Architecture In (Augello et al. 2013b) we defined the mechanisms to support creativity in a cognitive framework. In this work we use the same architecture (see .g. 2) but we adopt a new version Figure 2: The general cognitive framework used for the proposed system. Light grey blocks are neglected in this implementation. of LTM (Long Term Memory) that implements an associative mechanism described in details below. As said before, one of the basic components is an associative memory capable of restoring an incomplete sensory input stimulus. Completion is guided by context: when we interpret fuzzy or confused handwritten characters, we use associations with memorised handwritten characters, then we complete or rebuild the input, so that the most common association are made using objects of the same context. Objects coming from the same domain are probably represented by the same features and share the same concept space that was described in Gardenfors (Gardenfors 2004). Associations can also involve objects from different contexts in a more creative way. In this case the original context is discarded and objects come from different domains. According to these considerations we have built a multilayer mechanism that allows to connect memory locations related to a single domain. We have also built another layer that is used to connect memory locations with a more general association mechanism that allows to make associations that go beyond the domain. This second upper layer will be used when the original domain is discarded, for example when we want to find other solutions or we want to mix different domains. The kind of associations made at the second level will be the associations made when the focus of attention is relaxed and associative connections can be made even outside of a specific domain. The structure we propose is represented in .g. 3. Input from sensors are sent to the proper domain at the first level and they are memorised or completed when necessary. The second level contains the associations among different domains that will be further explained in the following paragraphs. The associative memory module that we propose is inspired by the work in (Morse et al. 2010) and is implemented using a Self Organising Map (SOM) neural networks (Kohonen, Schroeder, and Huang 2001). Self Organising Maps are neural networks constituted by a single layer of neural units usually organised in a 2D grid. After a successful training phase each neural unit ideally approximates the centroid of an input pattern cluster and the Figure 3: The overall schema of the proposed architecture for the Long Term Memory module (LTM). neighbour units represent similar values. This way each neural unit corresponds to a sort of average pattern for a cluster of inputs. The architecture proposed in (Morse et al. 2010), is made by multiple SOM, each one receiving inputs from a different sensory modality. In our architecture the SOM array, in the upper part of .g 3, receives inputs from different features extracted from the same sensory input, so that a SOM of the set can have colour features from image, another image boundaries, another one texture information and so on. The values of the SOMs are collected by the hubSOM that synthetically represents the object gathering the representations of the different SOMs. This process is sketched in .g. 4, where different features are substituted by different parts of the image. Figure 4: The associative domain memory training. While the SOM set and the hubSOM constitute the associative module for a domain there is also another SOM, named second level SOM, where the association among different domains takes place. The information from the domain modules, in this second level, are represented using more general features. For example if a domain is used to memorise images of trees and one of the SOM in the array in .g.3 memorises the shape of the leafs, the second level SOM can use the dimension of the bounding box1 as a feature. When we want to mix together objects from other domains we can consider objects that have the same bounding box. The substitution will 1the bounding box is the rectangle surrounding an image detail be driven by the second level SOM whose aim is to faithfully reflect the general structure of the image. A substitution according the bounding box dimensions is a simple criterion but a more general set of features that could also be employed. This second level SOM will implement the spreading of the attention focus because it will mix objects from different domains and group them just considering very rough characteristics. The next two subsections will explain how we can implement the effects of a variable focus of attention: with a narrow focus we can obtain simple completion inside the same domain while with the spread of the focus we can recover objects from different domains. Completion in the same domain Completion in the same domain is the simplest form of completion. For example, let us assume that a domain is trained to memorise simple images, and imagine that, inside this domain there are SOMs that memorise very specific parts of the image: we can think that each SOM memorises a quadrant of the image or, when representing faces, segments of human faces. In this case the basic components would be eyes, lips, noses and so on, memorised along with their positions, in different SOMs. The hubSOM takes into account all the positions of the components. This is sketched in .g. 4. If a part of an image is missing, only some of the SOMs can recall the corresponding memory locations and help to reconstruct the memorised image: one, or more, SOMs will not answer because they do not have any input. The hub SOM accomplishes the task to recall the necessary memory locations from the SOMs that do not have any input, in order to put together all the pieces of the image. This procedure is depicted in .g.5: the missing piece of image causes a failure of recalling in SOM Map 4, so that the hubSOM, containing the reference of the whole, outputs the address of the location of the SOM Map 4 and recall the missing piece. Completion in a different domain When completion is obtained using parts or memories that are outside the domain of the original image, or input, we are making an association that is not causal. This can happen when the recalled part is used to obtain memory contents from other, different domains. In this case the associations are the ones memorised in the second layer SOM, i.e. an association that corresponds to features of different kind. Figure 5: The completion procedure in the domain In .g. 6 the whole process is sketched : the missing part is recalled as said before; however, in this case, it is not sent to the output but it is sent to the second level SOM where it is used for recalling objects from different domains. The recovered information is used as a reference in order to obtain the missing part that is sent to the second level SOM. This signal excites a unit of the second level SOM and its output is sent back to all the associative memory of the other domains. Each domain answers with a list of the excited units that point out to a set of signal corresponding to the memorised objects. As indicated in .g 6 all these objects are proposed as substitution of the missing part. At this point the completion proposed from the original domain is again used as a reference: all the proposed substitution are compared to the original completion and the most similar one is chosen as a substitute. This mechanism is implemented in the box Implementation with Expectation in .g 6. Figure 6: The completion procedure outside the domain. Some Experimental Results and Conclusive Remarks The experiments were mainly performed to evaluate the effectiveness mechanism of the replacement of some parts of the images by the associative memory previously described. We have chosen a face domain that allows an immediate recognition and a leaves and flowers domain in order to resemble the effect of the Arcimboldo style. A sample of the images in our dataset is given in .g.7. The system was trained using 113 grey-scale images of faces and 100 images of leaves and flowers. Each image is 100 100 pixel in order to maintain a manageable size of the neural architecture. In order to reproduce the completion mechanism and to partially simulate the mechanism of focus of attention, each quarter of the image has been memorised in a different map of the array of SOMs (see .g.4). We have tried a quad tree decomposition and the learning process described above. An example is reported in .g.9. Each SOM in the array has a size of 20 20 units and is trained with segments in the same position of the image using the fast training procedure described in (Rizzo 2013), Figure 8: The map in the array of SOMs after the training are used as memory units. and the result is shown in .g. 8. At the end of the process it is possible to train the hubSOM submitting the images of the training set to the SOM array; for each SOM we will get the two digits coordinates on the neural units array, of the most similar exemplar (often called best matching unit or b.m.u.). These coordinates are submitted to the hubSOM that learns this 8 digits image coding and, after training, will be able to rebuild the correct coding for each image. This kind of representation is too precise to be used also at higher level, were we want to mix together different things. At higher levels we want a representation that captures just some of the characteristics of the images, for example colour masses, boundaries and shapes, and so on. For this reason we used the Haar and Gabor features, which contain less information. Figure 9: Final artwork obtained by our approach Conclusion The preliminary experimental results show that the proposed associative memory module is promising for the implemen tation of a sort of combinatorial creativity mechanism. Future works will regard the modelling of artists behaviour and motivation, the choice of domains during the completion process, and the evaluation of both creative process and produced artworks, according to the literature works (Pease and Colton 2011) (Colton and Wiggins 2012) (Jordanous 2012). Acknowledgment This work has been partially supported by the PON01 01687 -SINTESYS Research Project. 2014_35 !2014 Exploring Conceptual Space in Language Games Using Hedonic Functions Anhong Zhang Rob Saunders Design Lab Design Lab University of Sydney University of Sydney Sydney, NSW 2010 Australia Sydney, NSW 2010 Australia azha3482@uni.sydney.edu.au rob.saunders@sydney.edu.au Abstract The ambiguity of natural language can be an important source of creative concepts. In compositional languages, a many-to-many network of associations exists linking concepts by the polysemy and synonymy of utterances. This network allows utterances to represent the combination of concepts, forming new and potentially interesting compound meanings. At the same time, new experiences of external and internal contexts provide abundant materials for the evolution of language. This paper focuses on exploring the role of compositional language for social creativity through the simulation of language games running on multi-agent systems using a hedonic function to evaluate the interest of utterances as design requirements and the resulting design works. Introduction A single word may be associated with multiple meanings while one meaning can be represented by multiple words. Such ambiguity of polysemy and synonymy can be a source of creative inspiration, allowing the exploration of conceptual spaces by traversing the many-to-many mappings between words and meanings. Many-to-many mappings between utterances not only construct connections between seemingly unrelated concepts, but also provide more opportunities to recombine sub-utterances to new utterances representing novel meanings. The function of the ambiguity of language for social creativity can be explored through the use of language games combined with multi-agent simulation. In the guessing game (Steels, 1995), a speaker-agent describes an object using an utterance to a listener-agent who attempts to identify the topic of the utterance based on its experience of previous utterances and the current context. By repeating the guessing game for many generations, a simple language, grounded in use, may evolve (Steels, 1995). In the generation game (Saunders and Grace, 2008), agents that were previously speakers or listeners in a guessing game, explore a conceptual space using communication between client-agent and designer-agents. A requirement is expressed as an utterance by a client-agent and may be related with various meanings by multiple designer-agents that have different experiences of similar utterances. The creativity of communication primarily depends on client-agent generating an interesting requirement and selecting interesting design works produced by designer-agents in response. The evaluation of interest can be modelled using a hedonic function, e.g., the Wundt Curve (see Figure 1), where similar but different perceptual experiences are pre Figure 1. The Wundt Curve, a hedonic function for evaluating interest based on agents' confidence Methods The language games used in the simulations described in this paper produce utterances as a result of a compositional language. Compositional languages, as opposed to holistic languages, permit utterances to be composed using multiple words. Composition can be utilized to generate new utterances denoting valuable concepts. For example, given the previous utterances RED SQUARE, RED TRIANGLE and BLUE TRIANGLE, new utterances such as BLUE SQUARE may be generated by recombining the evolved sub-utterances BLUE and SQUARE. The agents in the simulation use Adaptive Resonance Theory (ART) networks to categorize utterances and concepts. ART networks are both stable and dynamic, they can not only retain existing categories but also add new categories for unfamiliar inputs which exceed the threshold of recognition of ART system (Saunders, 2002). Experiment The experiment focuses on exploring the combination of existing utterances generating new utterances representing interesting meanings through the communication between agents who play the roles of speaker and listener in guessing game as well as client and designer in generation game. Experiment Settings 1. Initial settings The first set of experiments were initialized with 50 samples randomly selected from 121 objects, which were generated by combining 11 colors and 11 shapes. Each of the shapes is represented by a list, e.g., [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]. The population of agents is 6. The language game uses one combination rule combining two features including color and shape; and each feature is represented by one letter as its name. So the length of utterance is limited to 2. For example, {color:0.2, shape:0.3}'s utterance may be ha. 2. Guessing game settings In the guessing game, 8 topics are selected randomly from the samples available for each exchange between speaker and listener, selected among 6 agents randomly. When the success rate (see Equation (1)) is above 60%, the guessing game is finished and generation game is started. ratesuccess = timessuccess / ( timessuccess + timesfailure) (1) 3. Generation game settings In the generation game, four types of procedures with or without evaluation of the interest of requirements (utterances) or works using The Wundt Curve (Figure 1) are implemented. Each design cycle is repeated 1000 times. Every time, the last agent always plays the role of client while others play the role of designers. Experiment Procedures Procedure 1. Guessing game The following guessing game is implemented repeatedly till success rate reaches 60%. 1. The speaker selects one topic randomly from randomly generated context. 2. The speaker generates an utterance representing the selected topic and tells listener. 3. The listener guesses the topic by exploring its existing associations between utterances and the ART categories. If an appropriate association cannot be found, a new association between the utterance and the topic in current context is generated. Then listener tells the speaker its guess. 4. If successful, both speaker and listener increase the weight of their association between the topics ART category and the utterance and increase the frequency of each related instance or generate a new association connecting the selected topic with the utterance. Otherwise, if guessing failed, the listener decreases the weight of the related association and generates a new association between the topics ART category and the utterance, then increases the weight of the newly generated association and generates a new association connecting the correct topic and the utterance. After completing the guessing game, the agents (Group A) are cloned three times to get three new groups of agents (Group B, Group C and Group D) to implement different procedures for the generation game. Procedure 2. Generation game without evaluation of interest The generation game is implemented for 1000 generations by Group A without evaluating the interest of requirements and works. 1. Each designer-agent generates a set of design works (topic) by searching existing associations or generating a new association connecting related an ART category with client-agents requirement (utterance). 2. A client-agent generates an utterance by combining two randomly selected names of each features ART category (prototype) without evaluation. 3. The client-agent selects the most similar design works compared with its requirement-associated topic. But if the most similar works did not belong to the same ART category of client-agents associated topic, game fails; and all designer-agents decrease the weights of their own selected associations. Otherwise, client-agent finds its own relevant association or generates a new association connecting its own ART category of the selected design works and the utterance, then increases the weight of the association and increases the frequency of related instance or generates new instance connecting the works and the utterance. At the same time successful designer-agents increase the weight of related rule and increase the frequency of related instance or generate a new instance while other designer-agents decrease the weights of related associations. Procedure 3. Generation game with evaluation of works interest The generation game is implemented for 1000 generations by Group B. The procedure is the same as Procedure 2 except that the client-agent selects design works using the Wundt Curve. In the process of selecting design works, the distances between each design works features and clientagents original topics features are first measured, then their hedonic value is evaluated. The design works with the highest interest are selected by client-agent. If all interests Figure 2. An example of the distributions of agents' instances Procedure 4. Generation game with evaluation of requirements interest The generation game is implemented for 1000 generations by Group C. Each time, the procedure is the same as Procedure 2 except that client-agent generates several requirements (utterances) and select the most interesting one. Firstly, the weight of each single utterance in every requirement is calculated by summing the frequencies of the utterance used in all instances. Then the interest values of these requirements are calculated by summing the interests of their own utterances. Finally, the requirement with the highest interest is selected. Procedure 5. Generation game with evaluation of both requirements interest and works interest The generation game is implemented for 1000 generations by Group D. In each generation game, the procedure is the same as Procedure 2 except for the generation of interesting requirements and the selection of interesting design works by client-agent. The process of generating interesting utterance is the same as Procedure 4. The process of selecting interesting design works is the same as Procedure 3. Results In Figure 2, the radius of each circle represents the frequency of an instance used by an agent. If a topic is associated with more than one utterance, several circles will be drawn at the same place resulting in a darker color. The results of the experiments show that agents explored a greater number of new topics and generated more instances (the associations between topics and utterances) when the client-agent used the Wundt Curve only for selecting interesting design works, see Figure 2(C3). But the frequency differences between the instances are not distinctive compared with when client-agent utilized the Wundt Curve not only for selecting interesting works, but also for generating interesting requirements, see Figure 2(C5). This suggests that the client-agent preferred using a small set of interesting utterances frequently. Hence, the frequency-distribution of instances is nonuniform. This is similar as the signature of life with uneven frequency distribution comparing Figure 2(C5) with Figure 2(C3). The number of designer-agent's instances are less than that of client-agent's instances, see Figure 2(C2D5) except that generated in guessing game, see Figure 2(C1,D1) because only one designer-agent's works could be accepted by the client-agent in the most successful interactions while other designer-agents had no opportunities of updating their instances, but the client-agent can update its instances almost every successful time in generation game. The average number of instances generated by client-agent and that by designer-agents in a generation game are shown in Figure 3. When only using the Wundt Curve to assess the interest of design works, the number of instances increased sharply especially for client-agent. However, when using the Wundt Curve to evaluate not only the interest of design works, but also that of utterances, the number of instances decreases even below that of instances generated without evaluation of interest except the average number of designer's instances generated in Procedure 5, which is slightly higher than that in Procedure 2 but still lower than that in Procedure 3. The average max degree of the graph networks of instances generated by client-agent and designer-agents respectively are illustrated in Figure 4. As can be seen, the highest average max degree belongs to the instances generated using the Wundt Curve selecting both interesting requirements and interesting works. The average max degree related with the evaluation of only requirements are higher than that of only woks. Therefore, the evaluation of the interest of utterance may be more important than that of works. Figure 3. The average number of agents' instances generated with or without evaluation of interest in generation games Discussion Based on the results of the simulations, client requirements may be more important than designers works because the final pattern of the distribution of utterances and design works are primarily determined by the client-agent rather than the designer-agents. Interesting requirements narrow the combination area of utterances initially generated via crossover of two randomly selected utterances, resulting into the selection of interesting artifacts. Figure 4. The average max degree of the graph networks of agents' instances generated with or without evaluation of interest in generation games According to the illustrations of both Figure 3 and Figure 4, less is more is realized as less instances and more connections. In other words, many meanings may be associated with one utterance while the total number of utterances can be relatively small when using a hedonic function to select randomly combined utterances. Consequently, more connections may lead to discovering more new concepts. Therefore, the procedures of language games described in this paper could be adopted in brainstorming by both clients and designers evolving original requirements and novel concepts. The combination of guessing game and generation game can also be utilized in artificial collaborative system to handle the evolution of compositional language for creative design. Conclusions The results of the simulations suggest the following conclusions: 1. Using a hedonic function to evaluate the interest of utterances affects the direction for exploring conceptual space. 2. The ambiguity of language especially caused by polysemy may play an important role in creative communication by using compositional language. 3. Client-demand driven design may be more important than content driven design in social creative systems. Future work Graph theory has been used in this paper for evaluating the degree of connections between utterances and meanings. Other graph-theoretic functions (Hagberg, Swart, and S Chult 2008) such as density, diameter related with the evaluation of social creativity will be explored. Language games based on Fuzzy sets have been implemented in our most recent experiments. So, the simulations using Fuzzy sets to represent vague and ambiguous concepts will be studied in near future. 2014_36 !2014 The apprentice framework: planning and assessing creativity Santiago Negrete-Yankelevich and Nora Morales-Zaragoza Divisin de Ciencias de la Comunicacin y Diseno Universidad Autnoma Metropolitana-Cuajimalpa Vasco de Quiroga 4871, Santa Fe Cuajimalpa de Morelos 05348, D.F., Mxico {snegrete, nmorales}@correo.cua.uam.mx Abstract In this paper we introduce and discuss the apprentice framework, which we speculate can be used to plan and evaluate computational creativity projects. The framework defines a sequence of phases a system must follow in order to reach a level of creativity acceptable to a set of human judges. It also establishes four aspects of a creative piece susceptible of creative work. We mention some examples from different artistic disciplines. Our work focuses on establishing an environment as well as a team of people and machines to foster, study and monitor the emergence of creativity. On Human and Machine Creativity Assessing creativity in machines has become a prime issue in computational creativity after many systems have been built that exhibit a behavior that can intuitively be considered creative. The domains of such systems are so varied, and the versions of each one so many, that comparing them to one another or with different versions of themselves has become a hard task. Every system that claims to be creative must have criteria associated as to what kind of creativity it aims to achieve. After all, in different domains, different notions of creativity may be established. Even so, recently, several frameworks and models of creativity have been advanced to try to capture a generic or general notion of creativity for computer systems (cite: Ritchie 2007, Wiggins 2006, Jordanous 2012, Maher et al. 2013, Colton et al. 2011). They all propose a practical method to unify criteria across the community so that creativity can be measured in systems from different disciplines of application. Computer programs, so far, are designed to produce valuable things for humans. Therefore their creativity is always assessed against human values or needs. This is an unfair situation since it is very hard to program computers to produce valuable objects for humans when these are not well defined and only they can say whether they are valuable or not. If computers were subject to a survival economy like living things in the planet, as Stuart Geiger suggests (2012), then it would be easier to establish what is valuable for them and hence a process parallel to human creativity could be defined to assess creativity by computers. But, for the time being, computers are still doomed to serve our purposes and their creativity will be assessed by human standards. Creative computer systems are still considered in a separate realm to human creativity in practice. They are often assessed against toy scenarios or their products considered as computational creativity (as opposed to simply creative) to avoid measuring them against human products or creativity. This leaves in the observer the decision of whether the systems behavior is creative in general terms or could be generalized to reach a state where it could be considered so. In creativity we still don't assign the same expectations to machines and humans. Yet the very concept is defined with respect to the latter. So, either human or machine, creativity is not the same, in the sense that they fulfill the same expectations, or they should be assessed by the same standards. But as computers get more involved in creative processes, it is possible to view them as participants and describe what they do as playing a role in a team (Jones et al. 2012). It is possible to interpret the process leading to a creation as collaboration between humans and computers and assign roles to all of them according to their activities. Our view of creativity evaluation is that, although there can be many axes along which it can be measured that seem to be common to several disciplines, ultimately, actual criteria seem to be elusive, arbitrary, subjective and ever changing. These characteristics of creativity don't seem to be problematic to society and most people accept them. It is when we require precision to measure the performance of computer systems that vagueness is problematic. The only way we have to tell whether a computational system is creative, is by inserting it into a human environment and ask humans to assess whether the outcome of the process is creative in the general sense of the term. Thus a concert composed by a computer, for example, will have to be listened by the same group of human experts who would decide whether a composition made by a human is creative, in the general sense applied in music. In this text, we describe a framework we call the apprentice framework to plan and evaluate CC projects. It derives from our multidisciplinary experience in an ongoing project called e-Motion (Negrete, S. & Morales, N. 2013), aimed at building a creative system to produce animatics. These animated shorts, precursors to a final animation, are an essential element of the overall creative process. In the project we examine the relationship between a computing system and the human counterparts that collaborate with it within a successful, creative team. Where is Creativity? Creative products are the result of creative processes. These can take infinite forms but we identify four aspects of creations (creative pieces). Aspects are properties of creations that may be the result of creative work. They are identifiable as the results of separate mental processes that may have occurred at separate times, and may have even been performed by different people: Structure is the basic architecture of a piece; it is what allows spectators to make out different parts of it, to analyze it to understand its main organization. Plot is the specialization scaffold of the structure to one purpose; it is the basis for narrative and the most detailed part of planned structure. It is upon plots that pieces are rendered. Rendering is a particular way in which the plot was developed and filled with detail in order to be delivered to the audience. Remediation is the transformation of a creative piece already rendered into another one, re-rendered, possibly into another media. We now discuss some examples of this model in different creative disciplines. Music. If we consider a piece of music, for example, a composer can be innovative in the structure: a new form of concert, symphony or even something not known to this day; she may also be innovative in the plot: a new score, that is a new piece of written music; a new concert, for instance. Musicians can also be innovative in the rendering of the piece. That is the execution of the score with the realization of all the details needed to deliver the piece to the audience: a performance. Or they can also do remediation: transcribe an already composed piece of music from a string quartet to a rock band, for example. Literature. Here the structure refers to the genre. The most general structure of texts: tragedy, satire, comedy, etc. Plot is the structure of a particular story and rendering is the process to transform a plot into a complete literary piece that the audience can read. The rendered piece can also be subject to a process of remediation and be adapted to cinema or theater, etc. Performing arts usually concentrate on elaborating different renderings for given plots (scores). Each staging of a play in a theater is, in the terms used here, the rendering of a plot. That is, the specification of all the details needed for the audience to receive the original idea. If the performance is improvised, then the performers create, at the same time, both plot and rendering for the audience. Visual arts. In painting, we can consider the plot to be the sketch on a canvas, the initial drawings where the composition is outlined and the main elements designed. The rest of the work has to do with filling the details to complete the painting: details, colors, texture, etc. This is what we call the rendering. In this context, the structure aspect of the painting is its general description as a piece of art: oil on canvas. The audience perceives, in first instance, the rendering, then the plot and finally the structure. They go from the most emotional aspect of the set, to the most intellectual or logical one. Remediation may or may not be part of the piece, it is only included when a certain translation from media to media is needed. The rendered piece produces emotion in the audience while the plot produces understanding of the design behind the piece. Plot and structure enable communication, rendering and remediation, expression. All aspects are present in a piece in different degrees: in some works of abstract art like Jackson Pollocks paintings, plot plays a minimal role, rendering is the most important aspect, there is hardly any structure, the emotion produced by the lines and colors is what constitutes its main expressive motif. In some pieces of conceptual art, on the other hand, structure and plot are the most important aspects while rendering is not as important. In Gabriel Orozcos Cats and Watermelons, the number, order, size or disposition of the cans or the fruits (rendering) is not as important as the idea behind it all, the plot. One important thing about these four aspects of creativity is that they are not stages in the creative process, but they emerge during such a process, in any order or simultaneously. These properties might be the result of creative activity of and individual or a collective instance and they influence each other. Distinction between each, can characterize different forms of creativity. An art piece puts more emphasis on rendering while Design does it in plot. Literature and visual narratives strive to attain a balance between the two in order to maintain the equilibrium between clarity and expression to be enjoyed by an audience (McCloud, S. 2006). The Role of the Computer in a Creative Team Computers and computer programs are often used in creative processes. They can be used to store information, as tools, as means of displaying work, and many more. But not all of them have the same degree of creativity. We therefore distinguish five roles a computer program can play in a creative process: Environment. The computer is a medium where other members of the team can store, display, transmit and, in general, act as an environment where the work is created. Toolkit. The computer is used by members of the team, as a set of tools to transform and shape the work creation. Generator. The computer has been programmed to gener ate specimens or prototypes of partial or complete pieces of work that meet correctness rules. That is, the specimens belong to the desired kind (chairs, paintings, sonatas, stories, etc.) and team members can adjust parameters in order to vary the specimens generated. The final piece of work is either selected from the set generated or it is an elaboration of some elements of the set. Apprentice/assistant. The computer produces a reduced set of prototypes that, besides being correct members of the desired kind, they also fulfill some of the properties of creative products: e.g. valuable, innovative, surprising, etc. In this case, other team members have to choose the best of the candidates proposed by the system according to some more subjective human criteria (e.g. trendiness, politics, commission requirements, etc.). Master. The computer produces a complete and finished work that is considered creative by the designated experts. The rest of the team does management and configuration of the system and handling of the finished work. The environment role is the most common use of computers for creative purposes. Many people performing creative tasks have found that computers provide them with a suitable environment to work digitally on their subject matter. Working within a computational environment is often simpler, cheaper and more efficient. Another common role for computers is that of toolkit. Programs like Photoshop and many more like it that provide a set of tools a user can apply interactively and see the work progress are also ever more popular amongst artists. These systems have become indispensable for artists and creators and many activities like photography have already integrated tools like Photoshop into the basic set of tools for the profession. Many sophisticated systems apply a set of well-studied rules to produce correct pieces of work. These works are easily identifiable as part of the desired kind (poem, tale, motet, sculpture, etc.). It is useful to develop systems like these because they raise the level of abstraction in the process of creation. The programs generate works that can be considered candidates (or nearly) for a final. People using the system modify its parameters in order to alter the generation process and thus obtain better specimens. The user can be subtracted from the problem of assembling a product and concentrate on a new process by which the machine assembles the product and the user considers whether it is good enough or it needs to be modified somehow. Works produced by a generator system may be novel to itself, but not necessarily to the rest of the world. As we've said before, it takes a human eye to tell. Yet the generation process may expedite the overall process by speeding up a trial and error cycle. An apprentice system is one that has managed a new level of sophistication by showing a degree of knowledge that produces work specimens that fulfill general criteria for creativity (e.g. valuable, innovative, and surprising). Perhaps going from generator to apprentice is the challenge most computational creativity systems are facing in recent days. It can be seen as a search problem: moving from trial-and-error up to informed search methods. This last level is set as a reference. In the upper limit, a system that does all the important work and delivers a finished product that can be ascribed to a creative process is the ultimate capacity a computer system can acquire. We find several advantages to the model just described for the development of computational creativity: 1. A machine embedded in a creative process ensures that any development of the programs in it can be checked to see how much impact it has on the overall creative process. In particular, it is possible to verify whether the outcome of the whole process is still creative, thus eliminating the problem of generalizing toy worlds. 2. Versions of programs can be benchmarked according to the roles they are expected to play. 3. A staged plan for a research program can be drawn with clear goals and strategies based on roles assigned to participants. 4. The four aspects of creative products we described help teams to identify, for a particular role, what it is trying to achieve and decide how to evaluate its performance. Evaluating Creativity as Participation We have just described a framework that, we believe, can be used to assess creativity in a computational system by using it combined with already known methods from other fields like Design and applied Arts. These fields use participative and integrating methods to find out what is desirable and valuable for people (Ranjan, P.M. 2013). Participatory approaches are about including participants in the process of creation of product or experience. Evaluation aspects in this kind of projects are needed to measure impact and performance in the roles participants play. Nina Simon has developed a way to evaluate impact of participation of visitors in the context of museums and we think is relevant for our task. Her method consists of three main steps: 1. Stating the project goals. 2. Defining behaviors and outcomes that reflect those goals. 3. Measuring or assessing incidence and impact of the outcomes via observable indicators. Based on Simons model and using the apprentice framework described above, we can know how to proceed to either develop or assess a creative computational system. We should start by identifying which aspect of creativity is being emphasized, by doing so we are setting constrains and framing the context to work. This would drive the statement of project goals. Then we need to identify a particular role and skill that the computational system is expected to have by taking explicit knowledge from the humans members. Setting the skills and responsibility of the computer in the overall creative process would be the criteria for constant evaluation and modification of the system. It is important to stress that participatory projects often benefit from incremental and adaptive measurement techniques. Many creative outputs are process-based. So they have to be valued many times and incrementally before the project ends so that they stay aligned with the goals and all those involved are satisfied. (Simon, N. 2012). Conclusion Our experience with eMotion has led us to question many of the underlying principles of CC. We have found by looking at a complex creative human team that it is difficult to pinpoint where creativity really lies. All members of the team can be credited for some percentage of the overall creativity. In the very same way, machines partaking in the process can be assigned their own share of credit and be considered creative too. This view of creativity as part of a process that also gives context seems more promising as a generic framework to develop creative systems than the traditional view of a system designed to perform well in the whole process from the start. Often, the parameters of creative behavior in media projects are either not known from the beginning or highly subjective. Therefore, setting off to develop a computer system that plays a creative role in a team and can be readily assessed by the other members of the team, as it would happen with human members, gives a perspective where several levels of proficiency can be planned ahead and assessed. Other frameworks share similar ideas with our framework (Jones et al. 2012, Colton et al. 2011). The main difference with ours is that we erase the difference between assessing human creativity and machine creativity and try to establish a common methodology. Our framework seeks to evaluate different roles within a creative group, regardless of their being played by a person or a computer program. A creation (the result of a creative process), can have several creative components, built by sub-process that can also be considered creative. In some disciplines, this is recognized explicitly: in cinema, many prizes around the world, like the Oscars, recognize a whole movie, but also, separately, other creative sub-processes and products: script, musical score, set design, etc. Each one of these is valued under different sets of rules and criteria by people who are experts in those areas. Yet, all those sub-processes contribute to a whole movie, which, in turn is valued on its own right. In many creative projects these sub-products can be identified and evaluated separately. CC systems can be inserted as part of a team to take a specific role to create a particular creative sub-product. This view allows CC systems to be provided of a context where their development can be planned and they can be properly evaluated. The four aspects of creative products we have mentioned in this paper allow teams to decide where a particular role is supposed to be innovative and, therefore, how it ought to be assessed. 2014_37 !2014 Criteria for Evaluating Early Creative Behavior in Computational Agents Wendy Aguilar Posgrado en Ciencia e Ingenieria de la Computacion, UNAM, M exico D.F. weam@turing.iimas.unam.mx Abstract Our research is focused on the study of the genesis of the creative process. With this purpose we have created a developmental computational agent, which allows us to watch the generation of the first behaviors we could consider as creative. It is very important to develop methodologies to evaluate the behaviors generated by this kind of agents. This paper represents our first effort towards that end. Here we propose five criteria for its evaluation, and we use them to test the behaviors created by our developmental agent. Introduction The construction of artificial systems which simulate the creative process is currently a topic of great interest among the artificial intelligence and cognitive sciences community. A great effort has been made to build methodologies that help us evaluate such systems (e.g. Ritchie, 2007; Colton, 2008; and Jordanous, 2012). Nevertheless, it has not been an easy task, and it is necessary to do more research. This article is intended to contribute to it. Our research is focused on the study of the genesis of the creative process, which takes place during the first years of our lives, as explained in the next section. With this purpose we have created a computational agent that simulates cognitive development (introduced in Aguilar and Perez, 2013), which allows us erez y Pto watch the generation of the first behaviors we could consider as creative. In this article, we focus on proposing some criteria which may let us evaluate the behaviors generated by this kind of agents. Concepts and Definitions Creative Behavior In the literature on the subject we can find a number of definitions for creative behavior. For example, from a behaviorist viewpoint, Razik (1976) defines creative behavior as a unique response or pattern of responses to an internal or external discriminative stimulus. Or, from the point of view of artificial systems, for example Maher, Merrick, and Saunders (2008) propose that the developing of creative behavior in artificial systems focuses on the automatic generation of sequences of actions that are novel and useful. This article is based on Cohens point of view (1989), who describes Rafael Perez erez y P Departamento de Tecnologias de la Informacion, Division y Dise~ on de Ciencias de la Comunicacino UAM Cuajimalpa, M exico D.F. rperez@correo.cua.uam.mx creativity as a series of adaptive behaviors in a continuum of seven levels of development: initially, creativity involves adaptation of the individual to the world; and at higher levels, it involves adaptation of the world to the individual. For the context of this article, we will focus only on the first level, called Learning Something New: Universal Novelty. This kind of creativity is that resulting in behaviors that are useful and new to the individual, but not strange or valuable to others. Cohen considered that it can be observed in babies and toddlers as a result of their need to start to adapt to the world. Adaptation For Piaget (see for example Piaget, 1952), adaptation takes place by means of two complementary processes he called assimilation and accommodation. The assimilation process allows children to face new situations by using their knowledge from past experiences. However, in some cases, the situations they face contradict its current knowledge of the world. In these cases of conflict, the accommodation process allows children to deal with new situations by progressively modifying their knowledge (throughout the continuous interaction with their environment) in order to include the results of their new experiences. In this way, Cohens first level creative-adaptive activity helps us adapt to our world either modifying our perception of the environment so that it fits our knowledge acquired from past experiences (it is, adaptation by assimilation), or modifying and producing new knowledge when it does not match reality (it is, adaptation by accommodation). Cognitive Development Piaget considered that when children interact with their environment by using their previously acquired knowledge, they are in a stage called cognitive equilibrium. Whereas when the interaction with their environment causes a conflict between their knowledge and reality, they then experience a crisis moment called cognitive disequilibrium. He also suggested the change from equilibrium to disequilibrium and back to equilibrium (through accommodation) promotes children evolving across four continuous qualitatively different stages, from birth to adulthood (the interested reader can find a brief summary of his theory in Crain, 2010, chapter 6). The first of them is called sensorimotor stage, which starts at birth and finishes at around 2 years old (approximately the same age in which Cohens first level adaptive creativity is observed). According to his theory, the sensorimotor stage is subdivided into six substages, each characterized by the emergence of new behaviors. For example, the second substage is characterized by the acquisition of behaviors centered on his body, such as learning how to follow any object visually or how to keep objects of interest grasped; whereas the third substage is characterized by the acquisition of behaviors involving consequences on external objects, such as learning how to squeeze a rubber duck in order to have it quack. This first stage of development is quite interesting from the point of view of creativity. On the one hand because it is in this stage that the childrens behaviors start to be goal oriented; this is the beginning of means-end differentiation, a basic skill to become capable of solving problems. And problem solving has been considered as a form of creativity (Runco 2007). On the other hand, because Piaget himself considered it as the most creative period of life, since it is during this stage that newborns must start to build their knowledge of the world, and such construction requires creativity (Runco and Pritzker 1999, p. 13). So, during this period, childrens first manifestations of creative behavior arise. Piaget called the evolution through the different substages and stages cognitive development. Evaluation Criteria Inspired by Maher, Merrick, and Saunderss paper (2008), we propose that an artificial agent that simulates cognitive development (e.g. Stojanov 2001; and Aguilar and P erez y P erez 2013) generates creative behaviors if they comply with the following characteristics: Novelty. A behavior is considered novel if it did not exist explicitly on the agents initial knowledge base. Usefulness. A behavior is considered useful if it serves as a basis for the construction of new knowledge that eventually leads the agent to acquire new skills that are characteristic of its next developmental stage. For example, those driving it from behaviors characteristic of the second substage of the sensorimotor period (behaviors based on the body) to behaviors characteristic of the third sub-stage (behaviors involving consequences on external objects). Emergence. Based on Steelss (2014) definition, we propose to consider a behavior as emerging if its origin cannot be directly traced back to the systems components, but if it originates as a result of the way such components interact. Motivations. Amabile (1983, 1999) distinguished between two types of creativity: intrinsically and extrinsically motivated creativity. Intrinsic motivation refers to a behavior that is driven by internal rewards, while extrinsic motivation is focused on external reward, recognition or punishment avoidance. In this article we propose that a behaviour that an agent develops should be considered creative only if it resulted from an intrinsic and/or extrinsic motivation. Adaptation to the environment. The ability to adapt to our environment has been seen traditionally (perhaps as of Darwin) as a necessary condition for really creative behavior (Runco 2007, p. 398). We therefore propose that a behavior that an agent develops should be considered creative only if it resulted from an agents adaptive process to its environment. Case Study In order to illustrate the application of the evaluation criteria we propose, we assessed the agent presented in (Aguilar and Perez y Perez 2013). Brief description of the agent The agent lives in a 3D virtual environment with which it interacts. It can lift its head, move it down and turn left and right; as well as open and close its hand. It has a visual and a tactile sensor. The visual sensor is implemented as a virtual camera with a field of vision of 60 degrees. Its field of vision is divided into the nine areas shown in Figure 1b. It implements five main cognitive capabilities: 1) it can see and touch its world; 2) it simulates an attention process; 3) it simulates affective responses of pleasure and displeasure (represented by variables with values -1 for displeasure and +1 or +2 for two intensities of pleasure), emotional states of interest, surprise and boredom, and an intrinsic motivation of cognitive curiosity (represented by boolean variables with value true when the agent shows such state or motivation); 4) it has a memory where it stores its knowledge on how to interact with its world, and it does so in structures called schemas of which there are two types: basic schemas representing default or innate behaviors (defined as two-part structures consisting of a context and an action), and developed schemas representing behaviors created by the agent as it interacts with its world (defined as three-part structures consisting of a context, an action, and a set of expectations); and 5) it simulates an adaptation process which is inspired by Piagets theory. (a) The agent in its virtual world (b) The division of the agents field of vision Figure 1: The agent and its virtual world The agent interacts with its environment: 1) by sensing its world, 2) by choosing one of the sensed objects as its center of attention, 3) by choosing what action to carry out, and 4) by executing the chosen action. Steps 1 to 4 are called perception-action cycle. Its central component is its adaptation module called Dev E-R (Developmental Engagement-Reflection). It is imple Figure 2: Initial basic schemas. They represent the initial behaviors the agent knows for interacting with its world. Basic Schema1 represents the tendency to preserve a pleasant stimulus, and Basic Schema2 represents the tendency to perform a groping to get a pleasant stimulus back when it disappears. mented with a new extended version of the computational model of the creative process Engagement-Reflection (Perez y P erez and Sharples 2001). Dev E-R simulates the assimilation process by searching in the memory for schemas representing similar contexts to the current perceived situation (which is defined in terms of the current agents affective responses, emotional states and motivations). On the other hand, Dev E-R models the accommodation process by creating new schemas and/or modifying the existing ones as a result of dealing with new situations in the world. The creation and modification of the schemas takes place by means of one of the two following methods: generalization or differentiation. This way, the agent interacts with its virtual world, assimilating and accommodating its knowledge, until it manages to reach a cognitive equilibrium state. It is, until it does not need to modify its schemas during the last NC cycles, since they allow it to interact with its environment satisfactorily. When the agent reaches a cognitive equilibrium state, it is able to face new situations by partially using its knowledge from past experiences. This may cause new schemas to be built, having the agent enter a cognitive disequilibrium state again. It is this way that the agent goes from equilibrium states to disequilibrium states frequently, until it stops its development because it keeps equilibrium for a certain number of cycles. It is then that its execution ends. Testing In Aguilar and Perez (2013) it was reported that the erez y Pagent interacted with the environment shown in Figure 1a, for 9000 cycles. In this world there were balls in different colors and sizes, moving from the left to the right and downwards, independently. Nevertheless, they never made contact with the agents hand. It was also reported that the agent was initialized with two basic schemas representing behaviors characteristic of the first substage of the sensorimotor period (shown in Figure 2). Novelty At the beginning of the agents execution, it constantly lose the objects of its interest. This was due to the fact that it could only use its basic predefined behaviors to interact with its world. When this happened, the behavior it showed was a random movement of its head (resulting from the use of its Basic Schema2). Nevertheless, after letting the agent interact with its environment for 9,000 cycles, it built 13 new schemas that did not exist in its initial knowledge base. The seven first, called schemas type 1, represented behaviors meant to recover the objects of interest it had lost in different positions of its field of vision. For example, if it lost an object on its right, then its schema indicated to turn its head right, generating the expectation of recovery. The six next schemas, called schemas type 2, represented behaviors meant to keep the objects of interest within its field of vision, causing the number of objects it lost to progressively decrease and even reach zero. The 13 schemas the agent built are different in structure and contents; and, more important, they represent different behaviors to those it was initialized with. Although it is also important to note that the seven schemas of the first type are very similar among themselves, since all of them represent behaviors of recovery for lost objects of interest. Similarly, the six schemas of the second type are very similar among themselves, since all of them represent how to keep the objects of its interest within its field of vision. We can therefore conclude the agent built 2 groups of schemas, schemas type 1 and schemas type 2, that represent totally novel behaviors to the agent (recover and keep objects of interest). It also built 7 and 6 schemas within such groups, representing behaviors less novel among themselves. Usefulness In order to evaluate the usefulness of the behaviors the agent developed, lets remember it was initialized with behaviors characteristic of the first substage of the sensorimotor period. From there, throughout the interaction with its environment, it built its first seven schemas related to recovering the objects of interest (schemas type 1). These were later used as a base, on partially using them in the new situations it faced, in order to build the following six schemas related to keeping objects of interest (schemas type 2). The use of these 13 schemas together caused the agent to show the new behaviors of following the objects of interest visually (by moving its head), and of keeping them centered within its field of vision most of the times. These two new abilities that the agent acquired were described by Piaget as two of the main abilities related to vision which children develop during the second substage of the sensorimotor period. Therefore, the schemas the agent developed are considered useful since they allowed it to go from predefined or innate behaviors (typical of the first substage of the sensorimotor period) to behaviors based on its body (typical of the second substage of the sensorimotor period). Emergence The construction of the different behaviors the agent develops depends on various factors, among them: 1) the characteristics of its environment, 2) its physical characteristics, and 3) its current knowledge. For example, regarding the first point, if the agent lived in a world in which it always had to lift its head in order to recover the objects, it would build schemas representing that particular characteristic of its environment. Similarily, if the agent were enabled with the ability to touch but not to see its world (it means, if it were blind), it would develop different behaviors to those it developed with vision (using exactly the same adaptation processes in both cases). Except that now, the new abilities would be related to touching. For example, it would learn how to keep the object of its interest grasped. Also, regarding the third point, the behaviors the agent develops depend on its current knowledge. For example, schemas type 2 require the construction of schemas type 1 first, since it is not until they exist and are stable that those type 2 can originate. This is because the agent uses its knowledge on how to recover objects of interest in the different positions in order to learn how to keep them within its field of vision. We can therefore conclude that the behaviors created by the agent emerged as a result of the way the different components of the system interacted among themselves, since the new behaviors were not set up by default, and also because they are contextual. It is, because they depend on its interaction with the environment, on its sensory abilities and on its current knowledge. Motivations One of the core components of the agent is that it simulates affective responses, emotional states and an intrinsic motivation of cognitive curiosity that push it to act. Particularly, regarding the development of new schemas, they are created, modified or eliminated as a result of the triggering of: 1) an emotional state of surprise (for example, caused by the unexpected recovery of an object of interest), or 2) a cognitive curiosity motivation (that is generated when dealing with unknown situations that contradict its current knowledge of the world). So, in this model, the emotional state of surprise and the intrinsic motivation of cognitive curiosity trigger the necessity of modifying and building new schemas. Adaptation to the environment The schemas the agent developed originated as a consequence of its facing new unknown situations, to which it reacted whether assimilating the new situation into its acquired knowledge (by means of the process of searching a schema representing a similar situation to that of its current context in memory) or accommodating its knowledge so that it .tted the new experience (by creating a new schema or by differentiating, generalizing or deleting an existing one). Therefore, the construction of new schemas took place as a result of a complementary assimilation and accommodation process. In other words, they originated from an adaptation process of the agent to its world. Conclusions In this article we propose five criteria to evaluate if the behaviors created by agents that simulate having cognitive development can be considered creative. The criteria we propose are: novelty, usefulness and emergence. Additionally, it is requested that such behaviors had originated as a result of intrinsic and/or extrinsic motivations, as well as of the adaptation to its environment. The results of the evaluation of the agent of the case study showed that, under these criteria, the first behaviors it develops (learning how to follow objects of its interest visually and how to keep them centered within its field of vision) are considered creative. These results represent our first approach to the evaluation of this kind of agents. There is still much more research to do on this matter. Acknowledgements This research was sponsored by the National Council of Science and Technology in M exico (CONACYT), project number 181561 and doctoral scholarship number 239740. 2014_38 !2014 COINVENT: Towards a Computational Concept Invention Theory Marco Schorlemmer,1 Alan Smaill,2 Kai-Uwe Kand uhnberger,3 Oliver Kutz,4 Simon Colton, 5 Emilios Cambouropoulos6 and Alison Pease7 1Artificial Intelligence Research Institute, IIIA-CSIC, Spain 2School of Informatics, The University of Edinburgh, UK 3Institute of Cognitive Science, University of Osnabruck, Germany 4Institute of Knowledge and Language Engineering, Otto-von-Guericke University Magdeburg, Germany 5Department of Computing, Goldsmiths, University of London, UK 6School of Music Studies, Aristotle University of Thessaloniki, Greece 7School of Computing, University of Dundee, UK Abstract We aim to develop a computationally feasible, cognitively-inspired, formal model of concept invention, drawing on Fauconnier and Turners theory of conceptual blending, and grounding it on a sound mathematical theory of concepts. Conceptual blending, although successfully applied to describing combinational creativity in a varied number of fields, has barely been used at all for implementing creative computational systems, mainly due to the lack of sufficiently precise mathematical characterisations thereof. The model we will define will be based on Goguens proposal of a Unified Concept Theory, and will draw from interdisciplinary research results from cognitive science, artificial intelligence, formal methods and computational creativity. To validate our model, we will implement a proof of concept of an autonomous computational creative system that will be evaluated in two testbed scenarios: mathematical reasoning and melodic harmonisation. We envisage that the results of this project will be significant for gaining a deeper scienti.c understanding of creativity, for fostering the synergy between understanding and enhancing human creativity, and for developing new technologies for autonomous creative systems. Introduction Of the three forms of creativity put forward in (Boden Figure 1: Houseboat blend, adapted from (Goguen and 1990)combinational, exploratory, and transformational Harrell 2010) the most difficult to capture computationally turned out to be the combinational type (Boden 2009), i.e., when novel ideas (concepts, theories, solutions, works of art) are pro-day thought and language, and modelled it as a process by duced through unfamiliar combinations of familiar ideas. which people subconsciously combine particular elements Although generating novel ideas, or concepts, by combining and their relations, of originally separate mental spaces, into old ones is not complicated in principle, the difficulty lies in a Unified space, in which new elements and relations emerge, doing this in a computationally tractable way, and in being and new inferences can be drawn. For instance, a houseable to recognise the value of newly invented concepts for boat or a boathouse are not simply the intersection of the better understanding a certain domain; even without it being concepts of house and boat. Instead, the concepts house-specifically soughti.e., by serendipity (Boden 1990, p. boat and boathouse selectively integrate different aspects 234), (Pease et al. 2013). of the source concepts in order to produce two new concepts, To address this problem, we will concentrate on an im-each with its own distinct internal structure (see Figure 1 for portant development that has significantly influenced the the houseboat blend). current understanding of the general cognitive principles The cognitive, psychological and neural basis of concepoperating during creative thinking, namely Fauconnier and tual blending has been extensively studied (Fauconnier and Turners theory of conceptual blending, also known as con-Turner 2003; Gibbs, Jr. 2000; Baron and Osherson 2011). ceptual integration (Fauconnier and Turner 1998). Faucon-Moreover, Fauconnier and Turners theory has been successnier and Turner proposed conceptual blending as the fun-fully applied for describing existing blends of ideas and condamental cognitive operation underlying much of every-cepts in a varied number of fields, such as linguistics, mu sic theory, poetics, mathematics, theory of art, political science, discourse analysis, philosophy, anthropology, and the study of gesture and of material culture (Turner 2012). How ever, the theory has hardly been used for implementing cre ative computational systems. Indeed, since Fauconnier and Turner did not aim at computer models of cognition, they did not develop their theory in sufficient detail for concep tual blending to be captured algorithmically. Consequently, the theory is silent on issues that are relevant if conceptual blending is to be used as a mechanism for designing creative systems: it does not specify how input spaces are retrieved; or which elements and relations of these spaces are to be projected into the blended space; or how these elements and relations are to be further combined; or how new elements and relations emerge; or how this new structure is further used in creative thinking (i.e., how the blend is run). Con ceptual blending theory does not specify how novel blends are constructed. Nevertheless, a number of researchers in the field of com putational creativity have recognised the potential value of Fauconnier and Turners theory for guiding the implementa tion of creative systems, and some computational accounts of conceptual blending have already been proposed (Veale and ODonoghue 2000; Pereira 2007; Goguen and Harrell 2010; Thagard and Stewart 2011). They attempt to con cretise some of Fauconnier and Turners insights, and the resulting systems have shown interesting and promising re sults in creative domains such as interface design, narrative style, poetry generation, and visual patterns. All of these accounts, however, are customised realisations of concep tual blending, which are strongly dependent on hand-crafted representations of domain-specific knowledge, and are lim ited to very specific forms of blending. The major obstacle for a general account of computational conceptual blending is currently the lack of a mathematically precise theory that is suitable for the rigorous development of creative systems based on conceptual blending. A Formal Model of Conceptual Blending To address the relative lack of study of the computational potential of conceptual blending, in the FP7-ICT project COINVENT1, we are setting out to: 1. develop a novel, computationally feasible, formal model of conceptual blending that is sufficiently precise to capture the fundamental insights of Fauconnier and Turners theory, while being general enough to address the syntactic and semantic heterogeneity of knowledge representations; 2. gain a deeper understanding of conceptual blending and its potential role in computational creativity, by linking this novel formal model to relevant, cognitively-inspired computational models, such as analogical and case-based reasoning, induction, semantic alignment, and coherence-based reasoning; 3. design a generic, creative computational system based on this novel formal model, capable of serendipitous inven 1www.coinvent-project.eu tion and manipulation of novel abstract concepts, enhancing thus the creativity of humans when this system is instantiated to particular application domains for which conceptual blending is a core process of creative thinking; 4. validate the model and its computational realisation in two representative working domains: mathematics and music. The only attempt so far to provide a general and mathematically precise account of conceptual blending has been put forward by Goguen, initially as part of algebraic semiotics (Goguen 1999), and later in the context of a wider theory of concepts: Unified Concept Theory (Goguen 2005a). He has also shown its aptness for formalising information integration (Goguen 2005b) and reasoning about space and time (Goguen 2006). Goguens intuition was that conceptual blending could be modelled based on the colimit construct of category theorya field of abstract mathematics that has provided deep insights in mathematical logic and computer science, and has often been used as a guide for finding good de.nitions and research directions. In his Categorical Manifesto, he intuitively describes this construct as follows: Given a category of widgets, the operation of putting a system of widgets together to form some super-widget corresponds to taking the colimit of the diagram of widgets that shows how to interconnect them. (Goguen 1991) To model conceptual blending we would start with a collection of input spacesGoguen defines them as semiotic spaces of signs and their relationsand of structure-preserving mappings between them, capturing how the structure of these spaces is related. The colimit would be the optimal way to put these spaces together into one single space taking into account how they were originally connected by structure-preserving mappings. Here optimal means that the colimit includes all structure of the input spaces, but not more; and that it would not make unnecessary fusion of structure. An important property of colimits is that they are unique up to isomorphism. But since conceptual blending does not operate in general under this notion of optimality, Goguen suggested to extend this idea by including a notion of quality of structure-preserving mappings between mental spaces to cope with the idea of partial mappings that selectively map only certain structure into the blend, and to model conceptual blends as colimits in this extended setting. As it stands, Goguens account is still very abstract and lacks concrete algorithmic descriptions. There are several reasons, though, that make it an appropriate candidate theory on which to ground the formal model we are aiming at: It is an important contribution towards the uni.cation of several formal theories of concepts, including the geometrical conceptual spaces of (Gardenfors 2004), the symbolic conceptual spaces of (Fauconnier 1994), the information flow of (Barwise and Seligman 1997), the formal concept analysis of (Ganter and Wille 1999), and the lattice of theories of (Sowa 2000). This makes it possible to potentially draw from existing algorithms that have already been developed in the scope of each of these frameworks. It covers any formal logic, even multiple logics, supporting thus the integration and processing of concepts under various forms of syntactic and semantic heterogeneity. This is important, since we cannot assume conceptual spaces represented in a homogeneous manner across diverse domains. Current tools for heterogeneous speci.cations such as Hets (Mossakowski, Maeder, and Luttich 2007) allow parsing, static analysis and proof management incorporating various provers and different speci.cation languages. By developing a formal model of conceptual blending building on Goguens initial account, we aim to provide general principles that will guide the design of computer systems capable of inventing new higher-level, more abstract concepts and representations out of existing, more concrete concepts and interactions with the environment, and to do so based on the sound reuse and exploitation of existing computational implementations of closely related models, such as those for analogical and metaphorical reasoning (Falkenhainer, Forbus, and Gentner 1989), semantic integration (Schorlemmer and Kalfoglou 2008), or cognitive coherence (Thagard 2000). With such a formal, but computationally feasible model, we will ultimately bridge the existing gap between the theoretical foundations of conceptual blending and their computational realisations. This, in turn, will contribute to the much-needed foundations for the design of creative systems that effectively enhance both artificial and human creativity when deployed in the kinds of genuinely creative tasks underlying the sort of abstract reasoning common to many branches of the sciences and the arts. Working Domains To explore the genericity of the proposed formal model of concept invention and of the computational realisation we are after, we will focus on two representative working domains of creativity: mathematics and music, the most sharply contrasted fields of intellectual activity which one can discover, and yet bound together, supporting one another as if they would demonstrate the hidden bond which draws together all activities of our mind, and which also in the revelations of artistic genius leads us to surmise unconscious expressions of a mysteriously active intelligence, as noted wisely in (von Helmholtz 1885). In mathematics, the creative act of providing novel de.nitions, conjectures, theorems, examples, counter-examples, or proofs can be seen as particular cases of concept invention (Montano-Rivas et al. 2012). In music, concept invention may apply to the generation of new melodies, harmonies, rhythms, or counterpoints (and their combination) (Mazzola, Park, and Thalmann 2011), and to the integration of musical and textual spaces to achieve novel musical metaphors (Zbikowski 2002). The following examples illustrate the sort of creative activity we want to address with our formal model and its computational realisation. Example 1 The historical example of the discovery of the quaternions by Hamilton is one that is well documented (e.g., (Hersh 2011)), so much is known about the intermediate steps involved in the discovery. This can be treated by our approach, by taking the starting point as the unproblematic blend between the algebraic structure of the complex numbers as a field (with addition, multiplication and division), and the geometric structure of the 2-dimensional real plane as a real vector space (with addition, and scalar multiplication). In our terms, Hamilton wanted to find a similar blend involving an algebraic structure corresponding to 3dimensional real vector space. He ended up, however, by finding a blend involving a 4-dimensional real vector space, and the algebra of the quaternions which involves leaving out from the algebraic theory the commutativity of multiplication. We thus see the characteristic features of blending, in the diagram of Figure 2, where the arrows indicate morphisms in Unified Concept Theory. This shows the characteristic features of blending, where: there are two given concepts: commutative fields, and (4dimensional) real vector spaces; a common concept, structurally similar to some aspects of the given concepts is identified (Common); the initial concepts are blended, respecting the common aspects, an initially inconsistent blended concept of quaternions is obtained; this is modified by dropping an initial feature (commutativity of multiplication), to obtain a consistent concept. Figure 2: Blend for quaternions. By deploying COINVENT-based technology in this working domain, our ultimate goal is to transcend the capabilities of current state-of-the-art automated reasoning support tools, which as of today are reluctantly accepted by their users and perceived more as an obstacle to than a facilitator of creative thinking. The choice of the domain of mathematics is further supported by the following reasons: Evidence from cognitive science, education, and history of mathematics suggests that the hierarchy of mathematical concepts is grounded on some simple numerical abilities humans have, combined with know-how about physical scenarios of interaction with the environment (Lakatos 1976; Lakoff and Nu~ nez 2000). This means that by tackling the case of mathematics, we need to address problems concerning the situatedness of agents. The span of usage of mathematical concepts goes from rather concrete situations (children learning to count how many toys you give them) to the very abstract (as when professional mathematicians do research) see (Lakoff and Nu~ nez 2000; Alexander 2011). Mathematics allows us to explore the social dimension of concept invention and the forces external to cognition that shape the process of conceptual blending over time, crucial in educational and research environments (Lakatos 1976; Goguen 1997). Currently, there is no cognitive model of the way in which people invent mathematical concepts; there are to our knowledge no models of how humans create mathematics. Hence only a few computational creativity systems exist that support creative mathematical thinking, such as (Colton 2002). Example 2 Devising appropriate chordal harmonisations for melodies derived from non-Western cultures or, even, for new creations could potentially be tackled computationally based on our approach. A computational system could autonomously explore different chordal spaces generating novel harmonic combinations/blends appropriate for the melodies at hand. This could be applied for the design of an interactive compositional tool or computer game where the user inputs a melody (may sing in a melody) and the automatic harmonisation system produces interactively novel harmonisations that creatively combine harmonic properties from different music idioms. It could also be applied, for instance, for video-game design and programming, by endowing game creations with the capacity of generating new harmonisations on-the-.y; the creative melodic harmonisation assistant could provide appropriate harmonisations following the mood changes or activity or gestural patterns emerging as the game unfolds. In Figure 3, a traditional melody is harmonised in radically different ways corresponding to individual harmonic spaces (tonal, modal, atonal). The creative harmonisation assistant may generate such original harmonisations or enable the emergence of new unpredicted harmonisations stemming from blends between such spaces. By deploying COINVENT-based technology in this working domain our ultimate goal is to be capable of making software go beyond a mere application of compositional rules, so as to refute the common belief that creativity is separated from the computational processes used in music composition, and that these processes just do uncreative calculations. The choice of the domain of music is further supported by the following reasons: The conceptual level of music, together with the role of cognitive models such as conceptual blending in musical Figure 3: Four different harmonisations of a traditional melody (first four-bar phrase) harmonizations created by C. Tsougras (Aristotle University of Thessaloniki). analysis, has gained increased attention in the field of music theory (Zbikowski 2002). A substantial body of contemporary research on musical creativity from the philosophy of computer modelling, through music semiotics, education, performance and neuroscience, to experimental psychology (Deli`ege and Wiggins 2006; Mazzola, Park, and Thalmann 2011) provides the necessary background for exploring computational creativity in a scienti.c manner in the domain of music. Traditional music analysis has weak conceptual power for studying complex constructions. Formal theories of musical structure and processes, as employed in contemporary computational modelling of music (Anagnostopoulou and Cambouropoulos 2012; Conklin and Anagnostopoulou 2006; Steedman 1996), are considered an adequate tool for computer-aided composition of advanced music. The language of modern mathematics, whose conceptual character has been stressed by contemporary mathematicians (Lawvere and Shanuel 1997; Boulez and Connes 2011), has been advocated as a way forward in the analysis of its effectiveness in musical creativity (Future and Emerging Technologies 2011). Musical creativity, particularly musical performance, is ultimately contextualised, situated, and embodied (Goguen 2004). In particular, in musical gesture theory, conceptual blending has been suggested as a powerful model of musical interpretation (Echard 2006). We believe that the exploration of the domains of mathematics and music should reveal very general principles applicable to other creative domains. Relevant Prior Research COINVENT is a collective effort to advance the understanding of creativity through a precise formalisation of an important cognitive model and a concrete computational realisation thereof. We shall do so informed by the main contributions towards a science of creativity (Sternberg 1999) and drawing from several foundational theories that have hitherto largely been pursued independently. During the last decades, scholars and researchers in cognitive linguistics and cognitive psychology have made signi.cant contributions to the understanding of the fundamental role that metaphor and analogy play in cognition (Lakoff and Johnson 1980; Gentner, J.Holvoak, and Kokinov 2001; Fauconnier and Turner 2003), at the same time that significant evidence has been gathered supporting a philosophy of mind grounded on the embodiment of mind and meaning (Maturana and Varela 1987; Varela, Thompson, and Rosch 1992; Lakoff and Johnson 1999; Johnson 2007). This research has been heavily influenced by the dramatic progress in imaging techniques carried out in the field of neuroscience, such as functional MRI. In parallel, the development of the field of Category Theory has led to a remarkable uni.cation and simpli.cation of mathematics (Mac Lane 1971; Lawvere and Shanuel 1997), which has helped to reach a deep understanding across different fields such as computer science, mathematical logic, physics, and linguistics. More recently, these techniques have been applied to obtain some preliminary formalisations of conceptual metaphor and blending (Goguen 1999; Old and Priss 2001; Guhe, Smaill, and Pease 2009) by applying techniques such as institution theory (Goguen and Burstall 1992) or information flow theory (Barwise and Seligman 1997), which are based on category theory. Automated reasoning techniques from artificial intelligence that are either based on cognitive principles such as case-based reasoning (Aamodt and Plaza 1994) grounded on the prototype theory of categorisation (Rosch 1973) and reasoning by analogy making (Gentner 1983)or on formal methods for inductive reasoning such as anti-uni.cation (Plotkin 1971) will be some of the seed technologies for the computational realisation of our model. Some preliminary steps have been made already, in joint research by some of the consortium members of COINVENT, by taking ideas from Lakatos (Lakatos 1976) and from (Lakoff and Nu~ nez 2000) as starting points and extending the HDTP system (Heuristic-Driven Theory Projection, developed at the University of Osnabruhnberger, and Schmid 2006; uck (Gust, KSchwering et al. 2009) and based on anti-uni.cation) to give a computational account of how these processes can give rise to basic concepts of arithmetic (Guhe et al. 2011). Another set of important seed technologies for COINVENT originates in research carried out originally at the University of Bremen, and now at the University of Magdeburg, and addresses the knowledge representation and reasoning layer of the project. This includes the distributed ontology language DOL, currently standardised within the Object Management Group OMG (www.ontoiop.org), a major international effort with over 40 experts involved worldwide, and which supports an extensible number of logical languages, major modularisation and logical structuring techniques, and in particular supports the specification of basic blending diagrams as formalised by Joseph Goguen. Moreover, the Hets system2 will serve as a central, and extensible, reasoning infrastructure, with which other tools developed within COINVENT will interface. Lastly, the technology developed in the OntoHub.org project will allow the building of a dedicated semantic repository for formalised concepts in the mathematics and music domains, supporting heterogeneous specifications in a semantically-backed logical context, and providing interfaces for sharing, browsing, and the integration of reasoning services. This repository will be hosted at conceptportal.org. In addition, the consortium members of COINVENT have shown an important experience in the development and application of the above foundational theories and seed technologies to a wide variety of fields, in computational creativity and other related areas: by studying the combination of case solutions and knowledge transfer in case-based reasoning (CBR) (Ontan~n on and Plaza 2010; Onta~on and Plaza 2012) and its application to computational creativity (Onta~non and Plaza 2012; Arcos 2012); by providing formal foundations for distributed reasoning with heterogeneous logics and their representations (Mossakowski, Maeder, and Luttich 2007), and by applying them to achieve semantic alignment and integration (Schorlemmer and Kalfoglou 2008; Kutz, Mossakowski, and Lucke 2010; Kalfoglou and Schorlemmer 2010; Kutz et al. 2012); by proposing novel architectures for coherence-driven, cognitively-inspired (BDI) agents (Joseph et al. 2010) and computational frameworks for multi-agent interaction-based agreement on concepts and their semantics (Ontan~ on and Plaza 2010; Atencia and Schorlemmer 2012); by formalising Lakatos-style automated theorem proving (Colton and Pease 2004) and mathematical theory formation (Colton 2002). Expected Contributions We expect that a mathematically precise theory, as the one we are proposing in the context of the COINVENT project, will lead to the following contributions: Theory and Technology. Computational implementations of cognitive and psychological models serve, in general, two main purposes: 2See http://www.informatik.uni-bremen.de/ agbkb/forschung/formal_methods/CoFI/hets/ index_e.htm 1. Computational implementations are tools for exploring implications of the ideas embedded in a particular model, beyond the limits of human thinking. Thus, they are vehicles of further scienti.c inquiry of the cognitive and psychological processes that the model seeks to describe. In this sense, the formal model coming out of the COINVENT project, together with its computational realisation, will be an important tool for exploring the implications of Fauconnier and Turners theory of conceptual blending for understanding creative thinking. One such implication is the role concept creation and invention plays in serendipitous reasoning, i.e., in recognising the value of newly invented concepts not only for better understanding a certain domain, but even for advancing the understanding of a previously unidentified problem that was initially not the concern of inquiry. If our model advances the understanding of implications such as how serendipity might work, cognitive science and psychology could take these results to explore serendipitous reasoning from a cognitive and psychological point of view. This alone would already be an important step forward in developing a science of creativity. By grounding our research on Goguens proposal for a Unified Concept Theory, we will build upon the deep understanding gained by relating different approaches to the notion of concept invention, and do so on a firm mathematical foundation that is consequently of great help in providing precise descriptions of what can and should be implemented in a computational system. 2. Computational implementations make a general model that is usually stated in abstract terms more concrete, facilitating the study of its formal and computational properties, and guiding the design and implementation of computer systems that attempt to display the cognitive capabilities captured in the model. Hence, they provide direct engineering advances. We will demonstrate these advances through two prototype implementations of autonomous creative systems that display creative activity through the accomplishment of concept creation and invention in the domains of mathematics and music. Ideally, these systems will be developed with the following properties: an ability to form abstractions over both semantic and syntactic aspects of a domain; an ability to form new representations, by conceptual blending; an ability to revise representations on the basis of new concrete information that fits badly with the current conceptualisation (using ideas from Lakatos); and heuristically guided algorithms to solve problems, based on combinations of the above abilities. If our intuitions are right about the power of conceptual blending to boost the capabilities of autonomous creative systems and our project is successful, our contribution could go even beyond that direction, in developing novel ways to use methodologies from cognitive science in systems engineering, and vice versa. Working Domains. In the domain of mathematics, we plan to build a computational system that aids mathematicians in by supporting their reasoning at a conceptual level and in their creative work, for example proposing potentially interesting novel definitions, theories, and conjectures that are motivated by conceptual (not only formal) reasons, and evaluating the potential of such ideas when proposed by the mathematician. Not only mathematicians, but also others engaged in similar sorts of reasoning, when developing new concepts and theories, can benefit greatly from the processes of building new conceptualisations from combinations of existing conceptualisations and particular examples and counter-examples. The particular system we propose as our proof-of-concept would be the first of its kind in mathematics, as it goes well beyond what proof assistants do. More importantly, if, as we intend, the system turns out to be judged by mathematicians attractive and even potentially useful in their work of conceptually advancing mathematics, this would open the door to something not seen before. The system resulting from this project will, therefore, be a showcase of how systems like proof assistant systems can be improved so that they are useful for mathematicians. In the domain of music, we plan to build a pioneering computational system that aids musicians in composition, namely in melodic harmonisation, that allows exploration of novel uncharted conceptual territories, for example proposing new harmonic concepts emerging from learned harmonic spaces, examples and counter-examples; suggesting new harmonic conceptualisations emerging from combinations/blends of different harmonic spaces that give rise to potentially interesting new harmonies. Computer-aided compositional systems are often accused of merely replicating/mimicking given music styles and being confined to the initial musical space that has been explicitly modelled in the system. The creativity of such systems is considered rather limited as the system cannot supersede its built-in concepts and cannot generate new unforeseen concepts. The particular system we propose as our proof-ofconcept would be the first of its kind that goes well beyond what current melodic harmonisation systems are capable of doing. It would open the way more generally to music/art creativity assistance tools that enable people to explore the borders of their artistic creativity by giving them new original ideas for further exploration. Measures of Creativity. The computational creativity community needs concrete measures of evaluation to enable us to make objective, falsi.able claims about progress made from one version of a program to another, or for comparing and contrasting different software systems for the same creative task. There are currently three main models of evaluation (Ritchie 2007; Colton, Pease, and Charnley 2011; Jordanous 2011), but they are still rarely used, and there are problems with each. We will extend these measures: for instance, serendipity, which is an important aspect of human creativity, currently does not feature in any of the evaluation models. We will formulate ways of evaluating this and other under-represented notions. We will also contribute to the methodology of computational creativity by applying all three models, as well the new measures we develop, to our system and to other creative systems. One of the best ways to evaluate and improve measures of creativity is to apply them in a reflective manner. We will furthermore evaluate each model of evaluation according to principles in the philosophy of science, and survey other experts for ease of use and adherence to intuitions about creativity. Acknowledgements The project COINVENT acknowledges the .nancial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET-Open Grant number: 611553. 2014_39 !2014 Creativity in Conceptual Spaces Antonio Chella1 Salvatore Gaglio1,3 Gianluigi Oliveri2,3 Agnese Augello3 Giovanni Pilato3 DICGIM (1) Dipartimento di Scienze Umanistiche (2) ICAR (3) University of Palermo University of Palermo Italian National Research Council Viale Delle Scienze Ed. 6 Viale Delle Scienze Ed. 12 Viale Delle Scienze Ed.11 Palermo, Italy Palermo, Italy Palermo, Italy (antonio.chella,salvatore.gaglio)@unipa.it gianluigi.oliveri@unipa.it (augello,pilato)@pa.icar.cnr.it Abstract The main aim of this paper is contributing to what in the last few years has been known as computational creativity. This will be done by showing the relevance of a particular mathematical representation of G ardenforss conceptual spaces to the problem of modelling a phenomenon which plays a central role in producing novel and fruitful representations of perceptual patterns: analogy. Introduction There is an old tradition going back to Plato for which the phenomena which fall under the concept of creativity are those associated with the acquisition and mastery of some kind of craft (techne), rather than with random activity and aimless chance. According to this way of thinking, there is no reason to believe that an unschooled little ant that happens to draw in its course on the sand the first page of the score of the St. Matthews Passion is engaged in a creative activity. Indeed, for the supporters of this tradition, including the later Wittgenstein, creativity presupposes the existence of a high level linguistic competence typical of human beings. Here, of course, painting and music making when seen as profoundly different from doodling or from casual humming -are considered to be activities involving the use of some kind of articulated visual or auditory vehicles which give expression to feelings, emotions, etc., articulated visual or auditory vehicles which come with a syntax. If we were successful in our attempt to model analogy within the particular mathematical representation of Gardenforss conceptual spaces we have chosen, this, besides scoring a point in favour of the computational creativity research programme (Cardoso, Veale, and Wiggins 2009), (Colton and Wiggins 2012), would also have important consequences with regard to the tenability of the old traditional view of creativity we mentioned above. For, since Gardenforss conceptual spaces, as we shall see in what follows, are placed in the sub-linguistic level of the cognitive architecture of a cognitive agent (CA), there would be at least a phenomenon intuitively belonging to creativity which could be represented independently of language. After a section dedicated to a brief survey of some of the central contributions to the study of the connection between analogical thinking and computation, the paper proceeds to an explanation of how analogy is related to creativity. The article then develops by means of an illustration of the cognitive architecture of our CA in which the nature and function of Gardenforss conceptual spaces is made explicit. A characterization of two conceptual spaces present in the library of our CA the visual and the music conceptual spaces is then offered and visual analogues of music patterns are examined. The theoretical points made in the paper are, eventually, illustrated in the discussion of a case study. Analogical thinking and computation Human cognition is deeply involved with analogy-making processes. Analogical capabilities make possible perceiving clouds as resembling to animals, solving problems through the identification of similarities with previously solved problems, understanding metaphors, communicating emotions, learning, etc. (Kokinov and French 2006), (Holyoak et al. 2001). Analogical reasoning is ordinarily used to transfer structures, relational properties, etc. from a source domain to a target domain, and is clearly involved in that human ability which consists in producing generalizations. Many models for analogical thinking are present in the literature. They are characterized by: (1) the ways of representing the knowledge on which the analogical capability is based, (2) the processes involved in realizing the analogical relation, and by (3) the manner in which the analogical transfer is fulfilled (Krumnack, Khnberger, and Besold 2013). A known class of computational models for analogy-making are those based on Gentners (1983) Structure Mapping Theory (SMT). This theory was the first that focussed on the role of the structural similarity existing between source and target domains, structural similarity which is generated by common systems of relations obtaining between objects in the respective domains. The structure mapping theory uses graphs to represent the domains and computes analogical relations by identifying maximal matching sub-graphs (Krumnack, Khnberger, and Besold 2013). Other models are based on a connectionist approach, for example, we can mention here the Structured Tensor Analogical Reasoning (STAR) (Halford et al. 1994) and its evolution STAR-2 (Wilson et al. 2001), which provide mechanisms for computing analogies using representations based on the mathematics of tensor products (Holyoak et al. 2001); and the framework for Learning and Inference with Schemas and Analogies (LISA) (Hummel and Holyoak 1996) which exploits temporally synchronized activations in a neural network to identify a mapping between source and target elements. In 1989 Keith Holyoak and Paul Thagard (Holyoak and Thagard 1989) proposed a theory of analogical mapping based upon interacting structural, semantic, and pragmatic constraints that have to be satisfied at the same time, implementing the theory as an emergent process of activation states of neuron-like objects. According to (French 1995), metaphorical language, analogy making and couterfactualization are all products of the minds ability to perform slippage (i.e. the replacement of one concept in the description of some situation by another related one) .uidly. All analogies involve some degree of conceptual slippage: under some pressure, concepts slip into related concepts. On the notion of conceptual slippage is based Copycat, a model of analogy making developed in 1988 by Douglas Hofstadter et al. (Hofstadter and Mitchell 1994). In (Kazjon Grace and Saunders 2012), a computational model of associations, based on an interpretation-driven framework, was put forward and applied to the domain of real-world ornamental designs, where an association is understood in terms of the process of constructing new relationships between different ideas, objects or situations. In (Grace, Saunders, and Gero 2008) a computational model for the creation of original associations has been presented. The approach is based on the concept of interpretation, which is defined as a perspective on the meaning of an object; a particular way of looking at an object 1 , and acts on conceptual spaces, where concepts are defined as regions in that space. In this context the authors represent the interpretation process as a transformation applied to the conceptual space from which feature-based representations are generated. The model tries to identify relationships that can be built between a source object and a target object. A new association is constructed when the transformations applied to these objects contribute to the emergence of some shared features which were not present before the application of the transformations. Creativity and Analogy It is intuitively correct to say that the use of a stick made by a bird to catch a larva in the bark of a tree is creative, as it is creative the writing of a poem or the introduction of a new mathematical concept. Creativity, indeed, covers a large variety of phenomena which also differ from one another in relation to their different degree of abstractness, i.e., the creativity of the hunting technique of the bird is much less abstract than that displayed by Beethoven in the writing of the .fth symphony. It is not our intention in this paper even to attempt to give a definition of creativity. What we want to do here is simply to focus on the concept of analogy the relation in which A 1(Grace, Saunders, and Gero 2008), Section 2, page 2 is to B is the same as the relation in which . is to which is at the heart of much of what we can correctly describe as creative activity. A traditional model of analogical thinking is provided by the concept of proportion: A. = B where A and B are entities homogeneous to each other like . and are homogeneous to each other but A and B are non-homogeneous to . and . Analogical thinking allows the emergence/recognition of a pattern in a certain environment E which is similar/the same as that which has already emerged/been recognized in another environment E. . Much of the work to be done in what follows will consist in rendering mathematically rigorous what we have called pattern, environment E, analogy as similarity of patterns given in different environments, identity of patterns given in different environments, etc. etc. Let us say that patterns are here understood as relational entities (structures) defined on a given domain.2 And since a necessary condition for the emergence/recognition of a pattern is the presence of a system of representation, we are going to identify the environment E with such a system, and choose as a model of such a system of representation Gardenforss conceptual spaces. Moreover, two patterns .1 and .2 given in two different conceptual spaces V1 and V2 are said to be analogous to one another if there is an homomorphism between .1 and .2, whereas they are said to be exemplifying the same pattern if there is an isomorphism between .1 and .2. A cognitive architecture based on Conceptual Spaces The introduction of a cognitive architecture for an artificial agent implies the definition of a conceptual representation model. Conceptual spaces (CS), employed extensively in the last few years (Chella, Frixione, and Gaglio 1997) (De Paola et al. 2009) (Jung, Menon, and Arkin 2011), were originally introduced by G ardenfors as a bridge between symbolic and connectionist models of information representation. This was part of an attempt to describe what he calls the geometry of thought. In (Gardenfors 2004) we find a ardenfors 2000) and (Gdescription of a cognitive architecture for modelling representations. This is a cognitive architecture in which an intermediate level, called geometric conceptual space, is introduced between a linguistic-symbolic level and an associationist sub-symbolic level of information representation. The cognitive architecture (see figure 1), is composed by three levels of representation: a subconceptual level, in which data coming from the environment are processed by means of a neural networks based system, a conceptual level, where data are represented and conceptualized independently of language; and, finally, a symbolic level which 2For the special case represented by mathematical patterns see (Oliveri 1997), (Oliveri 2007), ch. 5, and (Oliveri 2012). makes it possible to manage the information produced at the conceptual level at a higher level through symbolic computations. The conceptual space acts as a workspace in which low-level and high-level processes access and exchange information respectively from bottom to top and from top to bottom. The description of the symbolic and subconceptual levels goes beyond the scope of this paper. Figure 1: A sketch of the cognitive architecture According to the linguistic/symbolic level: Cognition is seen as essentially being computation, involving symbol manipulation (G ardenfors 2000). whereas, for the associationist sub-symbolic level: Associations among different kinds of information elements carry the main burden of representation. Connectionism is a special case of associationism that models associations using artificial neuron networks (Gardenfors 2000), where the behaviour of the network as a whole is determined by the initial state of activation and the connections between the units (G ardenfors 2000). Although the symbolic approach allows very rich and expressive representations, it appears to have some intrinsic limitations such as the so-called symbol grounding problem, 3 and the well known A.I. frame problem.4 On the 3How to specify the meaning of symbols without an infinite regress deriving from the impossibility for formal systems to capture their semantics. See (Harnad 1990). 4Having to give a complete description of even a simple robots other hand, the associationist approach suffers from its low-level nature, which makes it unsuited for complex tasks, and representations. Gardenfors proposal of a third way of representing information exploits geometrical structures rather than symbols or connections between neurons. This geometrical representation is based on a number of what Gardenfors calls quality dimensions whose main function is to represent different qualities of objects such as brightness, temperature, height, width, depth. Moreover, for G ardenfors, judgments of similarity play a crucial role in cognitive processes. And, according to him, it is possible to associate the concept of distance to many kinds of quality dimensions. This idea naturally leads to the conjecture that the smaller is the distance between the representations of two given objects the more similar to each other the objects represented are. According to Gardenfors, objects can be represented as points in a conceptual space, knoxels (Gaglio et al. 1988)5, and concepts as regions within a conceptual space. These regions may have various shapes, although to some concepts those which refer to natural kinds or natural properties correspond regions which are characterized by convexity.6 For Gardenfors, this latter type of region is strictly related to the notion of prototype, i.e., to those entities that may be regarded as the archetypal representatives of a given category of objects (the centroids of the convex regions). One of the most serious problems connected with Gardenfors conceptual spaces is that these have, for him, a phenomenological connotation. In other words, if, for example, we take, the conceptual space of colours this, according to Gardenfors, must be able to represent the geometry of colour concepts in relation to how colours are given to us. However, we have chosen a non phenomenological approach to conceptual spaces in which we substitute the expression measurement for the expression perception, and consider a cognitive agent which interacts with the environment by means of the measurements taken by its sensors rather than a human being. Of course, we are aware of the controversial nature of our non phenomenological approach to conceptual spaces. But, since our main task in this paper is characterizing a rational agent with the view of providing a model for artificial agents, it follows that our non-phenomenological approach to conceptual spaces is justifled independently of our opinions on perceptions and their possible representations within conceptual spaces Although the cognitive agent we have in mind is not a human being, the idea of simulating perception by means of measurement is not so far removed from biology. To see this, world using axioms and rules to describe the result of different actions and their consequences leads to the combinatorial explosion of the number of necessary axioms. 5The term knoxel originates from (Gaglio et al. 1988) by the analogy with pixel. A knoxel k is a point in Conceptual Space and it represents the epistemologically primitive element at the considered level of analysis. 6A set S is convex if and only if whenever a, b . S and c is between a and b then c . S. consider that human beings, and other animals, to survive need to have a fairly good ability to estimate distance. The frog unable to determine whether a .y is within reach or not is, probably, not going to live a long and happy life. Our CA is provided with sensors which are capable, within a certain interval of intensities, of registering different intensities of stimulation. For example, let us assume that CA has a visual perception of a green object h. If CA makes of the measure of the colour of h its present stereotype of green then it can, by means of a comparison of different measurements, introduce an ordering of gradations of green with respect to the stereotype; and, of course, it can also distinguish the colour of the stereotype from the colour of other red, blue, yellow, etc. objects. In other words, in this way CA is able to introduce a green dimension into its colour space, a dimension within which the measure of the colour of the stereotype can be taken to perform the role of 0. The formal model of a conceptual space that at this point immediately springs to mind is that of a metric space, i.e., it is that of a set X endowed with a metric. However, since the metric space X which is the candidate for being a model of a conceptual space has dimensions, dimensions the elements of which are associated with coordinates which are the outcomes of (possible) measurements made by CA, perhaps a better model of a conceptual space might be an n-dimensional vector space V over a field K like, for example, Rn (with the usual inner product and norm) on R. Although this suggestion is interesting, we cannot help noticing that an important disanalogy between an n-dimensional vector space V over a field K, and the biological conceptual space that V is supposed to model is that human, animal, and artificial sensors are strongly nonlinear. In spite of its cogency, at this stage we are not going to dwell on this difficulty, because: (1) we intend to examine the ideal case first; and because (2) we hypothesize that it is always possible to map a perceptual space into a conceptual space where linearity is preserved either by performing, for example, a small-signal approach, or by means of a projection onto a linear space, as it is performed in kernel systems (Scholkopf and Smola 2001). The Music and Visual Conceptual Spaces Let us consider a CA which can perceive both musical tones and visual scenes. The CA is able to build two types of conceptual spaces in order to represent its perceptions. As reported in (Augello et al. 2013a) (Augello et al. 2013b), the agents conceptual spaces are generated by measurement processes; in this manner each knoxel is, directly or indirectly, related to measurements obtained from different sensors. Each knoxel is, therefore, represented as a vector k =(x1,x2, ..., xn) where xi belongs to the Xi quality dimension of our n-dimensional vector space. The Conceptual Spaces can also be manipulated according to changes of the focus of attention of the agent (Augello et al. 2013a) (Augello et al. 2013b), however the description of this process goes beyond the scope of this paper and will not be described here. Visual conceptual space According to Biedermans geons theory (see (Biederman 1987)), the visual perception of an object is processed by our brain as a proper composition of simple solids of different shapes (the geons). Following Biederman main ideas, we exploit a conceptual space for the description of visual scenarios (see .g. 2) where objects are represented as compositions of super-quadrics, and super-quadrics are vectors in this conceptual space. Figure 2: Visual perception and corresponding CS representation For those who are not familiar with the concept of super-quadric, let us say that super-quadrics are geometric shapes derived from the quadrics parametric equation with the trigonometric functions raised to two real exponents. The inside/outside function of the superquadric in implicit form is: .. . 2 .. 2 ...21 .. 2 .1 .2 .1 xy z F (x, y, z)= + + ax ay az where the parameters ax,ay,az are the lengths of the super-quadric axes, the exponents .1,.2, called form factors, are responsible for the shapes form: values approaching 1 render the shape rounded. To see this, let us suppose that the vision system can be approximated and modeled as a set of receptors, and that these receptors give as output, corresponding to the external perceived stimulation, the set of super-quadrics parameters associated to the perceived object. This leads to a superquadric conceptual representation of a 3D world. The situation is illustrated in Fig 2 where an object positioned in the 3D space, let us say an apple, is approximately perceived as a sphere and is consequently mapped as a knoxel in the related conceptual space. In particular a knoxel in the Visual Conceptual space can be described by the vector: -k =(ax,ay,az,.1,.2,px,py,pz, ., ., .)T In this perspective, knoxels correspond to simple geometric building blocks, while complex objects or situations are represented as suitable sets of knoxels (see figure 3). Figure 3: A representation of a hammer in the visual conceptual space as a composition of two super-quadrics Music Conceptual Space In (G ardenfors 1988), Gardenfors discusses a program for musical spaces analysis directly inspired to the framework of vision proposed by Marr (Marr 1982). This discussion has been further analysed by Chella in (Chella 2013), where a music conceptual space has been proposed and placed into the layers of the cognitive architecture described in the previous sections. As reported in (Shepard 1982), it has been highlighted that for the music of all human cultures, the relation between pitch and time appears to be crucial for the recognition of a familiar piece of music. In consideration of this, the representation of pitch becomes prominent for the representation of tones. In the music CS the quality dimensions represent information about the partials composing musical tones. This choice is inspired by empirical results about the perception of tones to be found in (Oxenham 2013). We model the functions of the ear as a finite set of .lters, each one centred on the i-th frequency (we suppose for example to have N filters ranging from 20Hz to 20KHz at proper intervals. In this manner, a perceived sound will be decomposed into its partials and mapped as a vector V =(c1,c2, ,cn) whose components correspond to the coefficients of the n frequencies that compose the sound (.1,.2, .n), as illustrated in Fig 4. The supposition is that here we use the discrete Fourier Series Transform, which is commonly used in signal processing, considering not only music but also other time-variant signals such as speech. The vector V is, therefore, a knoxel of the music conceptual space. The partials of a tone are related both to the pitch and the timbre of the perceived note. Roughly, the fundamental frequency is related to the pitch, while the amplitudes of the remaining partials are also related to the timbre of the note. A similar choice is to be found in Tanguiane (Tanguiane 1993). A knoxel in the music CS will change its position when the perceived tone changes its pitch or its loudness or tim-Figure 5: A representation of two chords in the music conceptual space. From Visual Patterns to Music Patterns A cognitive agent is able to represent its different perceptions in proper conceptual spaces; as soon as the agent perceives visual scenes or music, a given geometric structure will emerge. This structure will be made of vectors and regions, conceptual representations of perceived objects. Music and visual conceptual spaces are two examples of conceptual representations that can be thought as a basis for computational simulation of an analogical thinking that provides the agent with some sort of creative capability. Knowledge and experiences made in a very specific domain of perception can be exploited by the agent in order to better understand or to express in different ways the experiences and the perceptions that belong to other domains. This process resembles the synaesthesia 7 affecting some people, which allows to perform analogies between elements and experiences belonging to different sensory areas. Analogical thinking reveals similarities between patterns belonging to different domains. For what concerns the music and vision domains, several analogies have been discussed in the literature. As an example, Tanguiane (Tanguiane 1993) compares visual and music perceptions, considering three different levels and both static and dynamic point of views. In particular, from a static point of view, a first visual level, that is the pixel perception level, can correspond the perception of partials in music. At the second level, the perception of simple patterns in vision corresponds to the perception of single notes. Finally at the third level, the perception of structured patterns (as patterns of patterns), corresponds to the perception of chords. Concerning dynamic perception, the first level is the same as in the case of static perception, while at the second level the perception of visual objects corresponds to the perception of musical notes, and at the third final level the perception of visual trajectories corresponds to the perception of music melodies. Gardenfors 1988), in his paper on Seman arderfors (Gtics, Conceptual Spaces and Music discusses a program for musical spaces analysis directly inspired to the framework of vision proposed by Marr (Marr 1982), where the first level is related to pitch identification; the second level is related to the identification of musical intervals and the third level to tonality, where scales are identified and the concepts of chromaticity and modulation arise. The fourth level of analysis is that at which the interplay of pitch and time is represented. In what follows we are going to illustrate a framework for possible relationships between visual and musical domains. The mapping is one among many possible, and it has been chosen in order to make clear and easily understandable the whole process. As we have already said, it is possible to represent complex objects in a conceptual space as a set of knoxels. In particular, in the visual conceptual space, a complex object can be described as the set of knoxels representing the simple shapes of which it is composed, whereas in the music conceptual space we have seen how to represent chords as the set of knoxels representing the different tones played together. In the two spaces will emerge recurrent patterns, given respectively by proper configurations of shapes and tones which occur more frequently. A fundamental analogy between the two domains can be highlighted, concerning the importance of the mutual relationships between the parts composing a complex object. In fact, in the case of perception of complex objects in vision, their mutual positions and shapes are important in order to describe the perceived object: e.g., in the case of an hammer, the mutual positions and the mutual shapes of the handle and the head are obvi 7a condition in which the stimulation of one sense causes the automatic experience of another sense ously important to classify the complex object as an hammer. A the same time, the mutual relationships between the pitches (and the timbres) of the perceived tones are important in order to describe the perceived chord (to distinguish for example, a major from a minor chord of the same note). Therefore, spatial relationships in static scenes analysis are in some sense analogous to sounds relationships in music conceptual space. Although in this work we are overlooking the dynamic aspect of perception in the two domains of analysis, we can also mention some possible analogies, for example, we could correlate the trajectory of a moving object with a succession of different notes within a melody. As certain movements are harmonious or not, so in music the succession of certain tones creates pleasant feelings or not. Visual representation of musical objects: a case study In what follows, we describe a procedure capable of simulating some aspects of analogical thinking. In particular, we consider an agent able to: (1) represent tones and visual objects within two different conceptual spaces; and (2) build analogies between auditory perceptions and visual perceptions. At the heart of this procedure there is the ability on the part of the CA of individuating the appropriate homomorphism f : Rn Rm which maps a knoxel belonging to a n-dimensional conceptual space Rn the acoustic domain on to another knoxel in a different m-dimensional conceptual space Rm the visual domain. For the sake of clarity we simplify the previously illustrated model of both music and visual conceptual representation of the agent. In particular: for what concerns the visual perceptions, we consider only a visual coding of spheres: this leads to the assumption that every observed object will be perceived as a sphere or as a composition of spheres by the agent; for what concerns the auditory perceptions, we consider only a limited set of discrete frequencies which the agent perceives. All information about pitch, loudness and timbre is implicitly represented in the auditory conceptual space by the Fourier Analysis parameters. Figure 6 illustrates the mapping process leading from sensing and representation in the music conceptual spaces to a pictorial representation of the heard tone. The mapping is realized through an analogy transformation which let arise a visual knoxel in he visual conceptual space. The analogy process of the agent can be outlined in the following steps: the agent perceives a sound (A) the sound is sensed and decomposed through Fourier Transform Analysis (A) the measurements on the partials lead to a conceptual representation of the perceived sound as a knoxel in the acoustic space (A) Figure 6: Mapping process leading from sensing and representation in the music conceptual spaces to a pictorial representation of the heard tone the knoxel kA in the acoustic space is transformed into a knoxel kV in the visual conceptual space (B) the mapping lets arise a conceptual representation of an object that is not actually perceived. It is only imagined by analogy. (C) the birth of this new item in the visual conceptual space, is directly related to the birth of an image, which, most importantly, is simply imagined and not perceived (D) Given two conceptual spaces Rn and Rm, the mapping can be any multidimensional function that realizes the appropriate transformation f : Rn Rm . The function f can be learned in a supervised or unsupervised way through machine learning algorithms. At present, we superimpose the structure f. In order to make a choice for f we take some inspirations from Shepard in (Shepard 1982). Many geometrical mappings have been proposed for pitch: the simplest one is that one which use of a monodimensional logarithmic scale where each pure tone is related to the logarithm of its frequency. However, according to the two component theory (Revesz 1954) (Shepard 1982), the best manner to pictorially represent pitches is a helix or 3D-spiral instead of a straight line. A mapping based on this theory is illustrated in .g. 7, where simple sounds are drawn on the helix, as spheres of different sizes, according to their associated loudness. That mapping allows to complete one turn per octave and reaches the necessary geometric proximity between points which are an octave distant from each other. The strong point of the uniform helix representation is that the distance corresponding to any particular musical interval is invariant throughout the representation. Each tone can be mapped onto a spiral laying on a cylinder where points vertically aligned correspond to the same tone with different octave. This projective property holds regardless of the slope of the helix (Shepard 1982). In superimposing f we suppose that when the agent perceives a sound which is louder than another one, this evokes in his mind the view of something that is more cumbersome than another one. We assume that this perceived object has no preferred direction or shape, therefore the easiest way to represent it is a sphere, whose radius can be associated to the loudness of the perceived sound. The other parameter is the pitch. As soon as the agent perceives different pitches, he tries to visualize them, imagine them, locate them according to the helix whose equations are: x = rcos(2..) (1) y = rsin(2..) (2) z = c. (3) If we consider a simple tone of given frequency . , the pitch will be represented by a point p(x, y, z) in the spiral, while its loudness L will be represented by a sphere having centre in p(x, y, z) and a radius whose length r is related to the perceived loudness. The sphere corresponds to a knoxel in the Visualconceptual-space, while the perceived tone corresponds to a knoxel in the Music-conceptual-space. The agent therefore will visually imagine the perceived sound as a sphere whose radius is proportional to the perceived loudness, while its position corresponds to a point laying on the helical line representing all the tones that can be perceived by the agent, and a chord will be imagined as a set of spheres in this 3D space. Conclusions We have illustrated a methodology for the computational emulation of analogy, which is an important part of the imaginative process characterizing the creative capabilities of human beings. The approach is based on a mapping between geometric conceptual representations which are related to the perceptive capabilities of an agent. Even though this mapping can be built up in several different ways, we presented a proof-of-concept example of some analogies between music and visual perceptions. This allows the agent to associate imagined, unseen images to perceived sounds. It is worthwhile to point out that, in similar Figure 7: Visual representation of music chords deriving from a two-component theory based mapping way, it is possible to imagine sounds to be associated to visual scenes, and the same can be done with different kinds of perceptions. We claim that this approach could be a step towards the computation of many forms of the creative process. In future works different types of mapping will be investigated and properly evaluated. Acknowledgment This work has been partially supported by the PON01 01687 -SINTESYS (Security and INTElligence SYSstem) Research Project. 2014_4 !2014 Autonomously Managing Competing Objectives to Improve the Creation and Curation of Artifacts David Norton, Derrall Heath, Dan Ventura Computer Science Department Brigham Young University Provo, UT 84602 USA dnorton@byu.edu, dheath@byu.edu, ventura@cs.byu.edu Abstract DARCI (Digital ARtist Communicating Intention) is a creative system that we are developing to explore the bounds of computational creativity within the domain of visual art. As with many creative systems, as we increase the autonomy of DARCI, the quality of the artifacts it creates and then curates decreasesa phenomenon Colton and Wiggins have termed the latent heat effect. We present two new metrics that DARCI uses to evolve and curate renderings of images that convey target adjectives without completely obfuscating the original image. We show how we balance the two metrics and then explore various ways of combining them to autonomously yield images that arguably succeed at this task. Introduction There has been a recent push in computational creativity towards fully autonomous systems that are perceived as creative in their own right. One of the most significant problems facing modern creative systems is the level of curation that is occurring in these systems. If a system is producing dozens, hundreds, or even thousands of artifacts from which a human is choosing a single valued artifact, then is the system truly fully autonomous? Colton has argued that for a system to be perceived as creative, it must demonstrate appreciation for its own work (Colton 2008). A strong implication of this is that the system must be able to do its own curation by autonomously selecting an artifact for human judgment. DARCI (Digital ARtist Communicating Intention) is a creative system that we are developing to explore the bounds of computational creativity within the domain of visual art. DARCI is composed of several subsystems, each with its own creative potential, and each designed to perform an integral step of image creation from conception of an idea, to design, to various phases of implementation, to curation. The most complete subsystem, and the one that is the focus of this paper, is called the image renderer. The image renderer uses a genetic algorithm to discover a sequence of image filters that will render an image composition (produced by another subsystem) so that it will reflect a list of adjectives (selected from yet another subsystem). After evolving a population of candidate renderings, the image renderer must select an interesting candidate that reflects both the original image and the given adjectivesin other words, it must curate the finished artifacts. Historically, DARCI has been successful at producing such images when curation is a joint effort between DARCI and a human (Norton, Heath, and Ventura 2011b; Heath, Norton, and Ventura 2013). In these cases, DARCI selects a number of artifacts, and a human chooses their favorite from that selection. When DARCI curates on its own, the results have been significantly less successful. This decrease in quality is to be expected and is a phenomenon Colton and Wiggins call the latent heat effectas the creative responsibility given to a system increases, the value of its output does not (initially) increase ... (emphasis added) (Colton and Wiggins 2012). Since we know DARCI is capable of producing interesting images, we are interested in increasing the value of the artifacts the system produces when curating alone, thus decreasing the latent heat effect. DARCIs image renderer uses a combination of two conflicting metrics as a fitness function to evaluate and assign fitness scores to candidate artifacts. The fitness score not only drives the evolution of artifacts using a genetic algorithm, it is also used to curate the population of candidate artifacts when evolution is complete. For this paper we have made improvements to the fitness function in order to improve the quality of artifacts DARCI produces. Previously, the fitness function has been the combined average of an ad-hoc interest metric and an adjective matching metric. In this paper, we will abandon the interest metric in favor of a new similarity metric, and combine it with an improved adjective matching metric. While we take measures to ensure that both metrics output real values in a similar range, experience has shown that the two metrics are not measuring attributes of equal quality. This has led to the observation that if combining metrics with an average, the algorithm will give disproportionate weight to the metric that is easier to maximize. Thus, we will investigate different means of combining these two metrics in an attempt to more effectively balance the requirements put upon the image rendering subsystem and decrease the latent heat effect. We show the results of these new fitness functions in figures curated strictly by DARCI. Image Rendering The image rendering subsystem uses a series of image Filters to render pre-existing images which we refer to as source images. The subsystem has access to Photoshop-like filters with varying parameters. It uses a genetic algorithm to dis-Table 1: Image features used to train neural networks. cover the configuration and parameter settings of these image filters so that candidate artifacts will reflect target adjectives without over or under-filtering the source image (Norton, Heath, and Ventura 2011b; 2013). A genetic algorithm is used because evolutionary approaches elegantly facilitate the creation of artifacts through both combination and exploration, two processes described by Boden for generating creative products (Boden 2004). Gero has also outlined how the processes underlying evolution are ideal for producing novel and unexpected solutions, a crucial part of creativity (Gero 1996). Finally, we have shown how evolutionary algorithms approximate some aspects of the creative process in human artists (Norton, Heath, and Ventura 2011a). In this section we will describe in detail the two metrics used in this paper: adjective matching and similarity. Adjective Matching The adjective matching metric is the output of a learning subsystem of DARCI called the Visuo-Linguistic Associator (VLA). The VLA is a collection of artificial neural networks (ANN) that learns to associate image features with adjectives through backpropagation training. The original VLA has been described in detail previously (Norton, Heath, and Ventura 2010). Here we introduce an improved VLA. While DARCI is designed to function as an online system, the original VLA required subsystem resets whenever it was time to introduce new training data, essentially learning in batch. Thus, in order for DARCI to adapt, human intervention was needed at regular intervals. The new VLA uses an approach closer to incremental learning to better facilitate the desired autonomous online functionality. Additionally, the new VLA uses a more accurate and complete approach to predicting additional training data. In this section we will describe the new VLA without any assumptions that the reader is familiar with the previous system. Training Data Training data for DARCI is contained in a database. Each data point consists of an adjective (the label), the sentiment toward the adjective (positive or negative), the image features associated with the adjective (the image), and a time stamp. In our research, the term adjective always refers to a unique adjective synset as defined in WordNet (Fellbaum 1998). Hence, different senses of the same word will belong to different synsets, or adjectives. Data points are added to the database as they are submitted by volunteers using a training website (Heath and Norton 2009). Whenever the training algorithm is invoked, new relevant data points are introduced to the learner one at a time in the submitted order. The learner consists of a series of binary ANNs, one for each relevant adjective. An adjective, and any corresponding data point, is considered relevant once there are at least ten distinct positive and ten distinct negative instances of the adjective in the database. Here, distinct means occurrences of the adjective with unique sets of image features (i.e. if an adjective is used to label the same image multiple times it only counts as one occurrence). At the moment the learner is invoked, a new neural network is created for any new adjectives that have become relevant. Color & Light: 1. Average red, green, and blue 2. Average hue, saturation, and intensity 3. Saturation and intensity contrast 4. Unique hue count (from 20 quantized hues) 5. Hue contrast 6. Dominant hue 7. Dominant hue image percent Shape: 1. Geometric moment 2. Eccentricity 3. Invariant moment (5x vector) 4. Legendre moment 5. Zernike moment 6. Psuedo-Zernike moment 7. Edge direction histogram (30 bins) Texture: 1. Co-occurrence matrix (x4) 1. Maximum probability 2. First order element difference moment 3. First order inverse element difference moment 4. Entropy 5. Uniformity 2. Edge frequency (25x vector) 3. Primitive length 1. Short primitive emphasis 2. Long primitive emphasis 3. Gray-level uniformity 4. Primitive uniformity 5. Primitive percentage The reason we only create and train the learner on relevant data points is a matter of practicality. There are over 18000 adjective synsets in WordNet, and at the time of this writing more than 6000 adjective synsets in DARCIs database. However, most of the adjectives in DARCIs database are rare with only one or two positive data points. This is not enough data to successfully train any learner in a complex domain such as image annotation. Since performance speed is important for DARCI, accessing 6000 neural nets, most of which would be insufficiently trained, to annotate an image is impractical. As of this writing, DARCI has 237 relevant adjectives, a much more useful and manageable number. Taking synonyms into consideration, these relevant adjectives cover most standard adjectives. The learners neural networks are trained using standard back propagation with 102 image features as inputs. These image features are widely accepted global features for content based image retrieval, and most of them are available through the DISCOVIR (DIStributed COntent-based Visual Information Retrieval) system (King, Ng, and Sia 2004; Gevers and Smeulders 2000). A summary of the features we use can be found in Table 1. These features describe the color content, lighting, textures, and shape patterns found in images. Specific to the art domain, several researchers have shown that such features are useful in classifying images according to aesthetics (Datta et al. 2006), painting genre (Zujovic, Gandy, and Friedman 2007), and emotional semantics (Wang, Yu, and Jiang 2006). As many of these researchers have found color to be particularly useful in classifying images, we added four color-based features inspired by Lis own colorfulness features (Li and Chen 2009) to those contained in DISCOVIR. In Table 1 these colorfulness features are Color & Light numbers 4-7. When training neural networks in batch, back propagation requires many epochs of training to converge. During each epoch, all of the training data is presented to the neural network in a random order. To imitate this with incremental learning, each new data point is introduced to the appropriate neural network along with a selection of previous data points. Along with this recycled data, additional data points are predicted from the co-occurrences of adjectives with images. By including predicted data we are able to augment the limited data we do have. Similar, but less complete, approaches to augmenting training data have been successful in the past (Norton, Heath, and Ventura 2010). Recycling Data For each new data point presented to a neural network for a given adjective, a, n positive data points from the set of all previous positive data points for the given adjective, Da+, and n negative data points from the set of all previous negative data points for the given adjective, Da-, are selected. The data points are selected with replacement according to the probability P (rank(d)) where d 2 Das, s is the sentiment of the set (-or +), and rank(d) is the temporal ordering of element d in Das. The most recent element has a rank of |Das| and the oldest element has a rank of 1. The equation for P (rank(d)) is as follows: rank(d) P (rank(d)) = (1) |Das| P i i=0 The value for the number of previous data points chosen, n, is defined by n = min(r, |Da+|, |Da-|) where r is a parameter setting the maximum number of data points to recycle each time a new data point is introduced. For the experiments in this paper, this value is set to 100. Informally, every time a new data point is presented to a neural network, an equal number of positive and negative data points are selected from the previous data points for that neural network. These are selected randomly but with a higher probability given to more recent data. Predicting Data To augment the training data we collect from DARCIs website, we analyze the co-occurrence of relevant adjectives to predict additional data points. Here we say that two adjectives co-occur whenever the same image is labeled with both adjectives at least oncethese labels can be negative or positive. As each new data point is introduced to the learner, co-occurrence counts (distinct images) are updated for all pairings of relevant adjectives across all four combinations of sentiment. For example, as of this paper, scary has 26 co-occurrences with disturbing (or scary co-occurs with disturbing in 26 distinct images) and 0 co-occurrences with not disturbing, while not scary has 5 co-occurrences with disturbing and 32 co-occurrences with not disturbing. Once the co-occurrence counts have been updated, they are used to predict m positive and m negative data points to augment the new data point. m is calculated as bpnc where p is a prediction coefficient and n is defined above. For this paper, p is set to 0.3. These predicted data points are not added to the database. To predict new data points for the given adjective, a, the system first calculates each of the likelihoods that an image will be labeled with a or a given that the image is labeled positively or negatively with each of the adjectives, ai, in A, the set of all relevant adjectives. Likelihood is calculated as: co(a, ai) L(a|ai)= (2) supp(ai) where co(a, ai) is the co-occurrence count for a and ai, and supp(ai) is the support of ai (i.e. number of distinct images labeled with ai). Predicted data points for a are chosen using two probability distributions created from the above likelihoods, one for positive data points and the other for negative. The positive probability distribution is created by choosing the set of likelihoods, .+, that is the set of all likelihoods described with L(a|ai) and L(a|ai) that are greater than some threshold, ,, and less than 1. In this paper, , is set to 0.4. A likelihood of 1 is omitted because it is guaranteed that there will be no new images to predict with label a. The positive probability distribution is then created by normalizing .+. The negative probability distribution is created in the same way except using the set of all likelihoods, .-, described with L(a|ai) and L(a|ai) satisfying the same conditions. For each data point to be predicted, a likelihood distribution from either .+ or .-is selected using the above probability distributions. Then an image is selected, using a uniform distribution, from all those images with the likelihoods label (either ai or ai) that are not labeled with a. The label for the new predicted data point is a, the sentiment is the sentiment of the distribution ., and the features are the image features of the selected image. Informally, data points are predicted by assuming that images labeled with adjectives that frequently co-occur with a given adjective, can also be labeled with the given adjective. Artificial Neural Networks Once recycled and predicted data points for a particular incoming data point are selected, they are shuffled with the incoming data point and given as inputs into the appropriate neural network. The incoming data point then immediately becomes available as historical data for subsequent training data. This process is repeated for each new data point introduced to the learner. Assuming that there is sufficient data, each new data point will be accompanied by a total of 2n +2m data points. In the case of this paper, thats 260 recycled or predicted data points evenly balanced between positive and negative sentiments. As previously mentioned, one binary artificial neural network is created for each relevant adjective. These neural networks have 102 input nodes for the image features previously described. For this research, based on preliminary experimentation, the neural networks have 10 hidden nodes, a learning rate of 0.01, and a momentum of 0.1. When the VLA is accessed for the adjective matching metric, the candidate artifact being evaluated is analyzed by extracting the 102 image features. These features are then presented to the appropriate neural network and the output is used as the actual metric. Thus, as Baluja and Machado et al. have done previously, we essentially build and use a model of human appreciation to guide the creation process so that we will hopefully produce images that humans can value (Baluja, Pomerleau, and Jochem 1994; Machado, Romero, and Manaris 2007). Unlike Baluja and Machado however, our model associates images with language and meaning (adjectives), an important step in building a system that communicates intention with its artifacts. Similarity The similarity metric borrows from the growing research on bag-of-visual-word models (Csurka et al. 2004; Sivic et al. 2005) to analyze local features rather than global ones as we have done previously (Norton, Heath, and Ventura 2011b). Typically, these local features are descriptions of points in an image that are the most surprising, or said another way, the least predictable. After such an interest point is identfi.ed, it is described with a vector of features obtained by analyzing the region surrounding the point. Visual words are quantized local features. A dictionary of visual words is defined for a domain by extracting local interest points from a large number of representative images and then clustering them (typically with k-means) by their features into k clusters, where k is the desired dictionary size. With this dictionary, visual words can be extracted from any image by determining to which clusters the images local interest points belong. A bag-of-visual-words for the image can then be created by organizing the visual word counts for the image into a fixed vector. This model is analogous to the bagof-words construct for text documents in natural language processing. These fixed vectors can then be compared to determine image similarity. For the similarity metric used in this paper, we use the standard SURF (Speeded-Up Robust Features) detector and descriptor to extract interest points and their features from images (Bay et al. 2008). SURF quickly identi.es interest points using an approximation of the difference of Gaussians function, which will often identify corners and distinct edges within images. To describe each interest point, SURF first assigns an orientation to the interest point based on surrounding gradients. Then, relative to this orientation, SURF creates a 64 element feature vector by summing both the values and magnitudes of Haar wavelet responses in the horizontal and vertical directions for each square of a four by four grid centered on the point. We build our visual word dictionary by extracting these SURF features from more than 2000 images taken from the database of images weve collected to train DARCI. The resulting interest points are then clustered into a dictionary of 1000 visual words using Elkan k-means (Elkan 2003). Similarity is determined by comparing candidate artifacts with the source image. We create a normalized bag-ofvisual-words for the source image and each candidate artifact using our dictionary, and then calculate the angular similarity between these two vectors. Angular similarity between two vectors, A and B, is calculated as follows: AB cos -1() kAkkBk similarity =1 -(3) . This metric effectively measures the number of interest points that coincide between the two images by comparing the angle between vectors A and B. In text analysis, cosine similarity (the parenthetical expression contained in Equation 3) is typically used to compare the similarity of documents. With this metric, as the sparseness of vectors increases, the similarity between arbitrary vectors approaches 0. In our case, as vectors are quite sparse, artifacts that are even slightly different from the source would have low scores using this measure. Nevertheless, creating renderings that are very similar to the source image is trivial as it requires simply using fewer and less severe filters. Thus, despite encountering low scores from only small differences, the genetic algorithm would be able to easily converge to near perfect or even perfect scores. This interplay between a harsh similarity metric and relative ease of convergence would place too much weight on the similarity metric. In fact, auxiliary experiments have shown that when using cosine similarity, the adjective matching metric is almost ignored in artifact production. Since the bag-of-visual-word vectors can only contain positive values, using angular similarity instead of cosine similarity naturally constrains the output to between 0.5 and 1.0. This smaller spread in potential scores significantly reduces the negative impact of sudden jumps in similarity score due to small changes in the candidate renderings. It should be noted that in cases where a candidate artifact has no detected interest features (kBk =0), the similarity will default to 0. This is the only case where the similarity score can be below 0.5 as the metric cannot make a comparison. Experimental Design Six fitness functions are explored in this paper. They are referred to as similarity, adjective, average, minimum, alternate, and converge. Similarity and adjective are the similarity and adjective matching metrics in isolation. The other four combine these two conflicting metrics in different ways. Average is the approach we have used in the past. With this approach, the two metrics are averaged together with equal weight. With minimum, the fitness function is the minimum of the metrics. Alternate uses one metric at a time for the fitness function, but it alternates between the two every generation beginning with adjective matching. Finally, converge also uses one metric at a time; however, it alternates every 20 generations also beginning with adjective matching. The two conflicting metrics result in a process that is arguably transformational in nature, at least to a limited degree. Boden describes transformational creativity as that which transforms the conceptual space of a domain (Boden 1999). While the space of possible artifacts cannot change (the filters available for rendering images do not change), the evaluation of the artifacts does change through the interplay of the two metrics. This interplay occurs organically in the minimum fitness function by forcing the system to emphasize the metric that it is struggling to optimize at any given epoch during the evolutionary algorithm. The interplay of divergent metrics occurs more mechanically in the alternate and converge fitness functions by scheduling the emphasis; however, the sudden shift in metric could result in more unexpected results, a criterion of creativity emphasized by Maher (Maher 2010; Maher, Brady, and Fisher 2013). The scheduled approaches were inspired by Dipaola and Gaboras work with Evolving Darwins Gaze, an installation that also evolves images under two shifting criteria (DiPaola and Gabora 2009). Their criteria are a pixel matching metric comparing artifacts to a specific portrait of Charles Darwin, and an artistic heuristic. We anticipate that our less Figure 1: The three source images used in all experiments. Images A and C have resolutions of 1600x1200. Image B has a resolution of 1920x1200. restrictive metrics will ultimately allow for even more surprise and variation in artifacts, while also communicating meaning (adjectives). Each of the above fitness functions except for similarity was run on three source images across five adjectives for a total of fifteen experiments per approach. Similarity was only run once for each source image since no adjective was needed. For algorithmic efficiency, the artifacts produced in the experiments were scaled down to a maximum width of 800 pixels. Each experiment ran for 100 generations. The five adjectives used were happy, sad, fiery, wet, and peaceful. These were chosen because they were well represented in our adjective matching training data and because they depict a range of distinct meanings and emotional valence. The three source images (referred to as images A, B, and C) are shown in Figure 1 with their corresponding resolutions. As mentioned previously, optimizing to the similarity metric alone is trivial for the genetic algorithm since it need only remove filters to do so. However, there is no such trivial approach to optimize to the adjective metric. Historically, near perfect similarity scores are common, while near perfect adjective matching scores are non-existent. In order to balance the quality of the two metrics in our experiments, the source images were not scaled down to match the resolution of the artifacts. A source image and its otherwise unaltered counter part will yield similar but not identical visualbags-of-words when analyzed for the similarity metric. This means that the genetic algorithm will no longer be able to trivially achieve perfect similarity. The similarity scores of each source image compared to the scaled down version of itself are, for images A, B, and C respectively: 0.826, 0.739, and 0.843 with an average score of 0.803. This means that for our experiments, the range of similarity is now more or less between 0.5 and 0.803with a now soft ceiling. This is much closer to the range we have seen from adjective matching in auxiliary experiments: 0.144 to 0.714. Results In this section we will discuss DARCIs artifact selection for each experiment. While all interpretations of the images themselves are clearly subjective, we attempt to be conservative and consistent in our observations. We will discuss the artifacts in terms of the objectives of the image rendering subsystem: to depict the source image and adjective together in an interesting way. By interesting we specifically Figure 2: Sample sad images from training data. mean that extensive filtering (more than basic color filtering or use of inconspicuous filters) has occurred without removing all trace of the source image. Any hint of the source image will be considered acceptable in attributing interest to an artifact. This definition of interesting is derived from two commonly proposed requirements for creativity applied to the specific goal of DARCIs image rendering subsystem. These two requirements are, as defined by the American Psychological Association, functionality and originality; or, as Bo-den described them for the domain of computation, quality and novelty (Boden 1999). Since the purpose of the image renderer is to alter a source image, elimination of the source image would not be functional. Ritchie describes a related requirement that is also applicable herethat of typicality (Ritchie 2007). Ritchie defines typicality as the extent to which an artifact is an example of its intended class. In our case this would be a rendering of a source image as opposed to an entirely new image. The second requirement, novelty, requires that the image renderer produce renderings that are distinctive. Thus, minor or no changes to a source image would clearly suggest a failure at novelty. In an attempt to reduce the amount of subjectivity in our analysis, DARCIs artifacts are either interesting by this definition or not. There is no attempt to rate the degree of interest. In addition to being interesting, DARCIs artifacts must match the intended adjective. In order to be as objective as possible, we will compare DARCIs artifacts to images from the VLA training data for each given adjective. These images are representative of the types of images one would find if searching google images for a specific adjective. Examples of these images can be found in Figures 2-6. Since DARCI is rendering, as opposed to composing, and due to the limitations of DARCIs image analysis features (and indeed the limitations of the entire field of computer vision), we will be looking for similarities in color, light, and texture as opposed to similar object content. The sad training images (Figure 2) tend to be desaturated, even black and white, and/or dark with an emphasis on dull colors. The happy training images (Figure 3) trend towards bright and colorful, often containing a full spectrum of colors. The fiery training images (Figure 4) Figure 3: Sample happy images from training data. Figure 5: Sample wet images from training data. Figure 4: Sample fiery images from training data. usually have distinct flame textures, are bright, and most are monochromatictypically orange. The wet training images (Figure 5) consist of cool colors, usually blue, and have frequent specular highlights and/or wavy patterns. Finally, the peaceful training images (Figure 6) contain a variety of soft or pastel colors with a lot of smooth textures. Ideally, the most fit artifact discovered by the genetic algorithm should be the one that best satis.es the objectives for object rendering outlined above. Thus, for most of the fitness functions, we used this method of selection. However, we anticipated that for two of the fitness functions, alternate and converge this would not be an appropriate approach. The reason for this is that both of these fitness functions only use one metric at a time, meaning that the most fit artifact discovered could only have been optimized for a single metric. The expected result would be the same as a selection from one of the control fitness functionsnot an ideal balance of metrics. We will first discuss the results of the fitness functions that use the most-.t selection process: similarity, adjective, average, and minimum. Later we will discuss alternate and Figure 6: Sample peaceful images from training data. Most Fit Selection The most fit artifact discovered for each source image in the similarity control experiments is shown in Figure 7. The most fit artifact discovered in each of the other experiments is shown in Figures 8-12. First looking at the similarity results (Figure 7), we see that with the exception of image A, DARCI did not select nearly identical images as we might have expected. This illustrates the effect of not scaling the source images. The chosen artifacts actually had slightly higher fitness scores than the strictly scaled down source images demonstrated earlier. For comparison, the fitness score of each of these artifacts is, for artifacts produced from images A, B, and C respectively: 0.836, 0.762, and 0.860 with an average score converge using a different selection criteria. We will evaluFigure 7: The most fit artifacts for each indicated source ate each selection process by the proportion of artifacts that image discovered using the similarity fitness function. meet the interest and adjective matching requirements. Figure 8: The most fit artifacts for each indicated source image and fitness function for the adjective happy. of 0.820. That being said, these artifacts are still quite close to the source images, and any resemblances to any of the specified adjectives are obviously happenstance. For the average fitness function, arguably all three of the happy images convey their adjective by applying bright colored filters (Figure 8). All three of the sad images are made more sad by converting them to dark black and white images (Figure 9). Two out of the three fiery images are fiery by primarily coloring with oranges and reds (Figure 10). Image B also looks bright and molten in texture, and some of the buildings in the background of image C almost look on .re. All three wet images are debatably wet, mostly by implementing blue filters (Figure 11). Although, Image B actually looks like it is being viewed through a window soaked during a downpour. None of the peaceful images look any more peaceful than their sources; and very little if anything has changed (Figure 12). With the odd exception of the peaceful images, average does quite well at conveying adjectives; however, most of the images dont use much more than simple color filters to do so. In our estimation, for the average artifacts, happy B and C, fiery B and C, and wet B satisfy the objectives for object rendering as outlined earlier. For the minimum fitness function, two of the happy images, A and C, are made happy by incorporating many bright colors. Image A looks kaleidoscopic and image C has some rainbow effects. Image B seems out of place, though close inspection will reveal that it may have received a high fitness because of many bright colors as well. While perhaps difficult to notice at first, both image A and B maintain the presence of the source image. All of the sad images are quite dark, suggesting sadness. Image A and C may look like they have eliminated the source images, but the vague shape of the fish is visible within the squiggles of image A, and close inspection of image C will reveal many of the city lights behind the heavy distortion. The three fiery images could be considered fiery. Image A literally looks on fire and im-Figure 9: The most fit artifacts for each indicated source image and fitness function for the adjective sad. age C looks molten. All three wet images appear wet; as with average, this is primarily accomplished by making the images blue. Image B does look like the image is now reflected off of a lake, and image C is a bit bleary and wavy giving it ever so slightly the look of being underwater. With the exception of image A, the peaceful images arent even recognizable, nor do they look peaceful in the way peaceful is reflected in the training images. Were beginning to get a sense of how DARCI interprets peaceful though. In our estimation, of the minimum images, happy A and C, all sad and fiery images, and wet B and C satisfy the objectives for object rendering. While happy B and peaceful A are interesting representations of the source image, they do not convey the adjective properly. In the case of the adjective fitness function, we see that with three exceptions (happy A, sad A, and peaceful C), the source image is undetectable. Happy A and sad A do .t their adjectives, but peaceful C does not. Interestingly, in our estimation adjective does not depict the given adjectives as well as average or minimum. This can be attributed in part to the system exploiting the VLAs neural networks with extreme and unnatural image features. With all three of these fitness functions, we have seen unsatisfactory performance with peaceful. However, this poor performance goes beyond DARCIs strange interpretation of what makes an image peaceful (apparently being purple and noisy). That can be attributed to inadequate learning by the VLA, perhaps because of limited available training data. One could even make the case for it being a creative expression of peaceful. The other problem here is the fact that for peaceful artifacts, the three average artifacts were virtually unmodified from the source image, and that two of the minimum artifacts completely obfuscated the source image. This issue can be explained by a problematic interaction between the similarity and adjective matching metrics for peaceful. The peaceful neural network output has very low variance compared to the other neural networks, and a mean slightly under 0.5. The variance is so low that the highest peaceful neural network outputs encountered are not much higher than the lowest similarity score possible (0.5). Thus, the minimum fitness function is effectively acting like the adjective fitness function for peaceful. In the case of average, the variance is so low that the smallest changes in similarity still overshadow any changes in adjective matching. This example illustrates that despite our best efforts to balance the two metrics, incongruities between the two can still occur. Thus, for future work, a dynamic solution that takes into consideration certain statistics about each metric may be in order. Selection After Last Shift As indicated earlier, the alternate and converge fitness functions need a different selection method than that used above. As suspected, using most fit selection resulted in artifacts that were either similar to those in Figure 7 or completely abstract like the images produced with adjective. The assumption with alternate and converge is that even though only a single metric is in effect at each generation, the genetic algorithm will not be able to converge to either because of constant shifts in the metric, and will instead find an interesting and unexpected solution. With this in mind, the selection criteria that we use here is to pick The most fit artifact from the last shift in metric. This is the point at which we would expect to find the most surprising artifacts. We define a shift in the metric as the changing from the similarity metric to the adjective matching metric or vice versa. For alternate this is the shift from similarity to adjective matching at generation 100 which we will call alternate-adjective, and for converge it is the shift from adjective matching to similarity also at generation 100 which we will call converge-similarity. Since the direction of the shift may strongly effect the outcome, we have also selected The most fit artifact from generation 99 for alterFigure 11: The most fit artifacts for each indicated source image and fitness function for the adjective wet. nate (adjective matching to similarity) and generation 80 for converge (similarity to adjective matching). We will call these two approaches respectively alternate-similarity and converge-adjective. The results of these experiments are in Figures 13 to 15. In the interest of space, we do curate these images by only showing those artifacts that are neither over nor under-filtered (i.e. interesting) based on observations similar to those made for the earlier experiments. In the case of alternate-similarity, there were no artifacts produced that werent under-filtered. Most had tinting or small distortions, but none were interesting. Figure 13 shows interesting artifacts that were selected with alternate-adjective. This particular fitness function and selection criteria yielded the most numerous interesting artifacts of the four configurations. In this case, all but one of the not-shown artifacts were too abstract. Of the remaining interesting artifacts, all but the unusual peaceful images arguably convey the intended adjectives. Next, Figure 14 shows interesting artifacts selected with converge-adjective. Most of the other artifacts selected obfuscated the source image too much. Here, with the exception of fiery A and perhaps fiery B, the images convey the intended adjectives. Finally, Figure 15 shows the interesting artifacts selected with converge-similarity. While the images shown are adequately interesting, we dont consider them as distinguished as those in the previous two examples. All of the other artifacts were too similar to the source image to warrant display. All of the displayed artifacts do convey the given adjectives. Filter Sequence Length Functionally, much of the quality of an artifact can be attributed to the length of the artifacts genotype. The genotype is the genetic encoding of the artifact, and in the image rendering subsystem is a sequence of image filters. The more filters used to render a source image, the more likely the artifact will become abstract. The fewer filters used, the more likely the artifact will not deviate from the source image. Figure 16 shows the average genotype length (in number of filters) for each fitness function explored in this paper over the 100 epochs of evolution. The top performing fitness functions show a comfortable balance between too many and too few filters. Minimum does this the best. Conclusions The motivation behind this work has been to improve DARCIs ability to independently curate its own artifacts. All of the artifacts displayed in this paper were fully curated by DARCI under various selection criteria, with only a few indicated exceptions for space. We show that DARCI is autonomously able to consistently create and select images that reflect the requested adjective with four out of five adjectives. This demonstrates the quality of the new adjective matching metric. We also demonstrate that the similarity metric functions as intended. We explored a variety of fitness functions combining two metrics with varying degrees of success. Each method of combining the metrics had its own biases but, from our analysis, the minimum fitness function performed the best. Over half of the artifacts selected with this fitness function satisfied the goals of the image rendering subsystem arguably a significant step in decreasing the latent heat effect in DARCI. We attribute the success of minimum to the fact that it allows the genetic algorithm to naturally shift evolutionary focus to the metric that is suffering the most. We are confident that the improvements made to the image rendering subsystem in this paper will significantly decrease the latent heat effect in DARCI. We intend to test this theory in the future by conducting a thorough online survey comparing this improved version of DARCI to other versions, and perhaps even to humans. To further improve the image rendering subsystem described in this paper, we also intend to pursue more adaptable variations of the metFigure 13: Artifacts selected for the indicated source images and adjectives for the alternate-adjective fitness function. rics outlined here. Metrics that will adapt their output in response to the features of other metrics. 2014_40 !2014 The FloWr Framework: Automated Flowchart Construction, Optimisation and Alteration for Creative Systems John Charnley, Simon Colton and Maria Teresa Llano Computational Creativity Group, Department of Computing, Goldsmiths, University of London, UK ccg.doc.gold.ac.uk Abstract code. These include the MSDN VPL (msdn.microsoft. We describe the FloWr framework for implementing creative systems as scripts over processes and manipulated visually as flowcharts. FloWr has been specifically developed to be able to automatically optimise, alter and ultimately generate novel flowcharts, thus innovating at process level. We describe the fundamental architecture of the framework and provide examples of creative systems which have been implemented in FloWr. Via some preliminary experimentation, we demonstrate how FloWr can optimise a given system for ef.ciency and yield, alter input parameters to increase unexpectedness, and build novel generative systems automatically. Introduction One of the main reasons people give for why software should not be considered creative is because it follows explicit instructions supplied by a programmer. One way to reduce such criticisms is to get software to write software, because if a program writes its own instructions, or the code of another program, some level of creative responsibility has clearly been handed over. Automated programming techniques such as genetic programming have been used in creativity projects, such as evolutionary art (Romero and Machado 2007), and software innovating at process (al gorithmic) level has been studied in this context. Moreover, machine learning approaches such as inductive logic programming (Muggleton 1991) clearly perform automated programming. In both these cases, programs are generated for specific purposes. In contrast, we are interested here in how software can innovate at process level for exploratory purposes, i.e., where the aim is to invent a new process for a new purpose, rather than for a given task. Getting software to write code directly is a long-term goal, and we have performed some early work towards this with the invention of game mechanics at code level (Cook et al. 2013). Such code generation will likely be organised at module level, so it seems sensible to study how programs can be constructed in formalisms such as flowcharts over given code modules, to study creative process generation. Flowcharts are used extensively for visualising algorithms, e.g., UML is a standard for representing code at class level (Rumbaugh, Jacobson, and Booch 2004). There are also a handful of systems which allow flowcharts to be developed and automatically converted into com/bb483088.aspx), the RAPTOR system (Carlisle et al. 2004), and IBMs WebSphere, which allows program mers to visualise the interaction between nodes and produce fully-functional systems on a variety of platforms (ibm.com/software/uk/websphere). Also, Visual Programming systems such as Blockly, (code.google.com/ p/blockly), AppInventor (appinventor.mit.edu) and Scratch (scratch.mit.edu) allow the structure of a pro gram to be described by using different types of blocks. We could certainly have investigated process-level innovation by implementing software to automatically control the flowcharting systems mentioned above. However, these systems have been developed to support human-centric program design, and we have had many difficult experiences in the past where we have wrestled unsuccessfully with programmatic interfaces to such frameworks. In addition, in line with usual software engineering paradigms, there is an emphasis on being able to explicitly specify what programs do and an expectation of perfect reliability in the execution of those programs. We are more interested in a flowcharting system able to be given vague instructions (or indeed, none at all) and with some level of automation, produce valuable, efficient flowcharts for generative purposes. For these reasons, we decided to build the FloWr (Flo)wchart (Wr)iter system from scratch with a clear emphasis on automated optimisation, alteration and construction of systems. This paper describes the first release of this framework. In the next section, we describe the fundamentals of the framework: how programs are represented as scripts which can be created and manipulated visually as flowcharts, and how developers can follow an interface to introduce new code modules to the system. Following this, we detail a FloWr flowchart for poetry generation which uses Twitter, and we use this in an investigation of flowchart robustness. We then present some preliminary experiments to test the viability of FloWr automating various aspects of flowchart design. In particular, we investigate ways in which it can alter and optimise given flowcharts, and we describe an experiment where FloWr invented novel flowcharts from scratch. Notwithstanding a truly huge search space, we show there is much promise for process-level innovation with this approach, and we conclude with a discussion of future research and implementation work. text.retrievers.ConceptNet.ConceptNet_0 dataFile:simple_concept_net_1p0_sorted.csv relation:IsA rhsQuery:animal minScore:0 #wordsOfType = answers[*] ...WordListCategoriser.WordListCategoriser_0 wordList:child;human;apple; stringsToCategorise:#wordsOfType #filteredFacts = textsWithoutWord[*] text.retrievers.ConceptNet.ConceptNet_1 dataFile:simple_concept_net_1p0_sorted.csv lhsQueries:#filteredFacts relation:CapableOf minScore:0 #propertyFacts = facts[*] ...TemplateCombiner.TemplateCombiner_0 templateText: What if there was a little c1Texts[*][0] who couldnt c1Texts[*][2]? numRequired:1000 c1Texts:#propertyFacts #whatifs1 = instantiatedTemplates[r5] utility.saving.TextSaver.TextSaver_0 dir:/Output/Flow/whatifs textsToSave:#whatifs1 Figure 1: Ideation script and corresponding flowchart The FloWr Framework We aim to use the FloWr framework to investigate automatic process generation via the combination of code modules. As discussed in the subsections below, our approach has been to implement a number of such code modules, which we call ProcessNodes, engineer an environment where scripts direct the flow of data from module to module, and develop a graphical user interface (GUI) to enable visual combination of ProcessNodes into scripts using a flowcharting metaphor. Individual ProcessNodes Focusing on generative language systems, we have implemented a repository of 39 ProcessNodes for a variety of tasks, from the generation of new material, to text retrieval, to analytical and administrative tasks. For instance, in the repository, there is a ProcessNode for downloading tweets from Twitter, one for performing sentiment analysis, and one for simply outputting text to a file. A new node must extend the Java ProcessNode base class, by implementing its abstract process method, which will be called whenever the module is executed. The developer can write whatever software they see .t in the node, and this may call external code in any language. The developer can specify certain input parameters for the process, as public fields of the class, along with an optional list of allowed or default values for each parameter. As mentioned below, the scripting mechanism enables variables to be specified, which hold output from processes, and can be substituted in as the input parameters of other nodes. This facilitates the flow of data. At runtime, using Javas reflection mechanism, FloWr will set each ProcessNodes parameters according to the current state of processing, i.e., explicit assignments of the current value of variables to input parameters, prior to calling the process method for the node. The ProcessNode superclass provides a number of utility methods that a node developer can use during processing, such as determining the local location of the data directory which holds non-code resources. There are also methods for reporting processing errors during runtime, which developers can use to neatly handle exceptions and other failures. The process method of each ProcessNode returns an object of type ProcessOutput which holds all the output from the node, hence developers create a Java class that extends the ProcessOutput base class. This facilitates internal FloWr functionality for determining the types of output variables and checking whether a script speci.es passing objects of the right type from one node to another (again using Java re.ection). Developers can use bespoke or existing classes as fields within output classes, so they can create more complex data-structures for node output. Developers should be aware, however, that most nodes take as input primitive types such as integers and strings, and collections of these, so if they want the output from their nodes to be used by others, their ProcessOutput classes will probably need to have fields at some level in a standard format. A Scripting Mechanism A FloWr system is a collection of task-specific ProcessNodes, with a description of how data from each node is selected as input to others, expressed using a script syntax. An example script, which has been edited a little to improve clarity, is given in figure 1. The functionality of this script is described in the subsection on automated optimisation. Each paragraph of the script describes a ProcessNode by specifying its type, configuration and output. The first line is the type of node, which refers to the Java class called when that node is run in a system. In figure 1, the first process uses the ConceptNet class, from the text.retrievers package. Suf.xes are used to differentiate between multiple instances of the same node type in a script. When a script is parsed, each type must be an instantiable compiled subclass of ProcessNode in the stated package, which must also contain a valid ProcessOutput subclass. The next lines in the paragraph specify how the input parameters should be initialised at runtime as name:value pairs. The name indicates the parameter to be initialised, and the value can be either a simple assignment, or a variable representing some output from another node. Script parsing checks that each name refers to a publicly accessible field of the ProcessNode class, which can be validly assigned with the specified value or the value of the variable indicated. Default parameter assignments are used where a parameter value is blank. Node developers can define any parameters, so they could develop a single node that operates with various input types, to build more robust systems. Variable definitions are a #-prefixed alphanumeric label and an output speci.er for a particular part of the output from a process. As mentioned above, each ProcessNode class must have an associated ProcessOutput class. The output speci.er refers to the fields defined within this output class, which will be populated by the node at runtime. In its simplest form, the speci.er indicates a particular field to assign to the variable. Alternatively, they can be separated by dots, where each segment is a field relative to the speci.er to its left. Where the indicated field is a list, square brackets are used to indicate a selection speci.er, which identi.es a subset of elements to be assigned to the variable. The acceptable selection speci.ers are: *: all elements; fn: the first n elements; ln: the last n elements; mn: the middle n elements; and rn: n randomly chosen elements. When a script is run, all processes are checked for syntax errors and data-type inconsistencies. FloWr determines the process run order by inspecting dependencies between output variables and input parameters, and errors are raised whenever there are problematic loops in a script. FloWr then steps through each node in the run order by instantiating an appropriate ProcessNode object, assigning its parameters according to the script, calling its process method to execute the node, and storing the output. In the example script of figure 1, we see that the ConceptNet 0 ProcessNode has output with an answers field, which is a list. The whole list (indicated by answers[*]) is assigned to the variable #wordsOfType, which is passed into the WordListCategoriser 0 ProcessNode as the input parameter stringsToCategorise. In this simple script, each node except the last one assigns a single aspect of its output to a variable, which is passed onto the next node. A Flowcharting Interface The FloWr GUI shown in figure 2 is the primary system development tool, where flowcharts are used to visually represent the interaction between ProcessNodes. The interface has several components. Firstly, the central panel displays the flowchart currently being worked upon, with individual ProcessNodes shown as boxes and the arrows between them indicating the transfer of data. The flowchart in figure 2 the functionality of which is described in the next section has 16 nodes of 13 different types, with colour coding indicating nodes of the same type or which perform similar tasks. For instance, blue boxes in figure 2 represent ProcessNodes which categorise texts (using word sense, sentiment, regular expressions and a user-supplied word list). To add a new node to a flowchart, the user right clicks the main panel and chooses from a series of popup menus. As might be expected, flowchart boxes can be dragged, resized, deleted, copied and renamed, and multiple boxes in sub-charts can be selected, moved, resized and deleted simultaneously. When a box is clicked, it gains a thick grey border, and the arrows going into/out of it gain circles, which when clicked populates the mappings (upper) internal frame in the GUI with the output variables and input parameters of the two ProcessNodes joined by the arrow. The mappings between nodes can be edited by hand, and arrows are automatically generated whenever an output variable is used as an input parameter for another node. Clicking on a box populates the mappings frame with the input variables and output parameters for that ProcessNode, and populates the output (lower) internal frame with the values of the output variables, if they have previously been calculated via a run of the system. In figure 2, we see that the user previously selected the output for the SentimentSorter node, (which is a poem about being abusive) in the output frame. They then selected the circle on the arrow between two nodes, and the output variables and input parameters for LineSplitter and SentimentSorter were displayed accordingly in the mappings frame. A small black panel containing a play and stop button for executing and halting the script is shown at the top right of the flowchart panel. When the user has chosen to execute the flowchart multiple times from a menu, a number indicating which run is executing is shown in this panel (the number 17 in figure 2). The user can double click a node, and FloWr will run all the processes leading into that node, including it, but not nodes which occur later in the script run order. When the flowchart is running, the node which is actually executing is given a red border: in figure 2 the TextRankKeyphraseExtractor node is running. Nodes can take some time to .nish executing, and it is often useful for their output, and the output for all the nodes earlier in the flowchart, to be frozen, i.e., calculated once and stored rather than generated when that process is run again. This can be done using the interface and is indicated with a pushpin in the flowchart box: in figure 2, the Twitter node has been frozen. The pushpins in the mappings frame and the output frame can be used to stop their context from changing when boxes on the flowchart are clicked. An Example FloWr System We have used FloWr to hand-craft a number of systems, including flowcharts for poetry generation in a manner similar to that of (Colton, Goodwin, and Veale 2012), where news paper articles were manipulated to produce poems. We have also used FloWr to perform automated theory formation using the same production rule-based method employed by the HR system (Colton 2002), and we have re-implemented as pects of The Painting Fool software (Colton 2012). Finally, as discussed in the next section, we have used FloWr scripts to produce fictional ideas, with experiments using this given in (Llano et al. 2014a) and (Llano et al. 2014b). In each of these instances, we have developed a fully-operational system and the FloWr GUI has enabled a clear visualisation of the overall system, enabling us to design, edit and tweak each implementation. The ProcessNodes required and the flowcharts implementing these systems are available in the FloWr distribution. The flowchart in figure 2 produces poems as a collec tion of related tweets from Twitter in a relatively sophisticated way. Execution begins with a Dictionary ProcessNode which selects all the 5,722 words from a standard dictionary with a frequency of between 90% and 95%, with word frequency determined using the Kilgarriff database (Kil garriff 1997), which was mined from the British National Figure 2: The FloWr flowcharting graphical user interface. Corpus. Such words are relatively common but not too common or too uncommon in the language. Next in the flowchart, a WordSenseCategoriser selects the 772 words that are adjectives (in terms of their main sense) as per the British National Corpus tagset (Leech, Garside, and Bryant 1994). A SentimentCategoriser node then splits the adjectives into categories based upon how positive or negative a word is, using the A.nn sentiment dictionary (fnielsen. posterous.com/tag/afinn) expanded by adding syn onyms from WordNet. From the list of 211 negative words, i.e., scoring -1 or less for valency, a single word is randomly chosen as the poem theme, using the variable selection syntax [r1] in the underlying script, as described above. The Twitter ProcessNode accesses the Twitter web service through the Twitter4J library (twitter4j.org), and retrieves a maximum of 1,000 tweets containing the theme word there may be less if the word is not mentioned in many recent tweets. Tweets are cached to make retrieval quicker later on. Also, as part of the retrieval process, the tweets are filtered to remove copies and tweets containing a word which cannot be pronounced, as per the CMU pronunciation dictionary (CPD, at www.speech. cs.cmu.edu/cgi-bin/cmudict), or which cannot be parsed using the Twokenize tokenizer (bitbucket.org/ jasonbaldridge/twokenize). We have found that the 90-95% word frequency previously mentioned ensures that there are usually sufficient tweets (counted in the hundreds) after the filtering process, but that the tweets tend to be less banal than usual, as the usage of a somewhat uncommon word requires some thought. The retrieved tweets are used in two ways. Firstly, a TextRankKeyphraseExtractor node extracts keyphrases using an implementation (lit.csci.unt.edu) of the Tex tRank algorithm (Mihalcea and Tarau 2004) over the entirety of the tweets collated as a paragraph of text. As an example, the poem theme in the run presented in figure 2 was abu sive, and the keyphrases of abusive husband, abusive father and abusive boyfriend were extracted. Secondly, the tweets are passed through a triplet of WordListCategoriser nodes which are used to exclude tweets containing undesired words. The first filter removes tweets containing any of a pre-defined list of first names, discarding many tweets about particular people, which are too specialised for our purposes. The second removes tweets containing Twitter-related words such as retweet, and the third removes tweets containing certain profanities. The RegexCategoriser ProcessNode then splits the tweets into two sets based upon whether or not they contain personal pronouns (I, we, they, him, her, etc.). Only tweets containing personal pronouns are kept, which helps remove commercial service announcements, which are dull. In the abusive example, from the 1000 tweets retrieved, 110 were removed as duplicates or for being unpronounceable/non-tokenisable. 80 were further removed for including first names, 33 for including Twitter terms, 22 for including profanities, and 262 were removed because they included no personal pronouns, leaving 493 tweets for the construction of stanzas in the poem. The remaining tweets are processed by a RhymeMatcher node which finds all pairs of tweets with the same two phonemes at the end, when parsed by the CPD. The num On Being Eerie Eerie me. Eerie feeling. Bit eerie. I hate the basement level of buildings. You always lose reception and its always quiet and eerie. This doesnt quite capture the eerie pink glow of this morning. Is pop culture satanic? In a spiritual (not religious) sense? I dont really know. But man, there are some eerie parallels. Its concerning. I find it very eerie when someone is tinkering with your teeth and telling jokes. Or is that just me? I hate winter and the cold, but I love how silent the night is during cold winter weather. Its eerie, but peaceful. Old school! It was always eerie. No one around, and completely quiet. Its like being on the wrong end of the apocalypse. Experiencing the eerie light of total eclipse. Im going through it today. The fact that Im talking about my grandma in a past tense is eerie and weird to me. I saw weird stuff in that place last night. Weird, strange, sick, twisted, eerie, godless, evil stuff. And I want in. Yes -that is quite an eerie sound! Its so eerie listening to the crying in the background. I can understand that. It just feels eerie to have it haunt you (word-for-word) by different users. I hope the cloud stayed away for you. Wow, how was the eerie darkness? I thought I told you. Oops. Weird, eerie, strange portraits and locations. An antique metal ship and a candle make for eerie (and awesome) decorations. I mean the art direction is eerie. Im pretty sure its hogwash. Bit eerie. Eerie feeling. Eerie me. Figure 3: Example Twitter poem: On Being Eerie. ber of matching phonemes can be changed to increase or decrease the amount of rhyming. From these, 250 pairs are randomly chosen (or the entire set, if less than 250). The tweets are likewise processed by the FootprintMatcher node, which counts the number of syllables, again using the CPD, and finds all pairs of tweets with the same footprint. As before, 250 pairs of tweets are chosen randomly. Next, LineCollator constructs sets of 16 different tweets in quadruples of the form ABBA, where the As are a pair with equal footprints and the Bs are a pair which rhyme. An example quadruple is as follows (note the two central lines rhyme, and the outer lines both have 17 syllables): I hope the cloud stayed away for you. Wow, how was the eerie darkness? I thought I told you. Oops. Weird, eerie, strange portraits and locations. An antique metal ship and a candle make for eerie (and awesome) decorations. I mean the art direction is eerie. Im pretty sure its hogwash. The TemplateCombiner node brings all the processed information together into a poem based upon a specified poem template. The inputs to this process are the theme word which becomes part of the poem title, the keyphrases which Freq(%) Structure Neg. Stanzas Yield(%) 85-90 90-95 80-85 90-95 90-95 90-95 90-95 F RRF F RRF F RRF R1R2R2R1 R1R2R2R1 R1R2R2R1 R1R2R2R1 false true false false true false true 4 4 4 4 4 6 6 94 94 90 80 74 46 12 Table 1: Yield results for Twitter poetry flowchart. provide a context at the top and (reversed) at the bottom of the poem, and the quadruples from LineCollator, which each form a stanza of the poem. TemplateCombiner is told to produce 20 poems by choosing 20 sets of quadruples from LineCollator randomly. The LineSplitter ProcessNode takes each poem and splits any line where there is a period (tweets often contain two or more sentences), which tends to make the poems more poem-shaped. Finally, the SentimentSorter node selects the poem with the most negative affect, which is saved to a file by the TextSaver ProcessNode. This is given the theme word as an input, and the file is so named. In general, we have found that these Twitter poems are surprising and interesting. In particular, the slight rhyming in the centre of the poems is noticeable, and the multiple voices expressed through 16 different tweets, coupled with the often rushed nature of the tweets can give the poems a very dynamic feel. Another example poem is given in figure 3, where the theme was eerie. This poem was recited as part of a poetry evening during a festival of Computational Creativity, in Paris in July 2013 (Colton and Ventura 2014). The nature of the flowchart, including the ProcessNodes, the I/O connections and the parameterisation of the processes was carefully specified and tweaked by hand over many hours, to produce a poem most of the time, for different adjectives. One of the benefits of the flowcharting approach is that variations can be easily tried out but it would be frustrating if the yield of poems wasnt consistent. To investigate the robustness of the flowchart, we varied the word frequency parameters in the Dictionary to test the retrieval of tweets containing less common words. We also made the poem construction more difficult. Firstly, we introduced all-rhyming stanzas (R1R2R2R1) rather than the footprint-rhyming structure (F RRF ). Secondly, we introduced an additional SentimentCategoriser node to ensure that only tweets with an average (Neg)ative valency were used. Finally, we increased the number of stanzas from 4 to 6. For each of 7 setups given in table 1, we provide the yield produced from 50 runs of the flowchart. We note that the flowchart is fairly robust to lowering the theme word frequencies, but the volume of tweets didnt support well the construction of more complex poems. In fact, only 12% of runs resulted in a poem when six R1R2R2R1 stanzas with only negative tweets were sought. This indicates that there is a limit to how far a successful flowchart can be tweaked before it loses its utility. Automation Experiments We present here some preliminary experiments to automatically alter, optimise and generate flowcharts. As mentioned previously, a driving force for the project is to study the potential of automated process generation. FloWr simpli.es the process of constructing a system but, as highlighted in the previous section, fine-tuning a chart can be a laborious process. For example, the flowchart/script in figure 1 was developed by hand for a project where the Con ceptNet database of internet-mined facts (Liu and Singh 2004) was used for fictional ideation in the context of Disney cartoon characters, as described in (Llano et al. 2014a). Given a theme word like animal, the flowchart uses the ConceptNet1 node to find all Xs for which there is a fact [X,IsA,animal], removes spurious results, such as [my husband,IsA,animal] with the WordListCategoriser, and then for a given relation, R, finds all the facts of the form [X,R,Y], using the ConceptNet2 node. To produce the fictional idea, it inverts the reality of each fact using the TemplateCombiner node to produce an evocative textual rendering. For instance, the fact that [cat,Desires,milk] becomes What if there was a little cat who was afraid of milk?. In further testing, we substituted animal for other theme words such as machine, and produced ideas such as: What if there was a little toaster who couldnt find the kitchen (by inverting the LocatedNear relation in this case). There are 49 ConceptNet relations and a large number of couplings of these with theme words, many of which yielded no results. For instance, we found no facts about types of machines and the Desires relation, presumably as machine dont tend to desire things. Focusing on animals, it took around 2 hours to produce the first working flowchart which produced a non-zero yield of facts which could be usefully inverted for the invention of Disney characters. One of the benefits of automation we foresee would be a substantial reduction in this type of manual fine-tuning. Flowcharts can be constructed and altered in several ways. ProcessNodes can be added, removed or replaced with alternatives. Parameterisations of nodes and the links between them can be amended by modifying, creating or deleting variables and changing input settings. The space of all possible constructions and alterations is vast and, at this early stage, we have restricted ourselves to a subset. Specifically, we have considered changes to parameterisations of existing flowcharts and we describe some experiments in the following section, followed by how these can be guided to achieve particular optimisation objectives. After this, we consider constructing flowcharts from scratch by sequentially adding additional ProcessNodes. In all cases, FloWr has generated flowcharts representing novel and interesting creative tasks whilst avoiding an element of manual construction effort. Figure 4: Flowchart for automated regex generation. S NW FWLen WLCh FLCh LLCh Yield(%) Av. 1 3 3-6 equal equal none 55 48.6 2 3 3-6 equal any none 42 12.1 3 3-5 3-6 any any none 24 9.24 4 3 3-6 incr. incr. none 38 5.1 5 3-5 3-6 any any any 0 0 Table 2: Regex generation test yields (tongue twister texts). Flowchart Alteration When motivating the building of the FloWr framework in the introduction, we noted that we want the approach to produce unexpected results, with FloWr scripts being somewhat unpredictable. One way to increase unexpectedness is to randomly alter input parameters to ProcessNodes at run-time. We investigated this via the generation of simple tongue-twister texts, by extracting word sequences using regular expressions. We implemented a RegexGenerator ProcessNode which produces regular expressions (regexes) such as: \bs[a-zA-Z]4\b\s1,\bs[a-zA-Z]5\b\s1,\bs[a-zA-Z]6\b When applied to a corpus of text, this regex extracts all triples of words of length 5, 6 and 7 which begin with the letter s. We applied this to a corpus of 100,000 Guardian newspaper articles, and it returned 21 triples, such as small screen success and short skirts showing. The input parameters to RegexGenerator specify the number of words (NW) in the phrases sought, what the first words length (FWLen) should be, and how the word lengths should change (WLCh): either increasing, decreasing, staying the same, or no(ne) change. The parameters also enable the specification of the first letter of the first word, and how subsequent first letters should change lexicographically (FLCh): increasing, decreasing, staying the same or no(ne) change. The last letter changes (LLCh) can similarly be specified. Importantly, FloWr can be instructed to choose each parameter randomly from a given range. For start and end letters, this range is a-z, for word length and letter changes it is {increase, decrease, equal, none} and the integer value for NW and FWLen can be specified to be within a user-given range. We implemented the flowchart in figure 4 to input the whole Guardian corpus and a generated regular expression in the RegexPhraseExtractor node, and output the resulting text (if any) to a file. We ran five sessions with different input parameter ranges for the RegexGenerator node. For each session, we specified that the first letter of the first word should be chosen randomly. In each session, we ran the flowchart 100 times and recorded the yield as the percentage of times when text was actually produced. We also recorded the average number of lines of text produced (i.e., the average number of hits for the regular expression in the corpus). The results are given in table 2. We see that (S)etup 5 is completely unconstrained and the space which is randomly sampled from is dense in poor regexes which have no hits in the corpus: the yield is zero. However, with some constraining of the regex ranges allowed, the yield increases almost to 50%. Also, as expected, the average number of hits increased in line with the yield. The following are two tongue twisters found in the results for setups 1 and 4: posted pretax profit cancer despite everyone please please please classy devices emerging petrol prices played carbon dioxide expelled profit public policy carbon dioxide emission poorer people pushed choice defense everyone In other experiments, with the ideation flowchart of figure 1, we looked at automatically changing the theme word. To do this, a WordNet ProcessNode was used to find hypernyms of animal, which returned the words organism and being. We then requested the hyponyms of each of these, which generated 87 alternative themes, which were substituted for the theme in the flowchart. Several of the themes produced a high yield of invertible facts, with 13 theme/relation combinations generating more facts than the highest found by hand. Three of these used theme word person, e.g., with the CapableOf relation, which generated 2,154 ideas such as the concept of actors being able to face an audience. Similarly, the theme words individual and plant had high yields. However, one word that was identified automatically using this method was .ora, which gave interesting invertible facts about trees, such as being homes for nesting birds and squirrels. These were not considered in our manual experiments using the plant theme. In a similar way, we used Concept-Net to find theme words by inspecting all the IsA relations in its database, from which it identified 11,000 themes. Using these, we found the highest yield with the theme mammal and the relation NotDesires, which we hadnt found manually. This generated 568 facts, mainly about people, e.g., the ideas that people dont want to be eaten or bankrupt, both of which led to interesting fictional inversions. Optimising Flowcharts We performed some experiments in automating the task of finding high-yield configurations for the ideation flowchart of figure 1. To do this, we provided a list of themes and asked FloWr to consider all possible pairings of theme and ConceptNet relation. To assess the yield of a ProcessNode, FloWr uses Java reflection to traverse the structure of its output object and count the objects and sub-objects in individual fields or in lists. We have found this to be a reliable measure of output quantity, particularly when assessing relative sizes. It is also general, and will produce a useful yield measure irrespective of the nature of the node and its output. The manual process identified the theme word animal and the relationship CapableOf as producing the highest yield of 530 usable facts. The automated approach also identi.ed this combination, but it highlighted a more productive relationship for animal, namely LocatedAt, which provides 1,010 facts. This combination had been overlooked during the manual process, in favour of using the LocatedNear relationship, which produced only 39 facts. We also investigated optimising flowcharts for ef.ciency. Given a target time reduction and minimum output level for ProcessNodes in a given flowchart, we investigated an approach which identi.es small local changes to input parameters that have the most global impact on the system. Firstly, the nodes are ordered according to their increasing contribution to the overall execution time. Considering the slowest ProcessNode, P , first, an attempt is made to establish if the time taken is a consequence of the amount of data it receives, by halving the data given and comparing execution times. If input data is causing P s slow speed, the ProcessNode(s) which produced that input into P are re-prioritised higher than P in the ordering. Moreover, a local goal for each ProcessNode is assigned, which is either to reduce its execution time or the size of its output. Then, local recon.gurations consider incremental changes to numeric and optional parameters until the local goals, or failure, have been met. Any successful local reconfigurations are then applied to the global system and reported to the user if they achieve the overall goals. Multiple tests are used at each stage to confirm that the reported results are consistent. We successfully applied this approach to the Twitter poetry generator, where it identified that the high average base execution time of 10 seconds was caused by the WordList-Categoriser nodes processing a high number of tweets from the Twitter node. It applied an iterative process, which reduced the numRequired parameter by a given percentage for a pre-defined number of steps, noting each time that the node output yield was reduced, eventually settling on a numRequired setting of 63. It then tested this on the global system and found that this reduced the overall runtime to 630 milliseconds, whilst still successfully generating poems. In a similar experiment, we optimised another poetry system which used Guardian newspaper articles as source material, as in (Colton, Goodwin, and Veale 2012). The optimisation method found that one node could be optimised by reducing its input size, which led to the altering of another nodes input parameters, and a 40% reduction in overall execution time, while the flowchart still produced poems. Flowchart Construction We have investigated how to construct FloWr systems from scratch. Working in the context of producing poetic couplets, we tested a method which could generate a system with three to five nodes taken from these sets respectively: {Twitter, Guardian, TextReader}, {WordSenseCategoriser, SentimentCategoriser}, {TextRankKeyphraseExtractor, RegexPhrase-Extractor}, {WordSenseCategoriser, SentimentCategoriser}, {FootprintMatcher, RhymeMatcher}. We used our experience of which nodes work well together to create this structure, and to specify a number of possible options for the input parameters. For some nodes, we were restrictive, e.g., we specified that the Guardian node should use a specific date range for selecting articles and always return the same number. For other nodes, we allowed FloWr to use any of the parameter values from the optional lists provided by the node developer. For the Twitter node, we chose five dictionary words randomly for queries and TextReader was directed to use a set of Winston Churchill speech texts. Despite these limitations, there are still a huge number of possible combinations to explore. For example, there are 108 possible node combinations, 27,000 parameter combinations and over 261 million variable definition combina Figure 5: An automatically generated rhyming couplet system. tions. The size of this restricted subset makes a brute-force approach intractable, given that many nodes have execution times of over a second. Hence, we tried a depth-first search of all possible systems, by choosing node combinations randomly and con.guring each node with input parameters chosen from those allowed at random. Next, the method considers the possible data links between nodes by considering each pair in turn. The set of variables that could be defined in the scripting syntax for the earlier node in the system is compared with the input parameters for the following node. Only those where the output variable type and the input parameter types match will be syntactically valid, and these are chosen from randomly and applied to the script. We generated 200 scripts using this process and tested each to see whether it produced output from the final node. We found that 17 (8.5%) worked successfully and produced poetic couplets. Of these, 8 contained 3 nodes, 8 contained 4 nodes and one shown in figure 5 contained 5 nodes. This script takes Guardian articles from the first week of 2012, extracts the neutral texts in terms of sentiment, and identi.es all their key phrases. It selects keyphrases beginning with an adjective and outputs pairs of phrases with the same syllable footprint, producing these: actual bodily harm chief inspector working dangerous driving metropolitan police domestic violence potential recruits The yield from the 17 scripts varied widely from one to over 4 million couplets. The most commonly used ProcessNode in the successful flowcharts was TextRankKeyphraseExtractor, which was used 28% of the time, followed by Footprint-Matcher, used 23% of the time. FootprintMatcher is more prevalent than RhymeMatcher at 5%, because there are more pairs of phrases with the same number of syllables than pairs which rhyme. The RegexPhraseExtractor fails to appear, due to limited input data, i.e., there were no strings satisfying the regular expressions sought, due to the limited amount of text available. We experimented with further restricting the types of nodes that could be selected. In particular, using information about the frequency of nodes in successful scripts from the first experiment, we managed to improve the yield of working scripts to 18.5% by allowing only WordSense-Categoriser nodes to be used for categorisation. One particular (four-node) script caught our eye. It takes Churchill texts, extracts keyphrases, keeps only those where the first word has extreme sentiment, i.e., . 2 or .-2, then outputs pairs with the same footprint, such as: [great air battle:despairing men] and [greater efforts:greater ordeals]. The 52 poetic couplets that this script generated provided the starting point for a poem written by a collaborator: Russell Clark selected a subset of these pairs, then combined and ordered them into a piece entitled Churchills War, which is Churchills War Good many people, great differences good many people: outstanding increase. Great organisations, greater security greater security: terrible position Great combatants, brilliant actions Great preponderance, greater efforts Great air battle, despairing men Great air battle, brilliant actions Great Britain, good account Great Britain, good reason Great flow: Great war Great flow: Good men Chess proceeds, good reason Chess proceeds: victory Figure 6: A poem based upon the output from an automatically generated process for poetic couplet generation. shown in figure 6. The poem was one of four submitted for analysis by poetry experts as part of a BBC Radio 4 piece on Computational Creativity (Cox 2014), although a different poem was ultimately read out and analysed. Conclusions and Future Work The FloWr framework enables fairly rapid prototyping of flowcharts for creative systems. We presented here fundamental details of how code modules can be implemented and combined via scripts using a flowcharting front end. We presented flowcharts for producing poems, fictional ideas, tongue twisters and poetic couplets, which re-use nodes for retrieving, categorising, sorting, combining and analysing text. We have performed some experimentation to assess the potential for automating aspects of flowchart design, both to help users construct, vary and optimise flowcharts, and to highlight the potential for FloWr to automatically construct novel processes. The ultimate aim of this project is to provide an environment which encourages third party ProcessNode and flowchart developers to contribute material from which FloWr can learn good practice for innovating in automatic process design. We have already started implementing functionality which enables FloWr to learn flowchart configurations which are likely to produce results. This has aspects in common with other knowledgebased system design projects, such as Rebuilder (Gomes et al. 2005). Ultimately, FloWr will reside on a server, con stantly generating, testing and running novel system con.gurations in reaction to people uploading new ProcessNodes and scripts. We intend to have a large number of nodes covering a variety of different individual tasks in many domains. For instance, we have a variety of NLP nodes, e.g., for Porter Stemming (Porter 1980) and we will be extending this to cover nodes for other tasks, such as tagging and chunking. The first release of the FloWr framework, along with dozens of ProcessNodes and numerous flowcharts is available at ccg.doc.gold.ac.uk/research/flowr. In future releases, we plan a number of improvements to the underlying framework, including much more automation in the system, given the promise shown for this in the experiments described here. The systems that can be implemented currently are quite limited, and we plan to introduce additional programmatic constructs, such as framework level control of looping, and ProcessNode level control of conditionals. We will also implement useful functions, such as FloWr running a sub-flowchart repeatedly until it produces a particular yield for the rest of the flowchart, and translating variables, e.g., from ArrayList to String[], to increase flexibility. We will test different search techniques to tame the vast space of flowchart configurations, so that FloWr can reliably generate interesting novel flowcharts, and we will implement the optimisation and alteration routines we have experimented with as default functionalities. We also plan to implement more entire systems in FloWr, in particular we expect The Painting Fool art program (Colton 2012) to even tually exist as a series of flowcharts in FloWr. Also, we have started to port the HR3 automated theory formation system (Colton 2014) to FloWr. We have experimented with HR3 to add adaptability to the Twitter poetry generation flowchart: using concept formation over a given set of tweets, HR3 can successfully find a linguistic pattern which links subsets of tweets, that can be extracted and turned into poem stanzas. The flowchart in figure 2 is a creation in its own right. To some extent, the value of such flowcharts exists over and above the quality of the output they produce. That is, the way in which the flowchart constructs artefacts is an interesting subject in its own right. For reasons of improving autonomy, intentionality and innovation in computational systems, we believe that software which writes software whether at code-level or via useful abstractions such as flowcharts should be a major focus in Computational Creativity research. Automated programming has been adopted, albeit in restricted ways, in highly successful areas of AI such as machine learning, and we believe there will be major benefits for the building of creative systems through the modelling of how to write software creatively. Acknowledgments This work has been supported by EPSRC Grant EP/J004049/1 (Computational Creativity Theory), and EC FP7 Grant 611560 (WHIM). We would like to thank Russell Clark for his help with the poetry generation flowcharts and curating their output. We would also like to thank the anonymous reviewers for their helpful comments. 2014_41 !2014 New Developments in Culinary Computational Creativity Nan Shao, Pavankumar Murali, Anshul Sheopuri IBM TJ Watson Research Center, Yorktown Heights, NY Abstract In this paper, we report developments in the evaluation and generation processes in culinary computational creativity. In particular, we explore the personalization aspect of the quality and novelty assessment of newly created recipes. In addition, we argue that evaluation should be a part of the generation process and propose an optimization-based approach for the recipe creation problem. The experimental results show a more than 41% lift in the objective evaluation metrics when compared to a sampling approach to recipe creation. 1 Introduction "My children have a preference for meat. How do I create a healthy dish that will be enjoyed by them?" Can a computer help parents with such questions? The culinary domain is a new area for computational creativity, although "made up a recipe" has been listed as one of the 100 creative activities on human creativity rating questionnaire developed by Torrance more than 50 years ago (Sawyer 2012). (Morris et al. 2012) discussed recipe creation restricted to soups, stews and chili. (Varshney et al. 2013) discussed evaluation (work product assessor) motivated by neural, sensory and psychological aspects of human .avor perception, and proposed models for a culinary computational creativity system. To answer questions like the one listed above, we consider two aspects of the problem: the personalization of dish evaluation and the optimization of dish quality and novelty in a combinatorially complex creativity space. Our contributions to the culinary domain are as follows. First, creativity is only meaningful in the presence of a human audience or evaluator (Wiggins 2006), and humans are inherently different; therefore we explore the personalization aspect of the evaluation metric for a creative artifact. In particular, we consider .avor preference and novelty evaluation of a newly created recipe. Second, we consider evaluation as part of the generation/search process and provide an optimization-based approach for the recipe creation problem. For the latter, we draw inspiration from the search mechanism that (Wiggins 2006) proposed on moving through the complex conceptual space. We hypothesize that our proposed methodological framework can be extended to other creative endeavors as well. 2 Personalization in Culinary Creation We now turn to detailing a tractable approach for assessing personalized .avor preference and novelty. The approach is motivated by the human .avor perception science, technology to draw information from the web, and the work in (Varshney et al. 2013). 2.1 Flavor Preference Flavor enhancement, balance and substitution are choices that we make to live a healthy life. Often, we may want to enhance the .avor of our favorite ingredient. However, we may need to balance the .avor of healthy but not tasteful ingredients. Moreover, we may want to substitute red meat with a plant-based product to meet a dietary constraint and, at the same time, not lose the meaty .avor. In our work, we propose a methodology to address these personalized .avor preferences in a computational creativity system. Knowledge of how humans perceive .avors is necessary to build a system that accurately estimates a humans evaluation for creativity. For this reason, (Varshney et al. 2013) proposed a model for pleasantness which correlates olfactory pleasantness with its constituent ingredients and .avor compounds in those ingredients based on recent olfactory pleasantness study (Haddad et al. 2010; Khan et al. 2007). The smell of food is a key contributor to .avor perception, which is, in turn, a property of the chemical compounds contained in the ingredients (Burdock 2009; Shepherd 2006). Therefore a tractable step towards a data-driven model for .avor enhancement, balance and substitution is a model for odor similarity. For example, we could enhance the .avor of a featured ingredient by adding other foods with perceptually similar odors. Recent work has shown that perceptual similarity of odorant-mixtures can be predicted (Snitz et al. 2013). Consistent with the synthetic brain processing mechanism in olfaction, human perception groups many mono-molecular components into singular Unified percept. Each odorantmixture is modeled as a single vector made up of the structural and physicochemical descriptors of the mixture. The angle distance between two vectors is a meaningful predictor of the perceptual similarity of two odorant-mixtures. Therefore, given any two odorant-mixtures, we can predict a signi.cant portion of their ensuing perceptual similarity. Since food ingredients contain several .avor compounds (Ahn et al. 2011), and dishes contain several ingredients, we can predict the .avor perceptual similarity and dissimilarity of a featured ingredient and a dish to provide quantitative measurement on how the dish enhances or balances the featured ingredient .avor. We describe one approach here and show some results in Table 1, where the personal preference is to enhance the beef .avor of a stew. The formulation of the approach on .avor enhancement is described as below: n 1 Sr = Si, where Si = 100 Pr(D>di), n i=1 where the recipe enhancement score (Sr) ranging from 0 to 100 is the average of ingredient scores (Si) of ingredients in the recipe, and n is the number of ingredients in the recipe. The ingredient score Si, which is correlated with the angle distance (di) of the given ingredient and the featured ingredient beef, is 100 multiplied by the probability of angle distance in food (D) greater than the calculated angle distance (di). The .avor compounds constituents of food ingredients can be found in (Ahn et al. 2011), and the aforementioned probability can be calculated from the empirical distribution of paired ingredients angle distances. While the compound concentration in each ingredient should ideally be taken into account, the lack of systematic data prevents us from exploring their impact in this exercise. Table 1: Enhancement score of beef stew Ingredient Combination List Enhancement Score beef, cabbage, mushroom, potato, mint, sage, bacon, butter beef, mushroom, shell.sh, sage, garlic, ginger, butter 82 64 We comment that there may be other ways to calculate .avor preference score, such as taking the minimum or maximum of the ingredient scores instead of the mean. The goodness of the approach is open for empirical validation. The key idea of using scienti.c study of human .avor perception for a computational creativity system is a valid step towards building human-level evaluation models. 2.2 Personalized Novelty Assessment Creativity is only meaningful when there is a human observer, and each observers world views, culture, life experience, social network are different, so the perception of novelty which is heavily influenced by these factors are inherently different. A parsnip dish may be common to a European consumer, but may be novel to a Chinese consumer. Therefore we need a personalized novelty assessment speci.c to a targeted observer or a targeted social group. Bayesian surprise is proposed to quantify the perceived novelty of a newly created artifact (Varshney et al. 2013). The function measures the change in the observers belief of known artifacts after observing the newly created artifact, where the belief is characterized by the probability distribution of artifacts. The larger the change is the more surprising or more novel the newly created artifact is. We adopt the use of Bayesian surprise for personalized novelty assessment, and propose to use Internet activity and social media to construct a personalized set of artifacts known to a given individual or a social group. Then, we calculate a personalized surprise score of the newly created artifact. For example, we can learn recipes and ingredients known to an individual from various websites such as Pinterest and allrecipes.com by gathering recipes posted, reviewed and pined by the individual and her neighborhood in the social network. We denote the frequency of artifact a at time t known to individual p as fa(p, t). The weighted frequency (f~ a(p, t)) of artifacts known to the individual can be calculated by incorporating social proximity and temporal proximity. f~ a(p, t)= wT (t,t) wS(p ,p)fa(p, t) , t. Boat, Site |-> BodyOfWater interpretation house_floating : base2 to House = Object |-> House, Site |-> Plot The base ontologies and the interpretations above provide the necessary ingredients for a blending of BOAT and HOUSE to BOATHOUSE. The syntax of combinations is combine O1,...,Om,M1,...,Mn where the Oi are ontologies, and Mi are morphism names. The semantics of combinations is the colimit of the generated diagram. A colimit involves both pasting together (technically: disjoint union) and identification of shared parts (technically: a quotient). In our example, houseboat can be defined by the colimit based on the interpretations. To make the result easier to read, some of the classes are renamed: ontology house_boat = combine boat_habitable, house_floating with Object |-> HouseBoat, Site |-> BodyOfWater Ontohub is able to compute the colimit, which combines both the boat and house ontologies along the morphism. The colimit inherits most of the axioms of the ontologies and the base. Here we just show the declaration of the blended class Houseboat: Class: HouseBoat SubClassOf: Artifact and has_function some MeansOfTransportation and has_function some Floating and is_navigated_by some Agent SubClassOf: Artifact and is_located_on some BodyOfWater and has_function some ServeAsResidence In the case of blending of BOAT and HOUSE to BOATHOUSE, the crucial part in this blend is to view a boat as a kind of person that lives in a house. The two ontologies House and Boat presented above can be blended by selecting a base, which here provides (among others) a class Agent, and two interpretations, mapping Agent to Boat and Person, respectively. In this way, we let a boat play the role of a person (that inhabits a house). interpretation boat_personification : base1 to Boat = Agent |-> Boat interpretation house_import : base1 to House = Agent |-> Person ontology boat_house = combine boat_personification, house_import with Agent . Boat, House . BoatHouse As before, Ontohub is able to compute the colimit. As above, we present here only the relevant declarations of the blended concept. Class: BoatHouse SubClassOf: Artifact and is_located_on some Plot and has_function some ServeAsResidence Class: ArtifactThatExecutesResidenceFunction EquivalentTo: Artifact and executes some ServeAsResidence SubClassOf: is_inhabited_by some Boat Of course, the possibilities for blending the two concepts do not stop here. For example, we could map the agent in the base ontology to person in the boat ontology. This can be achieved by first defining an additional interpretation and by blending all three interpretations. interpretation boat_import : base1 to Boat = Agent |-> Person ontology boat_house = combine boat_personification, house_import, boat_import with Agent . Boat, House . BoatHouse The resulting blendoid is consistent, but it contains some strange consequences. For example, in the blendoid boats are driven by boats. However, if we are interested both in hosting boats and a hub for autonomous vehicles, this would count as an interesting result. In general, whether such more creative aspects of blendoids are desirable or not will depend on the context of the blending. We will address this issue in the section on evaluation below. Blending in the Hub Representation and Computation Indeed, combinations and colimits can be computed by our web platform Ontohub. Ontohub is a repository engine for managing distributed heterogeneous ontologies. Ontohub supports a wide range of formal logical and ontology languages and allows for complex inter-theory (concept) mappings and relationships with formal semantics, as well as ontology alignments and blending. Ontohub understands various input languages, among them OWL and DOL. We describe the basic design and features of Ontohub in general, and outline the extended feature-set that we pursue for conceptportal.org -a specialised repository within the distributed ontohub architecture. The back-end of Ontohub is the Heterogeneous Tool Set HETS, which is used by Ontohub for parsing, static analysis and proof management of ontologies. HETS can also compute colimits of OWL diagrams and even approximations of colimits in the case where the input ontologies live in different ontology languages (Codescu and Mossakowski, 2008). Computation of colimits in HETS is based on HETS general colimit algorithm for diagrams of sets and functions (note that signatures in most cases are structured sets, and signature morphisms structure preserving functions). Such a colimit of sets and functions is computed by taking the disjoint union of all sets, and quotienting it by the equivalence relation generated by the diagram, which more precisely is obtained by the rule that given any element x of an involved set, any images of x under the involved functions are identi.ed. The quotient is computed by selecting a representative of each equivalence class. A difficulty that arises is that we have to make a choice of these representatives, and therefore of names for the symbols in the colimit, as a symbol may be not always identically mapped in the base diagram of the blendoid. The convention in HETS is that in case of ambiguity, the name of the symbol is chosen to be the most frequently occurring one. This gives the user control over the namespace, such that the symbols of the colimit can be later renamed. We can see this for our boathouse example above, where Agent appears most often in the diagram and therefore the symbol has been explicitly renamed. Evaluating the Blending Space Optimality principles, in particular structural ones, can be used to rank candidate blendoids on-the-.y during the ontology blending process. However, even if they improve on existing logical and heuristic methods, optimality principles will only narrow down the potential candidates and not tell us whether the result is a successful blend of the ontologies. For example, assume that we had optimality principles that would show that from the roughly 1000 candidate blendoids of House and Boat that Goguen computed, only two candidates Bhb and Bbh are optimal. Is either Bhb or Bbh any good? And, if so, which of them should we use? To answer these question, it seems natural to apply ontology evaluation techniques. Ontologies are human-intelligible and machine-interpretable representations of some portions and aspects of a domain that are used as part of information systems. To be more specific, ontology is a logical theory written in some knowledge representation language, which is associated with some intended interpretation. The intended interpretation is partially captured in the choice of symbols and natural language text (often in the form of annotations or comments). The evaluation of an ontology covers both the logical theory and the intended interpretation, their relationship to each other, and how they relate to the requirements that are derived from the intended use within a given information system. Therefore, ontology evaluation is concerned not only with formal properties of logical theories (e.g., logical consistency), but, among other aspects, with the .delity of an ontology; that is whether the formal theory accurately represents the intended domain (Neuhaus et al., 2013). For example, if Bhb is an excellent representation of the concept houseboat, then Bhb provides a poor representation of the concept boathouses. Thus, any evaluation of the blend Bhb depends on what domain Bhb is intended to represent. The lesson is that the evaluation of the results of ontology blending is dependent on the intended goal and, more generally, on the requirements that one expects the outcome of the blending process to meet. One way to capture these requirement is similar to competency questions, which are widely used in ontology engineering (Grninger and Fox, 1995). Competency questions are usually initially captured in natural language, they specify examples for questions that Figure 3: Blendoid representation and colimit computation via Hets/Ontohub: the screenshot of Ontohub shows the heterogeneous ontology house+boat.dol, hosted in the Conceptportal repository. The entire double-blend of house and boat into boathouse and houseboat is shown in the Graph to the left. The red arrows denote the interpretations of the shared ontologies into the blend. The concept boat_house is selected and shown on the right: its theory can be inspected by following the link to the respective ontology specification. an ontology needs to be able to answer in a given scenario. By formalising the competency questions one can use automatic theorem provers to evaluate whether the ontology meets the intended interpretation. The requirements that are used to select between the different blends fall, roughly, into two categories. ontological constraints and consequence requirements. Ontological constraints prevent the blends from becoming too creative by narrowing the space for conceptual blending. E.g., it may be desirable to ensure that the is_inhabited_by relationship is asymmetric and that is_navigated_by is irre.exive. To achieve that any blendoid can be checked for logical consistency with the following ontology: ontology OntologicalConstraints = ObjectProperty: is_inhabited_by Characteristics: Asymmetric ObjectProperty: is_navigated_by Characteristics: Irreflexive Given these requirements, any blendoid that involves a house that lives in itself, or any boat navigated by itself (see the blendoid boat_house1 above) would be discarded. Consequence requirements specify the kind of characteristics the blendoid is supposed to have. E.g., assume the purpose of the conceptual blending is to find alternative housing arrangements, because high land prices make newly build houses unaffordable. In this case, the requirement could be a residence that is not located on a plot of land, which can be expressed in OWL as follows: ontology ConsequenceRequirements = [...] Class PlotFreeResidence EquivalentTo: Residence and (is_located_on only (not (Plot))) Ontohub allows to use ontological constraints and consequence requirements to evaluate blended concepts automatically. The requirements are managed as DOL files, which allow to express that a given blendoid is logically consistent with a set of ontological constraints or that it entails some consequence requirements. The requirements themselves may be stored as regular ontology files (e.g., in OWL Manchester syntax). Ontohub executes the DOL files with the help of integrated automatic theorem provers, and is able to detect whether a blendoid meets the specified requirements. At this time, the evaluation of blendoids for ontological constraints and consequence requirements depends on the use of DOL files. We are planning to integrate this functionality into the GUI of Ontohub to make it more convenient for the user. Another way to evaluate a blendoid is to analyse its structure for typical ontological errors. For this purpose, Ontohub has integrated OOPS!. OOPS! automatically analyses ontologies for common pitfalls, which is developed by the Ontology Engineering Group at the Technical University of Madrid (Poveda-Villaln, Surez-Figueroa, and Gmez-Prez, 2012). We are planning to add additional evaluation tools to Ontohub in the future. Outlook Our work in this paper follows a research line in which blending processes are primarily controlled through mappings and their properties (Gentner, 1983; Forbus, Falkenhainer, and Gentner, 1989; Veale, 1997; Pereira, 2007). By introducing blending techniques to ontology languages, we have provided a method which allows us to combine two thematically different ontologies into a newly created ontology, the blendoid, describing a novel concept or domain. The blendoid creatively mixes information from both input ontologies on the basis of structural commonalities of the inputs and combines their axiomatisations. We have illustrated that the tool HETS and the DOL language (Mossakowski et al., 2013) provide an excellent starting point for developing the theory and practice of ontology blending further. They: (1) support various ontology language and their heterogeneous integration (Kutz et al., 2008); (2) allow to specify theory interpretations and other morphisms between ontologies (Kutz, Mossakowski, and Lcke, 2010); (3) support the computation of colimits as well as the approximation of colimits in the heterogeneous case (Codescu and Mossakowski, 2008); (4) provide (first) solutions for automatically computing a base ontology through ontology intersection (Kutz and Normann, 2009) and blendoid evaluation using requirements or tools such as OOPS!. In particular, we have shown that the blending of ontologies can be declaratively encoded in a DOL ontology representing the respective blending diagramhere, employing the homogeneous fragment of DOL just using OWL ontologies. Blendoid ontologies, as well as their components, i.e. input and base ontologies, can be stored, formally related, and checked for consistency within Conceptportal, a repository node within Ontohub dedicated to blending experiments carried out in the European FP7 Project COINVENT. Onto-hub moreover gives access to thousands of ontologies from a large number of different scienti.c and common sense domains. They are searchable via rich metadata annotation, logics used, formality level, and other dimensions, to provide not only a rich pool of ontologies for blending experiments, but also for the evaluation of newly created concepts. Ontohub also supports a growing set of collaborative features, including online editing of ontologies, commenting, version control, and group and permission management. To make concept invention via ontological blending feasible in practice from within Ontohub, a number of further plugins into the architecture are planned covering in particular the automatic creation of base ontologies together with their mappings, the implementation of filtering blendoids by structural optimality principles and preference orders on morphisms, as well as the addition of more ontologically motivated evaluation techniques as discussed above. Acknowledgements The project COINVENT acknowledges the .nancial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET-Open Grant number: 611553. 2014_5 !2014 Automated Daily Production of Evolutionary Audio Visual Art An Experimental Practice Tatsuo Unemi Department of Information Systems Science Soka University Hachi oji, Tokyo 192-8577 Japan unemi@iss.soka.ac.jp Abstract Evolutionary computing based on computational aesthetic measure as fitness criteria is one of the possible methods to let the machine make art. The author developed and set up a computer system that produces ten short animations consisting sequences of abstract images and sound effects everyday. The produced pieces are published on the internet using three methods, movie files, HTML5 + WebGL, and a special application software. The latter two methods provides viewers experiences of a high resolution lossless animation. Their digest versions are also uploaded on a popular web service of movie sharing. It started October 2011. It is still in an experimental level that we need to brush up, but it has not always but often succeeded to engage the viewers. Introduction As similarly as the evolutionary process in the nature has produced huge number of complex variations of unique species on the earth, evolutionary computing has a capability to produce unpredictable designs by the computer. As the nature often provides us experiences of beautiful audio visual stimuli, the computer has a potential capability to produce beautiful images and sounds if we set up a combinatorial search space in the machine that contains masterpieces. We can find a lot of technical variations for such approach under the name of Generative art (Galanter 2003; Pearson 2011). The design of computational aesthetic measures is very important to realize an efficient search in the huge space. It would act as a skill of a genius photographer who can find amazing scenery in the nature to be captured by his/her camera. It is easy for the computer to generate huge number of audio visual patterns by exhaustive search, but almost all of the products would be trash without an appropriate measure. Though development of the computable models for aesthetic measures comparable with the human artists is on the long way of challenge in the research field of computational creativity, some of the methods has already been examined in the experimental activities by a number of researchers, such as (Machado and Cardoso 2002; Ross, Ralph, and Zong 2006; den Heijer and Eiben 2010). The author has also developed an experimental system of evolutionary computing that automatically produces art pieces, combining ideas of preceding researches and his own ideas (Unemi 2012a). Owing to the recent improvement of computational power of graphical processing unit (GPU) on the personal computer, it became possible to use this type of system for realtime production of non-stop sequence of short animations on site (Unemi 2013). At the same time, it is also possible to set up a machine to make automatic production everyday without any assistance by human. This paper introduces the authors project named Daily Evolutionary Animation that started October, 2011. The following sections describe a summary of evolutionary process, aesthetic measures employed, daily production process, and a public showing on the internet. In the final section, we discuss future extensions along this project. Summary of evolutionary process The author developed SBArt (Unemi 2009) originally as a tool to breed a visual evolutionary art using a mechanism of interactive evolutionary computation (Takagi 2001). The first version that runs on UNIX workstation was released in the public domain on the internet in 1993. It is based on a similar mechanism of the pioneering work by (Sims 1991) that uses a tree structure of mathematical expression as the genotype. The expression is a function that maps (x, y,t) coordinate of spaciotemporal space to a color space of hue, saturation and brightness. The spacial coordinate (x, y) is used to indicate the pixel in the image, and the temporal coordinate t indicates the frame position in the movie. Each expression is organized by the terminal symbols and nonterminal symbols. A terminal symbol expresses a value of three dimensional vector. It is a constant containing three scalar values or a permutation of three variables, x,y and t.A non-terminal symbol is a unary or binary operator that takes three dimensional vector for each argument and result value. We prepared nine unary functions including minus sign, absolute value, trigonometric functions, exponential functions, and so on; and ten binary functions including addition, subtraction, multiplication, division, power, and so on. Two selective functions that return one of two arguments choosing by comparison of the first elements are effective to compose a collage of different patterns. Each genotype is used to draw the phenotype by determining the color values distributed in a volume of movie data. The computational cost depends on the resolution of both space and time because it must calcu late a three dimensional value for each pixel. As an extension of the system, an automated process of evolution was implemented as described in (Unemi 2012a). Evolutionary process is conducted in a manner of minimal generation gap method (Satoh, Ono, and Kobayashi 1997) that produces only two offsprings from randomly selected parents in each computing step. The genetic reproduction is done in a style of genetic programming (Koza 1992) using subtree exchange for crossover and symbol replacement for mutation. To prevent infinite extension of the length of geno type through the iteration of genetic operations, the maxi mum number of symbols in a single genotype is restricted within 120. The fitness values are calculated based on aes thetic measures described in the next section. It used to take some seconds to render a single frame im age for movie production in 2001, but it became possible to render an animation in realtime by using the parallel pro cessing of GPU. We revised the software so that it uses Core Image Framework by compiling the expression into Core Image Kernel Language to take advantage of GPUs power (Unemi 2010). It is a dialect of shading language GLSL in OpenGL working on MacOS X. Aesthetic measures It might be an ultimate goal of the research on computational creativity to implement a computable procedure that evalu ates how a pattern is beautiful as a delegate of human critics. Many artists and scientists have been struggling with this difficult and interesting theme from several points of views as summarized in (Galanter 2012). It is obvious that the hu mans decision on aesthetics is depending on his/her own both private and social experiences, but it is also affected by physical functionalities of our sensory organs and funda mental signal processing in the brain widely shared among humans beyond the differences in cultures and races. Some of these measures in a level of perception should match with a mathematical theory of complexity and fluctuation. We implemented three for each measure on geometric ar rangement and on distribution of micro features for a still image, that is, 1. pseudo complexity measure utilizing JPEG compression, 2. global contrast factor in color image, 3. distribution of gradient angles of brightness, 4. frequency distribution of hue values, 5. frequency distribution of brightness and 6. average and variance of saturation values. The detail of procedure for each measurement and auxiliary normalization are described in (Unemi 2012a). All of these procedures are relatively easy to implement utilizing well known technics of image processing. The method 1. is a convenient approximation of com plexity originally used in (Machado, Romero, and Manaris 2007). The evaluation is done by calculating a ratio between the compression ratio and the ideal value the user specified. 2. is a modified version of the factor proposed by (Matkovic et al. 2005). The original version takes a gray scale image to calculate the differences of brightness between each pair of adjacent pixels in multiple resolution, but we extended it to be applicable for a color image by replacing the difference of brightness with the distance in the color space. For 4. and 5., there are a number of hypotheses and investigations on a frequency distribution of different types of features observed in phenomena happened in both nature and human society, such as pressure of natural wind, sound frequencies from a stream, populations of cities, note pitches of music, and so on. One of the well-known hypothesis is power law on which we can find a number of samples in (Newman 2006), for example. (den Heijer and Eiben 2010) is employing Benfords law, a similar shape of distribution with the power law, as one of the factors to measure the aesthetic value. We use a distribution extracted from one thousand snap photos of portraits and natural sceneries as the ideal distribution, that is approximately similar to the power law. 6. is a subject to be adjusted following the users preference, colorful or monotone. We used a parameter setup for relatively psychedelic results at the start time, but changed it for more grayish results some months later, in order to make the results give the viewer weaker visual stimuli. The geometric mean among these measures is taken as the total evaluation of a single frame image. To evaluate a movie, the aesthetic measure should be calculated from all of the pixels contained in the three dimensional volume of space and time of colors. However, it is still difficult to complete the calculation within an acceptable time for all of the data in the final product even using parallel processing on GPU. For example, half a minute of hi-definition movie contains approximately 2 giga pixels. To reduce the computational cost, we uses reduced resolution of 512 384 pixels for each frame image, and picks up only ten frames as the samples. In total, the number of pixels to be calculated is 512 384 10 = 1,966,080. It is also important to combine an aesthetic measure on motion in animation. We employed a simple method of taking average value of absolute differences between colors of two pixels in the same position of consecutive frames in order to estimate how fast or slow the picture is moving. The point of motion measure is the inverse value of absolute difference with the ideal speed specified by human. The final evaluation is a geometric mean between the average point of still images and the average point of motion measures among sampled frames. Automated daily production The functionality of automated evolution has enabled not only an installation of automatic art but also automated production without an assistance by human. From October 6th, 2011, the system has been automatically producing ten movies everyday. The production procedure starts in the morning of Japanese Standard Time, continuing the evolutionary process from a random population until the completion of 200 steps of generation alternation. Starting from 20 randomly generated genotypes, children are added to the population until the population size reaches 80, then replacement starts. To prevent a premature convergence that often happens in search process in optimization, the population is refreshed by the following procedure for each 50 steps. It 1. picks up the best 15 individuals from the current population, 2. generates five random genotypes, 3. produces 20 individuals by crossover operation from individuals in 1. + 2., and then 4. starts the same process from these 20 individuals as conducted in the first step. Throughout the process, 20+2200+20(200/50-1)= 500 (20 in the initial population, two children for each step and 20s in the refreshing procedure for each 50 steps) indi viduals are examined. After the completion of 200 steps of evolutional process, the procedure selects the best ten individuals from the final population, and generates a source code of shading language for WebGL and 20-second movie files for each. A synchro nized sound effect is also generated without any prerecorded sampled data but purely synthesized sound waves by combi nation of oscillation and modulation as described in (Unemi 2012b). The parameters of sound wave synthesis are the sta tistical factors extracted from frame images. The main machine for the evolutionary production is an old MacPro 2006, equipped with two Intel Xeon dual core processors of 3 GHz, GeForce 7300 GT as GPU and MacOS X 10.6. as the operating system. The elapsed time neces sary for the evolutionary process described above is approx imately 90 minutes. It would be reduced in less than half if we could arrange it by a newer machine. The entirety of the daily process is controlled by a program in AppleScript that accesses to application softwares, SBArt4 for evolutionary production, QuickTime Player 7 and X to convert the movie file format and to organize a digest movie, curl to submit the digest to YouTube, and t to announce the completion on twitter. The process is launched as a startup procedure after the machine wakes up at the scheduled time everyday. If no error occurs, the machine shuts down automatically. Public showing on the internet To complete the fully automated process by exhibiting the products on the internet, we built three types of user interfaces for viewers based on movie files, HTML5 + WebGL, and a special application software. In all of these methods, the animations are automatically played back in a sequence following the viewers choice from three alternatives, random, forward and backward. The viewer also allows to directly select the date from the calendar shown in the graphical user interface, and choose one of ten pieces listed as thumbnail images to be played back. Figure 1 shows a sample of the web page to watch the animations distributed as a form of movie files. Movie files Each of the produced movie files is compressed in both the H.264 and Ogg Vorbis formats in order to be adaptable for playback by popular web browsers, such as Safari, FireFox, Figure 1: A sample of web page to watch the animations distributed in a form of movie files. Google Chrome, and Opera. These movies are accessible from http://www.intlab.soka.ac.jp/~unemi/ sbart/4/DailyMovies/. Reorganization of a web site to adapt to the newly generated movies is also performed automatically just after the compressed movie files are uploaded to the web server. The daily and weekly digests of these movies are also posted to a popular site for movie sharing. A daily digest is a sequence of six-seconds excerpts for each movie, for a total duration of one minute. A weekly digest is a sequence two-seconds excerpts for each of the 70 movies produced in the last seven days. These digests are accessible at http://www.youtube.com/user/ une0ytb/. The daily process consumes an average of 346 MB of the storage in the web server everyday, which means that storing all of the movies produced over a number of years on a hard disk drive is feasible, because 126 GB for one years worth of movies is not unreasonable considering the HDD capacity of currently available consumer products. HTML5 + WebGL A drawback of movie file is dilemma between quality and size. Usual environment of an internet user has no enough capability to display an uncompressed sequence of raw images. If we try to transmit uncompressed movie data of VGA (640 480 pixels) in 30 frames per second, the required band width is 640 480 30 3 = 27,648,000 bytes per second. It is possible in a local area network with Giga bit channel, but difficult for usual connection beyond the continents toward a personal computer at home. The compression techniques widely used were designed for movies captured by the camera and/or cartoon animation. Because the evolutionary art might contains very complex patterns that is dif.cult to be compressed efficiently, such methods commonly used are not always effective for this project. Figure 2: A sample of web page to watch the animations distributed in a form of shading code. The latest web technology made it possible to let the browser render a complicated graphic image by downloading a script written in JavaScript. The newest specification of HTML5 includes some methods for interactive control of both graphics and network communication. In addition, WebGL is available to render a 3D graphics in a 2D rectangle area of canvas object utilizing shading language GLSL ES. It is possible to render an image without any loss by compression if the browser directly draw it based on functional expression produced through evolutionary production. Because SBArt4 is using Core Image Kernel Language to render each image as described above, it is relatively easy to generate a source code of GLSL ES from the genotype. An advantage of shading language is that it is possible to render arbitrary size of image without loss even if its in the full screen mode of high DPI display. The fastest frame rate is depending on both the power of hardware and the ef.ciency of JavaScript execution on the browser. An audio file in high quality is not so heavy in comparison with movie file. JavaScript controls the frame image alternation by checking the progress of audio playback. The average size of audio file in AAC compression is approximately 330 kbytes in 44.1 kHz as sampling rate, 16 bits as sample size, two channels and 20 seconds in duration, for each piece. Because the average size of shading code is 3 kbytes for each piece, the total amount of storage required for the web server is almost one 100th of the case in movie file. The service is available from http://www.intlab.soka. ac.jp/~unemi/sbart/4/DailyWebGL/. Figure 2 shows a sample web page to watch the animations distributed in a form of shading code. Specific application software In the method using WebGL described in the above section, it sometimes suffers computational bottleneck due to the hardware performance and browsers implementation for Figure 3: A sample image of a window of special application, DEAViewer, to watch the animations distributed in a form of shading code. executable scripts. To take full advantage of the power of machine at viewers side, it is the best way to distribute an application software optimized for viewing the products. We developed a software named DEAViewer runnable on OS X 10.6 or later, and are distributing it on Apples App Store in free of charge. The basic mechanism is almost same with the case of WebGL, but the procedure of control part is directly executed on CPU by compiled machine code without any overhead of either compilation or interpretation of the code. It downloads the same information used in WebGL version, and slightly modi.es the shading code to adapt to an efficient GLSL code. The more detail information is at http://www.intlab.soka.ac.jp/~unemi/ sbart/4/deaviewer.html. It provides a viewers experience of 30 fps lossless animation on 4K display. Figure 3 shows a sample image of a window of special application, DEAViewer, to watch the animations distributed in a form of shading code. Future extension Though it has already passed for two and half years and the number of produced pieces reached 9,500, but we have not conducted any analysis over them so far. In the authors intuitive reflection through those years, it often produces amazing pieces but sometimes not. Almost all of productions, except small number of erroneous failure, obtained higher fitness defined as a type of aesthetic measure we designed. This is a typical evidence why we need more research to pursue a human equivalent ability of evaluation even in a perception level for visual arts, because it suggests that the measures employed here might be necessary but not suf.cient. Of course, there are several candidates of aesthetic measures to be introduced, such as a composition based on golden ratio and/or rule of thirds. If we want to obtain an image that inspires something we know in the physical real world or in popular mythology, the composition is very important though it might be a long way to achieve. It is also necessary to consider not only on the perception level but also deeper level of understanding by combination of memory retrieval and conceptual inference connected to emotional move. It is of course a big issue in computational creativity to make a machine that creates emotionally impressive piece inspiring something in human mind connected with viewers private life or social affairs. An easier extension is on the method of combination among different measures. The system introduced here is using geometric means because we thought all of the measures should be necessary conditions. We should examine another style of combination such as weighted summation, minimum and maximum among them. More complex combination of these logical operations might be effective. It might be also interesting to introduce some methods developed in the field of multi-objective optimization (Deb 2001), as (Ross, Ralph, and Zong 2006; den Heijer and Eiben 2014) examined. An effective method must be introduced to produce pieces of wider variation, if we use more generations, or the eternal evolutionary process, for production. Another extension we should try in not far future is on the aesthetic measure of motion in the temporal sequence of pictures. We introduced very simple method to estimate the speed of motion in order to reduce the computational cost, but it must be replaced with some statistical analysis based on a type of optical flow. The techniques to extract distribution of 2D vectors of flow in the motion picture are originally developed for detection of the camera movement and an object moving in the captured scenery. But it must be useful to measure the interestingness of motion. To provide a test bed for the research on computational aesthetic measures, it might be valuable to develop a mechanism of software plug-in to add a third party module for evaluation. It will make it easier to examine and compare the researchers ideas. Conclusion Our experimental project of automated daily production of evolutionary audio visual art was introduced above. We have a lot of tasks to be conducted toward the machine that produces impressive art pieces. The author hopes this project inspires some ideas for the artists and researchers interested in creativity of human and/or machine. 2014_6 !2014 Building Artistic Computer Colleagues with an Enactive Model of Creativity Nicholas Davis1, Yanna Popova2, Ivan Sysoev1, Chih-Pin Hsiao3, Dingtian Zhang1, Brian Magerko1 1School of Interactive Computing 2Department of Cognitive Science 3College of Architecture Georgia Institute of Technology Case Western Reserve University Georgia Institute of Technology Atlanta, GA USA Cleveland, OH USA Atlanta, GA USA {ndavis35, ivan.sysoev, alandtzhang, yanna.popova@case.edu chsiao9@gatech.edu magerko}@gatech.edu Abstract This paper reports on the theory, design, and implementation of an artistic computer colleague that improvises and collaborates with human users in real-time. Our system, Drawing Apprentice, is based on existing theories of art, creative cognition, and collaboration synthesized into an enactive model of creativity. The implementation details of the Drawing Apprentice are provided along with early collaborative artwork created with the system. We present the enactive model of creativity as a potential theoretical framework for designing creative systems involving continuous improvisational collaboration between a human and computer. Introduction Creative technologies have come a long way in supporting human creativity in a variety of ways. Modern creativity support tools (CST) have been extremely effective at helping users produce higher quality products by allowing them to explore creative possibilities, perform complex simulations, and record and track ideas (Shneiderman 2007). However, with all their capabilities and features, popular creativity support tools like Adobes Photoshop are not yet able to generate original artistic contributions, such as new lines or brush strokes that add to the users artwork. Recent advances in artificial intelligence and computational creativity are beginning to change this by developing co-creative computer colleagues to enrich the human creative process in a completely new manner through collaboration with a creative computer (Lubart 2005). Computer colleagues can bridge the gap between CSTs that support a creative person and computers that generate creative products autonomously (see Figure 1). We hypothesize collaboration with computer colleagues based on the enactive model of creativity can enrich the creative process like human collaboration (i.e. increase playful exploration, motivation, creative engagement) in open-ended creative domains such as non-representational visual art.We have designed and implemented a prototype of an artistic computer colleague using the enactive model of creativity (EMC) to test this hypothesis. Our system, called Drawing Apprentice applies EMC to abstract improvisational art. This artistic domain was selected for its open-ended, flexible and emergent art patterns (Clouzot 1956). EMC synthesizes several cognitive science and creativity theories to model creativity as an enactive process that emerges through constant interaction with the environment and other agents within it. In this view, creative actions emerge through experimental interactions with the environment based on simulations and perceived artistic affordances rather than executing a fully formed plan and artistic goal. In the following sections, we first introduce co-creativity in the context of computational creativity and improvisational abstract art. Next, we provide some background on enactive cognition. Then, we present our enactive model of creativity and show how it helped us design an improvisational drawing agent. Finally, we consider evaluation metrics and show early artwork created with the system. Background Computational Creativity HCI researchers build creativity support tools that augment and extend the creative abilities of humans (Shneiderman 2007), while AI researchers develop computationally creative systems that implement and sometimes elaborate on cognitive theories of creativity (Boden 2003; Colton 2008; Li et al. 2012). Enormous progress has been made in these two complementary pursuits; however, there is a gap in the research literature about blending humans and computers in a continuous and collaborative co-creative process (Lubart 2005). The field of computational creativity does not yet have a guiding paradigm or set of design principles to structure creative systems involving continuous real time improvisational collaboration between creative humans and creative agents (Lubart 2005). Co-creativity is classified as multiple parties contributing to the creative process in a blended manner (Candy et al. 2002). It arises through collaboration where each contributor plays an equal role. Cooperation, on the other hand can be modeled as a distribution of labor where the result only represents the sum of each individual contribution (Candy et al. 2002). Co-creativity allows participants to improvise based on decisions of their peers. Ideas can be fused, and built upon in ways that stem from the unique mix of personalities and motivations of the team members (Candy et al. 2002). Here, the creative product emerges through interaction and negotiation between multiple par Figure 2: Time-lapse representation of Picasso's abstract art improvisation creative process reproduced from a film of Picasso painting in Clouzot (1956) ties, and the sum is greater than individual contribution. These interaction principles can be extended to include a sufficiently creative agent that can collaborate with human users in a new kind of human-computer creativity. Some approaches that have yielded interesting examples of human-computer creativity include mimicry, structured improvisation, and using contextual clues to negotiate shared mental models. The improvisational percussion robot Shimon mimics human musicians by analyzing the rhythm and pitch of musical performances and generating synchronized melodic improvisations (Hoffman & Weinberg 2010). In practice, the human and robot develop a call-and-response interaction where each party modifies and builds on the previous contribution. Some co-creative agents use sensory input to construct mental models of agents, actions, intentions, and objects in the environment (Magerko et al. 2010). Mental models help agents effectively structure, organize, interpret, and act on sensory data in real time, which is critical for meaningful improvisation. Abstract Improvisational Creativity Pablo Picassos work is the most well known example of the type of improvisational abstract art the system was designed for. One of the defining features of abstract art is its ability to morph and transform throughout the creative process as the artist discovers, assigns, and re-interprets meaning in the artwork (see Figure 2 and Clouzots (1956) Le mystere Picasso for additional context). In the cognitive science literature, this type of meaning re-assignment is referred to as a conceptual shift (Nersessian 2008). Colloquially termed the Eureka! or Aha! moment, conceptual shifts occur when two separate knowledge domains are connected in the mind (Boden 2003; Nersessian 2008). It is often partially or wholly responsible for insights that lead to creative discoveries and solutions. Abstract art is particularly interesting for creativity research because conceptual shifts and flexible meanings are its cornerstones. Its fluidity makes abstract art ideal for collaboration, as collaborators quickly and easily negotiate common ground and construct shared meaning in an artwork. Abstract art contributions also cannot be wrong inthe same strict sense as representational art because accurate representations are not the goal, which helps lower thebarrier of entry for novices (both human and computer). Improvisational creativity more closely resembles a dialogue where each party makes contributions that feed into an interactive creative process (Sawyer 2012). Jazz improvisation exemplifies artists working together to experimentally negotiate creative strategies based on current musical themes, patterns, and the history of interaction (Sawyer 2012; Mendona 2004). Improvisational creativity is distinguished from other types of creativity because the product is usually ephemeralthe process is the product (Sawyer 2012). Computer colleagues can enrich the creative process by engaging artists in a fun and interesting collaborative art making experience. The final creative product could be thought of as merely a record of that collaborative experience. Enactive Cognition Enactive cognition is an outgrowth of the embodiment paradigm in cognitive science. Embodiment claims cognition is largely structured by the manner in which our bodies enable us to interact with the environment (Varela et al 1991). This approach is contrasted with earlier cognitive theories that conceptualized the mind as a machine and cognition as a complex but disembodied manipulation of symbolic representations (Newell 1959). In particular, en-action emphasizes the role that perception plays in guiding and facilitating emergent action (De Jaegher 2009). In the following sections, we describe how the enactive approach reframes perception into an active and dynamic process critical for participatory sense-making, i.e. negotiating emergent actions and meaning in concert with the environment and other agents. Next, we examine the role of goals and planning in the enactive perspective. Finally, we review some sketching and design research to show evidence that enaction plays a key role in the creative process when creative individuals think by doing. Enactive Perception In the enactive view, cognition is seen as a cycle of anticipation, assimilation and adaptation, all of which are embedded in and contributing to a continuous process of perception and action. Perception is not a passive reception of sensory data, but rather an active process of visually reaching out into the environment to understand how objects can be manipulated (Gibson 1986; No 2004). This type of enactive perception minimally involves a negotiation among the following factors: 1) The subjects intentional state; 2) The skills and bodily capabilities of the individual; and 3) Perceptually available features of the environment that afford different actions such as size, shape, and weight (e.g. is it graspable, liftable, draggable, etc. as elaborated in Norman (1999)). Sensory data enters the cognitive system and irrelevant data is suppressed and filtered (Gaspar 2014). Objects and details of the environment that relate to the subjects intentional goals appear to conscious perception as affordances, which can grab, direct, and guide attention and action (Norman 1999). Each time the individual physically moves through the environment, or acts upon the environment, that action changes the perceptually available features of the environment, which can reveal new relationships and opportunities for interaction. For example, when a painter steps back from her painting, two things happen: (1) she disengages from her current painting activity, and (2) she changes the sensory input to her visual system. From this new perspective, the artist can evaluate global relationships between local regions in the painting and discover new themes and artistic goals that can guide her next artistic decisions once she re-engages the artwork. Participatory Sense-Making The enactive view accentuates the participatory nature of meaning generation, often called participatory sense making. Cognitive systems generate meaning by active transformational and not mere informational interactions with the environment (Varela et al. 1991; Gapenne and Di Paolo 2010). Each interaction with the environment can (and often does) reveal new goals, which leads to a circuitous rather than a linear creative process. Creative individuals engage in a dialogue with the materials in their environment (and other agents) to define and refine creative intentions (Schon 1992). This view is helpful in open-ended domains where goals are often discovered rather than explicitly defined. In human daily interactions, for example, there is evidence that some form of natural coordination takes place in the shape of movement anticipation and synchronization. A good example of participatory sense-making would be the familiar situation where you encounter someone coming from the opposite direction in a narrow passageway (De Jaegher 2009). While trying to negotiate a safe and quick passage, both participants look toward their intended path (providing a social cue) while also trying to assess the projected path of other agents. Interaction then, in the form of coordination of movements, is the decisive factor in how quickly the individuals achieve their goal of passing each other. Rather than adopting a plan with a fixed and concrete goal state to control locomotion, an enactive analysis would posit that individuals remain flexible throughout the situated action by dynamically accommodating the choice of the other agent. If the interaction cannot be settled by subtle perceptual negotiation, more intentional gestures can be recruited to communicate intention more precisely. If collision seems unavoidable, even after clear gestures to communicate intention are made, language may be recruited to settle the navigational issue with a solid plan, usually followed by a brief period of uncomfortable laughter (because we usually manage these situations without such extreme measures). Goals as Socially Negotiated, Dynamic, and Emergent Even at the level of social interaction with an intelligent agent, an enactive approach tries to avoid postulating high-level cognitive mechanisms at the core of our intersubjetive skills. Enaction breaks away from traditional cognitive science theories positing precisely formulated goals, detailed planning procedures, and robust internal representations of both (Newell 1959). The co-evolution of a communicative/creative process is seen here as a gradual unfolding in real time of a dynamic system spanning a human subject, the environment, and agents within it. In this view, intentions emerge but are also transformed in and through the interaction with other agents and the environment. One argument against a naive planning approach in AI is that it takes a significant amount of cognitive effort to construct mental simulations that provide the level of detail and granularity required to carefully plan every complex action humans engage in (De Jaegher 2009). There is considerable evidence that demonstrates humans do, in fact, have a keen skill for visual thinking, but it still takes cognitive resources to perform mental operations and inferences on images (Kosslyn 1980). It is often simply easier to act on the environment and experiment with how different interactions affect the system (No 2004). Thinking By Doing The literature on creativity supports the enactive perspective with research on thinking by doing. There is a multitude of evidence demonstrating how both representational and non-representational artists plan their artworks using sketches, studies, and other ways to simulate artistic alternatives (Mace 2002). Sketching reduces cognitive load and facilitates perceptually based reasoning (Schn 1992). Artists generate vague ideas and then use some form of sketch or prototyping activity to creatively explore, evaluate, and refine artistic intentions (Davis 2011). Sketching allows creative individuals to think by doing. When an action or idea is materialized in some way, the perceptual system is rewarded with richer data than pure mental simulations and abstract reasoning. Additionally, cognitive resources that would have been used to simulate the action (i.e. consciously visualizing the situation) are now freed for other tasks such as interpretation and analysis (Shneiderman 2007). Enactive Model of Creativity An enactive model of creativity proposes creativity as an emergent negotiation between agents with intelligent perceptual systems, exploratory interaction, and an environment rich with affordances. We first explain the visual conventions of the enactive model of creativity and describe how it can be applied to model creative cognition through time. Then, we introduce a new concept derived from our model called perceptual logic, which is a perceptual filter that highlights relevant affordances in the environment while suppressing irrelevant affordances. Model Description In the enactive model of creativity (see Figure 3), the awareness of the agent is represented by the vertical rectangle situated on a spectrum of cognition, which means that the agent is aware of what is perceived and its current intention. Perception is constituted partly by the mental model the agent has constructed for the current situation (top-down cognition) as well as the sensory input coming from the environment (bottom-up cognition) (Gibson 1988; Glenberg 1997; Varela et al. 1999; Stewart et al 2010; Gabora 2010). To get a sense of the intended dynamism of this model, imagine the entire awareness rectangle (the central part of Figure 3) can shift to the left or right of the cognitive continuum as a function of the agents concentration. Routine actions only require minimal thought and a limited amount of highly relevant sensory data. The enactive (and temporally extended) model of routine actions, such as driving, would by visually depicted by having the awareness rectangle resting at equilibrium in the center of the spectrum with small deviations to the left to update and revise strategy, and deviations to the right to interactively evaluate those ideas in a perceive-act cycle (see Figure 4). If the agent is performing an unfamiliar task, however, cognitive resources are recruited to actively build a mental model of the situation, which requires performing experimental interactions, closely examining the results in the environment, and then updating the mental model in a slower perceive-think-act cycle. As novices learn to filter irrelevant sensory details and operate effectively with minimum conscious supervision of a task, the perceive-thinkact cycle gradually tightens until expertise is achieved. Additionally, the agent can engage in pure reflection or pure interactive inspection, which would be described by tight cycles on either end of the spectrum (see Figure 4). Figure 4: Cycles of cognition in the enactive model of creativity To simulate working memory, the agent only has a limited amount of cognitive resources. These resources are used through a process of directed attention, i.e. concentration. During this simulated form of concentration, agents devote their attention to reflecting on the situation (building more detailed mental models, running complex mental simulations, etc.) and acting in a deliberate and interactive manner to inspect the world. Perceptual Logic The contents of perception vary based on an individuals position on this continuum of cognition (Glenberg 1997). As individuals deviate from the equilibrium in the center of the spectrum, perception becomes partially unclamped, which loosens semantic constraints on sensory input and memory (Glenberg 1997). In our model, different points on the cognitive spectrum result in a unique perceptual logic that is used to intelligently perceive affordances in the environment. The enactive approach in cognitive science describes the intelligence of perception in a theoretical sense, but operationalizing the theory required explaining the implicit black box mechanism that makes perception intelligent. The mechanism basically serves to to filter all possible affordances and present only relevant affordances to conscious perception. Perceptual logic is our proposed method for developing intelligent perception in an agent. The enactive approach proposes that perceptual intelligence arises through the formation of percept-action pairings that are chunked and internalized for quick retrieval (No 2004). Perceptual logic is a proposed cognitive mechanism that filters sensory data, identifies relevant percept-action pairings, and presents these percept-action pairings as affordances to perception. Perceptual logic performs a similar role as the simulator in Perceptual Symbol Systems (Barsalou 1999). The simulator activates all the associated information related to a percept, including the various ways it can be interacted with based on experiential knowledge and physical characteristics. Clamping Perception Research indicates that perception filters irrelevant sensory input to reduce distractions and facilitate everyday cognition (Gasper 2014). When the agent is engaged in a routine task and following well established affordances, sensory data is clamped to filter out unnecessary details and un-conventional ways of seeing objects (Glenberg 1997). Everyday cognition is represented in EMC by situating the awareness rectangle in the center of the spectrum of cognition, creating a point of equilibrium. Shifting either to the left or right on this spectrum requires the agent to concentrate on either the details of her mental model or closely inspect details in the environment. At equilibrium, EMC proposes that perception is clamped to a combination of sensory input and cognitive input that optimizes routine interactions. When minor problems arise, such as small improvisational adjustments to the action based on environmental feedback, this equilibrium is slightly perturbed. The agent could generate various alternative actions by thinking (moving slightly left on the spectrum) and explore various ideas by interacting with the environment (moving slightly right on the spectrum). Unclamping Perception If there is a severe disruption to the current task (e.g. a great new idea, distraction, or some kind of failure), it might become necessary to disengage from the current task to re-evaluate the situation (Dourish 2004). When an individual disengages from a task, perception becomes unclamped and attention shifts to thinking and simulating solutions (moving far left on spectrum) and closely examining the detail of the environment to discover new affordances (moving far right on the spectrum). The degree of concentration devoted to thinking about or acting on the environment determines how far, in either direction, awareness is situated on the spectrum of cognition. At the extreme left of the continuum (thinking) would be closing ones eyes to try to think deeply about a topic, which removes sensory input from perception altogether. At the extreme right of the continuum (inspecting) would be an individual fully concentrated on acting skillfully, carefully, and deliberately on the environment. Modulating Semantic Constraints During these periods of disengaged evaluation, EMC proposes that the semantic constraints for recalling associated ideas from memory and interpreting elements in the environment become unclamped to enable re-conceptualization. Unclamping semantic constraints helps overcome functional fixedness, which is a phenomenon where individuals have trouble dissociating objects from their entrenched meaning during insight problem solving (Adamson 1952). Interestingly, this model identifies an important role for distraction in the creative process. Distraction is one way to prompt an individual to disengage from everyday cognition. In abstract art, for example, unfinished segments of the artwork (or unexpected contributions from a collaborator) may distract the artist while they are drawing. These newly discovered areas might not align with the artists current intention. As a result, the artist might want to resolve that tension by drawing additional lines, which can catalyze the creative process. However, too many distractions might frustrate an artist. EMC accounts for meaning negotiation by describing how perception employs different types of perceptual logic to filter affordances in the environment. Applying a different perceptual logic changes the manner in which sensory inputs are processed, organized, and made sense of. It therefore reveals different affordances in the environment, which can help the individuals discover new creative uses for objects that are relevant to goals. Drawing Apprentice System Design The enactive model of creativity informs the Drawing Apprentices cognitive architecture, and collaborative drawing and jazz improv informs the turn-taking strategies (Mendona & Wallace 2004; Pressing 1984). Figure 5 shows the Figure 5: Apprentice Software Architecture system architecture of Apprentice. The creative dialogue begins as the human inputs a line. All current lines from the canvas are sent to the perceptual logic module. The perceptual logic module consults the creative trajectory monitor to determine what perceptual logic to apply to its current data set. The planned creative trajectory monitor has a coarse grained record of the previous drawing behavior based on the time between the users lines (i.e. longer periods of rest represent reflection, which is categorized as global perceptual logic, and short and rapid detail strokes are categorized as local perceptual logic). The creative trajectory monitor then averages the last 10-15 seconds of user drawing behavior and selects the dominant perceptual logic of the user. The average creative trajectory is adopted by the system to determine what layer of perceptual logic to apply in the current interaction. Layers of Perceptual Logic EMC suggests that each layer of perceptual logic should generate unique artistic affordances from the same input, such as shading a circle, intersecting it, and replicating it. Each logic layer sends its algorithms different amount of lines and different features for discriminating lines. There are several critical points that each perceptual logic filter can use in different ways, such as inflection points, start point, end point, segments between inflections, and corners. Moreover, gestalt groupings (e.g. proximity, similarity, closure, etc.) provide additional features to generate unique affordances building relationships between lines, groups of lines, regions, and patterns (Arnheim 2001). Figure 7: Layers of perceptual logic. Local perceptual logic mimics the last input line without any model of the artwork. Regional perceptual logic analyzes recent input lines into gestalt groupings to build on regional relationships. Global perceptual logic analyzes all lines in the agents mental model of the artwork to evaluate overall composition and identify opportunities. There are three layers or types of perceptual logic in EMC (local, regional, and global) determined by the position of awareness on the spectrum of cognition (see Figure 7 for an explanation of the categories of perceptual logic). We are implementing the EMC in steps with one layer ofperceptual logic implemented per step. Each successivelayer of perceptual logic considers a larger portion of thedrawing at a higher level of conceptual abstraction (global being the most complex), which presents additional technical hurdles. Layering our implementation strategy allowsa practice-based approach that encourages iterative testing with artists to ensure a meaningful artistic tool. With only the first two layers (partly) implemented, thesystem can receive line input from the user, analyze it andgenerate an improvised response line based on the visualfeatures of the input line and surrounding region. Table 1and Figure 6 display the first five types of drawing algorithms we implemented in the prototype. Local perceptual logic considers the visual features of one line. These drawing algorithms perform simple mathematical transformations on the input line and then redraw it, such as translation, reflection, scaling, and sketchify (see Figure 6). Local perceptual logic essentially mimics the creative input of the user by repeating the user's action with a small variation. Regional perceptual logic, on the other hand, segments recent line inputs into line groups, regions, and containers based on principles of gestalt grouping, such as proximity, similarity, common fate, and continuity (Arnheim 2001). The system then generates a line that builds relationships between objects in the same region or container. Intersection-connection is the first regional algorithm that analyzes an input line into critical regions to respond to the actual shape of the line (shown in Figure 6). Global perceptual logic (not yet implemented) considers the artwork as a whole. These algorithms are more intelligent than regional and local perceptual logic algorithms because they consider how the different regions of the drawing balance to form an overall composition. When this perceptual logic is applied, the system may decide to completely decouple its contribution from the humans recent input, i.e. it can select non-active regions of the artwork on which to operate if it presents more rewarding artistic opportunities. Global perceptual logic is the highest level of cognitive functioning and will eventually include semantic knowledge such as how to draw a dog, cat, person, etc. System Evaluation While creativity support tools typically help users produce a more polished product in less time, computer colleagues aim to support the creative process by increasing playful exploration, motivation, and creative engagement. Evaluating computer colleagues therefore involves analyzing and measuring creative engagement in the co-creative process rather than judging the creativity of the final product. Figure 8 shows an early practice-based art study of an expert artist (the first author) collaborating with the Drawing Apprentice over a period of 2 hours. Drawings 1 & 2 demonstrate short turn taking collaboration between theartist and the Drawing Apprentice (computer lines are blue). Without the regional and global perceptual logiclayers, the system only has minimal knowledge of the artwork. It knows what each of the line inputs are, but nothing about their relationship or the overall composition ofthe artwork. In future work, the regional perceptual logiclayer will group line inputs into regions and containers toenable the system to learn and modify entire shapes (ratherthan individual lines). However, even without regionalperceptual logic, the system was able to achieve complex (and artistically valuable) outputs in drawings 3-6 becausethe human starts defining themes and creating complexartistic patterns by drawing many lines per turn in rapidsuccession. The basic mimic functions of the Drawing Apprentice leveraged this complexity to achieve equally detailed output. The final product is shown in all black (as theartist saw it) in drawing 9. To capitalize on the emergent nature of creativity in improvisation, our development efforts focus on building more sophisticated methods of perceiving, analyzing, andunderstanding drawn human input in such a way that it can be intelligently and creatively re-used by the system. Thisinvolves teaching the system how to recognize line groups(regional perceptual logic), how to define relationshipsbetween those line groups (global perceptual logic), andwhen it is appropriate to use them for generating artisticcontributions (creative trajectory monitory). In practice, the current prototype appears like a clumsynovice because it can achieve continuous improvisation, but it cannot detect patterns, make abstractions about the artwork, or understand any user intentionality. This limitation means that many of the systems contributions accidently disrupt things the user intentionally drew, such as aface or a nice curve. This creates tension for the artist and can serve as a creative catalyst or as a source of frustrationif the disruptions are too severe or frequent. Skilled artistic collaborators are typically quite flexible and can integrate awide variety of unexpected line contributions into theirdrawings with one key exception: completing the drawing. When the artist was ready to complete the drawing byperfecting and refining each major line (drawings 7 & 8), the system kept blindly mimicking each line input, whicheffectively produced more work for the artist because eachcomputer contribution was an unpolished line that requiredrefining. This process eventually became frustrating because the artist wanted to stop but was never satisfied with the precision of the lines. Without global perceptual logic,the drawing as a whole cannot be evaluated to determinesits level of completion. With only the local and part of regional perceptual logicimplemented, the Drawing Apprentice is able to maintaincontinuous collaboration with an expert artist, which is a milestone for the project. In addition to continuous collaboration, the final prototype will be successful if: (1) It provides similar benefits as a human collaborator (i.e. playful exploration, motivation, and creative engagement) (Carroll 2009); (2) Users find collaboration meaningful and valuable (Candy and Edmonds 2002); and (3) Implementing additional parts of the EMC increases creative engagement (Candy and Edmonds 2002). Our research agenda includes a user study to evaluatethe system. The study is a controlled experiment that compares collaborating with the Drawing Apprentice to human collaboration and a random control. Participants are askedto perform three collaborative drawing sessions on a tabletcomputer with an unknown 'player' as the computer collaborator (e.g. Apprentice, human, or random lines). Aftereach drawing session, the participant will be interviewed and complete the Creativity Support Index to measure playful exploration, motivation, and creative engagement(Carroll et al. 2009). Conclusions This paper described a cognitive model of enactive creativity that is useful for designing continuous improvisational collaboration in creative systems. We built an artistic computer colleague called the Drawing Apprentice to test our enactive model of creativity (EMC). The Drawing Apprentice embodies the principles of EMC using increasingly complex layers of perceptual logic to analyze and react to user input in real time improvisation. We hypothesized collaboration with computer colleagues based on the enactive model of creativity can enrich the creative process like human collaboration (i.e. increase playful exploration, motivation, creative engagement) in open-ended creative domain such as non-representational visual art. We presented the theory, design, prototype details, and early collaborative artwork generated with Drawing Apprentice, the co-creative drawing partner. Acknowledgements Thanks to Dr. Ellen Yi-Luen Do and researchers Cora Wilson and Monet Tomioka for helping build the Drawing Apprentice. 2014_7 !2014 Computational Game Creativity Antonios Liapis1, Georgios N. Yannakakis1,2 and Julian Togelius1 1: Center for Computer Games Research, IT University of Copenhagen, Copenhagen, Denmark 2: Institute of Digital Games, University of Malta, Msida, Malta anli@itu.dk, georgios.yannakakis@um.edu.mt, juto@itu.dk Abstract Computational creativity has traditionally relied on well-controlled, single-faceted and established domains such as visual art, narrative and audio. On the other hand, research on autonomous generation methods for game artifacts has not yet considered the creative capacity of those methods. In this paper we position computer games as the ideal application domain for computational creativity for the unique features they offer: being highly interactive, dynamic and content-intensive software applications. Their multifaceted nature is key in our argumentation as the successful orchestration of different art domains (such as visual art, audio and level architecture) with game mechanics design is a grand challenge for the study of computational creativity in this multidisciplinary domain. Computer games not only challenge computational creativity and provide a creative sandbox for advancing the field but they also offer an opportunity for computational creativity methods to be extensively assessed (via a huge population of gamers) through commercial-standard products of high impact and .nancial value. Games: the Killer App for Computational Creativity More than a decade of research in computational creativity (CC) has explored the study of autonomous generative systems in a plethora of domains including non-photorealistic art (Colton 2012), music (Wiggins et al. 1999), jokes (Binsted and Ritchie 1997), and stories (Peinado and Gervas 2006) as well as mathematics (Colton 2002) and engineering (Gemeinboeck and Saunders 2013). While commercial games have used computer generated artifacts such as levels and visuals since the early 1980s, academic research in more ambitious and rigorous autonomous game artifact generation methods, e.g. search-based procedural content generation (Togelius et al. 2011), is only very recent. Despite notable exceptions (Cook, Colton, and Gow 2013; Zook, Riedl, and Magerko 2011; Smith and Mateas 2011), the creation of games and their content has not yet systematically been explored as a computationally creative process. From a CC perspective, procedural content generation (PCG) in games has been viewed like mathematics and engineering as a potentially creative activity but only if done exceptionally well. The intersection of CC, game design and advanced game technology (e.g. PCG) opens up an entirely new field for studying CC as well as a new perspective for game research. This paper argues that the creative capacity of automated game designers is expected to advance the field of computational creativity and lead to major breakthroughs as, due to their very nature, computer games challenge computational creativity methods at large. This position paper contends that games constitute the killer application for the study of CC for a number of reasons. First, computer games are multifaceted: the types of creative processes met in computer games include visual art, sound design, graphic design, interaction design, narrative generation, virtual cinematography, aesthetics and environment beauti.cation. The fusion of the numerous and highly diverse creative domains within a single software application makes games the ideal arena for the study of computational (and human) creativity. It is also important to note that each art form (or facet) met in games elicits different experiences to its users, e.g. game rules affect the players immersion (Calleja 2011); their fusion into the final software targeting the ultimate play experience for a rather large and diverse audience is an additional challenge for CC research. Second, games are content-intensive processes with open boundaries for creativity as content for each creative facet comes in different representations, under different sets of constraints and often created in massive amounts. Finally, the creation (game) offers a rich interaction with the user (player): a game can be appreciated as an art form or for its creative capacity only when experienced through play. The play experience is highly interactive and engaging, moreso than any other form of art. Thus, autonomous computational game creators should attempt to design new games that can be both useful (playable) and deemed to be creative (or novel) considering that artifacts generated can be experienced and possibly altered. For example, the game narrative, the illumination of a room, or the placement of objects can be altered by a player in a game; this explodes in terms of complexity when the game includes user-generated content or social dynamics in multiplayer games. Another unique property of games is that autonomous creative systems have a long history in the game industry. PCG is used, in specific roles, by many commercial games in order to create engaging but unpredictable game experiences and to lessen the burden of manual game content creation by automating parts of it. Unlike other creative domains where computational creativity is shunned by human artists and critics (Colton 2008), the game industry not only invented PCG but proudly advertises its presence as a selling point. Diablo III (Blizzard 2012), which set a record by selling 3.5 million copies in the first 24 hours of its release, proudly states that [previous] games established the series hallmarks: randomized levels, the relentless onslaught of monsters and events in a perpetually fresh world, [...]1. Highly-awarded Skyrim (Bethesda 2011) boasts of its Radiant A.I. (which allows for the dynamic reaction to the players actions by both NPCs and the game world) and its Radiant Story (which records your actions and changes things in the world according to what you have done). The prevalence of e.g. level generators in games makes both developers and end-users acceptant of the power of computational creativity. Unlike traditional art media, where CC is considered more of an academic pursuit, PCG is a commercial necessity for many games: this makes synergies between game industry and CC research desirable as evidenced by Howlett, Colton, and Browne (2010). This paper introduces computational game creativity as the study of computational creativity within and for computer games. Games can be (1) improved as products via computational creations (for) and/or (2) used as the ultimate canvas for the study of computational creativity as a process (within). Computational game creativity (CGC) is positioned at the intersection of developing fields within games research and long-studied fields within computational creativity such as visual art and narrative. To position computational creativity within games we identify a number of key creative facets in modern game development and design and discuss their required orchestration for a final successful game product. The paper concludes with a discussion on the future trends of CGC and key open research questions. Creative Facets of Games Games are multifaceted as they have several creative domains contributing substantially to the games look, feel, and experience. This section highlights different creative facets of games and points to instances of algorithmically created game content for these facets. While several frameworks and ontologies exist for describing elements of games, e.g. by Hunicke, Leblanc, and Zubek (2004), the chosen facets are a closer match to established creative domains such as music, painting or architecture. This section primarily argues that each facet fulfills Ritches definition of a potentially creative activity (Ritchie 2007, p.71). Additionally, it uses Ritchies essential properties for creativity, i.e. novelty, quality and typicality (Ritchie 2007) in terms of the goals of each creation process; whether these goals (or the greater goal of creativity) are met, however, will not be evaluated in this paper. 1From the of.cial What is Diablo 3? page at Blizzards website: http://us.battlefinet/d3/en/game/what-is Visuals As digital games are uniformly displayed on a screen, any game primarily relies on visual output to convey information to the player. Game visuals can range from photorealistic, to caricaturized, to abstract (J arvinen 2002). While photorealistic visuals as those in the FIFA series (EA Sports 1993) are direct representations of objects, in cases where no real-world equivalent exists (such as in fantasy or sci-. settings) artists must use real-world reference material and extrapolate them to fantastical lengths with what if scenarios. Caricaturized visuals often aim at eliciting a speci.c emotion, such as melancholy in the black and white theme of Limbo (Playdead 2010). Abstract visuals include the 8-bit art of early games, where constraints of the medium (low-tech monitors) forced game artists to become particularly creative in their design of memorable characters using as few pixels or colors as possible. In terms of computer generated visual output for games, the most commercially successful examples thereof are middleware which algorithmically create 3D models of trees with SpeedTree (IDV 2002) or faces with FaceGen (Singular Inversions 2001). Since such middleware are used by multiple high-end commercial games, their algorithms are carefully finetuned to ensure that the generated artifacts imitate real-world objects, targeting typicality in their creations. Games with fewer tethers in the real world can allow a broader range of generated visual elements. Petalz (Risi et al. 2012), for instance, generates colorful flowers which are the core focus of a flower-collecting game. Galactic Arms Race (Hastings, Guha, and Stanley 2009), on the other hand, generates the colors and trajectories of weapons in a space shooter game. Both examples have a wide expressive range as they primarily target novelty, with uninteresting or unwanted visuals being pruned by the player via interactive evolution. In order to impart a sense of visual appreciation to the generator, Liapis, Yannakakis, and Togelius (2012) assigned several dimensions of visual taste inspired by cognitive research on universal properties of beauty (Arnheim 2004); the algorithm was able to evaluate generated spaceships based on size, simplicity, balance and symmetry and adjust the generative rules via artificial evolution. The model of visual taste could further be adapted to a human user, with visual properties prominent in chosen spaceships being targeted in the next evolutionary run. In terms of creativity, this spaceship generator targeted typicality via vertical symmetry and constraints on what constitutes a valid spaceship, as well as quality via the computational model of visual taste. Beyond generating in-game entities, Howlett, Colton, and Browne (2010) generate pixel shaders which substantially change the appearance of a game scene, pointing to a broad expressive range. The shaders novelty is significant, while their quality is based on a users a priori specification of target hues; however, the resulting scenes are often too bright and objects are hard to make out, pointing to a low typicality with traditional game shaders. Audio While often overlooked when discussing games, a games audio is an important contributor to the overall experience; its recognition is demonstrated by two BAFTA Game Awards (music and sound) and, briefly, by a MTV Video Music Award for Best Video Game Soundtrack. Game audio usually includes background music such as the fully orchestrated soundtrack of Skyrim (Bethesda 2011), sound effects such as the pellet-eating sound from Pac-Man (Namco 1980) or the rewarding sounds of Bejeweled (Popcap 2001), and voice-acted dialogue which is deemed essential for large-scale commercial games and often includes Hollywood names such as Liam Neeson in Fallout 3 (Bethesda 2008). While the game industry is focusing on larger and more grandiose human productions for game audio, work on generated audio has seen several important developments in the last years, including the creation of the International Workshop on Musical Metacreation which has been, for 2012 and 2013, attached to the game-focused AIIDE conference. Apart from game sound effects such as those procedurally generated by sfxr and bfxr (both created by indie game developers), the generation of game audio is not much different than music generation outside of games. Collins (2009) goes as far as to consider sound effects caused by player actions or a tempo matching the games difficulty level as procedural music which transforms the games soundscape; this paper will not consider such a premise on the grounds that character animations similarly do not constitute a transformation of the games visual experience. While synergies between facets such as audio and ludus will be discussed later, worth mention is the work of Brown (2012) in composing game soundtracks based on characters on display and the work of Houge (2012) in combining short musical phrases according to in-game events to create responsive background audio for a strategy game. Most, if not all, attempts to generate game audio rely on the synthesis of human-authored pieces, indicating that any creativity involved would be combinatorial. Berndt and Hartmann (2007) argue that such hybrid methods are preferable as they leave the art creating process at the real artist, i.e., the human composer, and employ the machine beyond the humanly possible the immediate adaptation in response to interactive events in a virtual environment. However, as research in music metacreation improves the aesthetic quality of generated results, more fundamentally creative methods for generating game audio are expected to become available in the future. Narrative Many successful games are applauded for their excellent narratives. Unlike traditional stories (including computationally created ones), however, the highly interactive nature of games necessitates the use of interactive storytelling (Crawford 2004). Due to the freedom of players to visit areas and interact with elements of the story in different orders, the creativity required of a game writer differs from that of an author or even a film director. Thus, evaluating the creativity (or simply the quality) of the game narrative depends not only on the beholder but also on the pieces of narrative experienced as well as their order and context. Like more traditional forms of narrative generation, the design of interactive storytelling relies heavily on a large database of world knowledge both textual and logical. Games acclaimed for their narrative, such as Heavy Rain (Quantic Dream 2010) and Mass Effect (Bioware 2007), include thousands of lines of dialogue authored by multiple game writers. While game-like interactive narratives such as Facade (Mateas and Stern 2005) or Prom Week (McCoy et al. 2013) similarly include a large number of prewritten dialogues, the computer is much more proactive and selects a .tting response of a virtual character based on the context of the current discussion, the players assumed knowledge and the future intended outcome. Since typicality is still a concern in such projects for instance, Facade wants the game to tell a story of a couple with marriage problems the novelty of the storys conclusion is often not exceptional, although the events leading to this conclusion may well be. However, the burden of imparting world knowledge to an interactive narrative system can be somewhat alleviated by directly using real world data to inform the creation process. Human-based computation can use previous user interactions, current world events or online encyclopedias in order to detect items of interest or logical connections between story elements. For instance, Orkin and Roy (2007) use a lexicon of actions and utterances from data of over 5000 players in a simple restaurant game to train virtual agents verbal responses based on N-grams; of note is the evaluation of this machine learning method which required an audience of human judges to rate the agents behavior in terms of typicality, i.e. whether they were likely to be heard in a restaurant. Swanson and Gordon (2012) created a co-operative storytelling system where human and computer take turns adding sentences to an emerging story; the computer analyzes the current story, matches it to a database of over a million stories from web blogs and uses the corresponding next sentences from the closest matching story. Cook, Colton, and Pease (2012) used current news items from an online news site as well as wikipedia images of their protagonists (tailored to the storys mood) in order to implicitly tell a story in the background of a platformer game. Ludus While games have the previous facets in common with other media, there are also those that are unique to games. The term Ludus, established by Caillois (1961) and elaborated by Frasca (1999), refers to an activity organized under a system of rules that defines a victory or a defeat, a gain or a loss. The uniqueness of the ludic facet stems from the fact that rules define the limits of player freedom and pose as player goals; this allows room for creativity in defining the limits and goals of player interaction. A games play experience is primarily defined by the games rules. Rules provide the structures and frames for play (e.g. winning and losing conditions), as well as the games mechanics, i.e. the actions available to the player. In commercial games, rules are carefully crafted by human game designers. More often than not, such rules follow the standards of the games genre which constrains the creativity of the designers. While often a sequel to an established series has minor rule changes from its predecessors, there have been cases where a minor tweak in the rules has caused the literal transformation of a genre. An exemplar of this is a fan-made modification of the strategy game Warcraft III (Blizzard 2002) which removed base building and most unit control, allowing the user to control a single hero unit; the resounding success of these tweaks has since given rise to a new, popular game genre: Multiplayer Online Battle Arenas. Several researchers have attempted to build systems that generate game rules; however, the challenges and affordances of such creativity are naturally different than for visuals or narrative. Early systems used grammar rewriting or similar methods to tweak rules of existing games. As an example, Metagame (Pell 1992) tweaked the rules of Chess in order to create a class of games for evaluating general game-playing AI; since the motivation was to create a class of games, Metagame targets typicality with the base game. Metagame, however, ensured the quality of generated results in part due to the existence of a well-formed, successful inspiring set (Chess) and in part due to human-authored speci.cations for changing rules in order to maintain fairness between players etc. More recent work targets quality in the form of constraints on playability: Smith and Mateas (2010) generate game rules which satisfy the constraint that the victory condition is attainable, without however evaluating how challenging or intuitive the path to victory is. Evaluating quality in terms of challenge or learnability of the generated rules necessitates that the game is somehow played: the score (or other metrics) of a simulated playthrough can be used as an objective function for evolutionary computation (Togelius et al. 2011). As an example, Togelius and Schmidhuber (2008) evolve rules for simple Pac-Man like games, evaluating the resulting games based on their learn-ability in simulated playthroughs; by assuming that good games are non-trivial but learnable, the system targets an arguably more elaborate measure of quality than constraints. A successful example of game rule generation is the Ludi system (Browne and Maire 2010) which generates complete two-player board games in the style of classic games such as Chess and Go; generated game rules and boards are evaluated via aesthetic measurements made during self-trials. Level Architecture Most games are built upon the spatial navigation of levels which determine how the player agent can progress from one point in the game to another. Some examples of levels include the two-dimensional arrangement of platforms and coins in Super Mario Bros (Nintendo 1985), the three-dimensional arrangement of houses, trenches and enemies in the World War 2 shooter Call of Duty (In.nity Ward 2003), the elaborate structures that the player tears down in Angry Birds (Rovio 2009), or the expansive open gameworld in Minecraft (Mojang 2011). A games tone is often set by its levels and the challenges they pose; digital games often have a constant or near-constant set of mechanics throughout, but vary the gameplay and challenge through level design. Like real-world architecture, level design must take into account both visual impact and functional affordances of the artifacts it creates. Depending on the type of game, functional affordances may include a reachable end-goal for platform games such as Super Mario Bros, challenging game-play for driving games such as Forza Motorsport (Turn 10 Studios 2005), or good action pacing with breathing room between difficult sections as in Resident Evil 4 (Capcom 2005). On the other hand, the levels appearance plays a significant role not only for the visual stimulus it provides but also for the purposes of navigation: a sequence of identical rooms can easily make the player disoriented as was intended in the dream sequences of Max Payne (Remedy 2001), while dark sections can add to the challenge level of the ludic elements due to low visibility as well as psychological anxiety as is the case of Amnesia: The Dark Descent (Frictional Games 2010). The design of larger, open levels or gameworlds borrows less from architecture and more from city planning (Lynch 1960), with edges to constrain player freedom, districts to break the levels monotony and landmarks to orient the player and motivate exploration. Procedural generation of levels is one of the oldest and most popular commercial applications of autonomous creative systems. The sheer volume of levels required in modern games, and the unexpectedness of a fresh, unseen level motivate game companies to rely on PCG. Examples include the generated dungeons of Rogue (Toy and Wichman 1980), the world in the strategy game Civilization IV (Firaxis 2005) or the infinite gameworld in Minecraft. Overall, commercial level generators extensive use of randomness often targets novelty more than quality. Generative algorithms used in commercial games are usually constructive, i.e. do not evaluate the levels they produce. This is especially true in games where players can interact and change the world to their liking. In Spelunky (Yu and Hull 2009), for instance, a player can repair a level where the exit cant be reached by blowing up the blocking tiles with in-game bomb items. Academic interest in procedural level generators is recent yet extensive, focusing more on the quality of the generated levels. Quality can be ensured via a narrow set of constraints on what constitutes a desirable level, with content which satis.es it generated via constraint solvers (Smith, Whitehead, and Mateas 2011), or mathematically defining a measure of level quality/aesthetics and optimizing it via evolutionary search. The levels ludic properties are more accurately estimated via simulated playthroughs of the level using the games mechanics; Togelius, De Nardi, and Lucas (2007), for instance, used models of driving behavior learned from human playtraces to derive a quality measure for generated racing tracks. Liapis, Yannakakis, and Togelius (2013) target both quality and novelty in generated game levels with quality being ensured via playability constraints and novelty targeted explicitly as novelty search. Gameplay While game design (ludus) and level design (architecture) are usually deemed creative activities in the development of the games play experience, playing the game can also be a creative act. Players often exhibit considerable creativity in developing new strategies for playing the game.Wellconstructed strategy games such as Starcraft (Blizzard 1998) see the player community develop new and deeper strategies over the course of years or decades. Devising such strategies often involves thinking outside the box, such as the rush strategies in Starcraft which were outrageous to existing players. Some inventions even seem to go outside the spirit of the game (subversive play): as an example, players in Quake (id Software 1996) used rocket jumping (i.e. .ring a rocket on the ground below them and thus damaging themselves) in order to propel themselves long distances and reach otherwise unreachable areas. The initial discovery of this technique among players should be considered highly creative as it is fortuitous and involves high risk due to the damage accrued by the blast; by the same account, an AI-controlled agent discovering such behavior should be considered highly creative as it breaks the constraints in terms of accessible locations in the level design and the balance of the game design. Creative gameplay would therefore seem to be an excellent domain for computational creativity. Except for puzzle/casual games or strictly multiplayer games, most games include artificial agents acting as enemies, e.g. in F.E.A.R. (Monolith 2005) or companions e.g. in Fable II (Lionhead 2008). Modern agent controllers rarely limit themselves to arguably uncreative processes such as tree search and in several cases learn from player actions as in Black & White (Lionhead 2001), adapt to opponents strategies as in Endless Space: Disharmony (Amplitude Studios 2013) and even revise locomotion patterns to match custom creature physiologies in Spore (Maxis 2008). Such agent controllers often target typicality (i.e. human-likeness) in cases where believable behavior is the goal (e.g. for the 2K BotPrize competition), while others target quality (i.e. winning) in adversarial games (e.g. for Starcraft competitions). It is not uncommon for agent controllers to be of high quality but atypical: for instance, the A* agent that won the 2009 Mario AI competition performed well while playing the game in a distinctly non-human-like manner (Togelius, Karakovskiy, and Baumgarten 2010). While novelty is not often the explicit goal of such controllers, the particularities of e.g. evolutionary algorithms to find unexpected solutions have been harnessed to test games for sweet spots or exploits, where progress can be made in a game without really playing it well. In the work of Denzinger et al. (2005) on the sports game FIFA, evolutionary computation found a number of rather too innovative ways of playing the game. Computational gameplay can also be used to test generated game rules; Cook et al. (2012) highlight a subversive arti.cial agent using the (generated) teleportation mechanic to teleport directly to the exit without playing the level. Interactions and Synergies among Facets The previous section largely covered the different facets of creativity incorporated within games; as is usually the case, however, the whole is more than the sum of its parts. The interplay between the different facets and ultimately their fusion into what becomes the play experience is what makes games such a rich and challenging field for computational creativity. As an example of the interaction between facets, player actions (an element of ludus) are usually accompanied by a sound effect, such as the memorable sound of Mario jumping in Super Mario Bros. If an algorithm devises a new player action, it automatically constrains the sound effects that may accompany this action based on its duration or purpose. While action/sound (as a case of cause/effect) prioritizes the creation of one before the other, most interactions between facets are less one-sided: a game level is often memorable due to its visuals (such as the presence of a landmark) as much as it is due to the gameplay it affords, e.g. narrow corridors may elicit a claustrophobic feeling but may also facilitate aiming at incoming enemies. Game narratives especially rely on visuals, sound and ideally game-play in order to be suitably experienced by the player. Computational game creativity needs to rise to the challenge of tackling the compound generation of multiple facets. So far, many of the game creation projects focus on a single creative facet of a game artifact and do not investigate the interaction between different facets. For instance, Togelius and Schmidhuber (2008) create rules for red, green or blue pawns, but the colors are used for visual identification and are not, for instance, indicators of aggressive (e.g. red) or passive (e.g. blue) behaviors. Although Liapis, Yannakakis, and Togelius (2011a) evolve the speed and combat prowess of spaceships along with their appearance (Liapis, Yannakakis, and Togelius 2011b), the latter does not inform the former (e.g. spiky spaceships are not more powerful/aggressive). Holtar, Nelson, and Togelius (2013) use a soundtrack to generate ludic elements (e.g. spawning enemies when a clap sample plays), while the sound effects from player actions influence the enemy behavior in the same way as the soundtrack; however, the soundtrack or sound effects are not tailored (at least not computationally) based on the potential gameplay they can create. Game-omatic (Treanor et al. 2012) translates user-authored entities and their interactions into game visuals and game mechanics respectively, yet the mechanics do not take into account the visuals or semantics of the game objects they are applied on. Perhaps as the most elaborate example, platformer games generated by the system of Cook, Colton, and Pease (2012) use visuals and sounds that match a news story; however, the actual gameplay (such as the allowed player actions, level geometry or pacing) do not reflect the storys theme. The cited examples are by no means failings of the current early work in this domain; however, the unique blend of narrative, user interaction, visuals and audio within games makes them an ideal, if challenging, domain for creativity to simultaneously explore multiple dimensions. Potential links which can tie the separate facets together include the games intended emotion or message. The intended emotional effect of a game element can connect the visuals (Whitfield and Whiltshire 1990) with music (Scherer and Zentner 2001), while the text or dialogue of the story can be adjusted to match the affective goal (Veale 2013). The ludic elements can also be informed by the emotional effect, by e.g. making enemies abilities more powerful or by adapting their behavior to favor sneaking up behind the player in cases where the intended emotion is fear. The intended message of a game can also connect visuals, music, story and even ludus by measuring the distance of different words in WordNet (Fellbaum 1998) or by discovering associations between the intended message and e.g. color in Google N-grams (Veale and Hao 2007). Cook, Colton, and Pease (2012) have made several breakthroughs in using associations of images and sounds with the message (and emotion) of news stories in generated games. Discussion The survey of the different facets of games and their interaction demonstrates that developing a game (via the different roles of graphic artist, sound designer, game designer or game writer) is perceived as a highly creative activity; by the broad definition of the term, a computer program should also be considered creative if it performed the same tasks. Not only that, but a game should be considered an artifact stemming from a creative activity (Ritchie 2007, p.71) as it falls into a large class (possibly including subclasses as game genres such as strategy games or shooters), with somewhat fuzzy boundaries (Karhulahti 2013), and with extensive human-based evaluations of quality2. On the other hand, evaluating the type and level of creativity in game content generators is not straightforward and remains a challenging open research question. A number of methods for evaluating computational creativity have been proposed, and could potentially be applied to CGC. The notions of novelty, quality and typicality have already been mentioned as aims of different generators for different facets of games; a more methodological evaluation of whether these goals are met could be performed. Many PCG research papers include user surveys where game artifacts are evaluated by human users, although the dimensions on which they are evaluated are not a one-to-one match with those in CC research. Other theoretical frameworks such as the FACE model (Colton, Charnley, and Pease 2011) could also be used to evaluate the type of content generated. For instance, the commercial game generators which are fine-tuned to create e.g. realistic trees with SpeedTree perform creative acts of the form .Eg., while evolutionary algorithms with indirect encodings such as genetic programming (Ashlock and McGuinness 2013) perform creative acts of the form .Cg,Eg.. Special cases where the quality assessment is based on an artificial controller learning to play a generated game (Togelius and Schmidhuber 2008) perform creative acts of the form .Ag,Cg,Eg.. More traditional categorizations such as those of combinatorial, exploratory and transformational creativity (Boden 1992) can also be applied to game content generators: for instance, the synthesis of game audio from sound samples would qualify as combinatorial creativity, while genetic search for optimal game content would qualify as exploratory creativity. The borders between these types of creativity are unclear, however, while transformational creativity can also be viewed as exploration as suggested by Wiggins (2006); the game asset generator of Liapis et al. (2013), for instance, blurs the edges between transformational and exploratory creativity. Computational game creativity challenges CC theorys methods for evaluating creativity for two complementary reasons: (1) games as multifaceted entities can not be treated as visual or musical artifacts alone, and (2) games as highly interactive experiences can not be evaluated by a human audience but by active human participants (i.e. players) who introduce their own creativity into that of the system. 2e.g. www.metacritic.com compiles hundreds of game reviews. Evaluating compound game creativity which treats the game as a coherent entity and not the sum of its parts is a key research question which can potentially lead to breakthroughs in creativity research. A possible solution could include the links which tie different facets together: evaluating whether the generated game elicits the intended emotion or communicates the intended message could be a measure of its success, although such an evaluation method would not cater for creativity from ambiguity and serendipity. The interactive nature of games makes evaluating the creativity of the original designer (or computational creator) harder to disentangle from the players creativity or even their perceived creativity. An elaborate level design can for instance be ignored because the player is too focused on surviving a difficult combat sequence. A games narrative may not make sense when the storys elements or locales are visited in a different order than intended. More interestingly, the players incomplete knowledge of the game unlike an art critic who can literally see the big picture may ascribe more causality and creativity to rather uncreative (i.e. random) events. Subversive play can also lead to a perception of creativity even when that was not expected by the (human or computational) creator: for instance, rocket jumping can be attributed to a players creativity but also to a designers creativity for adding the affordances for such subversive play in the games physics. Finally, games where players are afforded significant agency, allowing them to alter the game-world or make their own stories are even more challenging to evaluate intentional game creativity in vitro. As an example, the gameworld generative algorithms in Minecraft are relatively mundane, yet motivate players to fabricate their own goals. In such cases the creativity of a player meshes with that inserted explicitly into the game; it is likely necessary to include the machine/user as a Unified entity when evaluating the creativity of such a game. Apart from evaluating the creativity of existing computational creators, designing new generators of game content geared towards computational creativity is another promising research area. Especially promising for game creativity are compound generators which can iteratively focus on different facets of games. Multi-agent systems could be used to simulate a game development team, with each agent generating different types of game content such as visuals, audio or levels. Each agents creations could be used as inspiring sets or constraints for the other agents: e.g. generated concept art (visuals) can be used to inspire level design, or a new player action can constrain the sound effects which accompany it. Similar results could potentially be accomplished with co-evolution, where multiple populations evolve genotypes of content of different facets (e.g. level design and game rules). As an early example which does not include all facets, Cook, Colton, and Gow (2012) co-evolve different elements of game levels such as the placement of blocking tiles, powerups and enemies. The aesthetic qualities targeted by each population could be domain-specific (such as harmonic quality or visual impact), could be derived from competition or collaboration with other populations, or could be automatically generated to .t a frame or unifying theme for the game, such as an intended message or emotion. Conclusion This paper introduced computational game creativity as the study of computational creativity within and for computer games, and provided several arguments as to why games as multifaceted, highly interactive art forms are ideal for computational creativity research. Elaborating on the different creative facets involved in the final play experience, the paper provided a short overview of current work in both game industry and game research on procedural content generation. The orchestration of these facets into a fully automatically generated game entity is a challenging future direction for CC research, and some early suggestions as to how it can all come together were listed. Other open questions for computational game creativity include the evaluation of game content generators using existing CC theory frameworks, the formulation of new frameworks that better account for the interactive and multifaceted nature of games, and the generation of new games encompassing more inclusive standards of appreciation. Acknowledgements The research was supported, in part, by the FP7 ICT project C2Learn (project no: 318480) and by the FP7 Marie Curie CIG project AutoGameDesign (project no: 630665). 2014_8 !2014 fiudus Ex Machina: Building A 3D Game Designer That Competes Alongside Humans Michael Cook and Simon Colton Computational Creativity Group Goldsmiths, University of London http://ccg.gold.ac.uk Abstract We describe ANGELINA-5, software capable of creating simple three-dimensional games autonomously. To the best of our knowledge, this is the first system which creates complete games in 3D. We summarise the history of the ANGELINA project so far, describe the architecture of the latest version, and give details of its participation in Ludum Dare, a game design competition. This is the first time that a piece of software has entered a videogame design contest for human designers, and represents a step forward for automated videogame design and computational creativity. Introduction Videogame development is a highly complex creative task incorporating the production of music, art, animation, architecture, narrative, cinematography, rules and system design, amongst others. It is not merely the sum of all these creative acts either, but the result of such acts cooperating together to achieve a creative goal. It is fair to say that videogame development is one of the most creatively diverse mediums that Computational Creativity has available to study. The games development community has grown rapidly over the last decade. The ubiquity of the Internet and the rise of digital distribution has allowed small developers to bypass traditional publisher routes to selling a game, and the spread of simple development tools and APIs such as Unity, Twine and Flixel has made it easier for people without a background in programming to develop games. This culture of rapid development, of shared learning experiences and the general popularisation of game development has led to game-making jams (competitions) playing an increasingly important role in allowing game developers of all levels to interact with and learn from one another. Their simple premise a time-limited event where entrants develop a game from scratch according to a given theme makes them ideal for newcomers who wish to work on something small-scale and simple. These features also make them ideal platforms for testing computationally creative software. We describe here ANGELINA-5, henceforth ANGELINA, an automated game designer that creates 3D games and interactive experiences using Unity, a modern engine for game development. We give details of the systems implementation and how it differs from earlier versions. We also report on ANGELINAs participation in Ludum Dare, a game design contest which drew 2064 entries in December 2013. We discuss ANGELINAs performance, and the cultural response to its involvement in the contest. The rest of the paper is organised as follows: in Background we give a brief introduction to the ANGELINA project and discuss the choice of Unity as a new platform for the system; in Design Process we describe the latest version of ANGELINA, and the challenges associated with building a game designer that works with modern 3D game design technology; in Game Jams we discuss game design contests such as Ludum Dare and their role in the culture of game development; we then discuss ANGELINAs entry to the contest in ANGELINA and Ludum Dare. In Related Work, we summarise other approaches to building systems capable of designing videogames; in Future Work we outline a road map for ANGELINA; we then close with Conclusions. Background ANGELINA is a cooperative coevolutionary system for automating the process of videogame design. There have been several different versions of ANGELINA in the past (Cook and Colton 2011) (Cook and Colton 2012) (Cook, Colton, and Pease 2012), each tackling a different kind of game design problem, often on different platforms or game engines. The latest version of the system represents a large step forward and a large shift in the platform that ANGELINA is built upon. The research aims of the project are concerned with automated game design and the procedural creation of content, but also target issues in Computational Creativity. Later versions of ANGELINA investigated questions of thematic control, context and framing of design decisions, and also whether ANGELINA could discover new game mechanics with minimal game knowledge (Cook et al. 2013). ANGELINA is built as an extension to the Unity game development environment (www.unity3d.com). Unity is an extremely popular, versatile and powerful game engine that ships with a comprehensive development environment that is also highly extensible. Unity games can be deployed to web browsers, all major desktop operating systems as native applications, every modern games console and handheld device, and most smartphone operating systems including iOS, Android and Blackberry. This versatility means that distribution of ANGELINAs games is extremely simple, and are also distributable to a wide variety of people, hopefully increasing the success of future studies, as well as improving the dissemination of our results. Unity also supports both 3and 2-dimensional game development, meaning that we can begin to investigate the automation of fully-3D game design. Moving into the development of 3D games allows ANGELINA to explore a wider variety of game types, and also strengthens the image of ANGELINA as a game designer in terms of using contemporary technology, which is an important aspect of the project from a computational creativity perspective. It also allows us to improve on the design and structure of ANGELINA as a research tool: Unitys extensibility means that we can build ANGELINA as a series of modifications to the Unity tool itself. This means the system can have a full user interface, better visualisation and statistical analysis of the development process, and an easier platform on which to run experiments or integrate with other software. In terms of our projects focus, we also hope to use Unitys breadth as a platform to apply ANGELINA to design tasks on the spectrum between games and interactive artworks. Unity is used for a wide variety of projects besides traditional games, including interactive art installations such as Canis Lupus1 and Mothhead2. We hope to make contributions to this spectrum also. Game Jams Structure A game jam is a co-ordinated event in which groups of people develop games in a fixed timeframe (commonly 48 hours), either alone or in groups. Some game jams are structured as contests, with judging, while others are organised for the self-improvement, to build communities of developers. Almost all game jams feature a theme which must be incorporated into the games designed for the event. These themes are used as creative aids, to focus people on a task or to make them explore unusual ideas. Interpretation of the theme is often a crucial creative step in producing an interesting game, particularly when trying to distinguish an entry from potentially thousands of others. As an example, a game jam held in 2013 was run with the theme Ten Seconds. Entries to the jam included many games incorporating time limits of some kind, ten seconds in length. Here is a selection of alternative interpretations of the theme, used in games for the competition: the player controls an orphan asking for seconds of food; the player controls a second, someone who replaces someone else in a duel; the game records ten seconds of microphone input from the player, and procedurally converts it into a three-dimensional world to explore. Role in Game Culture Game jams play a major role in the culture and community of game developers, particularly at independent and amateur level. In 2012, CompoHub3 recorded a total of 134 game jams taking place, including Ludum Dare4. Ludum Dare is a thrice-annual event that 1http://tinyurl.com/canlupus 2http://tinyurl.com/mothhead 3http://www.compohubfinet 4http://www.ludumdare.com/compo takes place in April, August and December and has been running since 2002. Ludum Dare is split into two events which run in parallel the Competition Track which is a 48-hour event in which solo developers make a game from scratch themselves, including any art and sound assets; and the Jam Track which is a 72-hour event in which the rules for the main competition are relaxed, allowing groups of developers to work together, and existing assets to be used. In December 2013, 2064 games were submitted. After the submission period is over for Ludum Dare, a review period commences which lasts 22 days. During this period, anyone who submitted a game to the event in either track can enter ratings and leave comments on other submissions. On the main rating page, games are ordered based on a ratio of the number of ratings they have received versus the number of ratings they have given out, weighted so that this ratio is ampli.ed at low numbers of ratings. This means that people who have submitted a game are encouraged to rate other games, since this is the fastest way of obtaining ratings for their own submission. Reviews are broken down into eight categories: Fun, Overall, Audio, Mood, Innovation, Theme, Graphics, Humour. Note that Overall is a separate category, not an average of the other seven. Each category can be left unrated, or given a score between 1 and 5. Reviewers are encouraged to leave non-anonymous comments along with their reviews, but are not obliged to. At the end of the review period, the rankings are announced, including breakdowns per category, separated into the competition track and jam track. Design Process Predesign Phase ANGELINA is given a word or phrase which acts as a theme for the game it is about to design. This method of starting a game design is derived from game jams, as described in the section Background. Examples of themes might be fairly straightforward, such as .shing, or more abstract, such as alone. In some cases, the themes are intentionally unusual or restricting in order to stimulate creativity. For instance, the theme for the 2013 Global Game Jam was the sound of a heart beating. Developers are encouraged to incorporate the theme into their game in whichever way they can, such as through the ruleset, the narrative or the visuals. When an input theme is given, if it is longer than a single word, ANGELINA will first attempt to isolate a single word most likely to be a suitable theme. Single words work better than phrases for our current methods of media acquisition and framing, because many of these processes are based on querying web services that expect singular queries. However, it should be noted that this single word approach is not a long term solution, and better theme parsing is a point of future work. In order to choose a single word from a phrase, ANGELINA uses a frequency analysis against a large corpus of English text5, in order to find the least common noun. This approach was developed by analysing 150 game jam themes by hand and running similar filters on them. We 5http://www.kilgarriff.co.uk/bnc-readme.html found that the most prominent theming information tended to be in more specific words, particularly, nouns. You are the villain simpli.es to villain, for instance, while End of the Universe simpli.es to universe. The exception to this rule is where the theme includes meta-references to the game itself, such as build the level you play here, the important information is contained within the phrase as a whole and cant easily be condensed into a single word. Once ANGELINA has a theme word, it attempts to expand the theme using word association databases6. We plan to replace this technique with a more relevant topic association approach in future, but for most applications word association provides a reasonable set of words relating to the source theme word. These word associations are combined with the theme word to provide a list of possible words relating to the games overall theme. For example, the theme word secret would lead to a list of words including secret, spy and mystery. A typical list of associations runs to about thirty words. These associations are then used to perform a series of multimedia searches, one for each association, in order to build a database of assets for use in theming the final game. ANGELINA downloads public domain fonts from DaFont7, 3D models from TF3DM8 and sound effects from FreeSound9. These media are archived as they are downloaded, so that they can be retrieved quickly if needed in the future. ANGELINA generates a zone plan which defines a number of themed zones for use within the game design. A zone is a collection of a .oor texture, a wall texture, a 3D model for use as scenery, and a sound effect. The sound effect and scenery model are both randomly selected from the media downloaded from the associations list. In order to select the texture, ANGELINA searches through a list of 622 tagged texture files for ones which are related to one or more of the association words. A relationship can be established in one of two ways: first, it can compare the associations with the filename or folder name of the textures, which are categorised roughly according to their type (such as clouds or paper). Secondly, it can call on a database of word associations mined using crowdsourcing via Twitter. ANGELINA regularly posts random untagged texture files to its Twitter account10 and asks its followers to provide single words which they associate with the image. These are retrieved and recorded in a database file, and used as a secondary means to relate associations to textures in the case that the filename match fails. Reply counts for a single tweet range from single replies to a dozen or more, and so far 901 responses have been recorded for 84 textures. If no matches are found through either method, ANGELINA selects textures randomly for the zones. Once ANGELINA has selected two textures and randomly chosen a 3D model to act as scenery (we describe scenery later) and a sound effect for each zone, the zone map is complete. Before it proceeds to the main design phase, AN 6http://wordassociationsfinet 7http://www.dafont.com/ 8http://tf3dm.com/ 9http://freesound.org/ 10twitter.com/angelinasgames GELINA will generate a title for the game, and select a piece of music. The games title is generated using a rhyming dictionary11 and a corpus of popular culture references, including famous examples of media such as music and books collated from Top 1000 lists such as IMDBs Top 250 Movies12, as well as idioms and common sayings. ANGELINA attempts to create puns using these resources and the list of source word associations, using a similar approach to the one described in (Cook, Colton, and Pease 2012). To select a piece of music, ANGELINA attempts to choose a suitable mood for the game. It first takes the main theme word, and passes it to Metaphor Magnet13 (Veale 2012) to obtain feelings people express in relation to the theme word. Metaphor Magnet is a tool for exploring a space of metaphors, mined from Google N-Grams. It has an array of features that are built on top of this concept, including the ability to show feelings people commonly express about a topic, such as poetic or metaphorical qualities of something, with the knowledge that these feelings are backed up by concrete examples in the N-Gram corpus. As an illustration, if we submit the word winter to Metaphor Magnet, we are presented with a number of possible metaphors for winter, such as a frightening night or a refreshing spring. By selecting one of these, ANGELINA can use words which express feelings that Metaphor Magnet has corpus evidence for -e.g., winter in the context of a frightening night is commonly described as frightening. This word is chosen as the base mood for the music for the game. It now has to relate this emotion to a piece of music. The music database ANGELINA currently uses is Incompetech14, which categorises pieces according to twenty different moods. In order to relate the mood discovered through Metaphor Magnet with an appropriate tagged mood in Incompetech, we use DisCo15 to rate the semantic similarity between each of the twenty known emotions and the one discovered emotion. The most similar emotion is used as the search mood for music, and a piece of music is randomly selected from the resulting pieces. In total, ANGELINA uses fifteen web services or APIs during the predesign phase, from linguistic tools to databases of tagged content. In (Pease et al. 2013) the authors discuss the concept of serendipity in the context of creative software, and they note in relation to web services that we believe this [accessing web services] will increase the likelihood of chance encounters occurring, [and] expect serendipity to follow. Note that the web services ANGELINA interacts with include unconstrained data sources such as Twitter as well as unedited automatically scraped databases such as Metaphor Magnet. This means that the results of the combinations of services are hard to predict, which offers a strong force of chance, one of the three dimensions of serendipity highlighted in (Pease et al. 2013). 11http://www.wikirhymer.com 12http://www.imdb.com/chart/top 13http://ngrams.ucd.ie/metaphor-magnet-acl/ 14http://www.incompetech.org 15http://www.linguatools.de/disco/disco en.html Figure 1: Screenshots of Hit The Bulls-Spy, a game designed by ANGELINA-. Top: The game world as viewed from above in the Unity editor. Bottom: A screenshot from the running game. Design Phase As with ANGELINA-3 described in (Cook, Colton, and Pease 2012), ANGELINA is composed of several evolutionary systems that work in tandem to cooperatively evolve a game design. Each evolutionary system has two aspects to its fitness function: internal, objective rules that are considered to be unchanging regardless of the overall game design, and external, subjective rules that take into account what properties the current most .t game design has to adjust its fitness evaluation accordingly. In order to evaluate these subjective rules for a given member of a population, ANGELINA takes The most fit example from every other evolutionary process, combines them together to form a game, and then simulates playing that game in real-time. Currently, this simulation is very basic ANGELINA will attempt to guide the player object from the starting point to the level exit, if such a path exists, and records any rules which activate (as well as how often they activate) during the course of the pathfinding. This data is used in the evaluation of the game designs, as detailed below. For more details on coo-operative coevolution, see (Potter and De Jong 2000). For more details on our specific use of cooperative coevolution in ANGELINA, including details on the applicability of cooperative coevolution to multifaceted design problems, see (Cook and Colton 2011) and (Cook and Colton 2012). There are currently four separate evolutionary processes: Level Design which forms a basic layout of solid space in the game world. The top image in Figure 1 shows a birds-eye view of a level designed by ANGELINA. Level designs are currently built out of smaller tiles which are selected from a library of hand-designed tiles and arranged into a variable-size array. For instance, in Figure 1, the size of the map is five tiles wide by five tiles high. A tile is a ten by ten array of integers denoting solid ground, empty space or scenery. Scenery regions are impassable to the player, and when the game is exported, they are replaced with large, static 3D models for theming purposes. Zoning which describes the visual and aural qualities of different regions of the game world. Zones are defined in the predesign phase, and during evolution a zone map is evolved, which is an array of integers relating each tile in the Level Design to one of the premade zones. Placement which describes the start position of the player, and the position of the level exit. The primary objective in all of ANGELINAs games is to reach the exit. In addition, a Placement defines the number and starting position of the games entities. Entities are objects which are placed in the game world and given code to execute to play a role in the games rules. A Placement contains a list of starting positions for each type of entity currently all games by ANGELINA include exactly two entity types, the purpose of which is defined by the Ruleset. Ruleset which describes the set of behaviours possessed by each entity. In Unity, behaviour is an overloaded term used to describe any piece of code which implements a particular interface. In the current version of ANGELINA, we have supplied a stock of behaviours which can be attached to the entities in ANGELINAs games to form a basic ruleset. These behaviours include providing motion for the entity (such as random walks, or wall following) and adding mechanical rules (such as killing a player, or providing score when collected). Expanding this set with automatically generated code is a point of future work, see (Cook et al. 2013) for details. Each of these four processes evolve their populations in isolation, according to various fitness criteria, normally expressed as parameters which can be easily varied, so as to give ANGELINA the ability to alter its own fitness functions in the future. Currently, all parameters have been set through experimentation to find values which produce an interesting variety of outputs in such terms as maze style variation (a mix of open spaces as well as some labyrinthine designs too) or level layouts (dense and sparse entity placement, varying approaches to extending the distance between start and exit). The fitness criteria are as follows: Level Designs are selected to maximise the size of the largest contiguous island, whilst simultaneously avoiding over.tting by limiting fitness to a maximum island size. This encourages level designs in which the tiles join up to form a single level space, but avoids the situation where the entire level is one open expanse by penalising levels which are too full of solid tiles. A level design is penalised if the player or exit start position is in empty space. Zone Maps are selected to maximise connectedness in zones of the same type. This means that a zone map which has two Zone 1 zones separated by a Zone 2 zone scores lower than a zone map which has a single contiguous Zone 1 zone and another single Zone 2 zone. This is done to provide consistency in when and how often a zone is encountered by the player. We anticipate this will become more important as ANGELINA develops, as zones will define clearly themed areas such as a forest, and having these frequently broken up by other zones would be disorienting and may reduce immersion for the player. Placements are selected to maximise spread of entity placements across the map, but are penalised for any placements, including player or exit placements, which are not on solid ground. Placements are also selected to maximise the distance of the path from the start position to the exit position, with a penalty if no such path exists. Rulesets are selected to maximise the number of rules .red in a simulation of a game. ANGELINA records which rules .re during an execution of the game, using a simple player controller which attempts to follow a direct path to the exit. Rulesets are penalised if there is no way for the player to gain score or die, but does not guarantee both score gain and death are in the game. It should be noted that many of these fitness criteria are in place only to complete ANGELINA as a game design system, particularly Rulesets and Zone Maps. We intend to replace these by giving the system the ability to create its own .tness criteria. These might therefore be considered baseline criteria for producing a complete game design. A typical setup for ANGELINA consists of a population size of 30 for each of the four evolutionary species, and a run of 40 generations for the system as a whole, meaning that each species undergoes 40 generations of evolution itself. We utilise one-point crossover and single-element mutation for all four species, since representation is almost entirely array-based. Selection is elitist, and we carry forward the parents of the previous generation, something which we found useful in previous versions of ANGELINA, due to the volatile nature of cooperative coevolutionary systems. Postdesign Phase When ANGELINA has completed the set number of generations and completed a game design, the game export process begins. Unity games are meant to be developed inside a single project which contains all the art and audio assets for the game, the data, the levels, the code and logic. Unity has export features that compile these various components together into a single package for a chosen platform (such as iOS). However, in our case it is ANGELINA that is the Unity project, not any single game that it develops. This means that the asset folders contain databases of models used in the past, music that has been downloaded, metadata and information about ANGELINA as a system, and so on. Exporting the games as-is is therefore not possible, as Unity cannot be Figure 2: A graph showing the highest fitness as generations pass, for a single run of ANGELINA. The blue is Zone Map fitness; the red is Placement fitness; the yellow is Level .tness; and the green is Ruleset fitness. told to avoid exporting certain resources, and would attempt to export gigabytes of data for each small game developed. For this reason, and because of a desire to archive games designed by the system, we have ANGELINA export all the relevant information about a game design into a separate folder. This includes a text file describing the level design and the locations of resources, as well as the asset files such as models and textures. This folder can then be read as a standalone Unity project that only imports the necessary resources, and can then export executable game binaries. In addition to the game export, ANGELINA also produces a commentary describing some of the decisions it made in the production of the game, using template paragraphs which are filled in using resources it finds on the Internet, and data from the games production. Previous versions of ANGELINA also used commentaries, as per (Cook, Colton, and Pease 2012). Figure 3 shows a sample commentary. Evolutionary Performance Figure 2 shows a sample fitness graph for each of the four evolutionary species that make up ANGELINA. The coloured lines are described in the caption to the figure. Note that there is little evolutionary improvement in the Zone Map or Ruleset species these species are underdeveloped in the current version of ANGELINA. The system will eventually be able to track information about player routes through levels and use this to guide the placement of zones so that they affect the players experience in a particular way, such as matching it against the emotional valence of a narrative, or to reflect changes in location. Similarly, the Ruleset species is awaiting an extension of work done on generating game mechanics through code (Cook et al. 2013) so that ANGELINA can propose rules itself which it can then use in a game design. Until then these evolutionary species remain incomplete. However, in the Level and Placement design species, we can see more clearly that evolution is working as intended. We anticipate that the other species will behave in this way, as they are integrated more fully into the cooperative coevolution. This is a game about a disgruntled child. A founder. The game only has one level, and the objective is to reach the exit. Along the way, you must avoid the Tomb as they kill you, and collect the Ship. I use some sound effects from FreeSound, like the sound of Ship. Using Google and a tool called Metaphor Magnet, I discovered that people feel charmed by Founder sometimes. So I chose a unnerving piece of music to complement the games mood. Figure 3: Title screen and excerpted commentary. ANGELINA and Ludum Dare 28 The Ludum Dare 28 game jam took place on the weekend of December 13th 2013, following a week of voting which narrowed down a list of 100 themes to a shortlist of 20, and a final announcement of the winning theme at the moment the game jam started. The chosen theme was the phrase You Only Get One. It generated 1284 entries to the competition track, and 780 entries to the jam track. ANGELINA entered Ludum Dare with two entries. In both cases, the system was given the theme in plain text, and configured to run for 60 generations, with a level population size of 35, a placement population size of 35, a ruleset population size of 20, and a zone population size of 15. Both games took approximately three hours to generate in their entirety, including the retrieval of game assets from the web. The motivation behind producing two games for the jam was to investigate the presence of bias in the assessment of creative software in the medium of videogames. Our hypothesis was that, contrary to anecdotal reports and studies from Computational Creativity researchers e.g. (Pease and Colton 2011) and (Moffat and Kelly 2006), people tended to be positively biased towards creative software working in videogames. We submitted the first game ANGELINA produced with a commentary explaining the background of the system, and an unabridged commentary from ANGELINA about the game16. To anonymise the entry, the second game was submitted under a pseudonym to the game jam, without any reference to ANGELINA or the research project, and with ANGELINAs commentary edited to avoid references to software or other phrasing that might give away the games background.17 16This game can be viewed at http://tinyurl.com/tothatsect 17This game can be viewed at http://tinyurl.com/stretchpoint Entries To That Sect ANGELINAs first game, and the one which was submitted with full disclosure, was titled To That Sect. Figure 3 shows a screenshot from the game. The player must avoid strange demonic statues while collecting ships, on their way to reaching the exit. An unsettling piece of music plays, and a ships bell tolls in the background. The scenery chosen for the game is a model of a player character from the game Lineage 2, dressed in armour. In both this game and Stretch Bouquet Point below, ANGELINA extracted the word one from the input theme as the most likely theme word, but then found it to be too general to use as a specific theme, and so chose to use the narrowing technique we described earlier to select a word associated with one as the target theme. In the case of To That Sect, it chose the word founder. Words associated with founder included religion and sect, which accounts for the references in the games title as well as the musical choice. Metaphor Magnet suggested that people feel charmed by founders presumably relating to the context of a cult or a religious sect and ANGELINA narrowed this emotion down to unnerving using DISCO. The references to ship are due to an ambiguation of the theme word since a ship can founder on rocks, as a verb. Stretch Bouquet Point This game was submitted anonymously under a different username, without any references to software or ANGELINA in the description, and an edited commentary to hide similar references in ANGELINAs output text. The player must avoid girls referred to as daughters while trying to reach the exit. An untextured model of a woman is used as scenery, and very loud chanting plays over the top of the games music, drowning it out. As with the previous game, one is further narrowed due to it being deemed an insufficient theme. This time, bridesmaid is chosen as the target word, as it was found to be associated with the word one. This leads to words such as bouquet, found in the title, as well as woman and daughter. The chanting that plays over the top of the game is from the keyword marriage a recording of an African griot singing during a marriage ceremony. The connection of bridesmaid to one is not obvious. Many of the results from basic word association rely on words appearing in proximity to one another, and one is a very generic word which may lead to erroneous or weak connections being made. Improving the association step is a point of future work. Results The scores for both games for each of the eight categories are listed in Table 1. Votes are not made public in Ludum Dare, and we were unable to obtain specific data from the organisers. Despite this, we can see that for many of the rating categories, the game which was publicly labelled as being created by a piece of software was ranked higher in all categories except humour hundreds of places in some cases. For humour, we believe the sole reason the anonymised game was ranked higher was because the (unintentional) surreality of the games was perceived as funny when it was believed to be coming from a person rather than software. To That Sect Stretch Bouquet Point Overall 500 551 Fun 515 543 Audio 211 444 Graphics 441 520 Mood 180 479 Innovation 282 525 Theme 533 545 Humour 403 318 Table 1: Rankings for ANGELINAs two games entered into Ludum Dare 28. There were 780 total submissions to this track. Lower rankings are better. In order to try and maintain equal prominence for the two submissions, we rated an equal amount of Ludum Dare submissions whilst logged in as each account. To avoid both games rising to the top of the rating system at the same time and risking identification, we performed rating sessions at least 24 hours apart and at different times of the day, to minimise the risk that the same reviewer would encounter both submissions. In order to minimise the impact of our experiment on the event as a whole, we ensured that no game was rated twice, and we did not leave any written comments when rating other entries. While the results indicate some potential positive bias towards the non-anonymised entry to Ludum Dare, we were unable to obtain specific voting data from the event organisers, leaving us unable to calculate specific confidence values for the reviews. Nevertheless, it does act as a good foundation for further investigation to be done in this area. These results are further reinforced by the written comments left underneath each submission by reviewers. Reviews for To That Sect largely balanced positive with negative remarks. No comments were universally negative, tempering any criticism with positivity: Angelina seems really good at creating an atmosphere with both sound and visuals. But the game part of it seems a bit lacking still. The game itself is too simple. It seem the AI got the mood, but not the [game]play. By contrast, comments on Stretch Bouquet Point were passive-aggressive or outright critical: this was a rather annoying experience. You made me feel something there. Dont make me put it into words though. The response to To That Sect was not without bias. One comment on the game notes that If it [had] added shooting at the statues that you must avoid and a [target] of ships you to collect, it would have been better. It felt like playing [an] art-message type of game. We can contrast this with LITH,18 a game entered into the competition by a human designer, where the player navigates a maze and collect bags of gold coins, while avoiding patrolling robots. While not exactly the same, the rules of LITH are very close to those of To That Sect: search for as many objects of a certain type as possible, while avoiding another object, then exit. LITH was entered in the same track as ANGELINAs games, and ranked 95th Overall, 125th for Fun, and 274th for Theme. None of the comments on LITH reference the games rule-sets in a critical way. Contrary to the comments that To That 18LITH game: www.tinyurl.com/lith-ludum Sect felt like an art game, one comment actually praises LITH for feeling old-school, a quite opposite compliment. The games are by no means identical: LITHs level is more closed in to accentuate a feeling of claustrophobia, but the similarities are many. This analysis suggests a fundamental difference in how people evaluate a game when they have knowledge and when they have no knowledge of its designer and design process. We plan further experimentation to investigate this notion. Although the results for Ludum Dare have an extremely long tail, it is still notable that ANGELINAs entry outperforms many hundreds of other entries to the contest. Low ranking entries included games which had very passive gameplay mechanics (such as a game in which single bets are placed on extremely long non-interactive races) or games which were lacking in appropriate art and audio content (many games were lacking audio entirely, or used music or sound effects which clashed with the games theme). While these are small differences, and this was not a large, conclusive study, it is nevertheless significant that ANGELINA was ranked, by a community of game developers, to have outperformed many other entrants. Related Work Procedurally generating specific types of content for videogames is a well-explored area of research (Togelius et al. 2011). Many different types of content have been generated automatically, from rulesets (Togelius and Schmidhuber 2008) to levels (Williams-King et al. 2012) to art assets (Liapis et al. 2013) and even procedural generators themselves (Kerssemakers et al. 2012). More specifically, the creation of software to automate the process of game design has been looked at by others in the past. In (Treanor et al. 2012) the authors describe the Game-o-Matic, a design assistant for journalists that could be given a graph representing relationships between concepts (such as police arrests protester) and then construct a game that reflected the network of relationships. The Game-o-Matic only understood a limited set of verb relations, and sourced its initial rulesets from a library of human-authored rules. However, it was able to source artwork for its games automatically, and could tweak rules to refine a game design, which gave it a good expressive range. In (Nelson and Mateas 2007), the authors present a simple mini-game generation system that takes verb-noun constructions and presents games based on the given relationship. The input shoot pheasant, for example, presents games where the player controls a crosshair trying to shoot birds, or controls a bird trying to avoid being shot. Connections are made between human-tagged game mechanics and known words using a combination of ConceptNet and WordNet. ANGELINA is not the first piece of creative software to engage with people in a social or cultural context. The Painting Fool, a piece of software its designer hopes will one day be taken seriously as an artist, has exhibited its work in public fora multiple times, e.g. (Colton and Perez-Ferrer 2012), and has sold its artworks to collectors. Elsewhere, Venturas PIERRE system (Morris et al. 2012) evolved soup recipes using a database of existing recipes and an understanding of food groups. PIERREs recipes were evaluated anonymously in online cookery forums, as well as having its creations cooked by a person and evaluated via tasting on multiple occasions, with the knowledge of the recipes origin in these latter cases. Anecdotal evidence suggested positive bias where the consumers had knowledge of PIERREs existence, however we do not present this as serious evidence for positive bias, as the author notes that the presentation of the recipes may have contributed to the negative response to the anonymised recipe submissions. Future Work The work described here represents a new foundation for our research into automated game design. The flexibility of Unity as a platform, and the more general architecture of ANGELINA, means that we hopefully will be able to work on a single piece of software for some time, and go deeper into some of the issues we have brushed up against over the past few versions of the software. In particular, the following areas present themselves to us for further study. Improved Communication Entering ANGELINA in a game jam underlined the importance of the use of commentaries and context in conveying the intelligence and creativity of a system to an observer. For further exploration of the role of the observer in the context of ANGELINAs entry to Ludum Dare, see (Cook and Colton 2013). In the future, ANGELINA will provide interactive commentary material that can be interrogated in-game to provide more detailed information about the design process. We believe this will ultimately increase the perception that the software is creative. Innovation in Design Because of the preliminary nature of some elements of ANGELINA, the games main game-play and objectives varied very little between different runs of the system. In order to improve this, we aim to bring in previous work on generating code for the invention of game mechanics as described in (Cook et al. 2013), and expand this to allow ANGELINA to generate code that produces new types of gameplay, and new styles of game. This will help strengthen the argument that ANGELINA is designing new games, and will also increase the independence of the system. Better Theme Interpretation A key aspect of entering a game jam is interpreting the given theme and working it into the final game design. We aim to integrate the theme into more aspects of the games design than just the visual and aural theming. Good games incorporate the theme into their mechanics and design. We have discussed methods for doing this previously in (Cook and Colton 2013), and we will look to build some of them into ANGELINA. Conclusions We have described ANGELINA, the latest iteration of our automated game design system. ANGELINA is a redevelopment of the system in the Unity game engine, the first automated game designer that we know of to produce output in 3D. ANGELINA was developed to take a different approach to previous versions of the software, in that it would work from arbitrary phrases acting as themes. This allowed the software to take part in a game jam the first time an automated game designer has done so, gaining a higher ranking than hundreds of other human-authored games. We described the process of entering a game jam, as well as describing the systems two entries into the jam one of which was publicly annotated as being developed by ANGELINA, while the other was anonymously submitted. We looked at the different reactions, both in terms of the scores the games received and the surrounding commentary on the games, and discussed the potential implications for creative software acting in the videogames medium in the future. For all the mixed reactions and ratings, the response to ANGELINA entering a game jam was overwhelmingly positive, and the interaction with the development community will benefit us as researchers as well as the project in the long run. Hopefully we will see this trend continue, and we aim for more interaction between ANGELINA and the community in the future. Acknowledgements The authors wish to thank the reviewers for their comments which helped improve the paper, as well as Mike Kasprzak, Phil Hassey, Seth Robinson and Mike Hommel. This project has been supported by EPSRC grant EP/L00206X/1. 2014_9 !2014 Adapting a Generic Platform for Poetry Generation to Produce Spanish Poems Hugo Gonalo Oliveira Raquel Hervs and Alberto Daz Pablo Gervs CISUC, Dep. Engenharia Informtica D. de Ing. de Software e Int. Artificial Inst. de Tecnologa del Conocimiento Universidade de Coimbra Universidad Complutense de Madrid Universidad Complutense de Madrid Portugal Spain Spain hroliv@dei.uc.pt {raquelhb,albertodiaz}@fdi.ucm.es pgervas@sip.ucm.es Abstract PoeTryMe was created as a generic system for the generation of poetry that takes into account both semantics, in the form of triplets of relations between concepts, and textual structure, in the form of a grammar of templates extracted from existing poems. It was originally instantiated to generate poetry in Portuguese. The present paper describes an effort to create a different instantiation of PoeTryMe, this time focused on the production of poems in Spanish. The instantiation effort involved the creation of a set of triplets of relations to represent the semantics of Spanish terms, the extraction of a grammar of templates for Spanish from a corpus of Spanish poetry, the application of a different tool for Spanish syllabic division, the integration of the various modules, and several experiments with the resulting system. Introduction Existing efforts at the automatic generation of poetry in recent years have uncovered a number of methods for implementing computationally this task, both from a semantic seed (Manurung 2003; Manurung, Ritchie, and Thompson 2012) and from a set of templates (Oulipo 1981; Colton, Goodwin, and Veale 2012), or by combining both (Gonalo Oliveira 2012; Veale 2013). Yet most existing efforts consist of custom-tailored solutions for specific languages, and it is difficult to envisage what amount of effort might be required to port one of them from the language for which it was originally designed to a different language. The present paper addresses the question of exploring the effort required by the task of adapting an existing generic platform for poetry generation, PoeTryMe (Gonalo Oliveira 2012), to a language (Spanish) different from the one over which its original instantiation was designed (Portuguese). To produce its poems, the PoeTryMe platform combines both semantic information, in the form of relation triplets between concepts which are used during the selection of content for the poems, and textual information, in the shape of template-like grammar rules used to render as text the selected content. In this sense, it presents a special challenge because resources specific to the new target language need to be produced at both levels, semantics and textual. The present paper reports on the engineering and development effort for these required resources and presents an exploration of the effect of their characteristics on the performance of the poem generation process. Throughout this effort, the overall goal has been to reuse or adapt existing resources and/or to extract automatically any novel ones, in order to avoid as much as possible the risks of fine tuning (Colton, Pease, and Ritchie 2001) inherent in handcrafting them. Previous Work Over recent years, many efforts that address the study of creativity from a computational point of view acknowledge the work of Margaret Boden (Boden 1990) as a predecessor. One of Bodens fundamental contributions was to formulate the process of creativity in terms of search over a conceptual space defined by a set of constructive rules. Poetry generation systems explore a conceptual space characterised by form and content. The concept of articulation (Gervs 2013) describes the initial analysis of a target artifact with a view to select a particular frame for understanding and decomposing it into parts that can later be used to assemble equivalent instantiations of the same type. This captures both the concept of different parts being joined together in a whole and the concept of allowing the parts to move with respect to one another. Different decisions on articulation can lead to processes that select a particular textual template with which the poems are produced (Oulipo 1981; Colton, Goodwin, and Veale 2012), or reuse a predetermined set of verses (Queneau 1961), or draw upon given sets of lexical items to employ (Gervs 2001), or even rely on a language model to follow, obtained from a reference corpus (Barbieri et al. 2012). The degree of articulation determines a particular conceptual space of possible poems, with poems outside that space being unreachable unless the articulation is changed. In terms of computational techniques used to explore these conceptual spaces, several solutions have been applied. The generate & test paradigm of problem solving has also been widely applied in poetry generators such as the early version of the WASP system (Gervs 2000b) and the initial work by Manurung (Manurung 1999). Evolutionary solutions have as well been applied (Manurung, Ritchie, and Thompson 2012). An evolution of the WASP system (Gervs 2001) used case-based reasoning (CBR) to build verses for an input sentence by relying on a case base of matched pairs of prose and verse versions of the same sentence. Alternative approaches to poetry generation include the application of constraint programming techniques (Toivanen, Jrvisalo, and Toivonen 2013), which has a great potential for adequately modelling the large amount of constraints that poetry generation deals with. Although a pleasant sound and a regular rhythm can sometimes make up for poor or inexistent semantics (Gonalo Oliveira, Cardoso, and Pereira 2007), meaning is also seen as an important feature in computer-generated poetry, whether more precise, vague or figurative. (Veale 2013) describes a system heavily influenced by semantic information, used to drive the poetry generation process, with special focus on figurative language and rhetorical tropes. But different systems handle semantics differently. In evolutionary approaches, among the other constraints, the goal state should consider meaning (Manurung 2003; Manurung, Ritchie, and Thompson 2012), whereas in CBR approaches, words are selected according to a given prose message. In fact, in several systems generation starts with a theme or a set of seed words, which constrain the poem search space and may be seen as setting the semantics of the poem (Wong and Chun 2008; Netzer et al. 2009; Yan et al. 2013). The choice of relevant words may be achieved either with the help of semantic knowledge bases (Netzer et al. 2009; Agirrezabal et al. 2013), by exploring models of semantic similarity, extracted from corpora (Wong and Chun 2008; Toivanen, Jrvisalo, and Toivonen 2013; Yan et al. 2013), or both (Colton, Goodwin, and Veale 2012). PoeTryMe PoeTryMe, originally presented in (Gonalo Oliveira 2012), is a poetry generation platform, on the top of which different systems for poetry generation can be implemented. It relies on a modular architecture (see Figure 1), which enables the independent development of each module and provides a high level of customisation, depending on the needs of the system and ideas of the user. It is possible to define the semantic relation instances to be used, the sentence templates of the generation grammar, the generation strategy and the configuration of the poem. In this section, the modules, their inputs and interactions are presented. Generation Strategies A Generation Strategy organises sentences according to some heuristics, such that they suit, as much as possible, a target template of a poetic form and exhibit certain features. A poem template contains the poems structure, including the number of stanzas, lines per stanza and of syllables in each line. Templates may also use a symbol for denoting the target rhyme for the lines. Figure 2 shows poem structure templates for a haiku (5-7-5) and a sonnet (14*10-syllable verses). There is no rhyme pattern specified for the haiku, but each line of the sonnet has a symbol that results in the following rhyme pattern: ABBA ABBA CDC DCD. Each strategy uses the Sentence Generator module to retrieve natural language sentences, which might be selected as poem lines. For the generation of a poem, a set of seed #haiku stanza{line(5);line(7);line(5)} #sonnet stanza{line(10:A);line(10:B);line(10:B);line(10:A)} stanza{line(10:A);line(10:B);line(10:B);line(10:A)} stanza{line(10:C);line(10:D);line(10:C)} stanza{line(10:D);line(10:C);line(10:D)} Figure 2: Templates with the structure of a haiku and a sonnet with a rhyme pattern. words is provided and used to narrow the set of possible generations, this way defining the generation domain. An instantiation of the Generation Strategy does not generate sentences, but follows a plan to select the most suitable sentences for each line. Selection heuristics might consider features like metre, rhyme, coherence between lines or other, depending on the desired purposes. Some of these features are evaluated with the help of the Syllable Utils. Syllable Utils As its name suggests, this module consists of a set of operations on syllables. Given a word, Syllable Utils may be used to divide it into syllables, to find the stress, or to extract its termination, useful to identify rhymes. Sentence Generator This is the core module of PoeTryMe. It is used to generate meaningful natural language sentences, with the help of: A semantic graph, managed by the Relations Manager, that connects words according to relation predicates (see Figure 3 for a very simple semantic graph, centered in the word poetry, in Portuguese/English). Generation grammars, processed by the Grammar Processor, with textual renderings for the generation of grammatical sentences that express semantic relations. The generation of a sentence starts by selecting a random relation instance, in the form of a triplet = {word1, predicate, word2}, from the semantic graph. Then, a random rendering for the predicate of the triplet is retrieved from the grammar. After inserting the arguments of the triplet in the rule body, the resulting sentence is returned. A third module, the Contextualizer, keeps track of the instances that were used to generate the lines and may be used to explain the choices made. Relations Manager The Relations Manager is an interface to the semantic graph. It may be used to retrieve all words related to another, or to check if two words are related by indicating their relation. To narrow the space of possible generations, a set of seed words is provided to the Relations Manager. This set defines the generation domain represented by a subgraph of the main semantic graph, where the relation triplets should either contain one of the seed words or somehow related words. More precisely, the subgraph will only contain triplets with words that are at most fi nodes far from a seed word, where fi is a Figure 1: PoeTryMes architecture Figure 3: Semantic Graph example neighbourhood depth threshold. It is also possible to define a surprise factor, ., interpreted as the probability of selecting triplets one level further than .. The number of seed words is open, and it can be enlarged with the top n relevant words for those seeds. For this purpose, the PageRank (Brin and Page 1998) algorithm is run in the full semantic graph. Initial node weights are randomly distributed across the seeds, while the rest of the nodes have an initial weight of 0. After 30 iterations, nodes will be ranked according to their structural relevance to the seeds. The n higher ranked nodes are selected. Grammar Processor The Grammar Processor is an interface for the generation grammar. Similarly to Manurung (Manurung 1999), it performs chart generation with a chart-parser in the opposite direction. A grammar is a editable text file with a list of rules, whose body should consist of natural language renderings of semantic relations and there must be a direct mapping between the relation names, in the graph, and the rules name, in the grammar. Besides simple terminal tokens, that will be present in the poem without change, this module supports terminal tokens that indicate the position of the relation arguments ( and ), to be filled by the Sentence Generator. This way, given a relation predicate, the Grammar Processor can retrieve one (or several) renderings for any triplet of that kind. A very simple example of a valid rule set, with three hypernymy patterns, is shown in Figure 4. These rules could be used to generate sentences as: a tool like a hammer, mango is a delicious fruit, man before animal. HYPERNYM-OF a like a HYPERNYM-OF is a delicious HYPERNYM-OF before Figure 4: Grammar example rule set. Contextualizer The ability to explain how its artefacts are created is an important feature of a creative system. PoeTryMe provides this feature by keeping track of all the relation instances that originated each line. Towards the notion of framing (Charnley, Pease, and Colton 2012), these can later be used to contextualize the poem by indicating the relation instances used to form the lines and how they are connected to a word in the generation domain. The context can be a mere list of relation instances or, if a contextualisation grammar is provided, it may consist of a natural language piece of text. Generating Poetry in Spanish The process of instantiating the PoeTryMe platform to generate Spanish poems required three separate processes of relevant system resources: (i) construction/adaptation of Spanish lexical resources (morphological lexicon, lexical-semantic knowledge base, syllable division tool); (ii) construction/extraction of a set of template-like renderings; and (iii) configuration of an appropriate generation strategy. Before describing those processes, some remarks on the requirements and on the flexibility of PoeTryMe are provided. Remarks on Requirements and Flexibility As presented in the previous section, PoeTryMes architecture is very flexible and may be used to generate poetry in different languages and/or on different domains. This applies as long as there are three main tools available, namely a lexical-semantic network, a generation grammar and syllable utilities, all targeting the same language. The lexical-semantic network, handled as a semantic graph, can be broad-coverage or on any specific domain, as long as it contains relation instances represented as triplets (word1 related_to word2). The generation grammar must contain textual renderings for the relation types covered by the lexical-semantic network. And the syllable tool should at least provide a method for each of the following operations: splitting a word into syllables, stress identi.cation and termination extraction. As a lexical-semantic network typically contains only lemmatised words, if we want to use also in.ected words, a morphological lexicon might also be needed in a preprocessing step. This lexicon should be as broad as possible and provide the part-of-speech (POS) of the words of the target language, as well as other morphological information, such as the gender and number of nouns and adjectives. It can be used for adding in.ected words to the lexical-semantic network and contribute to more variation, Moreover, if the generation grammar is learned automatically, with the help of the network, it will enable to learn more complete grammars. For Portuguese, there have been different instances of PoeTryMe where, apart from different generation strategies, the main differences in external resources were the different sizes of the lexical-semantic network and of the generation grammar. In fact, in the first instantiations of PoeTryMe, the generation grammars were handcrafted. Regarding the adaptation to Spanish, we used a morphological lexicon with the same information as the Portuguese, a syllable tool that performed the same operations, and a lexical-semantic network with the same format. The main difference probably relies on the latter. While, for Portuguese, the lexical-semantic network was extracted automatically fromdictionaries (CARTAO (Gonalo Oliveira et al. 2011)), for Spanish, it was obtained from a handcrafted resource. This resulted in a larger semantic graph for Portuguese (about 286,000 triplets between lemmas) covering more relation types, and more figurative language, but also more imprecisions. Another obvious difference on the instantiations for different languages results from the different generation grammars, which are learned from different collections of text, each written in its own language. Lexical Resources Used In order to handle the in.ection of nouns and adjectives (number and gender), the dictionary from FreeLing (Padr and Stanilovsky 2012) has been used as lexicon of Spanish. It contains over 650,000 in.ected word forms including nouns, verbs, adjectives and adverbs. For each form, there is information on the lemma, the POS, and in.ection details that include the tense of the verbs and the number and gender of the nouns and adjectives. As the source of relation instances that would build our semantic graph, we have used the Spanish WordNet from the Multilingual Central Repository (MCR) version 3.0 (Gonzalez-Agirre, Laparra, and Rigau 2012). MCR follows the classic wordnet structure, and thus contains synsets and relations between them. The following example shows how a synset relation is converted to relation triplets between words: {automvil, carro, coche} Synset relation hypernym-of {coche_deportivo, deportivo} automvil hypernym-of coche_deportivo automvil hypernym-of deportivo Word triplets carro hypernym-of coche_deportivo carro hypernym-of deportivo coche hypernym-of coche_deportivo coche hypernym-of deportivo A total of 366,125 relation triplets were obtained from the MCR relation tables. Additionally, 58,052 synonymy instances were obtained from the synsets. But we did not use relations of some types, namely those indicating that some word is in a synset gloss (rgloss), nor those that reference a previous version of WordNet (see_also_wn15). After .ltering, we had about 103,000 triplets, held between lemmas, to which we add all possible in.ections of nouns and adjectives. In the end, this resulted in 231,296 relation triplets. To compute the metric scansion of the poems in Spanish in terms of syllables, the corresponding module of the WASP generator of Spanish poetry (Gervs 2000b) was employed. This module is a Java reimplementation of an original set of rules designed as a logic program (Gervs 2000a). For integrating this module in PoeTryMe, an interface with the operations needed by the Syllable Utils module, and shared by the Portuguese tool, was implemented. Learning Renderings for Semantic Relations While we could have handcrafted generation grammars with semantic relation renderings, we decided to learn those automatically. This way, a larger and broader set of renderings was obtained, with much less manual labour. For this purpose, we exploited a collection of human-written Spanish poetry, with poems from an existing anthology of Spanish poetry on the web1 and also from the WASP knowledge base. Those amounted to 395 poems. The poems of this collection were processed while renderings, represented as grammars rules, were extracted from each line in the human-written poems where two words in a semantic triplet co-occurred. We used the aforementioned 231,296 triplets, collected from MCR. Generation Strategy In all our experiments, we have used a generate & test strategy (GT), already implemented in previous versions of 1http://www.poemas-del-alma.com/ PoeTryMe. From the currently available strategies, this achieves rhymes more consistently. For each target line, GT consists of the successive generation of sentences, while keeping only the best scoring ones. Line generation stops either after a predefined number of generated sentences (n), or when a sentence is generated precisely with the target number of syllables and target rhyme, if there is one. Sentences are first scored according to the absolute difference between their number of syllables and the target number of syllables, for the line. The higher the score, the less suited the sentences metre is. On the top of this score, there are bonuses for rhymes (-2 points) and penalties for sentences that end with the same word as another in the same stanza. Moreover, we may set a progressive multiplier (.) to increase the number of generations for lines of higher order in the stanzas, this way increasing the probability of rhymes. Experimentation Different configurations have been used to test the performance and behaviour of the system. In order to study the relation of input knowledge (both lexical and semantic) and the performance of the system, we have worked with different sets of data in the experiments. Regarding the discovery of lexical renderings to create the final text, we have trained the system using two different sets of Spanish poems: The whole collection of 395 Spanish poems (GR+), which produced a total of 1,285 grammar rules. A subset of the previous collection with only 64 poems (GR-), which produced a total of 245 grammar rules. Note that all the grammar rules in GR-are also in GR+. In addition, different sets of semantic relations were used: The whole set of semantic relations from MCR (SR+), which contains 231,296 triplets. A subset of SR+ with only synonymy relations (SR-Syn), which contained 55,300 triplets. A subset of SR+ with only hypernymy relations (SR-Hyp), which contained 130,669 triplets. In order to produce comparable results, all the experiments were performed using the same configuration. The goal was to generate a sonnet without a predefined rhyme pattern, using the generate & test strategy (GT), with a maximum of 1000 generated sentences per line. For setting the semantic domain, two values for the neighbourhood depth threshold were tested, fi =1 and fi =2, each used to generated a set of 100 poems, always with the surprise factor fi =0.1. The seed words used were always amor (love), muerte (death), suerte (luck), vivir (to live), sentir (to feel), and morir (to die). These were chosen especially because they were the main topics in the original set of poems. PageRank was not used, so the system only worked with this exact set of seeds. Experiments on Semantic Relations and Evaluation Table 1 presents the results obtained regarding the semantic relations used and the evaluation scores of the resulting poems. The former is presented as the size of the explored subgraph, given in terms of the percentage of distinct triplets used from the full semantic graph, in each case all (SR+), only synonymy (SR-Syn), only hypernymy (SR-Hyp). About evaluation, the presented scores gave -2 bonuses to each line ending with a termination previously used in the same stanza. As the lower the score, the better, this results in a possible best score of -20. We recall that this is not exactly the same evaluation function used in GT. In this strategy, the best possible score for a sonnet would be -12, because every time a rhyme occurs, the target termination is discarded. This however does not prevent the generation of poems as the one in Figure 5, where all lines share the same termination. . GR SR % of SR Evaluation Avg. Worst Best 1 GR SR+ 0.67% -8.76 -2 -14 1 GR+ SR+ 0.77% -5.19 0 -10 2 GR SR+ 13.80% -8.19 -3 -13 2 GR+ SR+ 17.78% -5.93 -1 -12 1 GR SR-Hyp 0.56% -10.86 -6 -19 1 GR+ SR-Hyp 0.61% -4.68 -1 -9 2 GR SR-Hyp 13.04% -12.03 -7 -19 2 GR+ SR-Hyp 15.30% -5.53 -1 -10 1 GR SR-Syn 0.56% -6.77 -3 -11 1 GR+ SR-Syn 0.55% -4.62 0 -9 2 GR SR-Syn 2.49% -8.91 -5 -14 2 GR+ SR-Syn 2.49% -4.69 0 -10 Table 1: Use of semantic relations (SR) and evaluation results for the different configurations of the experiments On the semantic relations used, values are consistent among different configurations. When fi =2 instead of 1, more triplets are used by definition. The increase in the percentages between fi =1 and fi =2 is proportional in all the experiments, including those using SR-Syn, where it is smaller because the full semantic graph contains about 23.9% synonymy triplets but 49.0% hypernymy. Regarding the scores automatically assigned by the system, the average poem score is higher (and therefore less desirable) when more grammar rules are used (GR+). A possible explanation for this counterintuitive behaviour is the increased number of grammar rules without extending the cut-off values for the resulting search. It is therefore possible that the search over the larger conceptual space is cut off prematurely, thereby having less options to find exactly the combination of relations, words and renderings most appropriate from the point of view of rhyme and length. There is not a clear relation between system assigned scores and the number or type of semantic triplets used. The best scoring poems were obtained with the smaller set of grammar rules (GR-) and only hypernymy relations (SR-Hyp). Figure 5 shows the best poem of experimental runs, along with its rough translation and the experimental configuration that lead to its production. This sonnet uses the same lexical template for all lines and adjusts it by using different pairs of verbs, where one is a hypernym of the other. The rhyme is perfect, but not especially interesting, mi hospedar no quiere albergar mi pensar no quiere relacionar mi olvidar no quiere arrojar mi morir no quiere soportar mi ocupar no quiere trabajar mi indicar no quiere informar mi recibir no quiere saludar mi tragarse no quiere soportar mi albergar no quiere albergar mi resolver no quiere terminar mi ocupar no quiere trabajar mi residir no quiere habitar mi percibir no quiere observar mi olvidar no quiere descartar Strategy GT my hosting wants no holding my thinking wants no relating my forgetting wants no throwing my dying wants no tolerating my busying wants no working my indicating wants no informing my receiving wants no greeting my swallowing wants no tolerating my holding wants no holding my resolving wants no ending my busying wants no working my residing wants no living my perceiving wants no observing my forgetting wants no discarging Renderings, relations GR-, SR-Hyp Generations/line 1000 . + . 1.01 PageRank no Domain amor (love) muerte (death) suerte (luck) vivir (to live) sentir (to feel) morir (to die) Score -19 Figure 5: System configuration in the experiment that obtained the best-scoring sonnet as all the lines end with ar. On the contrary, the worst scoring poems are always obtained with the complete set of grammar rules (GR+) regardless of the semantic relations used. Figure 6 presents one of these poems where the choice of lexical templates is not as repetitive as in the best poem, but there are just no rhymes. Besides the best and worst-scoring, from all the generated poems, we manually selected a more balanced one, which is shown in Figure 7. This choice was based on the variety of lexical templates used, metre matching, presence of rhymes, and evocative semantics. Experiments on Grammar Rules Table 2 has some figures on the experiments regarding the lexical renderings used from the grammar rules and the diversity on their selection. Although more configurations were tested, only those with the complete set of semantic relations (SR+) and fi =1 are shown. Results with other configurations were similar. Distinct renderings Repetitions Renderings from GR average maximum GR-SR+ GR+SR+ 57 257 15.72 6.83 259 114 14.29% 16.31% Table 2: Use of lexical renderings for different experimental configurations These results show that the repetition of the same rendering is very common. In both configurations, the average number of repetitions per rendering used is relatively high. The number of repetitions is even higher in the configuration with GR-. This is expected because the number of available lexical renderings is smaller and the ones suitable for the poem must be used more times. The number of lexical renderings used from the whole set of grammar rules (GR) is quite small in both experiments. In fact, only 15% of the lexical renderings derived from the grammar rules are used in the generated poems. This is due to the nature of the grammar rules derived from the original poems. For example, many lexical renderings correspond to lines in the original poems with significantly more or significantly less than 10 syllables. Therefore, their suitability for generating 10 syllable lines required by our sonnets is low. In order to test the coverage of the lexical renderings in the generated poems, we carried out a process of obtaining the grammar rules implicit in the generated poems. This was done in an equivalent manner to that used for obtaining renderings from the original set of poems the poems generated automatically were processed, and grammar rules were extracted for each line where two words in a triplet co-occurred. This led to an interesting finding: new lexical renderings, not in the original generation grammar rules, were discovered in the generated poems. From the total of lexical renderings obtained from the generated poems in both experiments (57 and 257 respectively), about 53% and 39%, respectively, were different from those in the original set of grammar rules. Considering repetitions, respectively 77% and 85% of the lexical renderings used in the poems were in the original set of grammar rules. New renderings obtained from the generated poems are discovered because of new relations between words in the triplets and words in the final realization of grammar rules. On the one hand, the new renderings could be incorporated as new rules of the generation grammar. This would result in a broader set of more varied and possibly more interesting renderings, worth being explored in the future. On the other hand, we should take some precautions because, while the new renderings would still be grammatically correct, they might be less semantically coherent. About the most frequent lexical rendering in all the experiments, it is mi no quiere (my does not want to ) where both arguments are expected to be verbs, and a hypernym of . When hypernyms are not used (SR-Syn), the most frequent rendering depends on the configuration. It can be: quiero de vivir y poblar la fe de cristo quiero quedarse entregar el alma muri como un cabo el final quiero identi.car distinguir muri como un gusto el afecto de poblar y vivir la fe de cristo gran muerte de matanza concurriendo quiero perder la vida sucumbir de vivir y durar la fe de cristo trayendo el final a .n dudoso y la desaparicin y la muerte muri como un afecto el gusto de encontrar y dar la fe de cristo quiero percibir poner atencin Strategy GT from living and populating the faith of Christ I want to stay give up my soul he died like a corporal at the end I want to identify distinguish he died like a pleasure the tenderness from populating and living the faith of Christ great death of slaughter concurring I want to loose my life succumb from living and lasting the faith of Christ bringing the ending to dubious end and the dissapearance and the death he died like a tenderness the pleasure of finding and giving the faith of Christ I want to perceive to pay atention Renderings, relations GR+, SR-Syn Generations/ line 1000 . + . 1.01 PageRank no Domain amor (love) muerte (death) suerte (luck) vivir (to live) sentir (to feel) morir (to die) Score 0 Figure 6: System configuration in the experiment that obtained the worst-scoring sonnet sordos a las estimas y afectas en el dulce amor ejercitados en los presentes trabajos y cuidados hinchen de tristes desgracias el viento llamar oler sentir les aprovecha y clidos indmitos cordiales por los odiosos los amables males hinchen de tristes desgracias el viento ocupar los actos y la prdida hinchen de tristes desgracias el viento que ni la matanza ni el violento duras puentes romper cual tiernas canas mi lamentar no quiere lamentarse mi ocupar no quiere esforzarse Strategy GT deaf to appreciations and affections in sweet love exercised in present works and cares swell the wind with disgrace calling, smelling, feeling profits them and warm cordial untamed by the hated the kind evils swell the wind with disgrace it will fill actions and loss swell the wind with disgrace that neither killing nor violent hard bridges to break like tender reeds my regret does not to want to regret my labor does not want to exert Renderings, relations GR+, SR+ Generations/ line 1000 . + . 1.01 PageRank no Domain amor (love) muerte (death) suerte (luck) vivir (to live) sentir (to feel) morir (to die) Score -7 Figure 7: System configuration in the experiment that obtained a more balanced sonnet (I want ), where the arguments are synonym verbs; muri como un el (he died as a the ), where the arguments are synonym nouns; or de y la fe de cristo (of and the faith of Christ), where the arguments are synonym verbs. Experiments on the Choice of Seed Words Another set of experiments has been performed to compare the effect of using different seeds for generation. First, the seed words used in the previous experiments were changed to study the effect of choosing seeds according to the term-frequency in the original poems. So, the six most and the six least frequent terms occurring in the original collection of poems were used. They were yo (I), gente (people), tierra (dust), amor (love), vida (life), and ser (to be). The least used terms were abismo (abyss), austro (south wind), tempestades (storms), detenerse (to stop), creer (to believe), and combatir (to .ght). These experiments were only run with the GR+SR+ configuration with fi =2. The obtained results shown a big difference regarding the number of semantic triplets used (75,622 vs 9,014), but not very different evaluation scores (on average, -5.78 vs -7). In another experiment, instead of using a predefined set of seed words, amor has been chosen as initial seed and PageRank was used to obtain the top-5 most relevant words. As expected, this set contained the word amor itself, and four other words, including some in.ections: amores, carino, afectas, afecta. The tested configurations were GR+SR+, GR-SR+ and GR+SR-Hyp, always with fi =2. With the five previous seeds and these configurations, the best scoring poem was obtained with the complete set of grammar rules (GR+) and with the whole set of semantic relations (SR+). Discussion The approach followed by PoeTryMe constitutes an effort to integrate the two classic approaches to poetry generation: it combines a degree of processing to obtain the structure of the poem from a given semantic input (semantic-based generation), and resorts to a grammar of possible renderings of the semantics so obtained to provide the final syntactic form of the resulting poem (syntax-aware generation). This procedure involves a double articulation into a set of semantic elements, each coupled with one or more syntactic elements from a parallel set. This structuring of the process has a certain similarity to the work of (Manurung 1999; Manurung, Ritchie, and Thompson 2012), where logical forms taken as input semantics were paired with TAG constructions that rendered them into text. The fact that the set of renderings is obtained from a corpus of existing poems has parallelism to a case-based reasoning approach such as the one advocated in (Gervs 2001), but the renderings themselves are closer to the templates used in the Rimbaudellaires (Oulipo 1981). Nevertheless, the procedure has its limitations. The fact that patterns for rendering are extracted only from contexts where two terms connected by a semantic relation occur within a small distance of one another in the original set of poems is a very strong constraint. As a result only a small percentage of the total set of lines of the original poems is selected into the final set of grammar rules used for rendering. Where an articulation solution based on lines, such as the one applied by (Queneau 1961) would make every line in the original set of poems available to be included in the resulting poems, the articulation solution used for PoeTryMe restricts the conceptual space to be explored to only those lines that contain pairs of terms connected by semantic relations. This has a secondary effect in that available patterns for rendering are very unlikely to originate from lines that are contiguous in the original poems. As a result, the chances are very low for fluent connection to arise during construction between lines that follow one another in the resulting poems. An additional restriction arises from the fact that each of the grammar rules used for rendering, by virtue of being a template with part of its contituent words already fixed during extraction, imposes a particular number of syllables that acts as starting point for the resulting line. Although different choices of words that will be employed to fill it may produce a slight variation (the final line will be longer if longer words are used, shorter otherwise), particular templates will be better suited for producing lines of length similar to that of the poem from which they originate. This could explain why such a small percentage of the extracted set of possible renderings are employed in the final set of poems, obtained with system configurations for a particular set of restrictions in terms the length of lines. Only grammar rules for renderings obtained from poems with lines of length similar to the target size are likely to be useful in producing new poems. From the point of view of the perception of creativity that the resulting poems inspire in their readers, the first impression is surprisingly positive. Generated poems have a high degree of variation in spite of being produced by means of templates. This is due to the relative richness of different lexical terms, achieved by the use of the Spanish wordnet. The use of semantic triplets as a constraint when filling in the templates enforces a logical connection between the various ingredients that ensures an impression of cogency. This is the result of constraints at two different levels: the existing link between each template and a particular semantic relation, and the imposition that the two terms used to fill the template be connected to one another by the corresponding relation. The metric pattern imposed by the Generation Strategy ensures that the form of the poem fulfills very closely the breakdown of lines into stanzas, the required number of syllables for each line, and, if availability of resources permits it, even appropriate rhymes. Concluding Remarks The present paper reports on the effort to adapt the PoeTryMe generic platform for producing poems in Spanish. This involved mostly the construction, reuse and extraction of the required resources to inform system operation. These resources were integrated with existing operational modules of the platform. The development of resources has been engineered with care to reduce the risk of fine-tuning the system towards a particular set of results. Nevertheless, the resulting set of poems shows heavy evidence of a particular style apparent in the lack of .uent grammar across sentences, a tendency to repeat successful patterns of speech (corresponding to optimal templates for lines), and a preference for in.nitives as rhyming solutions. In more general terms, the set of operational modules, strategies and configurations of input parameters available in the PoeTryMe is much larger than the limited subset that has been explored to obtain the results presented in this paper. Further work can be considered to explore the possible conceptual spaces that may be reached by applying the combinations left untried at the closure of this paper. Among other parameters, it would definitely be interesting to: explore the PageRank way of augmenting the seed words more deeply; generate other poems with a different structure than sonnets, possibly with a predefined rhyme pattern; and to explore the Contextualizer to provide some insights on the contents of the poem, useful to frame it and possibly to evaluate it. The reported effort constitutes evidence that PoeTryMe can indeed be extended to operate in languages other than Portuguese. The evidence provided by a Spanish instantiation is limited, given the close similarity between the two languages. However, the adaptations required were in no way made easier by those similarities. The possibility of extending the platforms is only restricted by the availability of the lexical, semantic and grammatical resources described, by the existence of a certain af.nity between the definition of poetry in the target language (such as being based on length in syllables and rhyme), and by the availability of software solutions for scansion of the desired metrics. Acknowledgements This work was supported by projects PROSECCO and ConCreTe. Part of this work was developed during a short term visit funded by the PROSECCO CSA project, European Commission under FP7 FET grant number 600653. The project ConCreTe acknowledges the .nancial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET grant number 611733. 2015_1 !2015 The man behind the curtain: Overcoming skepticism about creative computing Martin Mumford and Dan Ventura Computer Science Department Brigham Young University Provo, UT 84602 USA martindm@byu.edu, ventura@cs.byu.edu Abstract The common misconception among non-specialists is that a computer program can only perform tasks which the programmer knows how to perform (albeit much faster). This leads to a belief that if an artificial system exhibits creative behavior, it only does so because it is leveraging the programmer's creativity. We review past efforts to evaluate creative systems and identify the biases against them. As evidenced in our case studies, a common bias indicates that creativity requires both intelligence and autonomy. We suggest that in order to overcome this skepticism, separation of programmer and program is crucial and that the program must be the responsible party for convincing the observer of this separation. Introduction Demonstrations of computational creativity are often viewed with intense skepticism - much like a Victorian-era magician's trick full of smoke and mirrors. After all, creativity is regarded in many circles as a uniquely human characteristic, and so the claim of a creative computer invites immediate and often passionate skepticism. Even when an artificial system exhibits convincing creative behavior, the credit usually rests on the programmer as the true creative individual behind the act. What can be done to convince the audience that there are no strings attached - that a program is being creative independently from its programmer? How far should they be allowed to probe, to test, and to know about the system's workings to be convinced? It is important to motivate the separation of programmer and program in computational creativity applications. Consider a piece of software designed to monitor the landing gear on an aircraft. This software likely utilizes planning or decision-making algorithms, based on relevant conditions. If the software malfunctions in-flight, the aircraft may be damaged. Complex though the software may be, it cannot take the blame for following the instructions of its programming. Now consider a creative joke generator which tweets a new joke each day. One day, a generated joke happens to be highly offensive, and sparks criticism. This criticism cannot be targeted at the program, but at the programmer instead, for it is perceived to be following complex coded instructions. This is especially important as technology becomes more complex, and the general public becomes less aware of the specific details of its implementation. After all, as the science fiction author Arthur C. Clarke puts it: "Any sufficiently advanced technology is indistinguishable from magic." This leaves us with a powerful motivator to understand how people perceive the division of creativity between creator and creation. Because computers are currently perceived as incapable of autonomy and thought, as programmers, we will be credited for and be held accountable for what our programs do. In this paper we focus on the issues of perception and skepticism regarding artificial creativity. This discussion is hardly new but rather a modern revival motivated by recent progress in the field. As creative systems become more advanced, exhibiting more compelling creative behaviors, and applications begin to appear in the wild, the discussion becomes relevant again. We outline a high-level review of suggested properties of creative systems, as well as previously proposed tests for evaluating the creativity of a system. We then report on a brief case study illustrating the impact of interactivity on perception. This is supplemented by a survey taken by software engineers, computer scientists, as well as nonspecialists, which exposes some of the primary obstacles in the public perception of artificial creativity. We also offer an example from popular culture which highlights the issue of perceived autonomy as it relates to creativity. Finally, we discuss the impact of these perceptions on the potential direction and progress of the field. A History of Skepticism The Lady Lovelace, upon hearing about the possible creativity of Charles Babbage's Analytical Engine, put forth the same argument that is still used today - as quoted in (Dartnall 1994), that "[Machines] have no pretensions whatever to originate anything," having no autonomous thought, and thus cannot be considered creative. Nearly two hundred years later, despite significant advances in machine learning and computational creativity, this remains the dominant perception, with some degree of truth. In an attempt to address the question of whether or not June 2015 1 computers have the capacity for creative acts, several characteristics of creativity have been put forth by behavioral and computer scientists. Necessary and Sufficient Conditions The search for qualities of creative systems is rooted in the question "What is creativity?" While an ill-formed and hotly contested question, it has nevertheless motivated scholars to seek out some of the necessary conditions to determine whether a system should be considered creative. The properties put forth so far are still subject to debate, and far from sufficient or exhaustive, but offer a guiding set of characteristics by which to begin judging the creativity of a system. An artificial system possessing many of these characteristics could be persuasively argued to be creative, because it shares those attributes with creative humans. Properties of the Artefact The most straightforward way to judge a system is by the artefacts it produces. This requires no knowledge of system process, and success is often measured by comparing human-generated and computergenerated artefacts side-by-side or in a blind preference test. Creative qualities artefacts should exhibit have included quality (Wiggins 2006; Colton 2008b), novelty or imagination (Ritchie 2007; Wiggins 2006; Colton 2008b), robustness or variability, and typicality (Ritchie 2007; Colton 2008b). Properties of the System In addition to the artefacts, the process of creation itself has been suggested as a major factor in judging creative acts. Some of the aspects of the process include: appreciation or aesthetics (Colton 2008b; Colton, Pease, and Charnley 2011), individual style, intentionality, the ability to explain or justify decisions (Colton, Pease, and Charnley 2011), social context in a larger community of creators (Saunders and Gero 2001; Jennings 2010), and taking the audience into account (Maher, Brady, and Fisher 2013). Recent work has even been done on metaevaluation - the evaluation of creative evaluation frameworks (Jordanous 2014). Furthermore, we understand that the ability to learn is intertwined with the ability to create. A system that can learn its own fitness function for an aesthetic measure, for example, is arguably more creative than one that must have it explicitly specified by the developer, and some work has been done on automatically learning aesthetics (Colton 2008a). Tests of Computational Creativity A few general psychological creativity tests exist but are often in a format inaccessible to computers. For example, the Torrance Tests of Creative Thinking (TTCT) involve many verbal and drawing tasks which are beyond the abilities of modern computer vision and natural language processing. And so, in addition to a set of essential qualities for creativity, academics have sought to define a "Turing Test" for creativity more suited to computers. Even if a convincing, well-defined test existed, the concept itself has been criticized (Pease and Colton 2011) as limiting the potential style and variety of creativity in computers, much as the original psychological counterparts (Kim 2006) have been criticized. Turing Tests have been subject to scrutiny by the Chinese Room argument (Searle 1980), which appears to coincide with the most common criticism of creative systems - that no matter how creative they may seem, their internal workings could still comprise some form of Searle's rule-book. The Lovelace Test (Bringsjord, Bello, and Ferrucci 2003) tries to address this issue by dealing with the separation of programmer and program, rather than focusing on the system exclusively. Specifically, one of the requirements of the Lovelace Test is that the programmer cannot explain how an artefact was generated by the system, even when given ample time to do so. Notably, Bringsjord implies that the Lovelace Test can only essentially be passed when a system is perceived of as ‘thinking for itself', and ‘having a mind'. The perception of creativity is thoroughly entangled with the perception of intelligence and autonomy. While programmer surprise and inability to explain can help to establish the system as a separate entity, such surprise can be faked. Overcoming residual skepticism may require methods that establish the autonomy of a system without the need to rely on programmer reactions. Modern Skepticism In its current state, the field of Computational Creativity continues to face heavy skepticism from non-specialists. This is actually quite healthy for our field, as such skepticism provides a motivation to build systems that are not only theoretically sound, but convincingly demonstrable and socially acceptable. We explored the primary complaints and biases against the notion of creative computers, with the intent to discover the core issues that need to be addressed. This exploration revolved around the question, "What would it take to subjectively convince someone of a system's creativity?" Man behind the curtain: A case study In order to explore what it would take to alter people's perception, we created a simple analogy-making program, the output of which might be considered creative. This program was presented in three stages to 35 participants who were told that it was powered by a creative artificial intelligence. • Stage one: No interactivity. The user presses a button and the computer produces a random analogy. • Stage two: Selective interactivity. The user selects two nouns from a short list, and the computer produces an analogy between them. • Stage three: Full interactivity. The user inputs any two concepts, and the computer produces an analogy. The first two stages only appear to be creative - but in reality the computer is selecting from a pool of pre-generated analogies. Although the analogies could have been retrieved nearly instantly, a loading screen was presented to give the appearance of processing happening ‘behind the curtain.' The pre-generated analogies were created by hand using two seemingly unrelated concepts, and connected in a clever June 2015 2 and humorous way using similar properties between the two. For example, ‘cats are like lawnmowers: temperamental and destructive.' For stage two, items could be selected from two lists of five, making 25 possible analogies in the pool. Since we did not actually construct a creative analogy generator, the only way to provide full interactivity was to utilize a human operator using a networked device to ‘respond' to analogy requests. In our case we actually placed a man behind a curtain - the operator was sitting behind a partition nearby as users participated. In order to ensure consistent quality and style of analogies between different stages, the writer of the analogies for the first two stages also served as the operator for the third. Users were asked at random to either participate in one of the three tiers, or to move through all three consecutively. After observing the analogies that were ‘generated' by the computer, they were asked to evaluate the creativity of the task in general, as well as to determine where they felt the attribution of creativity belonged on a 5-point Likert scale from programmer to program. First, we observed that as the degree of allowed interactivity increased, the users were more inclined to test the system for patterns or trickery. When asked to split the attribution of creativity between programmer and program, a 1.0 on the scale represented ‘all programmer' and 5.0 represented ‘all program', where 3.0 represented an equal responsibility between the two. The average placement was 2.25 for stage 1, 2.46 for stage 2, and 3.1 for stage 3, showing an improved willingness to attribute creativity to the computer. Second, we observed that those who tried successively more interactive levels attributed dramatically more creativity to the system (more so than those participating in individual tests). This is likely because they had to revise their own assessments multiple times. Finally, among the highly skeptical, we found that a clear, repeated input-output pattern caused any and all creativity of the system to be discounted. Because the first two tests simulated a creative system by drawing from a pool of pregenerated analogies, and that pool was not particularly deep, astute users would probe the system until it eventually produced a duplicate. Each user who discovered a duplicate would invariably rate the system as having low creativity. Conflicts There also exist a few ‘double-edged swords' in a creative system that can subjectively decrease or increase the perception of creativity. Knowledge of System Keeping the system as a black-box (no knowledge) forces the user to evaluate the system based on the artefacts alone. Unfortunately this can mask the true creativity or lack of creativity in a system. For some individuals, keeping the system internals unknown is crucial, based on the notion that creative people produce artefacts ex nihilo, or that the creative process is fundamentally mysterious and cannot be explained. To expose the process might disrupt the appearance of creativity for these individuals. For example, it is trivial to implement a genetic algorithm to evolve a painting of the Mona Lisa, simply by setting the fitness function to be a pixel-by-pixel comparison between the phenotype and a picture of the Mona Lisa. Yet watching the painting evolve and take form in real time, it is easy for an outside observer to attribute to the program some level of intelligence and creativity. Of course, had the curtain been pulled back and the process exposed to the observer, they would have been disappointed at the naive way in which the system randomly combines and mutates. Exposing the high-level workings of the system allows the observer to make judgments about the process itself. However, exposing all of the system's process could remove the mystery of the process, leading to the perception that the program is ‘merely following instructions,' no matter how complex they may be. In our analogy-making experiment, several technicallyminded users attempted to discover the internal workings, inventing progressively harder requests meant to probe for templates and patterns. These individuals were impressed if they could not determine a consistent pattern, and remained unconvinced if they could imagine a clear process by which the artefacts were generated. Humanized Process People tend to project human emotions and behaviors onto non-human objects. A process that seems more ‘human' (pausing as if in thought, backtracking, slight errors, etc.) can improve the perception of creativity. As Colton observes (2008b), "...it is apparent that being able to watch The Painting Fool create its paintings means that people project more value onto them than they would if the paintings were rapidly generated through, say, an image filtering process. This seems to be because they can project critical thought processes onto the software, and empathise with it more." On the other hand a process with elements that appear highly computer-like (superhuman speed, enormous scale, lack of mistakes, logical explanations, etc.) can sometimes lend strength to the perception that a computer is doing all the work. Ultimately, the most persuasive portrayal might incorporate aspects of both philosophies. The Creative Threshold We conducted three surveys among different audiences asking about computers and creativity. Each participant was asked to rate whether computers were currently capable of creativity, and whether they will someday be capable of creativity, on a Likert scale from 0 to 10. They were then asked to define what they thought were essential requirements or characteristics of creativity. Finally, they were asked to describe what behavior or characteristics a system should have to convince them that it was creative. The exact questions and selected responses can be found in Appendix A. We first sought to understand the opinion of those that were technologically literate, but unfamiliar with programming and code. This survey was conducted on Reddit (a social bulletin board website) and had 75 respondents. We did not collect demographic information, but general statistics of Reddit users are can be found elsewhere (Duggan and Smith 2013) for those interested. June 2015 3 Figure 1: Quantitative analysis of responses by group: each boxplot shows the first quartile (left), median (bold) and third quartile (right). For comparison, the same survey was given to a group of 26 software engineers working in the industry, and again to a group of 37 computer science professors and graduate students at Brigham Young University. We originally anticipated that people familiar with programming or AI would have a deeper understanding of its potential, and thus show less skepticism at the concept of computational creativity. Academics, being the most familiar with current research and progress were expected to show the strongest optimism. However, the academics surveyed displayed somewhat more skepticism than any other group. More surprising still, the programmers demonstrated a disproportionately high level of confidence. Among the open-ended responses in all three groups about the requirements for creativity, eight broad classes emerged: • Lateral Thinking: Often described as ‘outside the box', including methods of thinking that ‘do not rely on logic,' going beyond formal inductive and deductive reasoning. • Flexibility: The ability to work within arbitrary constraints and handle many kinds of tasks. • Aesthetics: Taste, or the ability to judge quality and discern good artefacts from bad ones. • Novelty: Producing artefacts which are original, unique, or different from what has been seen before. • Analogy: The ability to make interesting analogies between seemingly unrelated concepts, or to combine or otherwise transform old concepts into something new. • Self-Improvement: The ability to learn from experience over time. • Autonomy: Often described as ‘independent thought', ‘unique intelligence', or emphasizing a lack of predefined rules. • Human Emotions: bravery and curiosity were the most common human emotions listed. Particularly among the most skeptical participants (those who rated it unlikely that computers are or ever will be creative), autonomy was the top priority for creativity. Responses such as, ‘agency', ‘choose for itself', ‘independent intellectual ability', and ‘independent thought' suggested that the system must be autonomous to convince them. Consider the following responses specifically about code: ‘not based on algorithms', ‘not a result of programming', ‘create its own programs', ‘no explicit code detailing what to do', and ‘write the program on its own'. Of course, computer programs can already exceed their original programming, through machine learning for example. Decades ago, classical AI algorithms were already capable of learning things that their creators did not know, and acquiring skills that their creators did not possess. The observed unwillingness to acknowledge a program as an independent entity appears to stem from a philosophical standpoint, even among other computer scientists, that code merely follows instructions (albeit extremely complex ones). This is a valid point of debate, though a particularly fuzzy one, since even creative humans could be argued to be following a complex set of chemical and psychological instructions. This need for an intelligent autonomous entity separate from the programmer sparks interesting questions. Is it possible for a computer system to possess all of the creative attributes typically outlined in our field (appreciation, skill, novelty, typicality, intentionality, learning, individual style, curiosity, accountability), and yet still not be creative? Alternatively, can a machine be creative without being intelligent? More broadly, is general or strong artificial intelligence necessary before people become comfortable with ascribing creativity to a machine? We are not prepared to claim that general intelligence is required for creative behavior, but instead observe that people are generally unwilling to attribute creativity to a system until it appears to be a separate, intelligent entity. In popular culture We turn to a portrayal of creative computing in popular culture to demonstrate the perception that in order to be creative, a computer must have autonomous thought and exceed its programming. In an episode of the television series Star Trek: Voyager, a trial is conducted to determine whether a computer program (the holographic doctor) should retain the rights to the creative work (holonovel) which he created. Part of the trial June 2015 4 appeals to the argument that attributes of the artefact are enough to deem a computer creative: BROHT:A replicator created this cup of coffee. Should that replicator be able to determine whether or not I can drink it? TUVOK: But I have never encountered a replicator that could compose music, or paint landscapes, or perform microsurgery. Have you? Would you say that you have a reputation for publishing respected, original works of literature? BROHT: I'd like to think so. TUVOK: Has there ever been another work written about a hologram's struggle for equality? BROHT: Not that I know of. TUVOK: Then in that respect, it is original. BROHT: I suppose so. TUVOK: Your honour, Section seven ... defines an artist as a person who creates an original artistic work. Mister Broht admits that the Doctor created this programme and that it is original. I therefore submit that the Doctor should be entitled to all rights and privileges accorded an artist under the law. However, the appeal to originality was ultimately not enough evidence to convince the judges. The winning argument rested on the doctor's autonomy and independent thought: KIM: He decided it wasn't enough to be just a doctor, so he added command subroutines to his matrix and now, in an emergency, he's as capable as any bridge officer. ARBITRATOR: That only proves the Doctor's programme can be modified. KIM: Your honour, I think it shows he has a desire to become more than he is, just like any other person. JANEWAY: Starfleet had programmed him to follow orders. The fact that he was capable of doing otherwise proves that he can think for himself. In this fictional case, as with the personal biases discovered in the survey, the deciding factor is intelligent, autonomous thought. This gives rise to several open questions for discussion: • In what way are different aspects of intelligence interrelated with different aspects of creativity? • Is intelligence necessary for creativity? • If so, is artificial general intelligence necessary for general creativity? • Is the threshold of evaluating creativity arbitrarily lower for humans or living beings such as crows (which have been shown to solve problems creatively) than for inanimate systems like programs? • Though increasing the intelligence of our creative programs could boost creative perception, would it necessarily have a positive impact on the true creativity of the system, or the quality of artefacts it produces? • How best can we convincingly demonstrate the autonomy of a creative system? Future Skepticism There are many current approaches we can utilize to overcome some of the perceptual barriers, one of which is the capacity for a program to code parts of itself. Work is already being conducted in creative code generation (Cook 2013), which could boost the perception of autonomy by non-specialists. Metaprogramming (writing code that writes code) does not necessarily translate to more creative programs, but it certainly lends credence to the idea that the program is separate from the programmer. This in turn provides an entity other than the programmer to which creativity can be attributed. Additionally, using machine learning methods to improve a system's aesthetic sense, cognitive ability, or skill level strengthens the claim that it is able to ‘exceed its original programming'. More broadly, we need to consider the impact of these perceptual issues on the goals of our field as a whole. To what extent should public opinion factor into our goals? Several of the requirements for creativity are already shared by both public opinion and computational creativity researchers. A heavier emphasis on boosting perception may only serve as a motivation for trickery and selective methods of presentation, which would not necessarily increase the creativity of our systems or the quality of artefacts they produce. Consider the difference between the aircraft landing gear software and the joke generator in the introduction. We understand there is a creative difference between aircraft software and a joke generator. Aircraft software was designed to be predictable and react to very particular situations in very particular ways - a clear mapping from inputs to outputs. Thus a software failure is likely to be the fault of the programmer. However, a joke generator is ideally unpredictable - that's the point. Its creator may be surprised at the jokes it generates, but the audience cannot necessarily ascribe this to the generator program being an autonomous entity. It could then be argued that the programmer is indeed responsible for the offensive joke, but unknowingly so, because the programmer was unaware of the range of possible jokes that the program could generate. A parent is socially responsible for the behavior of their child, but they cannot take credit for the child's creative acts or creative capacity, and nor can a mentor or teacher. However this relationship changes dramatically in software, where the programmer is not merely training an existing system, but making architectural decisions about the way it should think. If we could manipulate or condition the human brain to be more creative, or to deliberately specify how the thought process works, would the credit for the individual's creative acts rest partly on us? A primary goal of our field is to shift the burden of creativity from ourselves to our programs. However, our level of direct involvement in the minds of our machines makes this transference difficult, despite our best efforts to facilitate it. The philosophical question to ask is whether this difficulty is entirely a matter of perception, in which case it is a problem of persuasion, or whether more of ourselves resides in the machine than we would like to admit. This entanglement between creator and creation may be unavoid June 2015 5 able, until our creative systems can be considered separate, intelligent entities with independent thought, at which point we open an entirely different can of worms. 2015_10 !2015 FIGURE8: A Novel System for Generating and Evaluating Figurative Language Sarah Harmon Computer Science Department University of California, Santa Cruz Santa Cruz, CA 95064 USA smharmon@ucsc.edu Abstract Similes are easily obtained from web-driven and casebased reasoning approaches. Still, generating thoughtful figurative descriptions with meaningful relation to narrative context and author style has not yet been fully explored. In this paper, the author prepares the foundation for a computational model which can achieve this level of aesthetic complexity. This paper also introduces and evaluates a possible architecture for generating and ranking figurative comparisons on par with humans: the FIGURE8 system. Introduction Figurative language is embedded within and intimately connected to our cultures, behaviors, and models of the world. In fact, humans use figurative language so often that we seldom realize it (Lakoff and Johnson 1980); still, its utility for communication is clear. Using metaphors and similes, one can relate the unfamiliar, or the tenor, in terms of the familiar, or vehicle (Richards 1980). In Figure 1, for example, "moon" is the vehicle for "garden", the tenor. Attributes of the moon, such as its brilliance, are used to describe the beauty of the garden. Prior to the comparison, the garden's appearance is unknown (is it beautiful and luminous, or neglected and overgrown?). The simile helps to resolve this ambiguity and provide the reader with a clearer picture of the scene. Comparison gives us the ability to delicately express irony and sarcasm ("clear as mud"), exaggeration ("that man was as tall as a giraffe"), and emotion ("my heart was a sinking ship"). With such tools, we can explain how we feel, what kinds of people we are, and what experiences we have had. Further, metaphors give color to dry speech and are understood faster than literal equivalents (Gibbs and Nagaoka 1985); this is likely due to their appeal to common previous experiences and memories. For the purpose of this paper, we will consider two styles of figurative language: conventional (common analogies used in daily language, such as "I see what you mean") and creative (original comparisons that call attention to themselves as figures of speech, such as "Fear is a slinking cat I find / Beneath the lilacs of my mind" (Tunnell 1977)). Each type can provide value, although previous work on computational generation of figurative language has primarily focused on understanding and reconstructing conventional metaphors and similes. Cliches (e.g., "fast as lightning") are arguably useful ´ when fast, informal communication is required between a computer and a human, and such phrases can be learned via web query (Veale and Hao 2007a). Generating creative comparisons on par with human authors is a much more difficult challenge. A conventional metaphor is considered "good" if many others have used it before, but uniqueness and aesthetic qualities are critical in generating a strong creative metaphor. For instance, several aesthetic properties, such as syllable counts, phonetics, stressed syllable position, rhyme, and alliteration have been identified as "obvious" criteria for making creative poetic lines sound good, despite the fact that these "do not translate well into precise generative rules" (Gervas, Herv ´ as, and Robinson 2007). While creative gen´ erators for figurative language exist, few address this concept of what makes for a high-quality metaphor or simile. I will describe a system, FIGURE8, which contains a novel underlying model for what defines creative and high quality figurative comparisons, and evaluates its own output based on these rules. Related Work Modern research in creativity has generally defined a creative system as one that generates novel, context-appropriate output (Rothenberg and Hausman 1976; Sawyer 2012). Within the context of creative natural language generation, a third criterion has been noted: a creative system must generate context-appropriate knowledge outside of its pre-existing knowledge base (Perez y P ´ erez and Sharples 2004). ´ Several computational systems exist which attempt to meet this benchmark. ASPERA, for instance, combines case-based reasoning with intelligent adaptation of examples from corpora (Gervas 2000). Psychological theories ´ have further informed the art of generating figurative language, resulting in more advanced and thoughtful systems. Notably, Brown (Ortony 1993) and Glucksberg (Glucksberg 2001) have argued that categorization is inherent to metaphor. As a consequence, the concept of propertybased concept mapping has inspired metaphor generation approaches, and has been cited as the best method for producing robust, scalable and useful metaphors (Hervas et al. ´ 2007; Veale and Hao 2007a). June 2015 71 One must also consider how to develop an appropriate knowledge base without substantial manual authoring. Previous exemplary work in metaphor generation has emphasized the power of using the web to establish example cases of valid comparisons (Veale and Hao 2007a; 2007b). However, these systems merely generate large amounts of potentially creative descriptions, and cannot distinguish between original and poor quality comparisons (Veale and Hao 2007a). Further, they often ignore context, sentence construction, and aesthetics in the generation process, resulting in less evocative and meaningful language. FIGURE8 is a system that uses a web-driven approach to form a preliminary knowledge base of nouns and their properties. The system is provided with a model of the current world and an entity in the world to be described. A suitable vehicle is selected from the knowledge base, and the comparison between the two nouns is clarified by obtaining an understanding via corpora search of what these nouns can do and how they can be described. Sentence completion occurs by intelligent adaptation of a case library of valid grammar constructions. Finally, the comparison is ranked by the system based on semantic, prosodic, and knowledge-based qualities. In this way, FIGURE8 simulates the human-authoring process of revision by generating many vehicle choices and linguistic variations for a single tenor, and choosing the best among them as its favorite. While FIGURE8 does not claim to have a comprehensive set of rules for example, it does not consider phonetics in its evaluation of description quality it provides a novel foundation for an intelligent figurative language generation and assessment system. Approach Prior work has established that a strong creative metaphor is not only comprehensible (Tourangeau 1981), novel (Camac and Glucksberg 1984), and context-appropriate (Harwood and Verbrugge 1977; Tversky 1977; Gildea and Glucksberg 1983), but surprising (Tourangeau 1981). The following sections will illustrate how FIGURE8 considers these properties when generating metaphors and similes. A block diagram of the generation process is shown in Figure 3. Clarity A strong metaphor must have an understandable, accurate link between tenor and vehicle. A vehicle is thus only considered acceptable if it has properties in common with the tenor. Further, associating the tenor with the capacities and known manifestations of the vehicle should enhance the clarity of the description. In the FIGURE8 system, these associations are found by mining existing literary corpora (Hart 2014) for instances of the vehicle and using NLTK's parts-of-speech tagging to identify associations (e.g., refer to Figure 2). This procedure enables the system to use words commonly associated with the vehicle to develop a fresh relation to the tenor. For example, if we were to compare a teacher to a horse, FIGURE8 may now be able to reason that the teacher would prance or trot into the room. In this way, a sentence can be generated by only implicitly referring to the vehicle ("The teacher pranced into the room" vs. "The teacher was a wild horse, prancing into the room"). Common verbs, such as forms of "to be", were culled from the generated list of association because as all nouns have the capability to exist and be such verbs do not lend clarity to the comparison. Granted, the word chosen to relate to the tenor may not make sense (especially in the case of verbs), destroying the very clarity it was meant to enhance. FIGURE8 thus performs a web query using Python's urllib module to ensure that others have associated the chosen word with the tenor before. If a previous association has not been made, the metaphor is ranked lower in terms of estimated clarity. This evaluation measure ensures that nonsensical descriptions, such as "The turtle darkened like a blue ocean", are given a lower ranking overall. Novelty Cliches are frowned upon by expert authors; as Salvador ´ Dal´ı once said, "The first man to compare the cheeks of a young woman to a rose was obviously a poet; the first to repeat it was possibly an idiot" (1968). For computergenerated text, it is thus reasonable to expect that a quality metaphor is a fresh comparison. In the FIGURE8 system, each metaphor is checked against an existing knowledge base of comparisons (Friedman 1996), and all generations are ranked based on their similarity to conventional metaphors in this database. Aptness Ideally, a strong metaphor will fit the context within which it lives. For usage in a narrative context, the FIGURE8 system can be passed a model of a simple world of objects and character models, and incorporate these appropriately into its eventual output along with a prepositional phrase generation module. Additionally, one may ask FIGURE8 to generate ironic comparisons, such as those generated by a sarcastic character when speaking. Irony is achieved by selecting for properties with the exact opposite meanings, in accordance with prior work (Veale and Hao 2007b). The FIGURE8 system also endeavors to match a given context during sentence completion, which will be described in a later section. Unpredictability Metaphors are perceived as cleverer when the vehicle and tenor contain similarities, but the respective domains of these terms are distinct (Tourangeau 1981). A description is thus ranked as more surprising when the words are not very conceptually similar and contain fewer properties in common. With the assumption that they share at least one property in common, the chosen metaphor components are ranked by querying the UMBC Semantic Similarity System (Han et al. 2013). The degree to which the vehicle and tenor share major categories is also considered by using a function similar to WordNet's lexname query. This check is needed because if one or more major categories are shared, the metaphor is considerably less surprising. For June 2015 72 Figure 1: Example of a highly ranked output sentence by FIGURE8. Here, the tenor, vehicle, and associated phrases are garden, moon, lit up, and pale. The nouns garden and moon not only have low semantic similarity, but do not share a major category together. Likening a garden to a moon is also not a cliche comparison, lending to the description's potential novelty. ´ Figure 2: Example of how FIGURE8 discovers and associates a verb with a chosen vehicle, using text from The Count of Monte Cristo, and a part-of-speech parsing module similar to the Stanford Parser (Socher et al. 2013). Here, the nsubj label refers to a link between a verb ("deceived") and a noun phrase (in this case, the vehicle "world"). The remaining labels in the figure represent the part-of-speech tags. instance, "the strawberry is a pomegranate" is considered a poor metaphor because strawberry and pomegranate are contained within a major category: fruits. Such a description may be produced by a web-based generator (for instance, the online MIT-licensed Metaphorgy system (Groff-Palermo and Lawson 2013) produces "My strawberry is a Phaeacian cherry"), but will be given a low ranking by FIGURE8. Prosody The prosody of a metaphor can be defined as the rhythmic, tonal, and aesthetic qualities that distinguish one metaphor from another. Descriptions are ranked highly if their prosody is of consistent and high quality. For instance, consider the following similes: (1) The serpent stretched into the horizon, like a deserted desert. (2) The snake extended into the horizon, like an abandoned desert. Although alliteration and assonance can be used beautifully in figurative language, the high similarity of consecutive words in (1) may be distracting. Example (2) depicts the same imagery, but uses words of greater distance in terms of consecutive string similarity. At present, FIGURE8 conducts string similarity via Python's difflib to evaluate the prosody of its outputs. Using difflib's SequenceMatcher, one can determine a value indicating the degree of similarity between two input strings in a range from 0 (no similarity) to 1 (identical strings). FIGURE8 is thus able to quantify the string similarity for consecutive words, and ranks descriptions lower if there are many consecutive string similarity values above 0.7, which was deemed an appropriate threshold by the author. Consecutive words are also checked for alliteration and assonance, which are considered positive qualities by FIGURE8. Sentence Completion Automated metaphor identification in text has been thoroughly explored (Neuman et al. 2013; Steen et al. 2010) and, as such, FIGURE8 has been provided with a case library of appropriate sentence constructions for metaphor and simile. By following the procedure of imaginative recall (Turner 1992), FIGURE8 first attempts to fit the provided context of the situation to an exact, pre-existing solution. If no solution exists, FIGURE8 searches its memory, solves the problem for a similar case, and adapts that solution to the provided context. As an illustration: if FIGURE8 notes that other authors have used the phrase "to the barn", it should recognize the barn as a noun denoting a man-made object via WordNet. Similarly, a "chair" is a man-made object, and thus, FIGURE8 may decide to replace "barn" with "chair" when told that a chair exists in the current narrative context. This adaptive process enables FIGURE8 to match its constructions to any provided context and complete statements creatively. Evaluation Little research, if any, has worked towards developing a model of what makes a high quality computer-generated metaphor. Although there is no standard method to evaluate computationally-generated figurative descriptions, one reasonable way to judge would seem to be agreement with hu June 2015 73 Figure 3: Block diagram of the FIGURE8 generation system. If no world model is given, a tenor is selected at random from the noun-property database. A vehicle is then selected with at least one property in common with the tenor. The Clarity Enhancer module requests verbs and adjectives associated with the vehicle from mined literary corpora. Finally, the sentence is completed by performing imaginative recall with known valid sentence constructions for metaphor identified from literary corpora. Table 1: Comparison 1 of FIGURE8 and human rankings for clarity and overall quality. In this set, FIGURE8 was asked to generate and rank figurative descriptions given "pearl" as the tenor. Human clarity and likability rankings were found to be highly correlated (⇢ = 0.684). Spearman analyses also indicated positive correlations between human and FIGURE8 rankings (clarity: ⇢ = 0.872; quality: ⇢ = 0.821). man ratings. This can be assessed by requesting humans to rank descriptions generated by the FIGURE8 algorithm, and determining if the majority are in agreement with the computer's (FIGURE8's) ranking. A pilot study indicated that providing each description with additional context would make the ranking process too time-consuming for participants. Thus, functions to enhance aptness were not included when generating outputs to be evaluated in the full-scale study. Method One hundred participants (73 female, 27 male) were recruited via Amazon's Mechanical Turk. Each participant viewed a series of five sentences at a time, and were asked to rank the similes by how understandable they were (clarity), and by how much they, as individuals, enjoyed the comparison (likability). Each set of five sentences contained the same tenor, and were originally generated and ranked by FIGURE8. The sets were not hand-selected by the author. That is, the first eleven sets FIGURE8 generated and ranked were used in the study. Results Human preferences were determined by following the majority criterion. As seen in Figures 4 and 5, human clarity ratings were often positively correlated with overall quality ratings, and this correlation was confirmed with Spearman analyses. Overall, FIGURE8's top result for clarity and overall quality generally agreed with the human rankings for each of the eleven sets. FIGURE8 exactly matched the first ranking 46% of the time for clarity and likability. Further, it matched either the first or second ranking 82% and 100% of the time for the clarity and likability categories, respectively. Examples of how FIGURE8 matched human ratings are shown in Tables 1, 2, and 3. Discussion and Future Work In this paper, the author has introduced the FIGURE8 system as a novel tool for generating and evaluating creative figurative descriptions. FIGURE8's assessments are grounded in psychological models of metaphor comprehension, and have thus far been found to adequately match human rankings when agreed upon. Participants in the evaluation portion were not told that the descriptions were generated by a computer. Only two June 2015 74 Table 2: Comparison 2 of FIGURE8 and human rankings for sentences of tenor "snow". Human clarity and likability rankings were found to be positively correlated (⇢ = 0.763). Spearman correlation analysis suggested that FIGURE8 clarity rankings were positively associated with human clarity rankings (⇢ = 0.872), but no significant association was found between likability rankings in this case (⇢ = -0.359). Table 3: A third comparison of FIGURE8 and human rankings for sentences of tenor "queen". In addition to showing first choice rankings, this table displays human rankings when considering first and second choices. That is, "the queen stands like a strong castle" was ranked as either first or second for the majority of respondents. In both cases, human clarity and likability rankings were found to be positively correlated (⇢ > 0.9). Spearman analyses also suggested for both cases that FIGURE8 and human rankings for clarity and likability were positively correlated with high significance (⇢ > 0.7). Figure 4: First choice rankings for the generated set of sentences using pearl as the tenor. Although some disparities existed, the majority of respondents generally agreed upon which sentence was the most understandable. comments were made about checking sentences for validity prior to including them in the study, and one regarding how painful it was to rank "bad poetry". Most participants, however, enjoyed the task and provided positive feedback about their experience ("cool hit","super fun","I love this"). It is conceivable that task enjoyment affected user responses, but controlling for explicit indication of task enjoyment yielded no significant difference in the results. Controlling for gender also did not reveal significantly different outcomes. Interestingly, for roughly half (50-60% per set) of the participants, how much they liked the figurative description was directly correlated with how well they understood it. The most highly ranked phrases for clarity were also often ranked first for likability, and the Spearman coefficient was used to confirm these positive associations. This was a surprising finding, because more variation and subjectivity was expected for these ratings. Discrepancies between human and FIGURE8 likability rankings, such as in Table 2, could potentially be explained by a human tendency to prefer metaphors containing words of positive sentiment value. However, more analysis is required to confirm this idea, and further study is needed to evaluate how qualities of language are weighted across general and expert populations. Judging from participant comments, it is also possible that some people may like metaphors primarily based on qualities other than clarity (such as prosody, sentiment, or whimsy). If these groups could be automatically identified, perhaps future computer-produced descriptions could adapt to generate more personalized descriptions for the optimum enjoyment of the reader. While FIGURE8 is able to rank its figurative descriptions over various measures of quality, how well its output com June 2015 75 Figure 5: Clarity rankings for the generated set of sentences using snow as the tenor. Participants rated what FIGURE8 considered the most unsurprising metaphor as the most clear, but there was no highly significant consensus regarding the most likable description. pares with human-authored descriptions was not assessed. The fact that most participants in the evaluation did not question the source of the texts is a promising sign that the system presented here generates human-like output. Regardless, its present constructions can be automatically assigned rankings on par with human evaluations. It is assumed that as the quality of FIGURE8's generations increases, it will be able to extract the best output from the results of its "brainstorming". Future research should build upon this foundation and work towards evaluating computer-generated descriptions in terms of aptness, prosody, and unpredictability. When machines are fully able to grasp the subtleties and aesthetics of figurative language, we as humans will be able to relate to them as never before. 2015_11 !2015 Game of Tropes: Exploring the Placebo Effect in Computational Creativity Tony Veale School of Computer Science and Informatics University College Dublin, Belfield D4, Ireland. Tony.Veale@UCD.ie Abstract Twitter has proven itself a rich and varied source oflanguage data for linguistic analysis. For Twitter is more than a popular new channel for social interactionin language; in many ways it constitutes a whole newgenre of text, as users adapt to its new limitations (140 character messages) and to its novel conventions such as retweeting and hash-tagging. But Twitter presents anopportunity of another kind to computationally-mindedresearchers of language, a generative opportunity to study how algorithmic systems might exploit linguistictropes to compose novel, concise and re-tweetable textsof their own. This paper evaluates one such system, aTwitterbot named @MetaphorMagnet that packages itsown metaphors and ironic observations as pithy tweets. Moreover, we use @MetaphorMagnet, and the idea of Twitterbots more generally, to explore the relationship of linguistic containers to their contents, to understand the extent to which human readers fill these containers with their own meanings, to see meaning in the outputs of generative systems where none was ever intended. We evaluate this placebo effect by asking human raters to judge the comprehensibility, novelty and aptness of texts tweeted by simple and sophisticated Twitterbots. Tropes: Containers of Meaning A mismatch between a container and its contents can often tell us much more than the content itself, as when a person places the ashes of a deceased relative in a coffee can, or sends a brutal death threat in a Hallmark greeting card. The communicative effectiveness of mismatched containers is just one more reason to be skeptical of the Conduit metaphor (Reddy, 1979) - which views linguistic constructs as containers of propositional content to be faithfully shuttled between speaker and hearer - as a realistic model of human communication. Language involves more than the faithful transmission of logical propositions between information-hungry agents, and more effective communication - of attitude, expectation and creative intent - can often be achieved by abusing our linguistic containers of meaning than by treating them with the sincerity that the Conduit metaphor assumes. Consider the case of verbal irony, in which a speaker deliberately chooses containers that are pragmatically illsuited to the conveyance of their contents. For instance, the advertising container "If you only see one [X] this year, make it this one" assumes that [X] denotes a category of event - such as "romantic comedy" or "movie about superheroes" - with a surfeit of available members for a listener to choose from. When [X] is bound to the phrase "comedy about Anne Frank" or "musical about Nazis", the container proves too hollow for its content, and the reader is signaled to the presence of playful irony. Though such a film may well be one-of-a-kind, the illfitting container suggests there are good reasons for this singularity that do not speak to X's quality as an artistic event. Yet if carefully chosen, an apparently inappropriate container can communicate a great deal about a speaker's relationship to the content conveyed within, and as much again about the speaker's relationship to their audience. As more practical limitations are placed on the form oflinguistic containers, the more incentive one has to exploitor abuse containers for creative ends. Consider the use of Twitter as a communicative medium: writers are limited to micro-texts of no more than 140 characters to conveyboth their meaning and their attitude to this meaning. So each micro-text, or tweet, becomes more than a containerof propositional content: each is a brick in a larger edifice that comprises the writer's online personae and textual aesthetic. Many Twitter users employ irony and metaphorto build this aesthetic and thus build up a loyal audienceof followers for their world view. Yet Twitter challengesmany of our assumptions about irony and metaphor. Such devices must be carefully modulated if an audience is to perceive a speaker's meaning in the playful (mis)match ofa linguistic container to its contents. Failure to do so canhave serious repercussions when one is communicating tothousands of followers at once, with tweets that demandconcision and leave little room for nuance. It is thus not unusual for even creative tweets to come packaged withan explicit tag such as #irony, #sarcasm or #metaphor. Metaphor and irony are much-analysed phenomena in social media, but this paper takes a generative approach, to consider the production rather than the analysis of creative linguistic phenomena in the context of a fully June 2015 fi autonomous computational agent - a Twitterbot - that crafts its own metaphorical and ironical tweets from itsown knowledge-base of common-sense facts and beliefs. How might such a system exhibit a sense of irony thathuman users will find worthy of attention, and how mightthis system craft interesting metaphoric insights from aknowledge-base of everyday facts that are as banal asthey are uncontentious? We shall explore the variety oflinguistic containers at the disposal of this agent - a realcomputational system named @MetaphorMagnet - to better understand how such containers can be playfully exploited to convey ironic, witty or thought-provoking views on the world. With @MetaphorMagnet we aim to show that interesting messages are not crafted from interesting contents, or at least not necessarily so. Rather, effective tweets emerge from an appropriate if nonobvious combination of familiar linguistic containers with unsurprising factual fillers. In support of this view, weshall present an empirical analysis of the assessment of@MetaphorMagnet's uncurated outputs by human judges. Just as one can often guess the contents of a physicalcontainer by its shape, one can often guess the meaning ofa linguistic container by its form. We become habituatedto familiar containers, and just as we might imagine ourown uses for a physical container, we often pour our own meanings into suggestive textual forms. For in language, meaning follows form, and readers will generously inferthe presence of meaning in texts that are well-formed andseemingly the product of an intelligent entity, even if thisentity is not intelligent and any meaning is not intentional. Remarkably, Twitter shows that we willingly extend thisgenerosity of interpretation to the outputs of bots that weknow to be unthinking users of wholly aleatoric methods. Twitterbots exploit this placebo effect - wherein a wellformed linguistic container is presumed to convey a wellfounded semantic content - by serving up linguistic formsthat readers tacitly fill with their own meanings. We aimto empirically demonstrate here that readers do more thanwillingly suspend their disbelief, and that a well-packaged linguistic form can seduce readers into seeing what is notthere: a comprehensible meaning, or at least an intent tobe meaningful. We do this by evaluating two metaphorgenerating bots side-by-side: a rational, knowledge-based Twitterbot named @MetaphorMagnet vs. an aleatoric and largely knowledge-free bot named @MetaphorMinute. Digital Surrealists: La Regle Du Jeu Most Twitterbots are simple, rule-based systems that use stochastic methods to explore a loosely-defined space of texual forms. Such bots are high-concept, low-complexity text-production mechanisms that transplant the aleatoric techniques of surrealist writers - from André Breton to William Burroughs and Brion Gysin - into the realms of digital content, social networking and online publishing. Each embodies a language game with its own generative rules, or what Breton called "la regle du jeu." Yet Breton, Burroughs and Gysin viewed the use of aleatorical rules as merely the first stage of a two-stage creation process: at this first stage, random recombinant methods are used to confect candidate texts in ways that, though unguided by meaning, are also free of the baleful influence of cliché; at the second stage, these candidates are carefully filtered by a human, to select those that are novel and interesting. Most bots implement the first stage and ignore the second, pushing the task of critiquing and filtering candidate texts onto the humans who read and selectively re-tweet them. Nonetheless, some bots achieve surprising effects with the simplest language tools. Consider @Pentametron, a bot that generates accidental poetry by re-tweeting pairs of random tweets of ten syllables apiece (for an iambic pentameter reading) if each ends on a rhyming syllable. When the meaning of each tweet in a couplet coheres with the other, as in "Pathetic people are everywhere" |"Your web-site sucks, @RyanAir", the sum of tweets produces an emergent meaning that is richer and more resonant than that of either tweet alone. Trending social events such as the Oscars or the Super Bowl are especially conducive to just this kind of synchronicity, as in this fortuitous pairing: "So far the @SuperBowl commercials blow." | "Not even gonna watch the halftime show." In contrast, a bot named @MetaphorMinute wears its aleatoric methods on its sleeve, for its tweets - such as "a haiku is a tonsil: peachblow yet snail-paced" - are not so much random metaphors as random metaphor-shaped texts. Using a strategy that stresses quantity over quality, this bot instantiates that standard linguistic container for metaphors - the copula frame "X is a Y" - with mostly random word choices every two minutes. Interestingly, its tweets are as likely to provoke a sense of mystification and ersatz profundity as they are total incomprehension. Yet bots such as @Pentametron and @MetaphorMinute do not generate their texts from the semantic-level up; rather, they manipulate texts at the word-level, and thus lack any sense of the meaning of a tweet, or any rationale for why one tweet might be better - which is to say, more interesting, more apt or more re-tweetable - than others. The Full-FACE poetry generator of Colton et al. (2012) also uses a template-guided version of the cut-up method to mash together semantically-coherent text fragments in a way that - much like @Pentametron - obeys certain over-arching constraints on metre and rhyme. These text fragments come from a variety of online sources, ranging from short tweets to long news articles. News stories are a rich source of readymade phrases that convey resonant images, and these can be clipped from a news text using standard NLP techniques, while tweets that use affect-rich language can also be extracted automatically via standard sentiment analysis lexica and tools. Thus, a large stock of resonant similes, such as "blue as a blueberry" or "hot as a sauna" can be extracted from the Web using a search engine (Veale, 2014), since the simile frame "as X as Y" is specific enough to query for, and promiscuous enough to match, a rich diversity of typical X:Y associations. These associations can then be recast in a variety of poetic forms to make their clichéd offerings seem fresh again, as in "Blueberry-blue overalls" or "sauna-hot jungle." June 2015 fi Indeed, the very act of juxtaposing clichés can itself be a creative act, as evidenced both by the success of the cutup method in general and that of specific cut-ups in particular. Consider William Empson's withering analysis of the persnickety, cliché-hating George Orwell, whom Empson called "the eagle eye with the flat feet" (quoted in Ricks [1995:356], who admires Empson's "audacious compacting of clichés"). The Full-FACE system is just one of many CC systems that use an autonomous variant of Burroughs and Gysin's cut-up method to integrate tight constraints on form with loose constraints on meaning. Breton famously stated that "Je ne veux pas changer la regle du jeu, je veux changer de jeu." Twitterbots do not change or transcend their own rules, but different bots do represent different language games with their own rules. So to change the game, a CC developer can simply build a new bot, to exploit a different set of tropes and linguistic containers. It is rare for any one Twitterbot to incorporate a diverse set of tropes and production mechanisms; each typically follows Breton's experimentalist approach to art in its random sampling of a specific space of possibilities. Each bot thus forms its own art installation, to showcase a single generative idea. @MetaphorMagnet, the bot at the heart of this paper, represents a departure from this norm, insofar as it exploits a wide range of tropes and rendering strategies, it employs diverse sources of knowledge, and it applies a variety of reasoning styles to generate surprising conclusions from what is otherwise a stock of banal facts. But does this added sophistication - bought at the cost of increased system complexity and knowledge-engineering effort - result in tweets that are seen as more meaningful, novel, apt or retweetable by human users? It is this point that exercises us most in the coming sections. The Placebo Effect : Trope-A-Dope We humans obtain more mileage than we care to admitfrom templates, tropes and other "bot" tricks for linguisticcreativity. Consider what Matthew McGlone and Jessica Tofighbakhsh (1999) call the Keats heuristic, an insight into creative language use that owes as much to Nietzsche("we sometimes consider an idea truer simply because ithas a metrical form and presents itself with a divine skip and jump") as to the poet John Keats ("Beauty is truth, truth beauty"). McGlone and Tofighbakhsh (2000) showthat when presented with uncommon maxims or proverbswith internal rhyme (e.g. "woes unite foes"), subjects tendto view these as more insightful about the world than theequivalent paraphrases with no internal rhyme at all (e.g. "troubles unite enemies"). While the Keats heuristic is not exactly a license to pun, it is an incentive to rhyme, and togive as much weight (or more still) to superficial aspectsof poetry generation as to deep semantics and pragmatics. Indeed, the heuristic is tacitly central to the operation ofvirtually every computational creativity (CC) approach topoetry generation (e.g. Milic, 1970; Chamberlain & Etter, 1983; Gervás, 2000; Manurung et al. 2012; Veale, 2013). If human poets ask questions first and rhyme later, CCsystems typically rhyme first and ask questions later, if at all. For if the human jury in the O.J. Simpson trial couldbe turned against bald facts with the Keatsian "If the glove don't fit you must acquit", readers of computer-generated poetry can be persuaded to see deliberate meaning and resonance in any output that has a "divine skip and jump." There is something undeniably special about poetry, whether it is the gentle poetry of William Shakespeare's "Shall I compare thee to a summer's day" or the rough poetry of Johnnie Cochrane's "If the glove don't fit you must acquit". Milic (1970), an early CC pioneer, argues that while poetry "is more difficult to write than prose" it offers other freedoms to writers due to the willingness of readers to "interpret a poem, no matter how obscure, until he has achieved a satisfactory understanding." What then of the enigmatic tweets of bots like @MetaphorMinute, whose obscurity is a function of random word choice and whose surface forms are not designed to make any sense at all? Milic argues that computer poetry serves a useful role other than its obviously generative one, by alerting us to "the curious behavior of familiar words in unfamiliar combinations." Behaviour that makes perfect sense when dealing with the writings of a gifted human poet, such as our tendency to "interpret an utterance by making what concessions are necessary on the assumption that a writer has something in mind of which the utterance is the sign", is, argues Milic, "inappropriate when the speaker is a computer." Yet Twitterbots benefit from such concessions and assumptions whether or not followers know them to be bots. This placebo effect is especially pronounced in the coining of would-be metaphors, leading Milic to note "how readily we accept metaphor as an alternative to calling a sentence nonsensical." @MetaphorMinute and other aleatoric bots wring maximal value from this insight by devising texts that they themselves cannot distinguish from nonsense. So this begs an important question: are the meanings imposed on a random text by a creative human of comparable value to those conveyed by a bot with its own model of the world and its own insights to tweet? Building Metaphors : Theory and Practice What might it mean for a bot to have "something in mind of which [its] utterance is the sign"? When it comes to metaphor generation, we might expect that our bot would generate its figurative tweets from a conceptual model of the world as it sees it, in a way that accords with a sound theory of how and why humans actually use metaphor. For the latter, AI offers us a range of models to choose from. Computational approaches to metaphor divide into four broad classes: the categorial, the corrective. the analogical and the schematic. Categorial approaches view metaphor as a means to reconceptualize one idea by placing it into a taxonomic category strongly associated with another (see Hutton, 1982; Way, 1991; Glucksberg, 1998). Corrective approaches view metaphor as an inherently anomalous deviation from literal language, and strive to recover the corresponding literal meaning of any figurative statement that violates its lexico-semantic norms (see Wilks, 1978; June 2015 fi Fass, 1991). The analogical approaches aim to capture the relational parallels that allow our representation of an idea in one domain, the source, to be systematically projected onto our mental representation of an idea in another, the target (see Gentner et al., 1989; Veale and Keane, 1997). Finally, schematic approaches aim to explain how related linguistic metaphors arise as surface manifestions of deep seated cognitive structures called Conceptual Metaphors (Lakoff & Johnson, 1980; Carbonell, 1981; Martin, 1990; Veale & Keane, 1992). Each approach has its own merits, but none offers a complete computational solution. Bots that aim for a general competence in metaphor must thus implement a selective hybrid of multiple approaches. Yet each approach also requires its own source of knowledge. Categorial approaches require a comprehensive taxonomy of flexible categories that can embrace atypical members on demand. Corrective approaches are built on a substrate of literal case-frames onto which deviant usages can be correctively projected. Analogical approaches assume an inventory of graph-theoretic representations of concepts, from which a structure-mapping engine can eke out its sub-graph isomorphisms. Schematic approaches rely on a stock of Conceptual Metaphors (CMs) - such as Life is a Journey or Theories are Buildings - to unearth the deep structures beneath the surface of diverse linguistic forms. Though hybrid approaches demand multiple sources of knowledge, there exist public Web services that integrate this knowledge with the appropriate means of using it for metaphor. The Thesaurus Rex Web service of Veale & Li (2013) provides a highly divergent system of fine-grained categorizations that allows a 3rd-party client system to e.g. determine that War and Divorce have each been viewed as kinds of destructive thing, traumatic event and severe conflict in the texts of the Web. The Metaphor Eyes Web service of Veale & Li (2011) is a rich source of relational norms - also harvested at scale from Web texts - such as that businesses earn profits and pay taxes, or that religions ban alcohol and believe in reincarnation. The Metaphor Magnet service of Veale (2014) offers a rich source of the stereotypical properties and behaviors of familiar ideas, and provides the means to retrieve salient CMs from the Google n-grams (Brants & Franz, 2006) which can then be further elaborated to create novel linguistic metaphors. @MetaphorMagnet relies on each of these public Web services to generate the conceptual conceits that underpin its figurative tweets. For instance, it uses Thesaurus Rex to provide the categorization insights that it then packages as odd-one-out lists or as faux-dictionary definitions. It uses the Metaphor Eyes service to provide the relational structures it needs to perform structure mapping and thus concoct original analogies and dis-analogies. And it uses the Metaphor Magnet service to access the stereotypical properties and behaviors of ideas, and to juxtapose these properties via resonant contrasts and norm contraventions. Once the conceptual chassis of a metaphor is constructed in this way, it is then packaged in an apt linguistic form. Building Strings: Trope-On-A-Rope CMs such as Life Is A Journey and Politics Is A Game are more than productive deep-structures for the generation of whole families of linguistic metaphors; they also provide the conceptual mappings that shape our habitual thinking about such familiar ideas as Life, Love, Politics and War. Politicians and philosophers exploit conceptual metaphorsto frame an issue and shape our expectations; when a CMfails to match our own experience, we reject it and switchto a more apt metaphor. So a metaphor-generating bot canthus create a thought-provoking opposition by pitting oneCM against another that advocates a conflicting view ofthe world. The following tweet from @MetaphorMagnetuses this approach to contrast two views on #Democracy: To some voters, democracy is an important cornerstone. To others, it is a worthless failure. #Democracy=#Cornerstone #Democracy=#Failure The CM Democracy Is A Cornerstone (of society) is oftenused to frame political discussions, and can be seen as an specialization of the CM Society Is A Building, itself an elaboration of the CM Organization Is Physical Structure(see Grady, 1997). Yet the importance of cornerstones tothe buildings they anchor finds a sharp contrast in theassertion that Democracy Is A Failure. Each of these affective claims is so commonly asserted that they can be found in the Google n-grams, a large database of shortfragments of frequent Web texts. The 4-gram "democracy is a cornerstone" has a frequency of 91 in the Google ngrams, while the 4-gram "democracy is a failure" has a frequency of 165. These n-grams, which suggest potentialCMs for @MetaphorMagnet, are elaborated with added detail via the Metaphor Magnet Web service, which tells the bot that the stereotypical cornerstone is important and the stereotypical failure is worthless. The following tweetmakes similar use of CMs found in the Google n-grams, but renders the conflict in a different linguistic container: Remember when tolerance was promoted by crusadingliberals? Now, tolerance is violence that only fearful appeasers can avoid. The bot is guided here by the suggestive Google 3-gram"Tolerance for Violence" (frequency=1353), but it doesnot directly contrast the ideas #Tolerance and #Violence. Instead, it finds a potential analogy in this juxtaposition, between the promoters of #Tolerance (which it renders ascrusading liberals) and the opponents of #Violence (which it renders as fearful appeasers). The choice of stereotypical properties (crusading and fearful) is drivenby the bot's need to create a resonant semantic opposition. The bot omits the hashtags #Tolerance=#Violence fromthis tweet due to the confines of Twitter's 140-character limit. But it can also choose to render a complex conceitacross two successive tweets, as in the following pair: June 2015 fi Remember when research was conducted by prestigious philosophers? #Research=#Fruit #Philosopher=#Insect Now, research is a fruit eaten only by lowly insects. #Research=#Fruit #Philosopher=#Insect @MetaphorMagnet uses a number of packaging strategiesto turn a figurative comparison into an ironic observation, ranging from the use of an explicit #irony hashtag (which is commonplace on Twitter) to the use of "scare" quotesto focus on the part of a tweet most deserving of disbelief. The following tweet showcases both of these strategies: #Irony: When some chefs prepare "fresh" salads the way apothecaries prepare noxious poisons. #Chef=#Apothecary #Salad=#Poison Irony offers a concise means of contrasting two points ofview: that which is expected and the disappointing reality. By comparing the preparation of salads - the "healthy" option on most menus - to the preparion of poisons, thisanalogy undermines the expectation of healthfulness and suggests that some salads are noxious and chemical-filled. The real world is filled with situations in which naturally antagonistic properties are found in surprising proximity. These situations, if expressed in the right linguistic form, can be elevated to the level of situational irony. Consider, for instance, the following @MetaphorMagnet tweet: #Irony: When the timers that are found in enjoyablegames activate gruesome bombs. #Enjoyable=#Gruesome It is important to stress that @MetaphorMagnet does not simply fill linguistic templates with related words. Rather, the above tweet is constructed at the knowledge-level, bya bot that intentionally seeks out stereotypical norms thatare related (e.g. by a pivotal idea timer) yet which can beplaced into antagonistic juxtapositions around this pivot. In effect, the goal of the linguistic rendering is to packagea knowledge-level conceit - typically a conflict of ideasand properties - in a tweet-sized narrative. For example, the following tweet is rendered as a narrative of change: To join and travel in a pack: This can turn pretty girlsinto ugly coyotes. #Girl=#Coyote Twitter offers unique social affordances that allow a bot to elevate almost any contrast of ideas into a dramaticnarrative. Rather than talk of generic liberals or appeasers, a bot can give these straw men real names, or at leastinvent fake names that look like the real thing and which, as Twitter handles, seem wittily apropos to the views that are espoused. In this way, by imagining its central conceitas a topic of a vigorous debate by real people, a bot canturn an abstract metaphor into a concrete situation with itsown colorful participants. Consider the social debate that is made personal in this tweet from @MetaphorMagnet: .@war_poet says history is a straight line .@war_prisoner says it is a coiled chain #History=#Line #History=#Chain The handles @war_poet and @war_prisoner are invented by @MetaphorMagnet to suit, and amplify, the figurativeviews that they are advanced in the tweet, by using a mix of relational knowledge (from the Metaphor Eyes service) and language data (via the Google n-grams). Since poetswrite poems about the wars that punctuate history, andpoems contain lines, the 2-gram "war poet" is recognizedas an apt handle for an imaginary Twitter user who mightadvance a view of history as a line. In this case the handle @war_poet really does name a real Twitter user, but thisonly adds to the sense that Twitterbot confections are anew kind of interactive theatre and performance art (seeDewey, 2014). Note that the more profound aspects ofthis contrast are not appreciated by @MetaphorMagnetitself, or at least not yet. For example, the bot does not yet appreciate what it means for history to be a straight line, and while it knows enough to invent the intriguing handle@war_prisoner, neither does it appreciate what it might mean to be a prisoner of history, enslaved in a repeatingcycle of war. The placebo effect is not a binary effect: it benefits by degrees, and can benefit knowledge-rich botsjust as much as knowledge-free bots. Our bots will alwaysevoke in we humans more than they themselves can everappreciate, yet this may itself be a key part of CC's allure. Bot Vs. Bot : The Metaphor Challenge @MetaphorMagnet differs from @MetaphorMinute in a number of key ways. For one, its mechanics are informed by Lakoff and Johnson's Conceptual Metaphor Theoryand a range of computational approaches. For another, itdraws on considerable semantic and linguistic resources, from a large knowledge-base of conceptual relations and stereotypical beliefs to the linguistic diversity of the Google n-grams. All of @MetaphorMagnet's tweets - all its hits and all its misses - are open to public scrutiny onTwitter. But to empirically evaluate the success of the botas a knowledge-based, theory-driven producer of novel, meaningful and retweet-worthy metaphors, we turn to thecrowdsourcing platform CrowdFlower, where we conduct a comparative evaluation of @MetaphorMagnet and its closest knowledge-free counterpart, @MetaphorMinute. The latter, designed by noted bot-maker Darius Kazemi, uses a wholly aleatoric approach to metaphor generationyet has over 500 followers on Twitter that do not mind itsone-every-two-minutes scattergun approach to generation. @MetaphorMinute crafts metaphors by filling a template with nouns and adjectives that are chosen more-or-less at random, to produce inscrutable tweets such as "a cubit is a headboard: stational yet tongue-obsessed." We chose 60 tweets at random from the past outputs of each Twitterbot. CrowdFlower annotators, who were each paid a small sum per judgment, were not informed of the origin of any tweet, but simply told that each was selected from Twitter because of its metaphorical content. We did not want annotators to actively suspend their disbelief by knowingly dealing with bot outputs. Annotators were paid to rate the content of each tweet along three dimensions, June 2015 fi Comprehensibility, Novelty and likely Retweetability, and to rate all three dimensions on the same scale: Very Low to Medium Low to Medium High to Very High. Ten annotations were solicited for each dimension of each tweet, though the responses of likely scammers (nonengaged annotators) were later removed from the dataset. Tables 1 through 3 present the distributions of mean ratings per tweet, for each dimension and each Twitterbot. Comprehensibility @Metaphor Magnet @Metaphor Minute Very Low Med. Low 11.6% 13.2% 23.9% 22.2% Med High 23.7% 22.4% Very High 51.5% 31.6% Table 1. Relative Comprehensibility of each bot So more than half of @MetaphorMagnet's tweets were ranked as having very high comprehensibility, while less than one third of @MetaphorMinute's tweets are so ranked. More surprising, perhaps. is the result that annotators found more than half of @MetaphorMinute's wholly random metaphors to have medium-high to veryhigh comprehensibility. This Twitterbot's use of abstruse terminology, such as stational and peachblow, may be a factor here, as might the bot's use of the familiar copula container X is Y for its metaphors, which may well seduce annotators into believing that an apparent metaphor really does have a comprehensible meaning, if only one were to expend enough mental energy to actually discern it. Tweet Novelty @Metaphor @Metaphor Magnet Minute Very Low 11.9% 9.5% Med. Low 17.3% 12.4% Med High 21% 14.9% Very High 49.8% 63.2% Table 2. Comparative Novelty of each bot's tweets The dimension Novelty yields results that are equally surprising. While half of @MetaphorMagnet's metaphors are rated as having very-high novelty in Table 2, almost two-thirds of @MetaphorMinute's tweets are just as highly rated. However, we should not be overly surprised that @MetaphorMinute's bizarre juxtapositions of rare or unusual words, as yielded by its unconstrained use of aleatoric techniques, are seen as more unusual than those word juxtapositions arising from @MetaphorMagnet's controlled use of attested Web n-grams and stereotypical knowledge. As shown by Giora et al. (2004), novelty is neither a source of pleasure in itself nor is it a reliable benchmark of creativity. Rather, pleasurability derives from the recognition of useful novelty, that is, novelty that can be understood and appreciated relative to the familiar. Re-Tweetability @Metaphor Magnet @Metaphor Minute Very Low Med. Low 15.5% 41.9% 41% 34.1% Med High 27.4% 15% Very High 15.3% 9.9% Table 3. Relative Retweetability of each bot's tweets On Twitter, useful exploitation is frequently a matter of social reach. A tweet is novel and useful to the extent that it attracts the attention of Twitter users and is deemed worthy of re-tweeting to others in one's social circle. Our third dimension, Re-Tweetability, reflects the likelihood that an annotator would ever consider re-tweeting a given metaphorical tweet to others. Though we ask annotators to speculate here - neither bot has enough followers to perform a robust statistical analysis of actual retweet rates - the results largely conform to our expectations. The results of Table 3 show retweetability to be a matter of novelty and comprehensibility, and not just novelty alone. Though annotators are not generous with their Very-High ratings for either bot, @MetaphorMagnet's tweets are judged to be considerably more re-tweetable than the largely random offerings of @MetaphorMinute. Comprehensibility and comprehension are two different things: while a Computational Creativity (CC) version of the placebo effect may well foster a belief that a given tweet has a coherent meaning, it cannot actually provide this meaning. Meaning is the product of interpretation, and interpretation is often hard. Milic (1970) notes that in a context that licences a poetic interpretation, such as one in which a reader is told that a particular text is a metaphor, readers are more likely to accept that the text - as inscrutable as it may be - has a metaphorical meaning rather than dismiss it as nonsense. Recall that over 75% of @MetaphorMagnet's tweets and over 50% of @MetaphorMinute's tweets are judged as having medium-high to very-high comprehensibility. We thus need to look deeper, to determine whether raters can actually back up these judgments with actual meanings. In a second CrowdFlower experiment, we make raters work harder, to reconstruct a partial tweet by adding the missing information that will make it whole and apt again. That is, we employ a cloze test format for this experiment, by removing from each tweet the pair of key qualities that anchor the tweet and make its comparison of ideas seem meaningful and apt. For @MetaphorMagnet, for example, we remove the properties detailed and vague in this tweet: June 2015 fi To some freedom fighters, freedom is a detailed recipe. To others, it is a vague dream. #Freedom=#Recipe #Freedom=#Dream For @MetaphorMinute, we excise the pair of qualities hippy and revisional from the following tweet: a flatfoot is a houseboat: hippy and revisional For each tweet from each bot, we blank out a pair of original qualities as above; this pairing is the answer that is sought from human judges. We also choose 4 distractor pairs for each original pair, by selecting pairs from other tweets from the same bot. As in our first experiment, we chose 60 tweets at random from the past outputs of each bot, and 10 ratings were solicited for each. Annotators were presented with a tweet in which the key properties were blanked out (as above), and given five randomly ordered pairs of possible fillers to choose from. To make the results of the experiment comparable to those of the 1st experiment (Tables 1,2,3), we obtain the mean aptness of each tweet, so that e.g. if 7 out of 10 raters correctly choose the original pairing, then that tweet is deemed to have an aptness of 0.7. We then place these aptness scores into bands, where the Very Low band = 0 to 0.25, Medium Low = 0.26 to 0.5, Medium High = 0.51 to 0.75, and Very High = 0.76 to 1. By calculating the distribution of tweets to each band, we can determine e.g. the percentage of tweets from each bot that are put into the Very High band. Our hypothesis is rather straightforward: if tweets are linguistic containers that are carefully crafted to convey a particular meaning, then it should be easier to select the missing pair of qualities that make this meaning whole; if, on the other hand, the tweet is all there is, and its content is chosen mostly at random, then raters will choose the right pairing with no more success than random selection. The results reported in Table 4 bear out our hypothesis. Metaphor Aptness @Metaphor @Metaphor Magnet Minute Very Low 0% 84% Med. Low 22% 16% Med High 58% 0% Very High 20% 0% Table 4. Relative Aptness of each bot's metaphors The placebo effect in CC can lead us to appreciate a bot's tweets as meaningful but cannot tell us what this meaning should be. Though the results above may seem a foregone conclusion, as @MetaphorMagnet's tweets are designed to communicate a fully recoverable meaning while those of @MetaphorMinute are not, this is surely what it means to engage in real communication: to design an utterance so that an intended meaning is re-created, in whole or in part, in the mind of an intelligent, receptive audience. Fake It ‘Til You Make It The Placebo Effect benefits all Computational Creativity systems, from superficial users of surealistic techniques to sophisticated knowledge-based AI systems. That this is so should come as no surprise, for we humans also benefit from the effects of an active and receptive mind when dealing with other people. Just as a prior belief in the efficacy of a medical intervention can lead us to perceive (and experience) a post-hoc benefit from an otherwise empty treatment, a prior belief in the meaningfulness of a verbal intervention can lead us to perceive (and enjoy) a creative meaning where none was ever intended. When a CC system uses superficial techniques to convey a sense of understanding and profundity with otherwise shallow linguistic forms, as in Weizenbaum's (1965) infamous ELIZA system, the label "ELIZA Effect" proves to be an apt one (Hofstadter, 1995). However, we humans are also subject to an ELIZA effect of our own, insofar as we often do others the courtesy of assuming their utterances to be freighted with real meaning and creative intent, and will often work hard to uncover that meaning for them. At one time or another, we have all relied on catchphrases, clichés, slogans, idioms, canned jokes and other half-empty linguistic containers to suggest to others that we have deeper meanings in mind, or have something more profound to offer, than we actually do. In a famous polemical essay from 1946, George Orwell excoriates speakers of English for their reliance on jargon, foreign words and empty phraseology as a substitute for thoughts of real substance, while Geoff Pullum (2003) upbraids modern speakers for a grating over-reliance on "multi-use, customizable, instantly recognizable, time-worn, quoted or misquoted phrases or sentences that can be used in an entirely open array of different jokey variants by lazy journalists and writers." These "phrases for lazy writers in kit form" are not that different from the template-based language games played by superficial Twitterbots, and though we humans fill our templates - such as "X is the new black", "In X no one can hear you scream" or "if the Eskimos have N words for snow then Xs surely have as many for Y" - with lexical fillers that are contextually apt, we employ our templates to be just as provocative, and to imply or to suggest more than we actually mean. To see machines work with humans in the construction of real figurative meanings, readers are directed to a variant of @MetaphorMagnet - a related bot named @MetaphorMirror - that tweets its own novel metaphors in response to breaking news events. This bot's metaphors are not offered as informative summaries of the news, but as figurative lenses through which followers can view the news and adopt a novel perspective on human affairs. Acknowledgements This research was supported by the EC project WHIM: June 2015 fi The What-If Machine. See http://www.whim-project.eu/ 2015_12 !2015 OMG UR Funny! Computer-Aided Humor with an Application to Chat Miaomiao Wen Nancy Baym, Omer Tamuz, Jaime Teevan, Susan Dumais, Adam Kalai Carnegie Mellon Microsoft Research University Abstract In this paper we explore Computer-Aided Humor (CAH), where a computer and a human collaborate to be humorous. CAH systems support people's natural desire to be funny by helping them express their own idiosyncratic sense of humor. Artificial intelligence research has tried for years to create systems that are funny, but found the problem to be extremely hard. We show that by combining the strengths of a computer and a human, CAH can foster humor better than either alone. We present CAHOOTS, an online chat system that suggests humorous images to its users to include in the conversation. We compare CAHOOTS to a regular chat system, and to a system that automatically inserts funny images using an artificial humor-bot. Users report that CAHOOTS made their conversations more enjoyable and funny, and helped them to express their personal senses of humor. Computer-Aided Humor offers an example of how systems can algorithmically augment human intelligence to create rich, creative experiences. Introduction Can a computer be funny? This question has intrigued the pioneers of computer science, including Turing (1950) and Minsky (1984). Thus far the answer seems to be, "No." While some computer errors are notoriously funny, the problem of creating Computer-Generated Humor (CGH) systems that intentionally make people laugh continues to challenge the limits of artificial intelligence. State-of-the-art CGH systems are generally textual. CHG systems have tried to do everything from generating wordplay puns (Valitutti 2009) (e.g., "What do you get when you cross a fragrance with an actor? A smell Gibson") and identifying contexts in which it would be funny to say, "That's what she said," (Kiddon and Yuriy 2011) to generating I-like-my-this-like-my-that jokes (Petrovic and David 2013) (e.g., "I like my coffee like I like my war, cold") and combining pairs of headlines into tweets such as, "NFL: Green Bay Packers vs. Bitcoin - live!"1 However, none of these systems has demonstrated significant success. Despite the challenge that computers face to automatically generate humor, humor is pervasive when people use computers. People use computers to share jokes, create funny videos, and generate amusing memes. Humor and 1 http://www.twitter.com/TwoHeadlines Figure 1. Images suggested by CAHOOTS in response to chat line,"whyulate?"(a),(b),and(e) arefromimagesearchquery"funnylate",(f) isfromquery"funnywhy",(c) isacanned reaction to questions, and (d) is a meme generated on-the-fly. laughter have many benefits. Online, it fosters interpersonal rapport and attraction (Morkes et al. 1999), and supports solidarity, individualization and popularity (Baym 1995). Spontaneous humor production is strongly related to creativity, as both involve making non-obvious connections between seemingly unrelated things (Kudrowitz 2010). Computers and humans have different strengths, and therefore their opportunity to contribute to humor differs. Computers, for example, are good at searching large data sets for potentially relevant items, making statistical associations, and combining and modifying text and images. Humans, on the other hand, excel at the complex social and linguistic (or visual) processing on which humor relies. Rather than pursuing humor solely through a CGH strategy, we propose providing computational support for humorous interactions between people using what we callComputer-Aided Humor (CAH). We show that by allowing the computer and human to work together, CAH systems can help people be funny and express their own sense of humor. We explore the properties of this form of interaction and prove its feasibility and value through CAHOOTS (Computer-Aided Hoots), an online chat system that helps people be funny (Figure 1). CAHOOTS supports ordinary text chat, but also offers users suggestions of possibly funny June 2015 fi images to include based on the previous text and images in the conversation. Users can select choices they find ontopic or humorous and can add funny comments about their choices, or choose not to include any of the suggestions. The system was designed iteratively using paid crowd workers from Amazon Mechanical Turk and interviews with people who regularly use images in messaging. We compare CAHOOTS to CGH using a chat-bot that automatically inserts funny images, and to ordinary chat with no computer humor. The bot uses the same images that CAHOOTS would have offered as suggestions, but forcibly inserts suggestions into the conversation. Compared to these baselines, CAHOOTS chats were rated more fun, and participants felt more involved, closer to one another, and better able to express their sense of humor. CAHOOTS chats were also rated as more fun than ordinary chat. Our findings provide insights into how computers can facilitate humor. Related Work In human-human interaction, humor serves several social functions. It helps in regulating conversations, building trust between partners and facilitating self-disclosure (Wanzer et al. 1996). Non-offensive humor fosters rapport and attraction between people in computer-mediated communication (Morkes et al. 1999). It has been found that five percent of chats during work are intended to be primarily humorous (Handel and James 2002), and wall posts in Facebook are often used for sharing humorous content (Schwanda et al. 2012). Despite the popularity and benefits of humorous interaction, there is little research on how to support humor during computer-mediated communication. Instead, most related work focuses on computationally generating humor. Computational Humor Computational humor deals with automatic generation and recognition of humor. Prior work has mostly focused on recognizing or generating one specific kind of humor, e.g. one-liners (Strapparava et al. 2011). While humorous images are among the most prominent types of Internetbased humor (Shifman 2007), little work addresses computational visual humor. Prior work on CGH systems focus on amusing individuals (Dybala 2008; Valitutti et al. 2009). They find humor can make user interfaces friendlier (Binsted 1995; Nijholt et al. 2003). Morkes et al. (1998) study how humor enhances task-oriented dialogues in computer-mediated communication. HumoristBot (Augello et al. 2008) can both generate humorous sentences and recognize humoristic expressions. Sjobergh and Araki (2009) designed a humorous Japanese chat-bot. However, to the best of our knowledge, no prior research has studied collaboratively being funny using humans and computers. Creativity Support Tools CAH is a type of creativity support tool aimed specifically at humor generation within online interaction. Shneiderman (2007) distinguishes creativity support tools from productivity support tools through three criteria: clarity of task domain and requirements, clarity of success measures, and nature of the user base. Creativity support tools take many forms. Nakakoji (2006) organizes the range of creativity support tools with three metaphors: running shoes, dumbbells, and skis. Running shoes improve the abilities of users to execute a creative task they are already capable of. Dumbbells support users learning about a domain to become capable without the tool itself. Skis provide users with new experiences of creative tasks that were previously impossible. For users who already utilize image-based humor in their chats, CAHOOTS functions as running shoes. For the remaining users, CAHOOTS serves as skis. System Design Our system, CAHOOTS, was developed over the course of many iterations. At the core of the system lie a number of different algorithmic strategies for suggesting images. Some of these are based on previous work, some are the product of ideas brainstormed in discussions with comedians and students who utilize images in messaging, and others emerged from observations of actual system use. Our system combines these suggestions using a simple reinforcement learning algorithm for ranking, based on RMax (Brafman and Tennenholtz 2003), that learns weights on strategies and individual images from the images chosen in earlier conversations. This enabled us to combine a number of strategies. User Interface CAHOOTS is embedded in a web-based chat platform where two users can log in and chat with each other. Users can type a message as they would in a traditional online chat application, or choose one of our suggested humorous images. Suggested images are displayed below the text input box, and clicking on a suggestion inserts it into the conversation. Both text and chosen images are displayed in chat bubbles. See Figure 2 for an example. After one user types text or selects an image, the other user is provided with suggested image responses. The Iterative Design Process We initially focused on text-based humor suggestions based on canned jokes and prior work (Valitutti et al. 2009). These suffered from lack of context, as most human jokes are produced within humorous frames and rarely communicate meanings outside it (Dynel 2009). User feedback was negative, e.g., "The jokes might be funny for a three year old"and "The suggestions are very silly." June 2015 fi Figure2.TheCAHOOTSuserinterfaceinachat,withuser's messages (right in white) and partner's (left in blue). All text isuser-entered while images are suggested by the computer. Thesystem usually offers six suggestions. Based on the success of adding a meme image into suggestions, we shifted our focus to suggesting funny images. In hindsight, image suggestions offer advantages over text suggestions in CAHOOTS for multiple reasons: images are often more open to interpretation than text; images are slower for users to provide on their own than entering text by keyboard; and images provide much more context on their own, i.e., an image can encapsulate an entire joke in a small space. Image Suggestion Strategies In this section, we describe our most successful strategies for generating funny image suggestions based on context. Emotional Reaction Images and gifs Many chat clients provide emoticon libraries. Several theories of computer-mediated communication suggest that emoticons have capabilities in supporting nonverbal communications (Walther and Kyle 2001). Emoticons are often used to display or support humor (Tossell et al 2012). In popular image sharing sites such as Tumblr 2 , users respond to other people's posts with emotional reaction images or gifs. In CAHOOTS, we suggest reaction images/gifs based on the emotion extracted from the last sentence. Previous work on sentiment analysis estimates the emotion of an addresser from her/his utterance (Forbes-Riley and Litman 2004). Recent work tries to predict the emotion of the addressee (Hasegawa et al. 2013). Following this work, we first use a lexicon-based sentiment analysis to predict the emotion of the addresser. We adopt the widely used NRC Emotion Lexicon3. We collect reaction images and 2 http://www.tumblr.com 3 http://saifmohammad.com/WebPages/lexicons.html Figure 3. In response to text with positive sentiment, wesuggest a positive emotional reaction image. Figure 4. In response to the utterance, the user chooses asuggestion generated by Bing image search with the query"funny desert". their corresponding emotion categories from reacticons.com. We collect reaction gifs and their corresponding emotion categories from reactinggifs.com. Then we suggest reaction images and gifs based on one of five detected sentiments: anger, disgust, joy, sadness, or surprise. An example of an emotional reaction is shown in Figure 3. Image Retrieval We utilize image retrieval from Bing image4 search (Bing image) and I Can Has Cheezburger5 (Cheezburger) to find funny images on topic. Since Bing search provides a keyword-based search API, we performed searches of the form "funny keyword(s),"where we chose keyword(s) based on the last three utterances as we found many of the most relevant keywords were not present in the last utterance alone. We considered both individual keywords and combinations of words. For individual words, we used the term frequency-inverse document frequency (tf-idf) weighting, a numerical statistic reflecting how important a word is to a document in a corpus, to select which keywords to use in the query. To define tf-idf, let ..(..,......) be 1 if term ..occurred in the ..th previous utterance.Let .. be the set of all prior utterances and write .....if term .. was used in utterance ...... Then weighted tf and tf-idf are defined as follows: wtf=.7..(..,......)+.2..(..,......)+.1..(..,......) 4 http://www.bing.com/images 5 http://icanhas.cheezburger.com June 2015 fi Figure 5. An example of an utterance that generated akeyword combination cat gerbil, and a resulting image retrieved for the search funny cat gerbil. idf=log |.....:.....|tf-idf=wtf*idf. Here ..=43,370 is the total number of utterances collected during prototyping. The weights are designed to prioritize words in more recent utterances. An example of a single keyword for Bing is shown in Figure 4. Combinations of keywords were also valuable. Humor theorists argue humor is fundamentally based on unexpected juxtaposition. The images retrieved with a keyword combination may be funnier or more related to the current conversation than images retrieved with a single keyword. However, many word pairs were found to produce poor image retrieval results. Consequently, we compiled a list of common keywords, such as cat and dog, which had sufficient online humorous content that they often produced funny results in combination with other words. If a user mentioned a common funny keyword, we randomly pick an adjective or a noun to form a keyword combination from the last three utterances. An example of a query for a combination of keywords is shown in Figure 5. Memes Meme images are popular forms of Internet humor. Coleman (2012) defines online memes as, "viral images, videos, and catchphrases under constant modification by users". A "successful" meme is generally perceived as humorous or entertaining to audiences. Inspired by internet users who generate their own memes pictures through meme generation website and then use them in conversations in social media sites like Reddit or Imgur, our meme generation strategy writes the last utterance on the top and bottom of a popular meme template. A meme template is an image of a meme character without the captions. The template is chosen using a machine-learning trained classifier to pick the most suitable meme template image based on the last utterance, as in Figure 1(d), with half of the text on the top and half on the bottom. To train our classifier to that match text messages to meme template, we collected training instances Figure6. A "Doge" memeexample. from the Meme Generator website 6 . This website has tremendous numbers of user-generated memes consisting of text on templates. In order to construct a dataset for training machine learning models, we collected the most popular one hundred meme templates and user generated meme instances from that site. To filter out the memes that the users find personally humorous, we only keep those memes with fifty or more "upvotes" (N = 7,419). We use LibLinear (Fan et al. 2008), a machine learning toolkit, to build a onevs-the-rest SVM multi-class classifier (Keerthi et al. 2008) based upon Bag-of-words features. Even though this is multi-class classification with one hundred classes, the classifier trained in this simple way achieved 53% accuracy (compared with a majority-class baseline of 9%). The fact that the meme's text often matched exactly what the user had just typed often surprised a user and led them to ask, "are you a bot?" Also note that we have other strategies for generating different types of image memes, which modify the text, such as the Doge meme illustrated in Figure 6. Canned Responses For certain common situations, we offer pre-selected types of funny images. For example, many users are suspicious that they are actually matched with a computer instead of a real person (which is partly accurate). As mentioned, we see users asking their partner if he/she is a bot. As a canned response, we suggest the results of keyword-search for "funny dog computer," "funny animal computer,"or "funny CAPTCHA". Responding to Images with Images We observed users often responding to images with similar images. For example, a picture of a dog would more likely be chosen as a response to a picture of a dog. Hence, the respond-in-kind strategy responds to an image chosen from asearch for "funny keyword"with a second image from the same search, for any keyword. Another strategy, called the rule-of-three, will be triggered after a user selects a respond-in-kind. The rule-of-three will perform an image search for "many keyword"or "not 6 http://memegenerator.net June 2015 fi keyword". An example is shown in Figure 7. The rule-ofthree is motivated by the classic comic triple, a common joke structure in humor (Quijano 2012). Comedians use the first two points to establish a pattern, and exploit the way people's minds perceive expected patterns to throw the audience off track (and make them laugh) with the third element. In our system, when the last two images are both Bing image retrieved with the same keyword, e.g. funny dog images, we will suggest a Bing funny image with "many"+ keyword (e.g. "many dog") or "no" +keyword (e.g. "no dog") image as the third element. In response to images, "LOL", "amused" or "not-amused" reaction images and gifs were suggested to help users express their appreciation of humor. Ranking Suggestions using Reinforcement Learning The problem of choosing images to select fits neatly into the paradigm of Reinforcement Learning (RL). Our RL algorithm, inspired by R-Max (Brafman and Tennenholtz 2003), maintains counts at three levels of specificity, for number of times a suggestion was offered and number of times it was accepted. The most general level of counts is for each of our overall strategies. Second, for specific keywords, such as "dog," wecount how many times, in general, users are offered and choose an image for a query such as "funny dog." Finally, for some strategies, we have a third level of specific counts, such as a pair for each of the fifty images we receive from Bing's API. We use the "optimistic" R-Max approach of initializing count pairs as if each had been suggested and chosen five out of five times. The score of a suggestion is made based on a backoff model, e.g., for a Bing query "funny desert": if we have already suggested a particular image multiple times, we will use the count data for that particular image, otherwise if we have sufficient data for that particular word we will use the count data for that word, and otherwise we will appeal to the count data we have for the general Bing query strategy. Experiments To test the feasibility of CAH we performed a controlled study. Before the experiment began, we froze the parameters in the system and stopped reinforcement learning and adaptation. Methodology Participants were paid Mechanical Turk workers in the United States. Each pair of Turkers chatted for 10 minutes using: 1) CAHOOTS, our CGH system, 2) plain chat (no image suggestions), or 3) a CGH system with computergenerated images, all using the same interface In the CGH system, whenever one user sends out a message, our system automatically inserted the single top-ranking funny image suggestion into the chat, with "computer:" inserted above the message, as is common in systems such as WhatsApp. Assignment to system was based on random assignment. We also varied the number of suggestions in CAHOOTS. We write CAHn to denote CAHOOTS with n suggestions. We use CAHOOTS and CAH6 interchangeably (6 was the default number determined in pilot studies). The systems experimented with were CGH, plain chat, CAH1, CAH2, CAH3, CAH6, CAH10. A total of 738 participants (408 male) used one of systems, with at least 100 participants using each variant. Pairs of participants were instructed how to use the system and asked to converse for at least 10 minutes. After the chat, participants were asked to fill out a survey to evaluate the conversation and the system. We asked participants to what extent they agree with four statements (based on Jiang et al. 2011), on a 7 point Likert scale. The four statements were: x The conversation was fun. x I was able to express my sense of humor in thisconversation. x I felt pretty close to my partner during the conversation. x I was involved in the conversation. Experiments Averaged over the chats where our system made suggestions (CAH1,2,3,6,10) participants selected an image in 31% of the turns. In contrast, a field study found emoticons to be used in 4% of text messages (Tossell et al. 2012). System Variant Figure 8 summarizes participants' responses for the four Likert questions. Results are shown for chat, CGH, and two variants of CAHOOTS. P-values were computed using a one-sided Mann-Whitney U test. Figure 7. The "rule of three" strategy suggests putting an image of many dogs after two dog images. June 2015 fi 6.5 5.5 4.5 3.5 fun expressfelt close involved humor Plain chat CGH CAH_1 CAHOOTS Figure 8. Mean Likert ratings with Standard Error. 7 isstrongly agree, 1 is strongly disagree, and the statementswere 1. The conversation was fun. 2. I was able to expressmy sense of humor in this conversation. 3. I felt pretty closeto my partner during the conversation. 4. I was involved inthe conversation. CAHOOTS vs. CGH Participants rated CAHOOTS conversations better on average than CGH with p-values less than 0.05 for all four questions --more fun, able to express sense of humor, closer to partner, and more involved in conversation It is also interesting to compare CAH1 to CGH as this reflects the difference between one image automatically into the conversation and one image offered as a suggestion. Here CAH1 got higher response for fun, involvement, and closeness than CGH again with p < .05. Curiously, participants using CAH1 felt somewhat less able to express their senses of humor. CAHOOTS vs. plain chat CAHOOTS was also rated more fun than plain chat (p < .05), and CAHOOTS participants also reported being able to express their own sense of humor better than plain chat participants (p < .05). For the other two questions CAHOOTS was not statistically significantly better than plain chat. Note that while it may seem trivial to improve on plain chat by merely offering suggestions, our earlier prototypes (especially with text but even some with image suggestions) were not better than plain chat. Number of Suggestions Figure 9 shows responses to the fun question for different numbers of suggestions in CAHOOTS. In general, more suggestions makes the conversation more fun, though ten suggestions seemed to be too many. This may be because of the cognitive load required to examine ten suggestions or simply that with many suggestions scrolling is more likely to be required to see all image suggestions. Effective Image Generation Strategies As described earlier we used several different strategies for generating images. Table 1 shows how often each type was shown and how often it was selected. The rule-of-three How fun 6 5.8 5.6 5.4 5.2 5 0 2 4 6 810 Number of suggestions Figure 9. Mean and SE for "the conversation was fun" as wevary the number of suggestions, with 0 being plain chat. (inspired by our meetings with comedians) was suggested less often than some other techniques, but the rate at which it was selected was higher. Reaction images/gifs were the next most frequently selected image strategy. # suggestions % chosen Bing Images 44,710 10% Reaction Images and gifs 4,375 19% Meme 709 13% Rule-of-three 698 24% Cheezburger 537 7% Table 1 Selection rate of the top five strategies. Limitations Since we evaluate our system with paid workers, we have only tested the system between anonymous strangers whose only commonality is that they are US-based Mechanical Turk workers. We also asked workers to indicate with whom they would most like to use CAHOOTS: a family member, a close friend, an acquaintance, a colleague, or a stranger. Workers consistently answer that CAHOOTS would be best when chatting with a close friend who "can understand their humor." Also, we cannot compare CAHOOTS to every kind of CGH. For example, it is possible that users would prefer a CGH system that interjects images only once in a few turns or only when it is sufficiently confident. Qualitative Insights We analyzed the content of the text and image messages as well as worker feedback from both prototyping and experimentation phases. Note participants often remarked to one another, quite candidly, about what they liked or problems with our system, which helped us improve. Anecdotally, feedback was quite positive, e.g., "It should be used for online speed dating!" and "When will this app be available for phones and whatnot? I want to use it!" Also, note that when we offered a small number of suggestions, feedback called for more suggestions. In contrast, feedback for CGH was quite negative, such as "The pictures got kind June 2015 fi Figure 10. Two workers start to talk about Bill Murrayafter using a reaction gif featuring Bill Murray. of distracting while I was trying to talk to him/her." We now qualitatively summarize the interactions and feedback. Humorous Images Bring New Topics to the Conversation Without CAHOOTS image suggestions, most of the chats focused on working in Mechanical Turk, which they seemed to find interesting to talk about. With suggestions, however, workers chose an image that suited their interests and naturally started a conversation around that image. Common topics included their own pets after seeing funny animal images, and their own children and family, after seeing funny baby images. As one worker commented: "great for chatting with a stranger, starts the conversation." An example is shown in Figure 10, where two workers start to talk about Bill Murray after using a reaction gif featuring Bill Murray. Image Humor is Robust We found CAHOOTS robust in multiple ways. First, participants had different backgrounds which made them understand images differently. For example, one participant might complain that our memes were outdated, while the other participant's feedback would indicate that they didn't even recognize that the images were memes in the first place. Nonetheless, the latter could still find the images amusingeven iftheydidn't share the same background. Second, we found CAHOOTS robust to problems that normal search engines face. For example, a normal search engine might suffer from ambiguity and therefore performword-sense disambiguation, whereas humor is often heightened by ambiguity and double-entendres. While we didn't explicitly program in word-sense ambiguity, it often occurs naturally. Yes, and… A common rule in improvisational comedy, called the yes and rule, is that shows tend to be funnier when actors accept one another's suggestions and try to build them into something even funnier, rather than changing the direction even if they think they have a better idea (Moshavi 2001). Many CAHOOTS's strategies lead to yes-and behaviors. An example is shown in Figure 11. On the top, the computer suggestions directly addresses the human's remark to makes the conversation funnier. Figure 11. An example of man-machine riffing. Users Tend to Respond with Similar Images Humor support, or the reaction to humor, is an important aspect of interpersonal interaction (Hay 2001). With CAHOOTS, we find that users tended to respond to a funny image with a similar image to contribute more humor, show their understanding and appreciation of humor. When one user replied to her partner's image message with an image, 35% of the time the other user chose an image generated by the same strategy. Compared with two random images in a conversation, the chance that they are generated by the same strategy is 22%. Conclusion In this paper we introduce the concept of Computer-Aided Humor, and describe CAHOOTS—a chat system that builds on the relative strengths of people and computers to generate humor by suggesting images. Compared to plain chat and a fully-automated CGH system, people using found it more fun, enabled them to express their sense of humor and more involvement. The interaction between human and computer and their ability to riff off one another creates interesting synergies and fun conversations. What CAHOOTS demonstrates is that the current artificial intelligence limitations associated with computational humor may be sidestepped by an interface that naturally involves humans. A possible application of CAH would be an add-on to existing chat clients or Facebook/Twitter comment box that helps individuals incorporate funny images in computer-mediated communication. 2015_13 !2015 A Semantic Map for Evaluating Creativity Frank van der Velde1,2, Roger A. Wolf3, Martin Schmettow1 and Deniece S. Nazareth1 1Cognitive Psychology and Ergonomics (CPE-BMS), CTIT, University of Twente, Drienerlolaan 5, 7522 NB Enschede, The Netherlands 2IOP Leiden University, The Netherlands 3Saxion University of Applied Sciences, Handelskade 75, 7417 DH Deventer, The Netherlands f.vandervelde@utwente.nl Abstract We present a semantic map of words related with creativity. The aim is to empirically derive terms which can be used to rate processes or products of computational creativity. The words in the map are based on association studies performed by human subjects and augmented with words derived from the literature (based on human raters). The words are used in a card sorting study to investigate the way they are categorized by human subjects. The results are arranged in a heat map of word relations based on a hierarchical cluster analysis. The cluster analysis and a principal component analysis provide a set of five to six clusters of items related to each other, and as clusters related to creativity. These clusters could form a basis for scales used to rate aspects of computational creativity. Introduction In his Principles of Psychology, published in 1890, William James introduced his definition of ‘attention' as follows: "Everyone knows what attention is". Yet, debates on the distinctive features of attention continue up to the present day. Perhaps a similar situation could be found with the notion of ‘creativity'. In some way, ‘everyone knows what creativity is'. But it is non-trivial to find methods by which creativity can be evaluated. Yet, when we investigate creativity, either in humans or as achievements of computational systems, we need some way to evaluate creativity. For example, we need a measure of creativity to distinguish between brain states in the neuroscientific investigation of creativity (Fink and Benedek, 2014). We also need it to assess the products of computational systems as creative or not. Indeed, the question of how computational creativity can be evaluated has been described as one of the ‘Grand Challenges' of computational creativity research (Cardoso, Veale and Wiggins, 2009). Definitions of creativity have been presented in the literature. For example, "creativity is commonly defined as the ability to produce work that is both novel (original, unique) and useful" (Fink and Benedek, 2014, p. 111). Similar characteristics are novelty and useful or value (Amabile, 1996; Hennessey and Amabile, 2010), typicality, novelty and quality (Ritchie, 2007), novelty, value, and unexpectedness or surprise (Grace and Maher, 2014), and skill, imagination and appreciation (Colton, 2008). Each of these qualifications may capture aspects of creativity. But when they are used as criteria for the evaluation of creativity by human raters, as in the evaluations of processes or products of computational creativity, we need to validate their relation with the notion of creativity. In this context it is important to realize that an assessment (rating) performed by humans is an empirical investigation (behavioral experiment), whether or not the raters are experts or arbitrary people, and the rating scales used are instruments of measurement, which need to be validated. For this, it is not sufficient to argue that the rating scales are based on some kind of definition (no matter how sound the definition may appear to be). Recently, Jordanous (2012a, 2012b, 2014) investigated the question of how creativity of computational creativity systems is and should be evaluated. Based on an analysis of the evaluation of creativity in the scientific literature related to computational creativity, she found that evaluation ratings (if performed at all) were based on criteria set up by the researchers themselves (or by other researchers in the literature). To achieve a more empirical basis for rating computational creativity (i.e., not just derived from the subjective acceptance by researchers), Jordanous (2012a,b) used a statistical analysis by comparing word frequencies in scientific articles related to the study of computational creativity with word frequencies in scientific articles related to other topics. An analysis of this kind is based on the assumption that the meaning of words is related to the context in which the words are used. In particular, the meaning of a word (or aspects of it) can be determined by finding other words that co-occur with it statistically more often than can be expected on the basis on chance (Landauer and Dumais, 1997). Based on her analysis, Jordanous (2012a,b) derived a set of 694 terms that occurred more frequently in the scientific literature related to computational creativity comparted to June 2015 fi other, non-related, scientific articles. On the basis of these words, she derived 14 dimensions on which creativity could be evaluated. Here, we investigate the empirical basis for rating (computational) creativity based on empirical (behavioural) studies with human subjects. After all, ratings of creativity are conducted by human subjects, so we could also probe human subjects for the basis of these rating scales. Our aim is to arrive at a ‘semantic map' of terms related to the notion creativity, which can be used to derive and compare rating scales for creativity. To arrive at this semantic map, we conducted a study in which human subjects were asked to provide terms associated with creativity. Next, the terms associated with creativity were used in a ‘reverse' association study, to see whether terms like ‘creativity' are in turn associated with these terms. Then, a selected set of words based on both association studies was used in a card sorting study with human subjects. The words used in our card sorting study were augmented with a selected subset of the 694 words related to creativity based on the analysis of Jordanous (2012a,b). A card sorting study provides information about how a set of words are categorized by human subjects. Using the words based on our association studies, this in turn provides a prototype for a semantic map related to creativity. The remainder of this article is structured as follows. First, we outline how a set of words was derived as the basis for the semantic map. Then, we present and discuss the card sort study used to derive the semantic map. Next, the prototype of the semantic map based on the card sorting study is presented and discussed. Finally, we present the conclusions and briefly discuss future work. Word associations with creativity As introduced above, we conducted two word association studies. Word associations are used as a technique in experimental psychology, for example to obtain controlled stimulus material (Nelson, McEvoy & Schreiber, 2004). In an association study, a target word is given and subjects are asked to produce words associated with the target. In a free association study a subject can give an unlimited number of associated words. In a restricted or discrete association study, the number of association words is restricted beforehand (in case of a discrete study, only one associated word can be given). A problem with a free association study is the occurrence of a chain of associations, in which (new) associated words are given not because they are associated with the target word but instead are associated with a previously given associated word. We therefore used a restricted and a discrete association study. The aim of our first association study was to derive a set of terms associated with the word ‘creativity'. For this, we conducted a restricted association study. In this study, 36 subjects between the age of 18 and 52 (29 Dutch and 7 German) were asked to give at most three terms associated with the word ‘creativity' (either in Dutch or German). From this list three human raters selected a list of words on which they all agreed as words associated with creativity. This resulted in a set of 58 words. We augmented this list by a selection of words based on the set of words derived by Jordanous (2012a,b). She analyzed two corpora of texts: one consisting of scientific articles related to the study of creativity and one consisting of scientific articles not related to the study of creativity. A statistical analysis revealed a set of 694 terms that occurred statistically more frequently in the scientific articles related to the study of creativity. In our study, this set was reviewed by three human raters. They each selected words from this set that in their view were associated with creativity. The words on which all three raters agreed were included in the set of words associated with creativity. This procedure resulted in an initial list of 32 words based on the list provided by Jordanous (2012a,b). The list of 58 words obtained in our first association study included 10 words from the list of Jordanous (2012a,b) selected by the three human raters (see above). The list of 58 words included another eight words from the list of Jordanous (2012a,b) which were not selected by the three human raters. In this way, we obtained a list of 80 words to be used in our second association study. In this list of 80 words, 22 words derived exclusively from the list of Jordanous (2012a,b), in the manner outlined above; 40 words were derived exclusively from the list provided by human subjects in our first association study; 18 words co-occurred in the list of Jordanous and in the human subject list obtained in our first association study. In our second association study we used the list of 80 words obtained in our first association study, augmented with the words selected from Jordanous (2012a,b), to conduct a‘backward' (or reverse) discrete association study. That is, for each of these 80 words human subjects were asked to provide one term associated with that word. The list of words was presented in a randomized order to prevent priming effects. A subject sat in front of a screen and a keyboard in an isolated cubicle. One word at a time appeared on the screen. The subject then used a keyboard to type the answer. After that, a new word appeared. The subjects consisted of 50 students between age 19 and 27. None of them participated in the first part of the study. There were 29 Dutch and 21 German participants from whom 24 were men and 26 women. There were 25 technical students, 22 social studies students and 3 art students. The first aim of our second association study was to obtain ‘reversed' associations to the words associated with creativity (the list of 80 words outlined above). In particular, to see whether words like ‘to create', ‘creative' or ‘creativity' are in turn associated with the words associated with the word ‘creativity'. A second aim of this study was to see whether words in the list of 80 words are associated with each other. A subset of the list of 80 words gave a‘creativity' word ("creativity", "creative" or "to create") as a(reversed) association in our second association study. In this subset, 55% of the words came from the human list derived in our June 2015 fi first association study, 28% from the list provided by Jordanous (2012a,b) and 17% from both lists. However, the whole list of ‘reversed' associated words obtained in our second association study was used as one of the lists on which the words for the card sorting study were based, in the manner outlined below. Card sorting study The list of words obtained in our first association study (augmented with words from the list of Jordanous) and the list of words obtained in our second association study were used to select the words for the card sorting study. Figure 1. List of words used in the card sorting study The selection was based on three conditions: Firstly, a word had to appear in both lists of words. Thus, a word is considered to be strongly associated with creativity if that word is both directly and indirectly (reversely) associated with creativity. Direct association entails that the word is associated with creativity (more spe cifically, the word belongs to the word list of our first association study, augmented with words from Jordanous, 2012a,b). Indirect association entails that the word is associated with a word that is in turn associated with creativity (more specifically, the word belongs to the list of words obtained in our second association study). Secondly, a word had to appear more than once as an answer in our second association study (to avoid the use of idiosyncratic words in the card sorting study). Thirdly, the word could not be the word "creative" or any derivative of that base word, because the aim of this card sorting study was to investigate the internal semantic structure of the words strongly associated with creativity without interference from the base word "creativity" itself. In all, 42 words were selected for the card sorting study. In the study 40 Dutch participants took part. They did not participate in any of the previous studies. Figure 1 presents the words used in the card sorting study and the source (lists) on which they are based. That is, the source consists of the list derived from our association studies (H, 19 words); the list of Jordanous (2012a,b) (J, 8 words); or both lists (B, 15 words). Card sorting can be used to evaluate how people organize a set of items (Harloff and Coxon, 2006). Figure 2 illustrates a card sorting study with the following set of words: keyboard, printer, mouse, cat, dog. Figure 2: Example of a card sorting study In a card sorting study, these words are printed on cards and subjects are asked to group these cards into categories1. If, in their view, a word cannot be placed in a category, it forms a category on its own. All words have to be selected in this way. The set of words in figure 2 could, for example, be grouped as {keyboard, printer, mouse} and {cat, dog} (selection 1) or as {keyboard, printer} and {mouse, cat, dog} (selection 2). The number of times (percentage) a particular categorization is chosen determines the (relative) strength of that categorization. 1 One can also use an online version of a card sorting study. For an example, see https://conceptcodify.com/ studies/jfvi9n5751vue9bn/via/demo_use_only_not_ recording/ June 2015 fi The results of the card sorting study with our set of 42 word associated with creativity were analyzed with a Hierarchical Cluster Analysis (HCA), using the statistical programming environment R (Salmoni, 2012). The HCA technique (Coxon, 1999) selects the two highest associated words (i.e., that most often occur together in a card sorted group) and replaces them with a single item. The associations of this item with the other words are the average of those of the two words forming the item. Continuing in this way, a hierarchical cluster can be obtained of the results of the card sorting study. Figure 3: Hierarchical clustering of the 42 words used in the card sorting study of terms associated with creativity The results of the HCA on the card sorting data are presented in Figure 3. The hierarchical cluster structure provided by the HCA starts with clusters of one or two words at the left and ends with two overall clusters at the right. The horizontal distances in figure 3 provide a measure of (relative) distance between clusters and subclusters. Short distances between subclusters (as between the first layer of clusters at the left of the hierarchy) suggest that they essentially form a larger subcluster. Visual inspection of the HCA suggests that a set of subclusters to the left of the red line might provide information about a meaningful classi fication of the words related to creativity, because the distances within these subclusters are relatively short compared to the distances between the subclusters. Figure 4 presents a set of basic clusters of terms associated with creativity, based on the HCA presented in figure 3. They are selected (as indicated by the red line), by using the same distance from the basis as a selection measure. A basis for the selection is the observation that item-distances between clusters are substantially larger than itemdistances within clusters. Figure 4: Tentative clusters related to creativity Figure 4 presents six clusters and tentative cluster names. Perhaps the last two clusters could be combined into one, given that the item-distances between these clusters and the other clusters are the largest distances of the hierarchy in figure 3. This would provide the following five main clusters of items associated with the concept creativity: x Original (originality) x Emotion (emotional value) x Novelty / innovation (innovative) x Intelligence x Skill (ability) Before discussing these clusters we present and discuss a further analysis of the data based on the ‘heat map' presentation of the results from the card sorting study. June 2015 fi Heat map of card sorting results The results of the card sorting study can also be represented in a heat map, in which the color indicates the strength of the association between two terms. Figure 5: Heat map presentation of the card sorting results Figure 5 presents the heat map based on the results of the card sorting study. The rows and columns in the heat map represent the words used in the card sorting study (figure 1). The words in the heat map are arranged in the order of the HCA analysis presented in figure 3. In this way, the heat map forms a matrix. The color in each matrix cell represents the number of times the row and column word corresponding to the cell belonged to the same group in the card sorting study. Given that 40 subjects participated in the study, this number can vary between 0 and 40. The heat map presents this number in terms of a color, varying from light yellow (0) to deep red (40). In the data, the lowest number was 0 and the highest number was 34. The heat map is symmetric because the words used in the card sorting study are represented as rows and as columns. For this reason, the diagonal in the heat map does not represent data from the card sorting study. It is clear that the squares that form groups of words are related to the clusters in figure 3 (which results from the fact that the words in the heat map are arranged in the order of the HCA analysis presented in figure 3). For example, in the top left corner there is a 5x5 square that is much more red (darker) than the yellow around it. This 5x5 square belongs to a group of five words: unconventional, different, extraordinary, original and unique. If we wanted to label this group with one name, it could be ‘original', as indicated by the cluster name in figure 4. Original is often referred to in the literature as a characteristic of creativity (e.g., Hennessey and Amabile, 2010). Also, in the right corner at the bottom we see a large group that is relatively distinct from the rest. This is the group that we labeled as ‘skill' in figure 4. This group comprises a smaller ‘skill' group and a ‘craftsmanship' group in figure 4 (the ‘craftsmanship' group stands out within the larger ‘skill' group in the heat map). ‘Skill' has also been related to creativity in the literature (e.g., Colton, 2008). Yet, although the HCA structure in figure 3 and the heat map in figure 5 are based on the same data, they reveal different aspects of the semantic map based on the card sorting study of terms associated with creativity. The HCA structure shows a metric within and between the clusters of terms related to creativity. The metric is given by the (vertical) distance that needs to be travelled in going from one word to another. So, for example, the distance between unconventional and innovation is shorter than that between unconventional and skill. This metric is not directly revealed in the heat map. But the heat map shows that a word that belongs to a group can also be associated to words outside that group. For example, unconventional belongs to the 5 by 5 group referred to above, but it also has some association strength with renewing. These outside associations are not directly revealed by the HCA structure, due to the forced choice procedure on which the structure is based. In this way, the HCA analysis seems to miss the more global structure that is present in the results (and thus in the heat map). To analyze this more global structure, we analyzed the data in the heat map using a Principal Component Analysis (PCA). PCA analysis of the card sorting results A Principal Component Analysis (PCA) of a set of data reveals the orientations (axes) along which most of the variance in the data is found (Jolliffe, 1986; Jackson, 1991). These are referred to as the Principal Components (PCs). Starting with a covariance or correlation matrix of the data, a PCA analyses the matrix in terms of its eigenvalues and eigenvectors. The highest eigenvalue corresponds to the PC along which most of the variance in the data is found. The second eigenvalue then reveals the PC along which most of the remaining variance is found. This process continues until all of the variance in the data is accounted for. Because the eigenvectors are orthogonal, a PCA shows independent sources of variance in the data. A PCA starts with a covariance or correlation matrix of the data. For this we used the data underlying the heat map expressed in decimal fractions (based on the maximum possible score of 40). For the diagonal values we used the score 1.0 based on the assumption that a word is maximally related to itself. One of the advantages of a PCA is that it allows a reduction of the dimensions underlying the data, by taking into account only the PCs with the highest eigenvalues. Figure 6 presents a graph of the eigenvalues of the heat map data in descending order. This is also known as a scree graph or scree plot (Jolliffe, 1986). A rule is to use only the eigenvalues presented by the scree plot in the section before the plot levels off. In this case that would result in representing the data based on PCs corresponding to the June 2015 fi five highest eigenvalues (all > 2). Figure 6: Scree plot of the eigenvalues in the PCA analysis of the heat map A PCA gives the PCs of the highest variance in the data, but it does not provide an interpretation of a PC (Jackson, 1991). Looking at the heat map, however, it is clear that a substantial variance in the data results from the difference between high (red) and low (yellow) values. For a word, this difference corresponds to belonging to a subcluster (such as represented in figure 4) or not. It would seem that the first eigenvalue captures this source of variance. However, every word has both high and low values in the heat map, so this source of variance does not reveal much about the ways words belong to difference groups. Furthermore, when the values in the analyzed matrix are all positive, the coefficients of the first PC (eigenvector) are all of the same sign (Jackson, 1991). Therefore, in figure 7 we present the words in the heat map in terms of the PCs given by the eigenvalues of the PCA of the heat map, starting with the second highest eigenvalue. The PCs are all uncorrelated, but the coefficients of a PC (eigenvector) can be correlated. These correlations are in particular affected by the signs of the coefficients (Jackson, 1991). Therefore, we group words by the signs of their coefficients for a PC. The groupings are presented in figure 7, in terms of the second to the fifth PC with the highest eigenvalues, in descending order. In figure 7, the signs of the coefficients of PC 3 to 5 are represented by the letters P and N, to indicate that different groups could have the same sign on that PC. Figure 7 shows that the second PC (eigenvalue) separates the words in the heat map into two groups. We arranged the words in figure 7 in the manner as they appear based on 5 eigenvalues. This results in a word order (partly) different from the one found in figures 3, 4 and 5. However, it is clear that the two groups selected by the second eigenvalue in figure 7 correspond to the two largest clusters in figure 3. Thus, the first separation in the heat map (capturing most of the variance after the first eigenvalue) is between the large ‘skill' cluster in figure 4 and the other words (also illustrated with the difference between the large red-like square in the bottom right corner of the heat map and the other words). In figure 4 we selected five groups of words based on the HCA, with the ‘craftsmanship' and ‘skill' groups as one. In figure 7, the first four PCs also give five groups if we take the ‘craftsmanship' and ‘skill' groups as one. A comparison between both groupings reveals that they are quite compatible, although a few noticeable differences appear. The ‘original' group in figure 4 is maintained in figure 7, with the addition of the word renewing, which at face value seems to be related with these words. The ‘emotion' group in figure 4 is maintained as well, with the addition of imagination and inspiration (which split off with 5 PCs). So, ‘emotion' may not be the correct label for this group. Figure 7: Word clusters based on the first 5 eigenvalues in the PCA of the heat map The more substantial changes are with the ‘novelty' and ‘intelligence' groups in figure 4. Five words from the ‘intelligence' group in figure 4 are maintained in figure 7 together with hunch and resourceful from the ‘novelty' group in figure 7. Five words from the ‘novelty'group in figure 4 are maintained in figure 7 together with planning, process, and difficult from the ‘intelligence' group in figure 7. However, despite these changes there seems to be a substantial overlap in the cluster structure obtained with HCA and PCA. The difference results from the fact that the PCA takes the overall structure of the heat map into account. The clusters as presented in figure 4 and figure 7 could be seen as a semantic map of words related to each other and, as clusters, related to creativity. This map could be used as a basis for the evaluation of creativity. June 2015 fi Semantic map as a basis for evaluation The literature provides several characteristic of creativity that could be used to evaluate processes or products of computational creativity. As outlined in the introduction these include terms like novel (novelty), original, unique, useful, value, typicality, quality, unexpectedness, surprise, skill, imagination or appreciation. Many of these are found in the semantic map (figure 4, 7) as well. These include novel (novelty), original, unique, skill, and imagination. Other words are related to words in the semantic map. For example, unexpectedness or surprise are related to unconventional and extraordinary. The fact that words used in the literature are also found in the semantic map based on empirical investigations underscores their relation with creativity and justifies their use in assessing creativity. However, some words reported in the literature are notably absent in the semantic map. One of those is the word ‘useful'. Although often referred to as acharacteristic of creativity (Amabile, 1996; Hennessey and Amabile, 2010; Fink and Benedek, 2014), it is not found in the semantic map. This raises the question of whether humans would qualify useful as related to creativity, and thus as a dimension on which creativity could or should be evaluated. Because ‘useful' did not emerge in our word association studies, we could not investigate its relation with the other terms in the card sorting study. But in a follow up study we will include ‘useful' as an item to study its relation to other words related to creativity and to ‘creativity' itself in a card sorting study. The outcome will enhance our insight in the way useful and creativity are related as seen by human subjects (instead of by assumption or definition). One reason of why useful was not included may have resulted from the fact that we asked for terms associated with creativity without any further instruction or direction. It might be that when more specific instructions are given, for example to relate terms to creativity in a particular task or domain, terms like useful might appear. Hence, another venue of research is to investigate semantic maps related to creativity within specific domains (e.g., music, poetry, architecture), to see if differences between these maps are found. If so, that would argue for more specific forms of evaluation to be used for these domains. Yet another venue of research is to investigate whether semantic maps (whether or not related to specific domains) also differ between languages. In our association studies (but not the card sorting task itself) we used both Dutch and German native speakers. We could not find significant differences between the two. But this could be related to the similarity between both languages. The main clusters as presented in figures 4 and 7 could be used to develop rating scales for evaluating the creativity of artificial systems and humans. All of the terms in a cluster could be used as dimensions on which creativity is rated, each one as an example of the main cluster to which it belongs. An analysis of the ratings in terms of the cluster structure could then be related to the clusters found in the semantic map. That is, if the clusters in the semantic map reflect the notions that humans have about creativity, they would also determine the way they evaluate creativity. In that case, evaluations using terms within a cluster would be related to each other and between cluster evaluations would reflect the between cluster structure in the map. This procedure could also be used for the more domain specific semantic maps, if they are found. In that case, these maps could be used for the evaluation of domain specific forms of creativity and the results of the evaluation could be compared with the structure of the maps. When more semantic maps are investigated a more complete structure of the semantic relations with creativity will emerge. By comparing this with evaluations of creative processes and products (both computational and human) we will develop a more complete picture of how semantic relations with creativity influence the evaluation of creativity. The empirically derived semantic maps related to creativity could also be used to develop and evaluate experimental paradigms for investigating the neural basis of creativity. This might begin to unravel the diverse and sometimes apparently conflicting results obtained in the neuroscientific research of creativity (Arden et al., 2010; Dietrich and Kanso, 2010; Sawyer, 2011; Fink and Benedek, 2014). Effective use of semantic map in evaluation To use the concepts in the semantic map as tools for evaluation we need to develop and test rating scales based on these concepts. Here, a number of considerations play a role and should be addressed. The first one is the number of rating scales that can be used effectively. Using all concepts in the map would result in a large set of scales that could be ineffective. We can study this by using the rating scales based on these concepts in pilot evaluations and compare the scales using factor analysis. In this way we can investigate again whether concepts from the same cluster are used in the same way in an evaluation. If so, these rating scales could then be used as alternatives between evaluations. Or they could be used as alternatives within an evaluation (between or within subjects). The second one concerns the subjects that would perform an evaluation. One option is to use experts in a given domain. Another option is to use the users of a domain in an evaluation. Here, given that the subjects in our studies were students, one can think of creative domains like visual art in gaming (and movies), dance music (and other forms of ‘pop' music) and the use of new media like YouTube. Students certainly are involved here as users, and users to a large extent determine success in these domains and thus the way in which these domains develop. It is too simple to argue that only experts determine how forms of creativity develop. Users play a substantial role in that too (as they have also done in the past). Given a set of rating scales, we can also compare evaluations by experts with that of users. An interesting topic June 2015 fi of research here is whether experts in a domain would have a different conceptual structure related to creativity compared to users or whether they would have a similar conceptual structure (as in the semantic map) but would use it differently in an evaluation. This could consist of a different factorization of the rating scales with evaluations performed by experts compared to users. Conclusions An empirical basis for the evaluation of creativity is needed because evaluations, as conducted by human raters, are empirical investigations. Hence, the assumptions underlying these investigations, such as the rating scales used, need to be validated. We presented a semantic map of terms related to creativity based on human association and card sorting studies. The semantic map as presented here can be further developed by investigating domain specific aspects of terms related to creativity and the use of other terms often reported as related to creativity in the literature. To derive the semantic map in the card sorting study, we augmented the words based on our human association studies with words reported in the literature that were based on a statistical analysis. Interestingly, there is an overlap in the set of words formed by the two methods, but there are also some differences. Further investigations could reveal how these methods are related and if they are both needed (as complements) to arrive at more objective procedures for the evaluation of computational (and human) creativity. Acknowledgements We thank Saskia Hartmann and Janina Roppelt for their assistance. The work presented here was funded by the project ConCreTe. The project ConCreTe acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET Grant Number 611733. 2015_14 !2015 Human Competence in Creativity Evaluation Carolyn Lamb, Daniel G. Brown, Charles L.A. Clarke School of Computer Science University of Waterloo Waterloo, Ontario, Canada c2lamb@uwaterloo.ca, dan.brown@uwaterloo.ca, claclark@plg.uwaterloo.ca Abstract We investigate the performance of non-expert judges in using leading computational poetry evaluation metrics to evaluate poetry written by humans. We find that regardless of the model used, non-expert judges are very poor at using metrics to evaluate creativity, even displaying the reverse of the desired rating pattern, preferring novice poetry to professional poetry. We discuss likely reasons for this finding and the implications for the evaluation of computational creativity. Researchers using human judges should be aware that using a metric or structured evaluation does not negate the need for judge expertise. Introduction An increasingly important debate in Computational Creativity is the development of standardised evaluation methods. There are many reasons why it is desirable for computers to recognize and evaluate creativity, including the assistance of humans in creative acts, understanding of the creative human mind, and the AI application of teaching computers to behave creatively themselves. However, it is not clear how exactly one would go about distinguishing more creative from less creative output. Two important camps in this debate are those who use a metric with specific criteria (e.g. (Pease, Winterstein, and Colton 2001; Ritchie 2007; Colton 2008a; Colton, Pease, and Charnley 2011) and those who prefer a consensual assessment based on the agreement of expert judges, without specific criteria (Amabile 1983). While the Consensual Assessment Technique has been rigorously tested (see e.g. (Kaufman, Baer, and Cole 2009)), specific metrics used in the field of Computational Creativity have not. We therefore undertook an empirical test of four such metrics from the existing literature. These metrics evaluate a product's creativity based on (for example) its novelty, value, skill and other qualities, or on some calculation involving these qualities. We collected poems generated by humans at various levels of skill. We then recruited a large number of humans to evaluate the poems on the criteria used in our selected metrics. Our results were very counter-intuitive. On nearly every criterion, our judges significantly preferred amateur, unskilled poems to the work of professional poets the reverse of what one would expect. Poetry is a rarefied field, and we suspected that the reversed results were caused by untrained raters having dif- ficulty understanding the professional poems. Such poetry might not be accessible to an untrained reader. We ran the experiment again with poems written for children. This second experiment did not produce reversed results, but any power of the criteria to differentiate between good and bad poetry was reduced to noise. Our experiments show that non-expert judges do not apply creativity metrics appropriately to poetry. Of course, the Consensual Assessment Technique already mandates the use of expert judges for this reason. Non-experts in a consensual assessment have poor inter-rater reliability and poor agreement with the judgments of experts (Kaufman, Baer, and Cole 2009). However, our research shows that this problem also applies to judgments made with specific criteria. Using such criteria is not an escape from the issue of judge selection. Moreover, beyond simply losing reliability, the use of non-expert judges can produce the exact opposite of the intended result. Many evaluations in computational creativity today are still done by the researchers themselves (Colton, Goodwin, and Veale 2012; Norton, Heath, and Ventura 2010; Riedl and Young 2006; Smith, Hintze, and Ventura 2014) or by a group of human volunteers whose expertise in creativity is not discussed (Burns 2015; Gervas 2002; Karampiperis, Koukourikos, and Koliopoulou 2014; Llano et al. 2014; Monteith, Martinez, and Ventura 2010; Norton, Heath, and Ventura 2013; Roman and y P erez 2014). For robust eval- uation, it may turn out that neither of these approaches is sufficient. Background and Related Work The past 25 years of computational creativity research owe much to Boden's (Boden 1990) work on the meaning of creativity. Boden focuses on creativity as the exploration and transformation of conceptual space. While Boden's book does not give a definition which can be broken down into formulaic parts, she does repeatedly mention the need for creative systems to produce works which are both novel and valuable. Subsequent researchers have built on her work to propose numerical metrics. Ritchie, the first such researcher, proposes that human creativity is evaluated according to the criteria of Novelty Proceedings of the Sixth International Conference on Computational Creativity June 2015 102 ( To what extent is the produced item dissimilar to existing examples of that genre? ) and Quality ( To what extent is the produced item a high-quality example of that genre? ) (Ritchie 2001). For computational creativity, he proposes replacing Novelty with Typicality as a computer program must first be able to generate plausible examples of a type of creative product before attempting to make ones dissimilar from what has gone before. Ritchie then suggests various tentative criteria, such as high quality items should make up a significant proportion of the results , for evaluating a system based on its Typicality and Quality over several runs. The presence of these composite criteria implies that using Typicality and Quality measurements directly for creativity evaluation, without further analysis, may be overly simplistic. Nevertheless, one can easily imagine common-sense constraints on the base measurements. For example, while the Quality measurement could be used in various ways, one would certainly not expect creative poems to have a lower average Quality than uncreative ones. Ritchie's model has been used to evaluate creative systems in practice (e.g. (Gervas 2002; Tearse, Mawhorter, and Wardrip-Fruin 2011)). Other researchers performing similar work focus on Novelty rather than Typicality, a choice more in line with Boden's work. For example, Pease et al. (Pease, Winterstein, and Colton 2001) suggest a variety of ways to formally measure both Novelty and Value (a synonym of Quality). Some difficulties in the Boden-based models, particularly Ritchie's, have been illuminated through experience. Many of Ritchie's composite criteria are based on comparisons with an inspiring set of existing work. In the absence of a quantitative measure for similarity between creative products, such criteria are difficult to evaluate (Gervas 2002). Ventura's RASTER thought experiment (Ventura 2008) also claims to illustrate flaws in Ritchie's model: a highly uncreative system, generating works completely at random, can technically be said to meet the criteria. However, the RASTER thought experiment uses images from a Web search to guide output, without considering those images an inspiring set. It also fails to consider typicality and quality independently, which renders many criteria inapplicable. Ventura suggests that the inapplicability of these criteria, in and of itself, is a reason to treat a system with suspicion. Another metric, Colton's Creative Tripod (Colton 2008a), judges creative work by whether it appears to be skillful, appreciative, and imaginative. Colton's tripod has frequently been used to evaluate creative systems (Smith, Hintze, and Ventura 2014; Chan and Ventura 2008; Monteith, Martinez, and Ventura 2010; Young, Bown, and others 2010) or to guide their development (Norton, Heath, and Ventura 2010; Colton 2008b). A weakness of the tripod is that specific definitions for the three criteria are not provided. It has been pointed out (Bown 2014) that this provides too much opportunity for authors to make impressionistic statements about why their system meets the criteria, without rigorous, falsi- fiable inquiry into whether its performance in these areas is sufficient. Even the intentionally uncreative RASTER (Ventura 2008) is argued to meet Colton's criteria in this manner. Colton et al. have added many words to the tripod since its construction, including Learning, Intentionality, Accountability, Innovation, Subjectivity, and Reflection (Colton et al. 2014). However, since the majority of recent work implementing the tripod uses only the original three words, we focus our research on these original three. Another proposal by Colton et al. is the IDEA model (Colton, Pease, and Charnley 2011), in which an ideal audience rates a creative product according to Wellbeing (how much they likes the product) and Cognitive Effort (how prepared they are to spend effort thinking about and interpreting it). Like the criteria of Ritchie's model, Wellbeing and Cognitive Effort can be combined to measure different aspects of a product's reception. For example, if the variance in Wellbeing is high, a product would get a high score on Divisiveness . Many other standardized metrics for evaluating a creative system have been proposed. Jordanous's SPECS model (Jordanous 2012) incorporates many criteria based on cultural beliefs about the meaning of creativity, including criteria similar to Novelty and Value. Burns's EVE' model defines creativity as a combination of Surprise and Meaning, and has been applied to humorous poetic advertisements (Burns 2015), humorous haiku (Burns 2012) and, in thought experiment form, to line drawings (Burns 2006). Other new metrics either proposed or used ad hoc in the past ten years come from varied sources including Piaget's theories of cognitive development (Aguilar and Perez y P erez 2014), theories about quality in a specific art form (Das and Gamback 2014; Rashel and Manurung 2014; Pearce and Wiggins 2007), interestingness (Roman and y P erez 2014; Gerv as 2007), and many others (Brown 2009; Lehman and Stanley 2012; Llano et al. 2014; Monteith et al. 2013; Norton, Heath, and Ventura 2013). Very rarely have any such metrics been validated through direct use on human-generated products. A few researchers have used the metrics to compare computational products to human-generated products. Monteith et al. compare humancomposed to computer-composed music using an operationalization of Colton's tripod (Monteith, Martinez, and Ventura 2010). The computer music did better at expressing specific emotions (Skill) but the human music sounded more like real music (Appreciation). Burns tested his EVE' model on human products (Burns 2015) and found good correspondence between his model and human ratings; Surprise multiplied with Meaning accounted for 70% of the variability in ratings of Creativity. Binsted et al. built a system, JAPE, to generate riddles (Binsted, Pain, and Ritchie 1997), and evaluated it using children's responses to criteria similar to those which would later form Ritchie's model: Was that a joke? (Typicality) and How funny was it? (Quality). JAPE's jokes were compared to human jokes and to two categories of humangenerated non-jokes. Binsted et al. found that children rate human-generated jokes as more typical and of higher quality than non-jokes. JAPE's jokes were somewhere in between. Ritchie et al. performed further tests on this data and repeated the study with college students (Ritchie et al. 2008). Thir results were broadly the same, but there was low interProceedings of the Sixth International Conference on Computational Creativity June 2015 103 rater reliability, especially on Quality. While we focus on four specific metrics in our work, we do not mean to imply that these metrics represent four completely separate schools of thought. Instead, all four are in- fluenced by each other and by prior work such as Boden's. What they all have in common is the idea of decomposing creativity into sub-concepts, then measuring creativity by somehow measuring and combining other criteria. For example, under Ritchie's model, if one can calculate the Typicality and Quality of a creative work, one can then (by some means, perhaps a complex one) calculate the work's level of creativity. This contrasts to the Consensual Assessment Technique, in which judges rate creativity however they see fit. The advantage of a metrical perspective is that it invites standardized quantitative calculation and avoids circularity. We use four metrics from the literature to represent a range of influential perspectives within the paradigm of metrical assessment. Our aim is to add to our understanding of metrical assessment of creativity as a whole. Experiment I Method We tested 4 common metrics for creativity evaluation: Ritchie's model, Pease et al.'s novelty and value criteria, Colton's creative tripod, and the IDEA model. These metrics are easy to test on human poetry since they focus on the creative product and not on the process. Since none of these metrics have been put into a standardized questionnaire form, we constructed our own five-point Likert scalebased rating system for each. Each participant was only shown the questions for one of the four metrics, not all four. The questions we used are as follows: Ritchie's model " This resembles other poems I have read. (Typicality) " This is a high quality poem. (Quality) " I don't think this is a very good poem. (Quality, reverse coded) " This is not a poem. (Typicality, reverse coded) Pease's criteria " This is a high quality poem. (Value) " This poem is not like other poems I have seen before. (Novelty) " I don't think this is a very good poem. (Value, reverse coded) " This poem is cliched. (Novelty, reverse coded) Colton's Creative Tripod " The author of this poem seems to have no trouble writing poetry. (Skill) " The author of this poem is imaginative. (Imagination) " The author of this poem understands how poetry works. (Appreciation) " The author of this poem isn't very good at writing poetry. (Skill, reverse coded) " The author of this poem isn't bringing anything new or different into the poem. (Imagination, reverse coded) " The author of this poem doesn't really know anything about poetry. (Appreciation, reverse coded) IDEA model " I like this poem. (Wellbeing) " I am willing to spend time trying to understand this poem. (Cognitive Effort) " This poem makes me unhappy. (Wellbeing, reverse coded) " This poem is not worth bothering with. (Cognitive Effort, reverse coded) It should be noted that the construction of questions to represent abstract concepts from existing models is a potential source of error. For example, the IDEA model's Wellbeing criterion is based on like or dislike of a poem; it is not clear how an ideal reader would respond if they appreciated a poem but found it very sad. Appreciation in Colton's tripod, despite the lack of strict definitions of Colton's terms, also arguably refers to a creator's ability to evaluate its own work, rther than its ability to understand its field in general. However, researchers such as Norton et al (Norton, Heath, and Ventura 2010) refer to the Appreciation part of the tripod when training computers to apply labels to pre-existing images, implicitly lending support for the latter interpretation. After all, to evaluate one's own art one needs to be able to understand and evaluate art in general. A fully robust set of questions for a standardized questionnaire would require repeated testing and refinement in a variety of contexts; we have not yet reached the point of performing such tests. Data For this experiment we used three hand-collected data sets of contemporary poetry written by humans. Each set contained 30 short poems in English of between 5 and 20 lines; we stuck to contemporary poetry so as to avoid different eras of poetry becoming a confounding factor, and so as to minimize the probability that a study participant had read the poems before. In no case did more than one poem by a single author appear across data sets. For our purposes, we assumed that poems published in professional venues are more creative than poems written by novices. That is, we assumed that the editors of poetry magazines are experts and that their opinion strongly correlates with the actual creativity of the poetry published. This is, of course, debatable. Editors are sure to have specific cultural tastes and biases, but since all human judgments of creativity are culturally situated we find it an acceptable simplifying assumption. The Good data set was composed of poems from Poetry Magazine between November 2013 and April 2014. Poetry Magazine is a very long-established, professional magazine which can reasonably be considered to contain the work of the most critically acclaimed mainstream literary poets working today. All poems meeting the length and nonduplication requirements and appearing in the magazine during this time window were selected, with the exception of a Proceedings of the Sixth International Conference on Computational Creativity June 2015 104 Metric Criterion Good Medium Bad F Ritchie Typicality 0.20 0.41 1.23 10.6** Quality 0.23 0.67 1.40 10.2** IDEA Wellbeing 0.78 1.14 1.54 13.9** Cognitive Effort 0.60 0.94 1.46 14.9** Colton Imagination 0.75 1.16 1.07 2.3 Appreciation 0.67 1.11 1.68 8.3** Skill 0.44 0.84 1.40 7.4* Pease Novelty 0.96 0.80 0.49 9.8** Value 0.17 0.44 0.72 3.6 Table 1: Average ratings and F scores for poem categories according to each metric. Each component is scored between -4 and +4. Significant results (p < 0.05) following Bonferroni correction are marked with a *, or ** if highly significant (p < 0.01). few which were discarded due to complex visual formatting and two which were discarded due to experimenter discomfort. The remaining 30 poems comprised the Good data set. The Medium data set was composed of 2 poems each from 15 lesser-known online magazines. Some of these were magazines devoted exclusively to poetry while others were a combination of poetry and prose. Each magazine pays a token amount (between US $5 and $10) per a poem. For each magazine, the most recent 2 poems meeting length and nonduplication requirements were chosen for the data set, with a single exception in which one poem was discarded and the third-most-recent poem chosen as a replacement. This added up to a Medium data set of 30 poems. The Bad data set was composed of poems by unskilled amateur poets. We chose these poems by going to the Newbie Stretching Room at the Poetry Free-For-All, an online poetry critique forum. This section is for newcomers who have not posted poetry on the forum before; both experienced moderators and other newcomers can comment on the poems. We chose poems meeting the length and nonduplication requirements from this section, and discarded any which had received positive feedback from a moderator. Most of the chosen poems received comments from moderators instructing the author to read introductory articles on how to improve; a few had more specific, pointed comments. (Example: This is dreadfully bad beginner's doggerel that fails for many, many, many reasons. ) Selecting the most recently posted poems which fit these requirements resulted in a Bad data set of 30 poems. Finally, we collected a Test data set containing 6 texts which were the same length as the chosen poems, but were obviously not poems. 3 of these were snippets from business news, and 3 from sports news. These data sets are all available upon request. Collection We recruited study participants on Crowdflower, a crowdsourced microtasking website. In order to minimize cultural and linguistic difference as a confounding factor, participants were limited to those living in the United States. Each participant was given six poems at a time, selected from any or all of the data sets, and shown the questions for only one of the four metrics. The participant was then asked to rate each poem based on that metric. Participants could rate poems repeatedly up to a maximum of 36 poems per participant per metric. We collected enough responses to amass 20 responses on each metric for each poem. Participants were not shown the headings or names for the metric they were given, nor the names of the criteria on which the questionnaire items were based. Our justifi- cation for separating the metrics in this manner, and for coding Quality and Value separately even though the questions are identical, is that we were interested in taking each metrical approach as a whole, rather than mixing and matching criteria from all the metrics. For each criterion, we ran a single-factor ANOVA comparing the Good, Medium, and Bad poems' scores on that criterion. Since there were three two-criterion metrics and one three-criterion metric, we ran nine ANOVAs and then applied a Bonferroni correction for nine hypotheses. The null hypothesis was that, for all metrics, participants' responses to Good, Bad, and Medium poems would be drawn from an identical distribution. The alternative hypothesis was that the distributions would not be identical: that is, that on some criteria, poems from one or more categories would be rated differently than others. Results Results were the opposite of what we expected. For most criteria, participants rated Bad poems significantly (at p = 0.05 or better, following Bonferroni correction) more highly than Good ones. The exception was Novelty, in which Good poems were rated more highly than Bad. For Imagination and Value, the differences between categories were not significant. Exact F and p-values for each of these criteria are shown in Table 1. This was a highly surprising result since it is not attributable to rater incompetence or failure to pay attention. Incompetent crowd workers who failed to pay attention might give the same score to all poems, or give random scores. Our raters, however, had significantly different reactions to the different groups. Adding test questions and bonuses to incite workers to pay more attention did not change the overall response pattern. This indicates that crowd workers can differentiate between these groups but their preferences are different from what we had imagined. The results for Medium poems were more ambiguous. We ran a Fisher's Least-Significant Difference Test to underProceedings of the Sixth International Conference on Computational Creativity June 2015 105 Metric Criterion C-Bad C-Good t Ritchie Typicality 1.25 1.46 0.48 Quality 0.08 0.77 0.13 IDEA Wellbeing 1.30 1.61 0.35 Cognitive Effort 0.11 0.63 0.23 Colton Imagination 0.65 0.80 0.74 Appreciation 1.12 1.50 0.42 Skill 0.84 1.20 0.40 Pease Novelty 0.32 0.24 0.74 Value 0.11 0.34 0.62 Table 2: Average ratings and t scores for children's poem categories according to each metric. At p < 0.05, there were no significant differences found after Bonferroni correction Figure 1: Sample scatterplots showing relationships between Novelty, Typicality, and Quality for poems in all of the data sets from both experiments. stand the pairwise relationships between the three groups, again applying Bonferroni correction. Although Medium poems generally rated more highly than Good poems, in no case was this statistically significant. The difference between Medium and Bad poems, meanwhile, depends on the criterion. For Typicality, Novelty, and Effort, Medium poems were significantly different from Bad ones. For the other criteria, there was no significant difference between Medium poems and either other group. Experiment II One potential explanation for why participants preferred Bad poems is that the Bad poems were more accessible. Poems from a prestigious literary journal may be difficult to understand due to heavy allusiveness and other poetic conventions. To test the inaccessibility hypothesis, we ran a second experiment focusing on poems written with children as the indended audience. The C-Good data set was composed of children's poems found in the Children's Poetry section of the Poetry Foundation website in November 2014. The same selection constraints were used as with the first data set: poems were between 5 and 20 lines in length and no poet's work was used more than once. We also excluded poems by poets born prior to the 20th century. We collected a total of 10 C-Good poems, by authors such as Kenn Nesbitt and Shel Silverstein. The C-Bad data set was composed of poems posted on the Family Friend Poems forum by amateur poets between September and November 2014, meeting the length and author uniqueness criteria. 10 such poems were selected. As there is no expectation of detailed critique at Family Friend Poems, we did not filter poems by critiques given as we did with the Bad adult poems. In fact, most responses to these poems were brief and complimentary (e.g. Brilliant. Loved it 10 ), even when the poems made large mistakes with meter and rhyme. These poems were randomized and evaluated in the same way as the poems from Experiment I, on the criteria from the same four metrics. Since there are only two data sets in Experiment II, a t-test was performed on every criterion to detect differences in how the children's poems were rated. Results The children's poem results lacked the effect seen in the adult poems. Participants rated C-Good poems more highly than C-Bad poems on most criteria, but these results were not statistically significant. A power analysis determined that this was not solely a result of the smaller size of the second study; hundreds of poems would have been needed for significance. Using children's poems removed raters' preference for bad poems, but did not introduce a preference for good poems above the level of noise. Proceedings of the Sixth International Conference on Computational Creativity June 2015 106 Correlations within and between metrics It is not empirically clear if the different criteria from the different metrics actually elucidate different components of creativity. We investigated this by combining the data from Experiments 1 and 2, then generating scatterplots and correlation coefficients to examine the relationships between different criteria. With the exception of Novelty, all criteria were fairly well-correlated with each other (0.65 < r < 0.99), and scatterplots showed approximately linear relationships. Novelty had no significant positive, negative, or non-linear relationship with any other criterion. Example scatterplots are given in Figure 1 The high correlations between different criteria may indicate that these criteria especially those with extremely high correlations, such as Skill and Appreciation at r = 0.99 are not actually separate concepts, or at least, are not adequately separated in the minds of raters when phrased as our questionnaire phrases them. An alternative interpretation, suggested by a reviewer to this paper, is that the high correlation is a good thing: if all criteria measure some aspect of creativity, then one would expect them all to change in similar ways along with an underlying change in creativity. Discussion Our goal was to illustrate differences in effectiveness between different metrics, but we ended up finding something different. When using metrics, rather than simply asking judges how creative something is, the purpose is to be more objective and ensure that the appropriate factors are considered. However, the criteria we tested were not objective enough to produce trustworthy judgments from non-expert raters. Regardless of the criteria, non-expert raters showed a strong bias against Good poems due to these poems' inaccessibility. Even when more accessible poems were used, non-expert raters were unable to clearly distinguish between skillful and unskillful human poems. On Novelty, Typicality, Quality and Value A major difference between Ritchie's (Ritchie 2001) and Pease et al.'s (Pease, Winterstein, and Colton 2001) work is the concept of Novelty. While Pease et al. define Novelty as a necessary component of creativity, Ritchie prefers to measure its opposite, Typicality. The claim is that, first, a creative computational system must learn to produce acceptable examples of the target output class. For example, a poetry program should not simply produce random words, but should produce something recognizeable as a poem. Only when this hurdle has been crossed can we begin to work towards novel forms of poem. It is commonly claimed that the novelty and quality of creative works should form a Wundt curve. A completely non-novel work is not interesting. As works begin to diverge meaningfully from other works in their target class, they become more interesting. However, works which are too novel can be off-putting or difficult to accept. At the extreme, a completely novel and chaotic work is indistinguishable from meaningless noise, and is uninteresting for that reason. Therefore, an optimal creative work should involve a moderate amount of novelty. The empirical evidence for such a Wundt curve is not strong (see (Galanter 2012)) but when Ritchie and others treat typicality as a prerequisite to novelty, they implicitly argue for such a curve. Our research fails to show a Wundt curve or similar relationship between novelty, typicality, and value. Indeed, our research suggests that typicality and novelty are not opposites: the correlation between them is nearly zero (R = !0.05). Poems with high Typicality may have high or low Novelty, and vice versa. Typicality is strongly correlated with most of the other criteria tested, with our non-expert raters seeing poems as more valuable, skillful, etc the more typical they are. Even though our data set included very atypical works (non-poems), there did not appear to be a threshold at which poems became typical enough for novelty to become relevant. Meanwhile, Good poems are rated as more novel than Bad. Taken at face value, this would suggest that Novelty might be a better metric than others for measuring creativity. However, the effect for novelty disappears when applied to children's poems. Rather than measuring the creativity of a poem, it is more likely that Novelty for non-expert raters measures inaccessibility: Good poems are rated as more novel than others because they are more difficult to understand. This implies that a participant's rating of a poem as novel may signify discomfort. Without enough domain expertise to see the meaning underlying novelty, non-expert judges prefer poems without it. On accessibility and the target audience If non-expert judges prefer a minimum of novelty, one would expect to see a very different pattern of response from experts. If a poem can be too novel, then this raises the question: too novel to whom? Clearly, to the editors of Poetry Magazine, each poem in their magazine made sense and was of high quality. Yet Crowdflower users presumably ordinary people with little formal education in poetry saw less quality and sense in these poems than in the work of novice poets. The poems in Poetry Magazine are so complex that the magazine comes with an explanatory Discussion Guide. Poems allude heavily to other works and imply or illustrate things instead of stating them outright; some raise difficult questions such as who is creating what, as well as who is inside the work and who is outside (Poetry Foundation 2014). Without education in poetry, it is no wonder that an ordinary person finds such complexity offputting. Our results suggest that this offputting effect may be so strong that it drowns out any other differences between skilled and unskilled human poetry. To non-expert judges, the confusing complexity of professional poems is worse than any of the clumsiness of an amateur. Yet to an expert in poetry, it would be absurd to say that the amateur poems are therefore of higher quality. The strength of the effect here not just negating but reversing expected trends is surprising. It suggests that there is a great danger in ignoring the question of rater expertise. The use of specific criteria such as Novelty, Value, Skill, Appreciation, or Imagination does not remove the need for this Proceedings of the Sixth International Conference on Computational Creativity June 2015 107 question. When poems are judged for their quality, who performs that judgment? The researcher? An ordinary reader? An expert? If so, what kind of expert? Future computational creativity studies need to make their answers to these questions explicit, even if they are not already using techniques which demand the use of experts. In the meantime, without an identifiable target audience, it may be very dangerous to talk about quality, value, or skill in computational creativity as though it is only one thing. The quality of popular appeal and the quality of appeal to experts may be diametrically opposed, and there may be other audiences with still other views of quality. Until such an audience is chosen and the choice justified, the notion of creativity, without the notion of creativity to whom, is operationally meaningless. Conclusions Using the conceptual criteria from four popular computational creativity evaluation metrics, we have shown that nonexpert humans using these metrics can produce the opposite result from what is intended. Non-expert humans prefer more accessible poetry, even if that poetry is much less skilled according to experts. These results strongly suggest that even when structured metrics are being used, non-expert judges cannot approprately evaluate the creativity of a human or computer system. Regardless of the metric used, care must be taken in selecting and assessing an appropriate group of judges. 2015_15 !2015 Measuring cultural value using social network analysis: a case study on valuing electronic musicians Anna Jordanous School of Computing University of Kent, UK a.k.jordanous@kent.ac.uk Daniel Allington Department of Arts & Cultural Industries, University of the West of England, UK Daniel.Allington@uwe.ac.uk Byron Dueck Department of Music Open University, UK byron.dueck@open.ac.uk Abstract In evaluating how creative a program or an artefact is, a key factor to consider is the value inherent in that program or artefact. We investigate how to measure subjective, cultural value: value which has been expressed by members of a community towards other members. Specifically we focus on a case study asking: to what extent can we use social network activity to examine the value that electronic musicians place in each other's work? Focusing on activity by electronic musicians on the music social network SoundCloud, we combined qualitative and quantitative research to understand and trace significant ‘valuing activities' in SoundCloud data. Exploring interaction on the site in this guided way has enabled us to compare, contrast and assess what value is attributed to different members of the electronic music community on SoundCloud. In this paper we report our results and consider how this work offers a methodology for computational analysis of cultural value. We hypothesise that this methodology is extensible to other creative domains; potentially this could lead to a tool for automated cultural value judgement methods on large social network datasets. Hence we move towards computationally generated evaluations of value, a fundamental part of creativity. Keywords: Social network analysis, value metric, evaluation Introduction How can we measure the value of creative entities to a community? (especially unquantified value, expressed through esteem rather than money?) And how could such value judgements be automated across large amounts of data and implemented within computational systems? Value judgements are a vital part of creativity; the usefulness or value inherent in a creative system and what it does is intricately connected to how creative it is (Ritchie 2007; Jordanous 2012a). In computational research on creativity, we would like our systems to be able to perform evaluation of their own processes. Autonomous judgements of value, integrated within a computational system, are desirable but only occasionally realised in computational creativity. Value itself can be difficult to identify and measure. In particular, a distinction exists between the more easily identifiable economic value of creative works and their producers, compared to their inherent and intangible cultural value. Cultural value is attributed through peer interaction and underground expressions of esteem rather than measures such as sales of artefacts or ticket sales. There is a ‘relative independence of a status order built from peer esteem from one built purely upon popularity or sales' (Lena and Pachucki 2013, 239). For example, electronic music is a creative domain consisting of many underground subcultures, where economic or popular recognition is often not achieved and quite often not even pursued to any great degree. Value attributions become difficult to recognise due to lack of official recognition or monetary reward for electronic musicians. So how do you measure or evaluate cultural value? Here we address this question through a case study on electronic musicians. In the Valuing Electronic Music (VEM) project1 we investigated how electronic musicians show their appreciation and value for other musicians, via qualitative interviews and quantitative research around SoundCloud,2 a social network for musicians (particularly for electronic musicians). Our aim was to gauge how value is attributed and recognised through interactions between electronic musicians. In particular, we wanted to identify features of interartist networking and peer evaluation contributing to value production that are detectable in quantitative analysis of digital interactions. The main aim relevant to computational creativity is to determine what computational analysis could be performed as a proxy for cultural value. We argue that the approach developed in the VEM project is adaptable to assessment of value in a range of cultural contexts. We offer a method for empirical evaluation of cultural value through analysis of social interactions. Value evaluation in computational creativity Where evaluation of computational creativity systems includes some value judgements, objective metrics have to be carefully selected to ensure value is evaluated in an appropriate and representative manner. In computational work, though, objective metrics have key advantages over subjective data collection, which can be time consuming (especially if collating user feedback) and problematic in terms of identifying representative samples of users. Also, it is difficult to integrate such testing within a computational sys1 See http://valuingelectronicmusic.org 2 http://www.soundcloud.com June 2015 110 tem's processes and respond to the feedback, particularly if system testing is carried out towards the end of research projects. But there is a need for autonomous value judgements that could be integrated within computational creativity systems; creativity is not just about new work but also the development and refinement of this work (Boden 2004).3 The term value encompasses many different aspects such as appropriateness, relevance, usefulness, correctness, worthiness and/or quality. A minimum (probably insufficient) definition of creativity could be novelty+value (Jordanous 2012b). Jordanous (2012a) defines value as: • ‘Making a useful contribution that is valued by others and recognised as an influential achievement; perceived as special; "not just something anybody would have done". • End product is relevant and appropriate to the domain being worked in.' (Jordanous 2012a, 258) In his discussion of value, Ritchie (2007) makes extensive use of value ratings but leaves open what type of method should be used to generate these ratings. Domain-general heuristics for value judgements are difficult if not impossible to identify; value is relative to the domain and is embodied in different ways. For example, accuracy is vital for mathematical proof generation systems, (Colton 2008) but not for creative musical improvisation (Jordanous and Keller 2012). One of this paper's authors recently reviewed evaluation of computational creativity systems (Jordanous 2011). She found that 43% of papers containing some content on system evaluation aimed to evaluate the value, quality or appropriateness of the system or system's output. Many types of empirical value measurements were found, as well as value measurements based on user feedback. The value of a creative system entails more than the value of its products; but this perspective was not evident in Jordanous's review. Typically, systems were evaluated based on the value or validity of the artefacts they produce, e.g. statistical tests for validity, calculations of how fit-for-purpose material produced during runtime was, how interesting their products were, or other domain-specific indicators of validity or value. Social and cultural value, particularly in music Blacking noted that the existence of musical geniuses such as Bach and Beethoven is reliant on the presence of a discriminating audience (Blacking 1973). We push this viewpoint further: the relationship between audience and musical performer is both vital for appreciating musical value and the division between audience and musical performer can be blurred. Turino (2008) contrasts ‘presentational' musics, based around the quality of works and performances, with ‘participatory' musics, where value is within the quality and intensity of social interaction. Turino reminds those in a Western Classical musical mindset of a vital aspect of music: the collective, participatory social aspect of musical 3 Some computational creativity researchers use evaluation in the processes of creative systems (Perez y P ´ erez, Aguilar, and Ne´ grete 2010, engagement-reflection), (McCormack 2007, evolutionary computing), (Pease, Guhe, and Smaill 2010, generate-and-test). experiences, especially when incorporating collective listening, composition, performance and dancing. For example, social interaction and communication are key for creativity in musical improvisation (Jordanous and Keller 2012). Csikszentmihalyi (1988) proposes a systems model of creativity as a dynamic process of interaction between Domain, Field and Individual/Person: ‘[creative] is the product of ... a set of social institutions, or field, that selects from the variations produced by individuals those that are worth preserving; a stable cultural domain that will preserve and transmit the selected new ideas or forms to the following generations; and finally the individual, who brings about some change in the domain, a change that the field will consider to be creative.' (Csikszentmihalyi 1988, 325). Csikszentmihalyi's emphasis on interactions between domain, individual and field (Csikszentmihalyi 1988), can be situated within the broader area of field theory (Bourdieu 1993) where producers compete for recognition rather than financial gain. Bourdieu posits that all agents involved in music-making form part of the musical communities that attribute value to music-making activities, regardless of level of ability or profile. So ‘hidden musicians' (Finnegan 2007) (everyday music-makers who are key to the musical life of communities but understudied by scholars and publics) play a significant part in determining who and what is valuable within musical practices (Dueck 2013). Exploring how hidden and star musicians link together in networks of evaluation and commentary lets us see how all depend upon one another, jointly producing the cultural context in which their music can have value. Although Bourdieu focused on what he termed ‘legitimate' culture (i.e. serious literature, art music, etc), his ideas have since been adapted to other cultural forms e.g. Lopes 2000 (jazz), Elafros 2013 (hip-hop). Social networking and new media websites have provided music makers with new spaces in which to negotiate and produce cultural value for their work, taking on tasks that would once have been the sphere of specialists in marketing, publicity and criticism. These phenomena appear to have had a particular impact on electronic music, which is typically made by lone, but highly networked, individuals and is often circulated non-commercially online. A recent report across UK-based professional musicians found that 64% ‘us[e] web-based technologies to produce, promote, and distribute their music' (DHA Communications 2012).4 De Nooy argues that social network analysis can legitimately ‘be used to gauge the amount of... symbolic capital' (De Nooy 2003, 325). De Nooy's proposed approach to the study of symbolic capital had been successfully implemented as a methodology for studying the production of cultural value by one author of this paper (Allington, under review). Allington used data harvested from online sources to study the production of value within Interactive Fiction (stories that develop in plot through user interaction). Centrality measures were used to assess the level of value associated with specific creators working within that community 4 This figure may be higher for electronic music, which typically attracts music makers highly familiar with digital technology. June 2015 111 (Allington, under review). Allington's methodology formed the starting point of the present project (complemented with ethnographic research). We scale up from de Nooy's work with tens of producers and Allington's with thousands, to hundreds of thousands of users in the current work. Identifying cultural value in electronic music Looking specifically at electronic music, the Valuing Electronic Music project investigates how we can gauge what cultural value electronic musicians hold. With the above discussions guiding our work, we looked at how peer groups of electronic musicians showed appreciation of each other. Our quantitative work focused on tracing activities for ascribing value to users, through network analysis on large collections of data. This paper reports the project's findings, from the perspective of developing a methodology for empirically identifying and evaluating cultural value (that could in future be incorporated autonomously in a creative system). Partly inspired by successes using social network analysis to make proxy judgements about value within a network of Interactive Fiction writers (Allington, under review), the research focuses on interactions between creative producers on the music social network SoundCloud, aggregating peer evaluations and tracing the production of value. Our approach to cultural value judgements Our quantitative research centred around collecting and analysing data from SoundCloud's API, about how users interacted with each other on SoundCloud. SoundCloud provides a good data source for technical reasons (a well-developed API provides access to all public data), for social reasons (it is widely used by amateur, semi-professional, and professional electronic musicians for networking and publishing music), and for ethical reasons (the data is clearly marked to site users as public). This sits in contrast to sites such MySpace, which has declined in popularity. We initially collected data on all demographic information and activities that SoundCloud made public, with the intention of using our qualitative data to understand the relative importance of each activity. Demographic data that users had made publicly available include their location, URLs and avatars relating to their online profile, number of followers, details of record labels they were attached to, etc.5The activities that we collected user data for were the publishing of tracks, following and being followed by other users, liking a track, commenting on a track, creating personal playlists of tracks and creating or joining a group. While the project was primarily a study of online data, this study was contextualised and enriched through study of SoundCloud users in the offline environments in which they primarily perform. In particular, our initial research on SoundCloud suggested there existed more-or-less closelyknit communities of co-located producers of electronic music. This implies that, even in the apparently transnational world of electronic music and online distribution, the social 5 More details at the SoundCloud API documentation at http://developers.soundcloud.com/docs/api/guide and our github: http://www.github.com/ValuingElectronicMusic/network-analysis. production of value may still be influenced by localised realtime face-to-face interactions. Hence ‘offline' qualitative work was conducted alongside our quantitative work, with each mode of research guiding and influencing the other. We interviewed eight electronic musicians, representing various different types of musicians in different genres from grime to techno. We also attended three electronic music performances and made observations, and interviewed a panel of three musicians at a public event we organised in London in June 2014. Informing our qualitative research, we also actively engaged in the SoundCloud community e.g. ‘liking' tracks we enjoyed and following musicians. The interviews helped us to explore the performers' perceptions of value. Using semi-structured interviews allowed us to cover areas of interest such as how the interviewees valued other people's music, while allowing the interviewee to guide the conversation towards areas they felt important. Observation data from gigs (e.g. order of appearance of various performers, prominence of performers' names on promotional materials, audience behaviour, etc) informed the interviews themselves as well as providing much-needed context for our relatively abstract online data. A common theme emerging from our qualitative research was that rather than searching for value (as an entity to measure), we should be focusing on valuing activities. Actions by and interactions between musicians were reported by interviewees as a vital way in which they perceived that people appreciated them and their work. Similarly in observations during gigs and in specific questions to live performers, we often noted the importance attached to people's body language and responses to music. In these electronic music communities, the status attached to people also affected to what degree any valuing activities were. In particular, our interviewees typically gave higher credence to interactions with other musicians, compared to those with non-musicians (or those perceived as a non-musician, for example if their reputation as a musician was not known by the interviewee, if they had not mentioned their own musical activities during the interactions or if they had not included pointers to their own work in their SoundCloud profile or other online profiles). This is similar to Bourdieu's emphasis on cultural producers' esteem for one another's work (Bourdieu 1993). Data Collection We wrote code in Python to collect public data automatically from SoundCloud, using the SoundCloud API and Python SDK.6 It was impractical to study the entire network of users, which comprises tens of millions of accounts, many of them inactive or controlled by bots, and huge amounts of data to collect. We initially adopted a snowball sampling method: starting with a seed individual, collecting data for the seed and the individuals they are connected to, then collecting data for the individuals connected to our seed's connections, and so on). However, we encountered problems with this approach due to SoundCloud network structure and sheer density of data. Many millions of 6 This code is open-source and available at http://www.github.com/ValuingElectronicMusic/network-analysis it is built from existing code by Allington for social network analysis, also available from the ValuingElectronicMusic github. June 2015 112 users would frequently be found within just two degrees of separation of a single individual. Undeclared restrictions placed by the SoundCloud API on downloads of information meant that we were prevented from collecting full data on all of those people, with an upper limit of 8199 in place. For example, if a given user had over 100000 followers, one would be unable to discover the identity of more than 8199 of them. Following discussion with experts at a workshop organised as part of the project, we decided to adopt a different approach. We switched to a two-fold data collection approach of (i) a sample of 150000 randomly selected SoundCloud users and (ii) ego-networks consisting of the networks of users around our interviewees and their followers/followees. In each case, we collected all publicly available data about each user, along with data on all tracks uploaded by these users and those who followed them. Due to the download restrictions of the SoundCloud API we could only download up to 8199 items of data per information request, but in practice this only affected data collection for a very small number of highly popular SoundCloud users. Some minimal data cleaning was needed, mainly for reconciling locations of users where different people used different variations of a location name (e.g. Cairo and Al Qahirah, or NYC and New York), or neighbourhoods within cities rather than cities. Genres of electronic music are varied and broad, including: house, trance, techno, trap, EDM, ambient, grime, etc. Initial research showed that while the predominant types of music on SoundCloud are in electronic music genres, SoundCloud tracks are often tagged as belonging to a subgenre of electronic music, rather than as ‘electronic'. To locate data corresponding to electronic musicians, we could not merely search for those who published music tagged as ‘electronic', nor would it be appropriate to treat all electronic music genres as belonging to one community (as confirmed by our interviewees). Instead we made use of the fact that most musicians actively participating on SoundCloud (uploading music, interacting with other users) were electronic musicians. In our data collection, then, we collected data on randomly chosen musicians such that we could later filter the data by genres or other pertinent factors (to be informed by our qualitative research). Working out what data to look for In interviews, we asked if there were valuing activities the participants would highlight as important on SoundCloud, and if so, which ones. In general, even minimal acts of valuing such as playing someone's track were considered to have some value. Participants highlighted indication of a longer term public support base via number of followers, and the use of the commenting facility for people to leave messages on individual uploaded tracks. Further, participants valued activities which arose from or led to offline connections and collaborations, although this type of activity is difficult to track quantitatively.7 Activities such as playing or 'liking' 7 Collaborations between two SoundCloud musicians are tricky to detect in SoundCloud data, as tracks on SoundCloud can only be attributed to a single creator. Tracks with two or more associated creators tend to either be uploaded by one of the collaborators with someone's track or including a track in a personal playlist or group were not highlighted, possibly because it is less easy to trace the provenance of this kind of valuing activity to individual musicians and hence less easy to judge the credibility of the person being interacted with. The facility to follow and be followed by other SoundCloud users was widely used by users, and afforded analysis or user interaction on a wider scale than at the level of individual comments, allowing us to detect general trends in much larger samples of data. While the follow activity does not require much engagement compared to making a comment on someone's track, nevertheless this activity identifies a SoundCloud user as showing their valuing of another user, in a publicly accessible manner. Qualitatively, we found that there was value attached to having large numbers of followers, though the participants disagreed as to how important this was to them personally. Quantitative analysis revealed, however, that SoundCloud is not a media which compares to YouTube or Twitter in terms of magnitude of followers. In our 150000 user sample, only three accounts had over 100000 followers and all of these accounts represented agents involved in music that had enjoyed significant commercial/popular recognition, above the subcultural recognition that is more common in electronic music scenes. Interim results and redirections in our quantitative research Following Allington (under review), initial quantitative research (Jordanous, Allington, and Dueck 2014) saw us seek the top-ranked users according to centrality measures. (Centrality measures highlight the most influential nodes in a network.) We also attempted to visualise the networks but found that graphs for samples greater than 500 users would be unreadable. We measured recommendation and influence through indegree rankings (a measure based around how many users follow another user). In an initial test sample of 1500 users, we identified key users. This ranking did find some key players in electronic music whose data had been captured in our sample, such as Tiesto. Our results, ´ however, did not help us understand the network at a deeper level, particularly regarding our search for cultural value through peer esteem. While indegree is more sophisticated than merely measuring the number of followers per account, there was some similarity between these two rankings. A ‘Justin Timberlake' account, for example, comes in at position 20, despite having no interactive activity on SoundCloud and therefore no identifiable contribution to cultural value through SoundCloud interactions. We started to explore more sophisticated methods such as PageRank and eigenvector rankings to help identify key players in SoundCloud's networks. However we started to notice a mismatch between qualitative findings and the shape of our quantitative data, stemming from earlier observations about the nature of sub communities within electronic music. In interviews, when we asked questions about valuing and appreciation, participants often replied in terms text pointing to the other collaborator(s), or via the creation of a new SoundCloud account representing all the collaborators, which is distinct from the collaborators' personal accounts. June 2015 113 of relationships and interaction. When we probed further, the participants tended to answer in terms of the genre(s) they produced music in, reframing the question to focus on that sub-community they were part of. Understanding that we should look for subnetworks and cliques within our data, we investigated on what grounds we should cluster our data, through interviews and through inspection of our data for commonly occurring links. Genre was one important clustering factor suggested in the interviews. Somewhat surprisingly for an online network, geographical location was another factor we were guided to investigate. Participants reported how offline interactions at particular places fed back into the social network interactions. The importance of offline contacts could not be ignored, especially given the social network ‘fatigue' reported by some participants in building their profiles. In terms of location having an influence on a musician's perceived value, our interviewees talked about the importance of their location for raising their profile and credibility. Though some had experience of being based elsewhere, many of our interviewees were based in London, which as we find below is an important centre for electronic music. One participant in particular reported a conscious decision to base themselves in London for profile-raising reasons. Analysis of clusters of users and sub-networks Learning from experience, our quantitative research focused on what sub-communities and clusters existed in our data. We took two directions: 1. constructing and studying multiple networks of electronic music producers and their connections, and 2. using the comments-based data to identify the language used between peers to express value. We built networks of accounts and tracks, based on ‘follow' relationships, which we could re-apply centrality measures to. Clusters and cliques in these networks were also identified where possible, based on available metadata about users and tracks such as genre. We should note here that many users do not provide location information, particularly if not active users (though we focus on those users who actively engage with other users on SoundCloud). Inspecting the data on comments about tracks, we noted that the overwhelming majority of comments tended to be positive, unlike commenting activity typically observed on sites such as YouTube (Pihlaja 2012). In our analysis of the comments data (filtered from spam where possible) we used the Open Office dictionaries for English, French, Spanish and Italian to identify and extract English language comments to reasonable accuracy.8 We treated the Englishlanguage comments on tracks as corpora based on track genres. Corpus analysis allowed us to identify evaluative vocabularies associated with particular genres, groups, and locations, by comparing these subcorpora on the lexical level. Given that SoundCloud comments were typically positive (or spam), we posit these vocabularies as genre-specific indicators of value as expressed in that genre. 8 Our approach did not pick up comments such as ‘wooooot!!!' or ‘loveeeeeeeeee', the type of which occur frequently in our data. Table 1: Follow relationships by frequency of locations Location of followed Location of follower n 1 London London 3799 2 Melbourne Melbourne 2274 3 Berlin Berlin 1375 4 Paris Paris 1253 5 New York New York 1190 Computational analysis: Results and discussion9 Geography Analysis of locations in our random sample revealed London as the most common city location for music makers (users who had uploaded tracks to SoundCloud); 200 accounts out of the 17357 eligible accounts were attached to users based in London. London music-makers had the highest mean number of followers, though a disproportionately high standard deviation reveals results were skewed by a small number of very highly followed accounts. On analysing individual ego-networks of our participants, we could identify clear clusters within the ego-network based on location of the users, indicating a preference for users to follow other users in the same geographical area as them. This hypothesis was supported by evidence in the larger random sample (see Table 1). Other key cities identified through our random sample behind London were New York (171 accounts belonging to music-makers), Los Angeles (93), Chicago and Paris (both 81). In terms of followers, strong bidirectional links were identified between London, New York and Los Angeles (UK/US), and then between London, Berlin and Paris (major European capitals). Given that this part of our analysis was genre-agnostic, it was surprising to see cities such as Nashville and Mumbai, with strong musical connections to country music and Bollywood music respectively, featuring little in the interconnected data. Perhaps this is because these types of music do not enjoy the same associations with online/digital technologies and, more specifically, with SoundCloud (emphasising the need to ensure that the social interactions you are analysing are relevant to the creative communities you study). Using eigenvector centrality based on a graph connected by follow relationships, we identified similar rankings; the central node in this graph was London (0.90093 centrality), followed by New York (0.24838), Berlin (0.20645), Los Angeles (0.20121) and Paris (0.10437). By country, the United States was top by some degree (0.96823 centrality, with the second highest centrality at 0.21216 for the UK). Germany, Canada and France were next in influence, with centrality of 0.07380, 0.05749 and 0.05193 respectively. Genre In raw frequencies, hiphop producers were most prevalent in our sample, with 155 users uploading tracks 9 The following is a synopsis of findings that are relevant to developing computational analysis of cultural value. Fuller reports of our findings are described in (Allington, Dueck, and Jordanous, submitted) and (Allington, Jordanous, and Dueck 2014). June 2015 114 Table 2: Follow relationships by genre Follower Following n 1 hiphop hiphop 2443 2 house house 2276 3 techno techno 1415 4 progressive house house 800 5 dubstep dubstep 679 tagged as ‘hiphop'. House music was second (90 users), followed by rock (61), rap (59) and pop (49). However once we start to study the inherent cultural value through interactions between producers, we see different results as to the influence of different genres. We used eigenvector centrality based on follow relationships to study how producers of music within one genre interacted with music-makers in other genres. In our sample, house music producers were most influential, followed by hiphop, techno, and deephouse. Music tagged as ‘electronic' is still prevalent, though its subcategories are widely used as tags instead. These fuller results (Allington, Jordanous, and Dueck 2014, Table 27) also evidenced the influence of electronic musicians (as opposed to musicians of other genres) on SoundCloud. Many tracks were tagged with more than one genre term, and Figure 1 reveals patterns within genre tagging that empirically support existing genre classifications. Clustering together tags that frequently occurred together on tracks, we identified three macro-genres that could be categorised as ‘EDM' (Electronic Dance Music), ‘urban', and a miscellaneous ‘other' category. The two named macro-genres ‘EDM' and ‘urban' align with an analysis of data from 2007 on all musical genres on MySpace by Lee & Silver (2014), identifiably corresponding to two clusters that they tagged as ‘Electro/Dance' and ‘Black & Brown' respectively. Focusing on activity in the EDM and urban clusters (as the ‘Other' cluster contains negligible activity) typically EDM producers follow other EDM producers, and similarly Urban producers follow other Urban producers. Looking at the genre level, a similar pattern of following producers within the same genre is noted (see Table 2). Follower activity A common-sense hypothesis was supported by results: users who uploaded tracks to SoundCloud typically had more followers than those who did not (a mean of 127 followers per account for those who uploaded public tracks, compared to a mean of 19 per all types of users in our 150000-users sample). If we take our qualitative findings that number of followers is generally positively associated with value recognition, then we can underline that musicmakers are valued in the SoundCloud community. Commenting activity Taking the three macro-genres we identified, EDM producers were the most prolific commenters with 11711 comments, compared to 3673 comments by urban producers, and 2982 comments by producers of the ‘other' genres. By genre, dubstep producers engaged in commenting behaviour the most (2569 comments), then techno (2254), hiphop (2081) and house (1725). Figure 1: Co-occurrence of genres in track tags From the comments we have (as described above) identified genre-specific English-language vocabularies indicating value expressions. Keywords are presented for the top genres in Table 3, in order of ‘keyless' (decreasing proportional frequency). This table shows the different types of vocabulary prevalent per genre, for example keywords in comments on techno tracks appear more polite than on hiphop tracks. Evaluation of our approach When is social network analysis appropriate as a proxy for measuring cultural value? As shown by the lack of useful results of SoundCloud users in cities like Nashville and Mumbai, one needs to ensure they are analysing appropriate social networks for their specific creative domain. There may not be a relevant social network directly for these acoustic-music-based communities, but general social networks such as Twitter may prove useful. How could the VEM findings be useful to computational creativity researchers? Cynically, perhaps, we could set up a London-based SoundCloud account for a hypothetical electronic music computational creativity system we want to promote the work of, ensuring we (or the system) upload(s) tracks produced by our system. We could concentrate efforts on developing our hypothetical system's ability to interact with other music-makers' tracks who work in similar June 2015 115 Table 3: Genre-specific keywords for expressing value Dubstep Techno Hiphop House 1 sick set dope nice 2 tune great shit house 3 nice tracks beat super 4 big loved leave production 5 mix fantastic song support genres, commenting on such tracks and responding to comments on its own tracks using keywords which have been identified as commonly used in the genre we are working in. We could develop our system to follow other music-makers based in strategically important cities such as London, New York, Los Angeles, Paris or Berlin, or who upload music of similar genres. While this would not necessarily develop the musicality of our hypothetical artificial electronic musician, we argue such moves (if executed plausibly) would help increase the cultural value attributed to our musician (notwithstanding the debate about the effects of identifying the account or not as that of an artificial musician (Moffat and Kelly 2006; Cook and Colton 2014)). How could social network analysis be used more broadly within computational creativity? For this work to be most useful to computational creativity researchers, it could a. show how cultural value can be identified and gauged through research and/or b. offer a way of autonomously making value judgements about computational creativity systems. We believe that our work above demonstrates point a., how to tangibly identify markers that indicate cultural value. Allington (under review) has previously used similar network analysis to study Interactive Fiction. What we pursue now is the afore-mentioned point b., a methodology for using computational network analysis to gauge the cultural value associated with a creative entity such as a computational creativity system. For such an approach to work, we need the system to be capable of interacting with relevant online communities in a plausible manner, as suggested above for our hypothetical electronic computer musician. We also need there to exist an appropriate social network for such interactions to take place in, or as a fascinating alternative, a multi-agent system or similar digital environment containing several interacting agents. For the actual analysis, we advocate using a combination of initial quantitative data analysis and qualitative research to identify key indicators of cultural value that can be traced in the social network interactions. With these conditions in place, we can analyse interactions in the network and compare our creative system or agent to others within the network to gauge the value inherent in its interactive social behaviour. Future work Our results show that electronic music subcultures are geographically influenced and, within the UK, heavily London-centric. Our quantitative methodology could reveal important scenes associated with other cities, and whether we could identify musicians that are considered heavily influential and ‘valuable' to the local scene(s). Somewhat inspired by the 2014 Scotland independence referendum, we plan to examine electronic music scenes within Scotland to test our methodology. Our next step will be to apply the same approach to other creative domains to see if social network analysis can be applied more broadly for computationally analysing cultural value. We would welcome collaborations. Further useful information may be gained from quantitative analysis of comments made by users on each other's tracks, though this was not so straightforward to analyse during the project's funded time. In this work we would have liked to explore and build networks of users based around ‘comment' relationships between users. Such work will require considerably more intricate and varied analysis to filter links based around genuine comments. Ongoing work is currently examining the links between users based on commenting behaviours. We would also like to examine conversations; repeated comments or comments on multiple tracks from a user should indicate greater peer engagement. Conversations proved rather difficult to detect quantitatively due to the lack of a standard way to indicate who your comments are directed towards, but their analysis would be useful. Conclusions Value is recognised as a key aspect of creativity. In evaluating computational creativity, one large problem we face is in gauging the value of the work generated by our systems. Such evaluation is particularly problematic when we consider that value is often a cultural and intangible resource apportioned subjectively through the actions of peers. To what extent can we use social network activity to identify the cultural value of creative entities? Here we addressed this question through a case study investigating how electronic musicians place value in each other's work. The Valuing Electronic Music (VEM) project combined ethnographic observation/interviewing with automated collection of quantitative data from the SoundCloud music networking site. Our approach has implications for how we could measure cultural value in other domains, as well as contributing to our understanding of cultural value in electronic music. Challenges and rewards alike come from combining situated qualitative research with quantitative analysis of large datasets gathered online. Learning from (and feeding back into) the findings from interviews with electronic musicians, we used computational analysis to study interactions in social networks.10 Through such analysis we extrapolated information about how musicians interact with each other on SoundCloud, and how they express appreciation of each other's work. Typically, it was more productive to study clusters of strongly connected cliques within the SoundCloud network, rather than a sample of the entire network. The SoundCloud user community tends to cluster according to several factors. We found empirical evidence of clusters forming around common musical genres, and also of clusters around certain privileged geographical locations such 10Our approach echoes (Jordanous 2012b): to better represent creative activities using quantitative models, we need good understanding of the creative domain as well as the models themselves. June 2015 116 as London. One key ‘take-home' finding from this work is that one can study cultural value computationally by studying social activity, but often it is most useful to study interaction between smaller sub-groups of a network, rather than taking an overall view of the entire network as a whole. In other words, to understand how people express value for each other's work, we should look for social interactions and the building of relationships within a community. We found that while certain kinds of activity on SoundCloud have little apparent economic value (e.g. commenting on each others' tracks, publishing free downloads) these activities seem to generate cultural value that facilitates more economically valuable work. For the most part, musicmakers assert their concern for all listeners, but close attention to their activity (and how they describe it) suggests that interactions with peers (i.e. fellow music makers, preferably within similar genres, areas or with other links) are especially important for the production of value for their work. Our computational analysis of SoundCloud data allowed us to approximate the value placed in electronic musicians' work, showing that we can use social network analysis as a proxy for measuring certain types of musical and cultural value in a creative domain. We hypothesise that our methodology can be extended to analyse quantitatively the value inherent in other social networks centred around creative activity. We believe that this work contributes towards a significant type of tool in our ‘computational creativity toolkit': an automatable method for evaluating social/cultural value. Acknowledgments This work was funded by an AHRC Cultural Value grant "Online networks and the production of value in electronic music". We thank our interviewees, workshop and public engagement event participants and advisory board members. 2015_16 !2015 Conceptualizing Creativity: From Distributional Semantics to Conceptual Spaces Kat Agres, Stephen McGregor, Matthew Purver, and Geraint Wiggins School of Electronic Engineering and Computer Science Queen Mary University of London London E1 4NS UK kathleen.agres, s.e.mcgregor, m.purver, geraint.wiggins (@qmul.ac.uk) Abstract This paper puts forth a method for discovering computationally-derived conceptual spaces that reflect human conceptualization of musical and poetic creativity. We describe a lexical space that is defined through co-occurrence statistics, and compare the dimensions of this space with human responses on a word association task. Participants' responses serve as external validation of our computational findings, and frequent terms are also used as input dimensions for creating mappings from the linguistic to the conceptual domain. This novel method finds low-dimensional subspaces that represent particular conceptual regions within a vector space model of distributional semantics. Word-vectors from these discovered conceptual spaces are considered, and argued to be useful for the evaluation of creativity and creative artifacts within computational creativity. Introduction This paper presents a computational-linguistic model for mapping lexical spaces populated by statistical representations of words to conceptual spaces defined in terms of feature dimensions of conceptual representations. This research has three main goals. The first is to compare the features of a distributional semantic vector space with the results of an empirical word-association task completed by human subjects. This empirical corroboration serves to demonstrate that the model can capture meaningful aspects of human conceptualizations of queried topics, which are "musical creativity" and "poetic creativity" in the present study. The second goal is to use novel methods inspired by computational linguistics to map terms from the linguistic domain to representations in the conceptual domain. To this end, the terms generated by participants are used as input parameters for our computational model that uses co-occurrence statistics and linear algebraic metrics to quantify conceptual proximity. The third motivation of this work concerns the evaluation of creativity. In the field of computational creativity (CC), the evaluation of creative output is often either subjective on the part of the developer/researcher or unsystematic. We offer our own fundamentally computational approach as a means of identifying facets of the investigated concept or domain. Put another way, our model can generate terms within a conceptual space that may be used to query different aspects of creative output or creative behavior. Vector space models of distributional semantics are currently a popular approach for quantifying linguistic similarity, but many contemporary studies need grounding and external validation. Much of the work in this area compares model performance to semantic databases, but does not directly relate results to the cognitive performance of humans, or uses very restricted tasks, such as similarity judgments, rather than imploring subjects to elaborate on concepts. Because our aim is to elucidate how humans conceptualize creativity, sampling from people's own formulation of conceptual spaces is essential. Therefore, in the present work, our ground truth is derived from human responses stemming from direct queries about creative concepts. Because human response data is a limited and expensive resource, we hope that our comparison to human data will inform how conceptual spaces may be discovered as autonomously as possible in the future (that is, without the requirement of subjective user-input or parameter-tweaking). We also believe that this multidisciplinary and externally validated approach produces a more robust system. In order to pinpoint the relationship between the output of our computational model and the results of our empirical study, we take the human-generated terms and investigate their situation within the multidimensional space of our distributional semantics model. We then determine the characteristic co-occurrence dimensions of sets of words associated with concepts, and apply appropriate methods to reduce the dimensionality of the space in order to map broader clusters of linguistic terms to conceptual regions. We argue that the online generation of a reduced lexical space corresponds to the contextualization inherent in the momentary way in which concepts are necessarily formed in response to situations in a cognitive environment. We expect that this methodology will be a useful applied approach to formalizing the geometrical representation of conceptual spaces. Our research explores two related concepts: musical creativity and poetic creativity. There are several reasons for this choice. First, we are interested in computational creativity, and in particular in the evaluation of creative systems and their output. In order to evaluate creativity, it is necessary to characterize features of this concept using the expressive affordances of language. Our computational methods seek to capture these features of the conceptual space. Our model may also be used to discover conceptually-related terms that a human might not necessarily immediately con June 2015 118 sider. We hope this approach may be used to elaborate abstract concepts by elucidating an extensive set of terms that correspond to the queried conceptual spaces. We therefore offer this methodology as a novel approach for exploring and elaborating concepts, both for the evaluation of creative systems and for potentially contributing to creative pursuits themselves (such as poetry generation). Furthermore, we apply our method to a more concrete domain, extending a small subset of terms relating to the concept WILD ANIMALS in order to indicate the anticipated generality of our model. The organization of the paper is as follows: first we offer a summary of computational approaches to conceptual creativity, situating our research within the field. This overview leads into a discussion of computational approaches to the topics of conceptual spaces and geometric representations of concepts. An explanation of our computational model is then provided, including a description of how we have modeled a lexical space populated by word-vectors. This is followed by a description of our empirical study with human participants, and findings from this questionnaire-based study are reported. Given this context, we then discuss two ways in which the participants' responses contribute to our computational approach. The first is a comparison of computationally-derived terms with human-generated terms. The second contribution will be to treat the salient features of the word-vectors corresponding to the most frequently reported human terms as an indication of the dimensions of a vastly reduced subspace of our distributional semantic model. We then discuss how terms that fall near the centroid of the positively valued surface of the discovered lexical spaces may be used for the evaluation of creativity. CC and Concept Discovery Computational approaches to creative conceptualization have provided a target that is both elusive and essential to the identity of a field that incorporates a particularly diverse range of topics. Creativity itself has been interpreted by Koestler (1964) as a kind of meshing of disparate conceptual schemes, by which expectations are violated in favor of interesting new combinations of frames of reference. Presciently, Koestler has couched his model of creativity in terms of vector spaces and transformations, an idea which is broadly shared by the model presented in the present paper. In the same spirit of conceptual exploration, Hesse (1963) argued that the formation of creative analogies is the essence of scientific discovery, an idea demonstrated by the primacy of analogical modeling in fields such as physics, where there is no realistic way to literally conceive of phenomena that occur on obscurely minuscule or vast scales. In the specifically computational domain, Veale (2006) has proposed a system for the dynamic generation of new, non-literal conceptual categories based on a computational analysis of a taxonomical database such as WordNet. Likewise, other researchers are developing formal models of conceptual blending (Fauconnier and Turner, 2008) that seek to discover novel combinations of familiar ideas, targeting domains such as mathematical reasoning and story generation (Ontanon, Zhu, and Plaza, 2012). These ap´ proaches make clever and effective use of heuristics to pick out interesting new conceptual representations based on preconceived patterns identified by programmers. As such, the output of these methods is compelling and valid, but the conceptualization itself is arguably handed to the system in the prepackaged form of externally grounded symbols. Elsewhere, Heath et al. (2013) have taken a more connectionist approach to conceptual creativity, combining human based word associations with statistical models of distributional semantics to design a system that infers conceptual categories from lists of terms, and likewise generates lists of terms from linguistic input that is interpreted conceptually. In a similar vein, Jager (2009) has performed a statistical ¨ analysis on a set of human reported color terms and used this analysis to generate a geometric representation of certain consistencies in the ways that color is conceptualized across cultural linguistic boundaries. In their commitment to building models based on low level, non-symbolic observations about the world, these statistical approaches to creative conceptualization are in the same spirit as the work presented in the present paper. The model described here has been designed to engage with the field of computational creativity on two different planes. Principally, our method seeks to implement a low level approach to the delineation of conceptual regions based on the geometry of a distributed semantic space. By viewing concepts as momentary and pragmatic phenomena, we are able to use ad hoc reductions of a high dimensional lexical space to map concepts creatively based on situational contexts which do not have to be preformulated in the design of the model. Furthermore, our target domains of musical and poetic creativity play nicely into a salient issue in the field of computational creativity: the analysis of creativity itself, a difficult procedure that necessarily involves some degree of conceptualization about creativity. This secondary aspect of the work, the potential for meta-analysis inherent in the question of whether our model's output will be useful for guiding an evaluative discussion of creative work elsewhere, is intended to give the work its own pragmatic grounding, in that this suggests a practical application for the creative output described in the following pages. Spaces of Meanings This project uses computational methods as a platform for exploring the relationships between words and concepts within the context of a cognitive system. In the pragmatic spirit of Wittgenstein (1953) and Grice (1969), language is presented as a system defined by its own functionality, with meaning emerging from the use of words in the course of accomplishing communicative goals. To the extent that language is used to communicate ideas, statements are formed contextually, with reference to expectations about how relationships between words will suggest hierarchies of categorization relative to a particular situation. Barsalou (1993) characterizes the relationship between words and concepts in terms of the linguistic vagary inherent in the application of names to ideas: words represent concepts in a way that is fleeting and mutable. Fundamentally, words stand as indices to concepts, and the relationship between language and ideas is best understood as a mapping between two separate domains. The project presented in this paper is therefore motivated by a desire to model the relationship between two June 2015 119 different spaces, one of words and one of concepts, and to explore the ways in which these spaces might be aligned in terms of the computationally tractable elements of their geometries. Gardenfors (2000) has presented a spatial theory of con¨ cepts, by which the dimensions that determine the geometric situation of a conceptual region within a space of concepts correspond to the attributes which characterize that particular region. So, for instance, the concept RIPE BANANAS would occupy a region towards the higher end of the dimensions of curviness, yellowness, and sweetness within a conceptual space. This literal and factual quality of dimensions grounds conceptual spaces in low level observations about the world, giving regions within the space a geometric dynamism that lends itself to doing higher level work with the entities that emerge from the space as symbolic representations. In particular, well defined conceptual regions are characterized by convexity, a property that ensures that any intermediate point between two outlying extensions of a region will likewise belong to that domain. Vector space models of distributional semantics, on the other hand, offer an approach to language modeling involving a distinctly unstructured computational analysis of linguistic data. In the tradition of Harris (1957), the distributional hypothesis holds that there is semantic information inherent in the statistical comportment of language: linguistic meaning can be found in the quantifiable contextual relationships between words. This insight has motivated a productive field of research, with computational analyses of large scale corpora yielding distributional semantic models in which the meanings of words, sentences, and documents are rendered in terms of mathematically tractable representations (Schutze, 1992; Landauer et al., 1997). Distributional ¨ semantic models treat words as vectors, with the dimensions of these vectors representing, either directly or abstractly, the likelihood of a word occurring in a given context. The closeness of vectors in a lexical space, which reflects the tendency of the proximal vectors to occur in similar contexts, has been shown as an indication of lexical similarity between the words tied to the vectors. In their most straightforward implementation, distributed semantic spaces are constructed by counting the frequency with which each word in the model co-occurs with all other terms in a base corpus (see Turney and Patel, 2010; Clark, 2015, for an overview). There is an important difference between lexical spaces and conceptual spaces: the dimensionally regimented quality of coherent domains within a conceptual space is not reflected in the distribution of vectors in a lexical space, where the dimensions of word-vectors correspond simply to the context in which those words are likely to occur, and therefore capture all the flexibility and uncertainty of language in use—the linguistic vagary of Barsalou's system of conceptual symbols. So, for instance, the other vectors in the proximity of the word-vector !!pet in a distributed semantic model cannot be expected to contain only terms corresponding to domesticated animals, not least because the word "pet" itself has other uses. In this sense, where conceptual spaces are marked by a tidy taxonomy facilitated by the clarity of a region's dimensional substrate, distributed semantic spaces embody the pragmatic messiness of language as it is encountered in its natural, operational environment. Therefore, while lexical spaces and conceptual spaces both utilize geometry as a vehicle for semantics, the arrangement of a lexical space is in an essential way less ordered. An example of the difficulty of delineating conceptual regions within a lexical space is illustrated in Figure 1a. In the rudimentary distribution of words presented here, concepts are required to stretch and overlap in order to maintain their lexical constituencies. This simplified depiction of the potential uncertainty of conceptual membership doesn't demonstrate the even more fundamental problem of picking out salient words in regions that are littered with noise: in practice, in the densely and unevenly populated territory mapped out by a vector space model, many unwanted terms will be discovered in the region generally between two other terms. For instance, in the unrefined version of our model, there are 14 terms essentially between the word-vectors !!cat and !!dog, including such unlikely candidates as "during", "eventually", and "featuring". There is thus an inherent patchiness to the mapping of concepts that might be read in an unrefined vector space model of distributional semantics. Here we propose a system for mapping lexical spaces to conceptual spaces by considering a conceptualization as a particular and temporary perspective on a space of distributed semantics. The idea behind this system is that, for any desired clustering of words corresponding to a particular conceptualization, there is some subset of a distributional space's dimensions that will render a subspace in which that clustering is realized. This intuition is illustrated in Figure 1b, where the conceptually entangled space of Figure 1a collapses into a particular conceptual regime depending on the axis along which the space is projected, which is to say, the perspective from which the space is considered. The task of our system is therefore to determine the dimensions which should be picked out of a higher order vector space model in order to realize a grouping of terms that is conceptually homogeneous by virtue of the contextualization imposed by a particular perspective on the space. It is precisely the massive dimensionality of the space which facilitates the method's ability to pick out various successful conceptual perspectives on the space in a momentary and continuous way. With each additional contextual dimension introduced to a vector space, there is an exponential increase in the lower-dimensional combinations available to map corresponding spatial relationships of words to conceptual subspaces. Moving from the linguistic realm native to vector spaces back to the cognitive domain targeted by Gardenfors, these dimensional perspectives might be con¨ strued as corresponding to a contextualized perception of a situation. In this respect, our system models the haphazard quality of conceptualization described by Barsalou, as well the ad hoc nature of concept formation discussed more recently by Allott and Textor (2012), who suggest that meaning is appropriated in situ to endow statements with contextually relevant implicature. This phenomenon of conceptualization arising from pragmatic communicative affordances is what our method seeks to computationally model. June 2015 120 Figure 1: Conceptual Perspectives of Vector Spaces wolf dog cat lion canine pet feline predator (a) In this simplified and unrefined distributional semantic space, the conceptual regions suggested by the spatial arrangement of terms are indeterminate. Each word is roughly equidistant from two other terms, either of which could be linked in a distinct linguistic depiction of a concept. The conceptual domains which are delineated by this arrangement of words are awkwardly elongated. wolf x x dog x cat x x x lion x x predator o predator pet o canine o feline o species niche (b) If the same space illustrated above is considered from two different perspectives, the indeterminate arrangements of words collapse into lower dimensional spaces (one dimensional, in this simple example) in which the clustering of terms suggests straightforward conceptual domains. These perspectives effectively contextualize the meanings inherent in the distributional characteristics of the language model, and this context facilitates the mapping of the linguistic space to sets of conceptual regions. A Literal Lexical Space Our lexical model has been constructed based on the distribution of words found in the textual component of articles on the English language Wikipedia website.1 The xml code of the site was downloaded and then parsed into a text-only for1 The December 8, 2014 dump, downloaded from http://meta.wikimedia.org/wiki/Data dump torrents on January 23, 2015, parsed into plain text using the "Wikipedia mat, eliminating images, tables, lists, captions, and section titles, leaving only the well formed sentences composing the content of the site's articles. Sentences were separated by identifying terminal punctuation followed by whitespace, then punctuation was removed and all characters were converted to lower case. Articles ("a", "an", and "the") were stripped from the text. Sentences containing less than five words were discarded. The resulting corpus consists of almost 60 million sentences, containing about 1.1 billion word tokens (individual words) corresponding to about 7.4 million word types (classes of words). From this base corpus, we took the 200,000 most frequent word types to form our system's vocabulary. Our full lexical space is represented as a matrix Mw,c, where rows correspond to vectors representing words, and columns correspond to co-occurrence terms. The cell for word w and cooccurrence term c contains the mutual information MIw,c as described in Equation 1. Here nw,c is the frequency with which a term c is observed to co-occur within a context window of two words on either side of a vocabulary word w; nw represents the total count of w in the corpus; nc is the total count of c; and a is a smoothing constant. MIw,c = log2 ✓ nw,c ⇥ N nw ⇥ (nc + a) + 1◆ (1) The constant a reduces the undesirable effect of contextual words that occur very rarely throughout the corpus, but with a high frequency in the context of certain target words—we found 10,000 to be a good value for a based on trial and error. The value 1 is added to the probabilistic ratio in order to render all dimensions within the space positive: this means that the logged MI value of target words and context words that never occur together will be 0, and the value for terms that co-occur less frequently than would be expected in a random distribution will be between 0 and 1. Each word vector consists of a set of dimensions derived through this calculation, and each of these vectors is normalised to the scale of a unit vector. The result of this process is a distributional semantic space in which each of the 200,000 vocabulary terms sits in the positive region of the high-dimensional surface of a hypersphere. One notable feature of our vector space is the literal correspondence of its dimensions to co-occurrence terms. In general, state-of-the-art systems apply some form of dimensional reduction to the overall space, either using linear algebraic transformations to perform a principal component analysis (Pennington, Socher, and Manning, 2014), or by weighted networks to train abstract lower-dimensional word representations that predict the context in which that word is encountered in the course of training (Mikolov et al., 2013). While these techniques certainly make the space less expensive to compute, and arguably improve results for a variety of semantic tasks, our system is specifically geared towards the identification of salient, literal co-occurrence dimensions, and as such the space is, for the purposes of our initial analysis, maintained in its raw high-dimensional form. The dimensionality of our space is therefore on the Extractor" software, downloaded on February 13, 2015 from http://medialab.di.unipi.it/wiki/Wikipedia Extractor. June 2015 121 order of 7.4 million, as every word token in the corpus is a potential context for the 200,000 words in our vocabulary. Empirical Validation In order to provide grounding and validation for our computational model, human participants were asked to generate terms during a word association task, and to reflect upon how they would evaluate creativity in the musical and poetic domains. Provided terms were analyzed for comparison with the vector space model. Method Participants Twenty participants (avg age = 30 yrs, stdev = 5.2 yrs) volunteered to take part in the study, of which 11 were female. Sixteen individuals indicated that their career is either inherently creative or that they apply creativity to improve their job performance, and all but two of the participants engage in creative pursuits outside of work. Seven currently practice or perform music, and five individuals currently engage in creative writing. Procedure After reading an information form and providing informed consent, participants were given a brief questionnaire to complete. Two of the questions consisted of a word association task in which participants were asked to list three terms they associate with "musical creativity" or "poetic creativity". The order in which musical or poetic terms were prompted was counterbalanced across subjects. The other two questions requested participants to write one sentence describing how they would evaluate whether a new piece of music or poetry sounds creative (the order of these questions were similarly counterbalanced across subjects). These results will only briefly be touched upon in the current paper; although certainly of interest, due to space constraints, an in-depth analysis of the evaluation sentences must be saved for an expanded version of this work. After providing their responses, participants were given questionnaires requesting general demographic information (age, ethnicity, etc), and information about their past and current involvement in creative pursuits, e.g., "Do you currently play music or engage in creative writing?" Upon completing these, participants were debriefed as to the goals of the experiment and paid £2 each for their participation. Results For both musical and poetic creativity, participants' terms were placed into two lists: An exhaustive list of all terms provided for the concept, and a short list for terms cited by more than two participants (per concept). In the case of musical creativity, this yielded an exhaustive list of 52 distinct terms, and for poetic creativity, a set of 42 terms. The short list of musical terms included the following six terms: innovation, sound, instruments, novelty, emotion, and expression. The short list of poetic terms included these six terms: emotion, rhythm, expression, structure, flow, and words. We interpreted these concise lists of most frequent words to reflect dimensions of the concept that are more central to the conceptual space they populate. This resulted in discarding more peripheral terms such as "sensitive" that are undoubtedly related to creativity, but not cited frequently as an associated concept. Plurals and conjugations were considered to be the same category of term, e.g., "emotions", "emotional", and "emotion" were tallied together as "emotion". We continue the discussion of empirical findings in the next section, as we compare the model's performance with human results. Mapping Words to Concepts We began the exploration of mappings from our space of distributed semantics to a conceptual space with a top down approach, investigating the way our system reacted to the same kind of input that we presented to our human subjects. Along these lines, we examined the vectors for the word pairing "musical" and "creativity", and likewise for the pairing "poetical" and "creativity". In each instance, we calculated the mean value for each dimension that had a nonzero value for both words - that is, for each dimension corresponding to a term that co-occurred with both words at least once in the corpus - and returned a ranked list of average scores, running from high to low. Out of the 7.5 million co-occurrence features across the entire model, 4,772 were non-zero for both "musical" and "creativity", and 2,673 for both "poetic" and "creativity", statistics which highlight the sparsity of the base space. Our objective was to examine the nature of the terms that tended to come up in the context of our query as phrased for our human subjects. Results are listed in Table 1. The first thing to note about these results is that they are, in a qualitative sense, coherent descriptions of properties typically associated with the two concepts being explored. To frame this more empirically, these results can be extended in order to discover how far down the list of top mean dimensions the terms reported by humans lie. Of the exhaustive list of terms reported by human subjects in response to the "musical creativity" query, 4 fall within the top 15 results generated by our model; likewise, 4 human responses fall within the top 15 mean dimensions for "poetic creativity" (these terms are italicised in Table 1). Considering that 200,000 words were used as the vocabulary of the model, yielding 4 of the top 15 dimensions in common with humans' responses for both concepts is quite compelling. This outcome may be interpreted as indicating that there is a high degree of mutual information between the query words and terms that humans would consider as conceptually descriptive of those queries. In other words, there is a high likelihood of conceptually relevant co-occurrence within the context of terms that summarize these creative conceptual domains. These positive results do not hold up, however, for more concrete queries. For instance, when the mean dimensions for the query pair "wild" and "animal" are explored, top ranking results include some conceptually appropriate terms such as "boars", "deer", and "feral", but less directly relevant words like "skins" and "vegetable", and even antonymic terms like "domesticated" are also returned. It would seem that, in the case of words indexing more concrete concepts, the likelihood of co-occurrence in the conceptual context moves away from terms that generically describe components of the concept in question. This distinction is corroborated by Hill, Korhonen, and Bentz (2014), June 2015 122 "musical" & "creativity" "poetic" & "creativity" innovation genius imagination imaginative inventiveness metaphors improvisation originality talent prose talents creativity experimentation artistry versatility craftsmanship artistic intuition creativity imagery ingenuity inspiration aesthetics talents spontaneity lyrical individuality talent artistry self-expression Table 1: The top 15 dimensions with the highest mean scores between the word-vectors for each of our queries as given to human subjects. Terms in italics denote dimensions that were also cited by humans. who have used computational analyses of both corpora and semantic graphs to illustrate a distinction between the way that abstract and concrete concepts are arranged in a cognitive linguistic system. It is hardly surprising, given the inherent ambiguity of language use - replete, as it is, with metaphor and implication - that simple co-occurrence probability statistics do not generally map neatly on to well defined conceptual spaces. Projecting Words to Conceptual Subspaces Motivated by this predictable shortcoming of a simple dimensional analysis, we developed a more sophisticated approach for delineating conceptual regions within dimensionally reduced subspaces of our language model. Our technique involves first hand-picking a small set of terms that might be considered as paradigmatic descriptions of components of a conceptual domain (in the present example, WILD ANIMALS). We perform an analysis similar to the one described above on these conceptual component terms, selecting the word-vector for each term and then extracting those features with non-zero values for all input terms. Once again, we compute a ranked list of these mean feature values and choose the co-occurrence dimensions which scored highest on average. These salient dimensions for the small set of words analyzed are again somewhat scattered: some of the highest mean dimensions correspond to relevant animal names, but the results also stray into the more conceptually ambiguous territory signified by words like "sightings", "chases", and "fat". There are 827 universally non-zero dimensions found between the word-vectors of the six input terms describing exemplars of WILD ANIMALS listed in the first column of Table 2. We use these salient dimensions to define a drastically simplified subspace of our lexical model. Specifically, we reduce the model to the top 30 dimensions associated with the set of sample words (we arrived at the number 30 through trial and error; lower values tended to invite some unusual vectors into the crucial region of the subspace). After normalizing the new subspace, we then identify the central point on the surface of the positively valued quadrant of the reduced hypersphere—effectively the vector defined by 30 dimensions each with the the value 1/ p30. This positive centroid is then taken as the epicenter of a linguistic mapping of a new conceptual region, and we expand the region outward concentrically from this point, returning an ordered list of the points closest to the center of the positive surface of our space's low dimensional projection. Euclidean proximity is calculated by computing the square root of the sum of the squared feature-wise differences between the unit centroid and each of the 200,000 vocabulary words projected into the subspace. The top fifteen terms encountered using this method are reported in Table 2. Please note that the input terms for WILD ANIMALS are used as a preliminary test of the model's performance for concrete concepts; the input was hand-selected by the investigators, while the collection of terms relating to concrete concepts is the subject of ongoing emprical study. This same technique for expanding conceptual regions through a dimensional reduction of a distributional language model is applied to our target domains of musical and poetic creativity, again with compelling outcomes. In this case we were able to make use of our results from our survey: for each of our two target domains, we choose all the terms that were reported by three or more human subjects and analyze these for their most salient dimensions of co-occurrence. Again, the system's output for these terms is not entirely unexpected, but also not conceptually completely cohesive. In the case of the highest mean dimensions for the human reported constituents of MUSICAL CREATIVITY, a number of predictable terms are returned, but somewhat less obvious dimensions such as "lab", "mere", and "shapes" also rank towards the top of the list. Despite the conceptual uncertainty in the dimensional analysis, when a new subspace is constructed based on these dimensions, the central region of this space is replete with terminology appropriate to the example words at the base of the process. Interestingly, the original input words are only partially rediscovered in this new space, at least within the set of vectors most central to the positive surface of the new subspace. This indicates that some of the input word-vectors (used to select dimensions for creating the new subspace) are, in terms of the probability of regular co-occurrence with all the dimensions that underwrite the subspace, relative outliers which nonetheless make an essential contribution to the delineation of this linguistic representation of a conceptual region. It is also notable, and perhaps even remarkable, that in the case of the mapping of the conceptual region of poetic creativity, quintessential new terms such as "phrasing" and "inflection", arguably more intricately associated with the prosodic nature of the target domain than the original human generated terms, arise independently. When examining how well the model captures relevant terms to a conceptual query, it is informative to cluster human responses into semantic categories (such as emotion and structural elements), and compare these results to apparent categories of output vectors. For example, considering the sentences that the 20 participants wrote about how June 2015 123 WILD ANIMALS MUSICAL CREATIVITY POETIC CREATIVITY human input model output human input model output human input model output lion bobcat innovation novelty emotion phrasing wolf alligator sound liveliness rhythm intonation coyote raccoon instruments spontaneity expression musicality alligator opossum novelty innovation structure nuances bear armadillo emotion expressiveness flow timbre snake white-tailed expression refinement words sprightly anteater nuance rhythmical ocelot ingenuity nuance peccary believability expressiveness pronghorn newness rubato cougar sophistication instinctive cottontail dynamism bluesy rattlesnake subtlety directness skunk vibrancy modal boar elusiveness inflections Table 2: The output vectors most central to the positive regions of the subspaces reduced in terms of the salient dimensions of a small set of conceptually exemplary input terms. they would evaluate the creativity of a new song, many individuals referred to the notion of novelty in musical creativity, but used various terms to do so. In addition to explicitly using the term "novelty", participants made reference to "unexpected", "new", and "surprising elements" that were "like nothing else I'd heard before", as well as "melodic originality" and cases in which "known musical concepts or styles [are] combined in a novel/innovative way." Similarly, when considering the model's conceptual space of musical creativity, the words "novelty", "innovation", "ingenuity", "newness", "inventiveness", "distinctiveness", and "uniqueness" are found within the top 30 model output vectors. Although this is a qualitative assessment of the results, it does seem clear that the precise terms from humans and the model might not be exactly the same, but there is significant categorical or conceptual overlap between the two. One may also note that the output vectors for poetic creativity appear to be rather "musical". This may reflect the fact that the input dimensions were provided by people who, overall, have significant musical experience all but three of the participants have had musical training or have played music informally, whereas only four of the participants have experience with creative writing. People's experience with music might frame the way they think about other creative domains, or at the very least influencing the terms used to describe poetic creativity; consequently, this has led to a subspace that highlights the musical nature of this sample. In light of our model's ability to find conceptually proximal terms, we propose that this method has the potential to be practically applied to the discovery of unexpected and valuable terms for the evaluation of creative output. Importantly, this approach may be applied to different corpora; for example, Wikipedia pages in different languages may be explored to address the difficult issue of identifying conceptually similar spaces across languages. The model's conceptual spaces concretely delineate evaluative terms that one person alone may not consider. For example, the terms "distinctiveness", "finesse", "artistry", and "stylization" were not cited by humans, but were within the top 30 output vectors discovered by the model. Future work may build on these findings, by using the model's discovered terms as criteria for subjective evaluation of creative output. In addition, discovering the geometry, flexibility, and contextual specificity of conceptual spaces may be very useful for assessing products or systems based on specific underlying concepts (or developed to address particular conceptual issues). More generally, our method is presented as an implementation of the mapping of words to concepts: this approach charts a passage from a statistically tractable lexical space to the abstract but natively cognitive domain of ideas. The temporary and contextual aspect of this mapping is essential to its success: it is the flexibility of the model that allows for the bespoke generation of subspaces, just as it is the pragmatic frangibility of language that permits the ready-tohand adaptation of meaning for unfolding expressive purposes. As can be seen in our results, the same terms arise in different constellations of meaning depending on the contextual perspective taken on the space. It is the strength of our language model that it can be adapted in this way, with the high dimensional arrangement of words allowing for their projection as multitudinous conceptual representations. Conclusion We investigated the terms and concepts that individuals most strongly associate with creativity in the musical and poetic domains, and described a computational methodology for modeling these conceptual relationships. Our multidisciplinary approach employs methods inspired by computational linguistics, as well as methods from empirical psychology. There were several outcomes of this work: the output from our distributional semantics vector space model was compared with human responses on a word association task. Human-generated terms were found within the top 15 dimensions of our model's lexical space, despite the model's very large vocabulary. This served as validation that the June 2015 124 model discovers a lexical space that encapsulates the kind of terms humans use to describe these concepts. Subsequently, the most frequently reported human terms were used as model input parameters for discovering conceptual spaces of lower dimensionality. Our model was able to find vastly reduced subspaces corresponding to MUSICAL CREATIVITY and POETIC CREATIVITY, which again captured semantically relevant terms, many corresponding directly to participants' terms, and others extending the list of terms to insightful new dimensions. In addition, by sampling word-vectors that fall near the centroid of the discovered conceptual mappings, we aimed to find potentially useful terms for the evaluation of creativity. Although computational and AI methods have generated many systems which aim to display creative behavior or produce creative artefacts, the evaluation of computational creativity remains distinctly problematic. Therefore, we offer our method and results as a formal approach to delineating conceptuallyrelevant criteria on which to base the evaluation of creativity and creative artefacts in future studies. We saw, in terms of the most common dimensions in lexical space and the highest-mean word vectors in conceptual space, that the model is able to discover semantic categories and indices of concepts that are alligned to human conceptualizations. This said, the model did not capture all of the semantic categories cited by humans. The most noteworthy omission is in regards to emotion, as terms relating to affect and evoked emotional response were some of the most frequently cited terms for both musical and poetic creativity. Accordingly, future work will investigate why the model does not capture this cluster of emotion-related terms. Further directions for the future include the application of this computational approach to other domains, such as "culinary creativity", both for the ontologically useful task of elaborating concepts themselves, and to create well-tailored terminology for the assessment of creative output from the corresponding domains. This methodology may also be used to approach the task of conceptual blending: rather than specifying input vectors that belong to only one concept, one may supply input dimensions from several. This could result in output terms discovered at the intersection of the lexical regions specified by the vectors' different input dimensions. Acknowledgments The projects Lrn2Cre8 and ConCreTe acknowledge the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET grants number 610859 and 611733, respectively. This research is also supported by EPSRC grant EP/L50483X/1. 2015_17 !2015 Player Responses to a Live Algorithm: Conceptualising computational creativity without recourse to human comparisons? Oliver Bown Design Lab University of Sydney NSW, 2006, Australia oliver.bown@sydney.edu.au Abstract Live algorithms are computational systems made to perform in an improvised manner with human improvising musicians, typically using only live audio or MIDI streams as the medium of interaction. They are designed to establish meaningful musical interaction with their musical partners, without necessarily being conceived of as virtual musicians. This paper investigates, with respect to a specific live algorithm designed by the author, how improvising musicians approach and discuss performing with that system. The study supports a working assumption that such systems constitute a distinct type of object from the traditional categories of instrument, composition and performer, which are capable of satisfying some of the expectations of an engaging improvisatory performance experience, despite being unambiguously distinct from a human musician. I investigate how the study participants comments and actions support this view. Specifically: 1) participants interacting with the system had a stronger sense of the nature of the interaction than when they were passively observing the interaction; 2) participants couldnt tell what the rules of the interactive behaviour were, and didnt feel they could predict the behaviour, but reported this as being a positive, engaging aspect of the experience. Their actions implied that the improvisation had purpose and invited engagement; 3) participants strictly avoided discussing the system in terms of virtual musicianship, or of creating original output, and preferred to categorise the system as an instrument or a composition, despite describing the interaction of the system as musically engaging; 4) participants felt the long-term structure was lacking. Such results, it is argued, lend weight to the idea that as CC applications in real creation scenarios grow, the creative contribution of computer systems becomes less grounded in comparison with human standards. Introduction Live algorithms (Blackwell, Bown, and Young, 2012) are software systems designed to autonomously perform music with live musicians, typically in an improvised music format. There has been a great deal of activity in this area recently, owing to the increasing ease with which artist programmers can put together powerful realtime systems incorporating machine listening, realtime synthesis and patterning, and forms of adaptive behaviour. Recent concerts, attached to electronic arts and music conferences such as the International Symposium on Electronic Arts (ISEA) 2013, and New Interfaces for Musical Expression (NIME) 2014, have demonstrated the diversity of approaches to live algorithms (see Bown et al. (2013) for a discussion of these concerts). As in all aspects of computational creativity, the question of evaluation in live algorithms requires detailed consideration, as there are no simple, objective measurables that indicate when computer generation of output has been creatively successful. Two issues are important: how system output is evaluated by humans, and the extent to which we can attribute the creative component of the output to the system, rather than to its maker or to the inspiring set (Ritchie, 2007): the set of all examples given to the system. In live algorithms, the creative process is somewhat different from many instances of automated creative generation, since the output is always the result of the interaction between a human and a computer system. It is an interactive creative scenario. This muddies the issue of the attribution of the creativity further, but at the same time presents alternative, more tractable questions regarding the success of the system in its collaborative, improvisatory role. Whilst it should be borne in mind that such questions regarding the interactive experience of live algorithms are separate from the core questions of computational creativity evaluation, there is still much to be learnt from such an analysis. In Bown (2014), I argue that a human-focused, qualitative, and strongly context-aware approach to studying computational creativity systems is important to advancing evaluation. In the case of an artistic robot, for example, one should begin by examining the full set of interactions between the system, its maker, its operator, its audience and so on, before deciding how one should frame questions of creative ability. This is to avoid the danger of inappropriately framing the activity and the agency of participants in that activity. How is the creative attribution divided between these actors? How do people perceive the system, not only in terms of good or bad output, but in terms of the way in which the systems activity is presented in a social context? Others, particularly Colton (e.g., (Colton et al., 2014)) have emphasised the management of the social interactive context in computational creativity, presented as a means to enhance June 2015 126 the perception of creativity, rather than as a means to better understand interaction with creative systems to improve their design and efficacy in areas of application. Such developments point to the possibility that any proposed measure of the creativity of a system is significantly less fruitful than a rich description of the system as an agent with creative affordances described by its networks of interaction. This neutralises the crisis of working out how to score creativity, and provides simple practical analysis which can support real applications in the way that human computer interaction (HCI) and interaction design does to great success. Thus a qualitative descriptive approach is pursued in the present research, in order to build a rich descriptive understanding of human-machine creative interactions in practice in the context of live algorithms. A central motivation for conducting the following study is to conduct computational creativity research that is more focused on the details of a participants interaction with a creative system, involving a number of dimensions of experience that are relevant to creativity, and in doing so contribute to an understanding of how such systems work in practice in real creative contexts. In this paper I study the responses of improvising musicians to Zamyatin, a live algorithm system that I have developed and worked with artistically since 2010. Zamyatin has performed with a wide variety of musicians. It is conceptually speaking a very simple system as far as creative systems go, in terms of the generation of original content on its own. Specifically it is less driven by the use of musical intelligence than by an interest in low-level gestural interaction. But in light of the value of diverse approaches to computational creativity, I view the system as a useful experiment in computational creativity in that it is successful in establishing an autonomy of behaviour, both conceptually and as perceived. The questions the study looks at are focused on the ways in which participants experience and benefit from the creativity of a system: (1) how effective the system is at contributing to an effective performance; (2) the extent to which the participant experiences the system as autonomous, and also human-like, and how this influences other aspects of the perception of the system, and; (3) whether the participant experiences the system as originating novel output, and how this influences (and is possibly influenced by) the general perception of the system. These are issues that we must clearly gain an understanding of as part of a body of knowledge in applied computational creativity. The computational creativity literature remains lacking in work that formally studies these basic forms of interaction and experience using qualitative methods. The first question has self-evident value, and in one form or another is naturally asked in the course of creating any musical system. A challenge for a more experienceand interaction-focused computational creativity research program is to balance this goal with that of advanced computational generative sophistication. The perception of autonomy addressed in the second question is an important topic for the study of computational creativity. Autonomy is a critical component in the making of creativity: a system can only be called creative insofar as it possesses some degree of autonomy in the output it creates. Perceived autonomy may not be actual autonomy and vice versa, and actual autonomy anyway lacks a robust applicable definition. The distinction between software autonomy in general and human-like autonomy is one that will need to be unpacked further as we witness computationally creative systems at play in real interactive scenarios, and it is important to understand how individuals experience that autonomy and how that influences their behaviour towards the system and their own activities. Finally, in the context of interactive music creation we are interested in how the system can drive surprise and intrigue in the co-performer, and under what circumstances the performer acknowledges something as either creative, or in terms that connote creativity. Here it is particularly interesting to look at the language used, as this is an area where the anthropomorphism of cognition comes up easily. I begin by describing the motivations behind the design of Zamyatin in the following section, before moving onto describing the study and results. Zamyatin Zamyatin is a software system in ongoing development since 2010 (Bown, 2011). Before describing the design of Zamyatin, it is necessary to explain some of the design considerations, including a number of aesthetic decisions. An earlier description of Zamyatins design is given in Bown (2011). One of Zamyatins main goals was to emphasise the experience of interacting with something that felt autonomous and engaged in interaction, even if, it does not make sophisticated use of musical knowledge. For this, the free improvised mode provides a context that allows one to explore behaviour in a more abstract way than is afforded by many musical genres. Improvising software agents are a longstanding area of activity. George Lewis Voyager system(Lewis, 2000) is a widely known example, and uses a hand-coded complex of interacting generative elements to create rich, diverse and musically responsive behaviour. Musicians performing with Lewis system can be seen deeply engaged in the musical interaction as if performing with another human improviser. The use of a Disklavier (an acoustic piano that can be controlled by MIDI via mechanical actuators) limits the sense of a computer being involved. Artists such as Lewis have reported the responses of musicians performing with their systems, but such reports increasingly show that it is hard to pin down exactly how musicians think about, understand and evaluate such systems, suggesting the need for studies that get into more detail about the conceptual language and approaches used. Banerji (2012), for example, takes an anthropological approach, with a strong focus on working in real contexts, and looking as much at how the system influences the performers behaviour as at how the performer judges the system. Other projects such as the work of Plans Casal and Morelli (2007), focus strongly on using low-level realtime audio analysis and resynthesis to give the performer a strong sense that the software acts as a responsive agent, through interactive immersion. Pachets (2004) approach to establishing engagement is to mimic the June 2015 127 style of the improviser in a call and response fashion. Similarly with Blackwell and Young (2004) and Brown, Gifford, and Voltz (2013), who draw on a style analysis and resynthesis of the performers input to establish a strong sense of engagement. Although these projects report on user-responses, further research is needed to determine whether these are indeed effective strategies for creating desirable interactive musical experiences. A common challenge for the makers of generative systems is how to endow the system with autonomous behaviour that transcends the rules put into it by the programmer. That is, if your system is a collection of procedural instructions defined by the programmer, then even if the specific behaviour of the system is original, being some possibly unexpected product of the interacting rules, the general nature of the systems behaviour remains down to the programmer, since no new knowledge has been gained by the system. There are three commonly cited ways around this problem (Todd and Werner, 1999). The first is already implied above: if the set of rules I provide are complicated enough, then from the interaction of these elements there will emerge new, higher-level behaviours that were not anticipated. The classic example is flocking behaviour, where the programmer defines the behaviour of individual boids (Reynolds, 1987), but nowhere dictates that the system should start forming oscillating blobs on a macroscopic scale. Classical work from the generative art canon also highlight the value of this approach. Both Harold Cohens celebrated AARON system (McCorduck, 1990) and George Lewis Voyager system (Lewis, 2000) consist of complex rule sets that result in outcomes even their makers find surprising. In this case, it is perhaps wrong to describe what emerges from these systems as new knowledge. The second approach is that the system learns. This is easily understood by analogy with how humans acquire knowledge that they are not born with. A large number of systems use learning to build musical knowledge, and famous examples include David Copes EMI (Cope, 1996) and Francois Pachets Continuator (Pachet, 2004). In these cases, the input knowledge now comes from a body of input musical data as well as the programmer. One problem then is how to avoid the system becoming just a copycat. The system needs not only to learn the style but to learn how to produce new material in that style. Current systems have yet to show how the learning itself can perform this extrapolation. A third approach uses targeted evolution or another form of optimisation, dictated either by a measurable target behaviour, or user-feedback applied to a population of evolving behaviours. The rationale goes that a target behaviour itself does not contain the knowledge about how to achieve that behaviour, but running an evolutionary system to achieve that target can discover novel solutions which themselves constitute knowledge. Experiments in artificial evolution have shown the discovery of such solutions. For example, the coevolution of predator and prey systems reveal the emergence of specific hunting or hiding techniques (Cliff and Miller, 1995). Here the knowledge is produced through interaction, or learning-by-doing. Thus by specifying a target behaviour in the form of an evolutionary goal, one can drive a system to discover component behaviours that are not specified in that goal. Unlike the majority of live algorithm approaches to deriving behaviour, Zamyatin is not a corpus or machine-learning based system, and employs this third approach to achieving autonomy. I draw on Blackwell and Youngs PfQ framework to describe the system (Blackwell, Bown, and Young, 2012). Passing from the input (P) layer to the inner patterning (f) layer are low-level feature values derived from the input musical data of the system. Passing from the inner layer to the instrument or sounding layer (Q) are control signals. These can be thought of as the equivalent to the human physical control signals applied to a musical instrument, i.e., the movement of the hands, feet, breath, etc., although the object being controlled might involve its own generative elements. In Zamyatin, the inner patterning system is a type of decision tree, coupled with a internal array of states, that together feedback on themselves. This inner patterning system is connected to the outside world though the input layer and output layer. Somewhat like a traditional feedforward multilayer neural network, the connections between these layers flow in the forward direction only. A decision tree is a binary tree that propagates a decision making iteration from the root of the tree to one of the leaves (leaves represent decisions), at each junction choosing to go one way or the other based on whether a single numerical value is above or below a single threshold. Decision trees are used commonly as efficient classifiers. The internal state array is simply an array of floating point values in the range [0,1]. In addition to the internal state, the system is constantly being fed an input state derived from low-level features of the incoming audio. Decisions at each node in the decision tree are made based on either the current state of the low-level audio features being passed into the system, or the internal state array. A leaf in the decision tree contains a list of actions which include passing on control commands to the musical system (Q) and also updating the state array. In this way, the decision tree and state array form a feedback system that can exhibit complex dynamics in the absence of any input, and can also be driven by changes to the input. Previous work (Bown and Lexer, 2006) has looked at the musical use of neural networks with similar properties. An evolutionary approach is applied to the design of the decision tree, including the architecture of the tree (which can grow or shrink over time), the parameters of each decision node (which value to query and what threshold to apply) and the (variable length) list of actions to perform at each leaf. Actions control how the internal state array is updated, applying simple arithmetical operations to the state values. The inner layer updates at a control rate of around 20hz. It outputs two forms of control data at each update: a single integer, representing its current decision state, and an array of floating point values in the range [0-1], representing its internal state. Both are actually used to control the musical output. In evolving the system behaviour, a fitness function is hand-coded, that takes into account the pattern complexity, and other patterning properties such as degree of vari June 2015 128 ability and repetition, of the systems output under various input conditions. Different variant fitness functions are used to create large populations of decision tree candidates, which are then creatively explored during the preparation of musical work. Like procedural systems, Zamyatin does not draw on a corpus of musical knowledge, but instead attempts to establish novel behaviour through the interplay between the programmer specifying a behavioural target and the system evolving novel behaviours that achieve that target. The target behaviour does not describe the final musical output, but the output of the nested control system (f) that operates a number of virtual musical instruments. This target behaviour is defined by the programmer and the selection or definition of different target behaviours to suit the performing musician becomes part of the creative process of preparing Zamyatin for each new performance. Musical Study Three improvising musicians (P1, P2, P3) were invited to attend a focus group to investigate musician responses to Zamyatin. The goal was not to set up a musical Turing test: there was no attempt to conceal the computational status of the system. Instead, the study looked at questions of engagement, experience and perception in improvised interaction. The study was set up as a focus group in order to stimulate interaction between the participants, to look at the way they discussed musical interaction, and to get them to observe each other playing. Participants were first shown a video recording of the system performing with a musician and asked questions about how they perceived the interaction with the musician. They were then played an audio recording of an earlier manifestation of the system performing with another musician and asked similar questions. They were then asked to perform with the system and develop their responses to it. The author initially did not explain the design of the system, but later answered questions and provided more context as the study progressed, in response to the participants questions. Several other interviews with performers conducted prior to the focus group have influenced the expectations of the author in approaching the focus group. These will be reported in full in a forthcoming journal paper. Results Three main results are considered here: 1. Participants interacting with the system had a stronger sense of the nature of the interaction than when they were passively observing the interaction. During the initial observation of the pre-recorded concert, all three participants said that they did not see any clear clues as to how the system was responding, what information it had access to from the musician, and what the interaction paradigm was. This was manifest largely in the sense of uncertainty surrounding the interaction. The musicians had no way of identifying clear paths of causality from the musician to the system. Of the system in general, P2 says the following: The system of interaction is not obvious to me. . . . I cant tell. At times [the musician] is loud and I dont think the softwares responding, or vice versa, and then sometimes the two things are loud or the two things are soft. The obvious parameters you can sample and listen to are like dynamics and pitch, timbral stuff . . . There doesnt seem to be any clear oneto-one relationships with what the software does, or it changes over time? Sometimes it reacts in a particular way and sometimes it doesnt. In performing with the system the musicians responses shifted from this ambiguity to a greater sense of awareness that the system was responding to their playing. A good deal of uncertainty remained about precisely how the system behaviour was influenced by the musician, and as discussed below this is a theme in itself. After watching P1 performing, P2 says: It was way more dynamic than it comes across in the flat stereo recordings, it was actually really good. It surprised me a few times how loud it was prepared to go and transgressive of the duo in a way . . . mainly with dynamics but sometimes placement too . . . it did some bizarre things and you go oh thats cool. . . . but when its compressed . . . you dont understand the dynamics that much. . . . (Participant was asked to explain transgressive) It did naughty things, to do with timbres and placement. If it was someone playing that material youd go, theyre being a bit upfront, kicking the thing along a bit, putting provocations in. I like that. P3 adds: Thats the weird thing about it; you can really sense that somethings happening but I cant tell what it is. Interestingly, also, the critical analysis of the system naturally extended to the performing musician as well. The evaluation of the improvisation by the participants naturally applied as much to the performers as to the systems. This may be more their habit, but of some relevance, Banerji (2012) has proposed looking at the impact on musicians playing as a form of ethnographic approach to studying the qualities of live algorithms. P3: One thing I found that the second musician wasnt interacting with the software at all. I felt like maybe they were just playing. I didnt hear too much active listening, they were obviously playing with it, but didnt really feel like they were kind of . . . that level of interactivity wasnt really there from their performance. . . . It was a real contrast of style I thought. I thought the first guy was really overtly interacting with it to quite a large extent, and the second I thought wasnt. But its hard to say what the agenda is. . . . Its not that I enjoyed the first one better. Its more that if someone told me if the second player was in another room not being able to hear the performance I could believe you. June 2015 129 2. Participants couldnt tell what the rules of the interactive behaviour were, and didnt feel they could predict the behaviour, but reported that they did experience the behaviour as interactive, and presented this uncertainty as being a positive, engaging aspect of the experience. Their actions implied that the improvisation had purpose and invited engagement. In discussing what if any cues revealed the nature of interaction between system and human performer during the video playback section of the study, participants noted that any candidate explanations they developed for the behaviour of the system were frustrated by its seemingly changing interactive behaviour. For example, one participant began thinking that the system was matching the intensity of the performers behaviour, but then found that the oppose suddenly occurred. During interaction with the system, performers remained unsure about exactly what the responsive behaviour consisted of, but reported that they did feel that there was some sort of complex interaction taking place, and finding this particularly engaging, owing to the uncertainty of the systems behaviour. P2 (describing performance with P1): That started off with a noisy atmospheric tone. P1 came in and it maintained its thing, it kept its thing for a while. which kinda surprise me. I thought the introduction of a strong tone would shift it, but it didnt shift it and I thought thats cool. . . . the fact that it doesnt jump the whole time makes it worth listening to. If it was jumping the whole time with your stimuli, with the distinction from the live instrument to a clear distinction from that it would drive you crazy. This uncertainty was also described as potential source of frustration. Equally, the stability of the system over the long-term was described as a potential source of boredom. But on the whole participants agreed that the balance between uncertainty and predictability was well measured to create an effective sense of engagement for the musician. P1: To begin with, and thats the same with the other ones I saw, it takes the musician to initiate the interaction. . . . it was playing a long granulated tone, I came on top of that with a between note, probably to create some symbiosis with what it was doing. Then I found as I went into it that I wanted to find out that it reacted to what I was doing, and this was less clear. Sometimes it did and sometimes it didnt. P3: There was a really loud section with no stimulus behind it, and its like, where did that come from, but Im getting closer to seeing [the relationship] . . . actually Id find it quite stressful to perform with. 3. Participants strictly avoided discussing the system in terms of virtual musicianship, or of creating original output, and preferred to categorise the system as an instrument or a composition, despite describing the interaction of the system as musically engaging. The participants were clear explicitly in response to direct questions about it and implicitly in the way they described the interaction with the system that they felt no compulsion to see the system as a performer, preferring instead to view it as a form of complex instrument, or interactive score. However, the participants equally acknowledged that the behaviour of the system made it stand out from other types of digital interactive systems or instruments, particularly in terms of the autonomy of behaviour. To some extent this afforded the use of terms such as a perceived volition, that are arguably not normally associated with machine behaviour. As an example of a clear shortcoming, P2 states: It seemed a bit confused with the very high frequencies . . . I felt that it kind of suddenly went I cant actually see you . . . It was quite interesting. If it was another player youd go, ok, thats working. They elaborate on their perception of the system in terms of humanness: Ive steered clear from [referring to] anything to do with a performer because it doesnt feel like a human being at all, but it feels interesting, youve set up a compositional tool thats not momentary predictable but in the long term its predictable. The participant describes this engagement further as follows: It was good, it was something I wanted to do listening to the other things: give it its own space, do its thing. Its an intriguing notion that you didnt play for a while and then it comes up with something else. It kinda lets the audience know. Its not some sort of stupid device, something of its own volition. When asked how it compared to mere tape, P2 elaborates: I think audiences are pretty smart, they understand what tape is, what predetermination is and what liveness is, and if the audience were sitting there knowing that its a live system and it seems to have some initiative without the player, I think thats an interesting moment. . . . But Id say the choosing the samples becomes this overridingly important compositional decision. . . . I feel that with this, whatever samples you put into the composition . . . the machine has some sort of ability to stop and start things. 4. Participants felt the long-term structure was lacking. It was widely agreed that the system did not convincingly deal with long term structural management of the performance. P2: Listening to both of those things a lot of the activity is very much less than 3 seconds, so theres a lot of active many-events-per-minute sort of feel to it, and because it goes on for some time in that way it then has a sort of strange flatness as a result, and after a while you settle into the fact that there arent going to be any super-long events, and so in a sense it kind of flattens the whole thing down and makes it kind of amorphous. June 2015 130 The participant frames this in the context of contemporary improvised music: Its a subject right at the heart of whats going on in improvised music. Probably always has been, but seems really central to it these days. It feels to be generational as well. The older generation may feel that theyre not interacting and reacting (themselves) but they tend to more than younger generations. . . . I feel like theres players around now who work in much longer structures and they dont want to have a dialogue which is over some 10 second framework. Discussion The results of this study go some way to confirming existing assumptions and findings about evaluation of musical systems. The first result affirms a general principle that certain knowledge is better acquired through active participation. Interacting with a system tells you more about its interactive capacity than watching an interaction with a third party. This may not be manifest in the form of a expressible understanding of what the system is doing, as was the case in the present study, but nevertheless the participant in the interaction gains a direct sense of the interactive nature of the system, that may be obscure from outside. This has implications for the audience experience of the work. They may not be fully aware of the experience of the musician during the performance. On the other hand, the expression and observed response of the musician can be important to an observer understanding the interaction, and may indirectly reveal the experience. Pachet has shown how the video footage of participants, or simply composers engaged with the treatment of their own work, can do a fantastic job of revealing basic facets of user-experience (Addessi and Pachet, 2006; Pachet, 2014). Related to this, a common theme in the evaluation of autonomous music and art system is the question of making use of a Turing-style test (e.g., (Ariza, 2009; Bishop and Boden, 2010; Pease and Colton, 2011)). Results such as those of Moffat and Kelly (2006) show that positive results can be easily achieved in situations where people try to guess whether artefacts were computer or human generated, i.e., the system generated output can pass as human. However, without involving any form of probing or interaction with the system, the test in this form doesnt really tell us anything about the system, its intelligence or creative capacity (Pease and Colton, 2011). Despite what is said about the great communicative power of art and music, these artefacts form a poor window onto their creators. Nevertheless, it is still reasonable to expect that in general there are cues in creative outputs which reveal aspects of the nature of the system producing them, and which may be identified in interactive scenarios, but also possibly without the need for interaction. These cues may not be reliable identifiers of whether or not the system is computer or human, and should be better understood as contributing to a qualitative evaluation of creative or interactive behaviour. More generally, we may talk of the character of the system and how it contributes to or stimulates a productive musical process. The musicians participating in the study did develop a sense of the cues that indicated Zamyatins responsive behaviour in certain ways, sometimes, but without certainty. This led them to feel that the system was nontrivial and invited an engagement with the behaviour of the system. Related to this are the other two results. The character of the system is one in which an actively obscure relationship between performer action and system result is sought. To this end the evolutionary strategy has proven to be a convenient approach to relieving the system designer from the task of dictating the systems response directly, working around the Lovelace objection that a computational system might only do what it has been programmed to do. Finally we come to the issue of whether the system was at all perceived as bearing the qualities of a human performer. The response was resoundingly negative in answer to this question. Whilst, as stated in the introduction, the aim of the system design was never to simulate or mimic human behaviour, a stated goal has been to explore the middle ground between inanimate objects that do not exhibit adaptive or proactive behaviours, and sentient humans, or other creatures. The participants unambiguously placed the system in the category of objects, as opposed to any sort of performer, equating it either to an instrument or composition. It does not follow that they perceived this object as dumb or lacking lifelike properties. Scoring Zamyatin From these results we can consider the questions posed at the beginning of the paper: 1. (Q1) How successful is it as at creating effective performances with improvising musicians? The participants responses give enough support to a positive answer to this question, specifically in terms of the interesting dynamics produced by the systems interactive behaviour. 2. (Q2) To what extent do performers conceptualise of and perceive Zamyatin as autonomous, as well as human-like, and how does this influence other aspects of the perception of the system? The responses are more ambiguous with respect to this issue, not least because the definition of autonomy is itself hard to pin down in application. Because of this, participants were not asked to discuss autonomy directly, but we may make inferences based on their responses. Significantly, they perceived the system as being both (i) not passive in the form of its responses to input, and (ii) able to drive the performance through spontaneous action that appeared to come from nowhere. These support a technical definition of autonomy as behaviour that is not entirely determined from outside of the system, and the varying nature of the systems predictability supports an information theoretic form of this. However, there are other senses of perceived autonomy that could be achieved. Future studies could work towards understanding in greater detail the space of possible types of autonomy (for example as discussed in Eigenfeldt et al. (2013)) that might be perceived in a system. June 2015 131 3. (Q3) Does Zamyatin actually originate novel responses as far as the performers are concerned? The predominant response to this from participants was no. They quickly perceived that the system worked within strict limits, with the musical style and much of the content (e.g., choice of sounds, pitch sets, etc.) dictated by the system designer. But equally the participants responses indicated that they did attribute actions to the idiosyncratic nature of the system, which was described to them as having resulted from an evolutionary search that went beyond the input of the designer. For example, on numerous occasions the behaviour of the system was described by participants as surprising, and not like anything a human would do, coupled with value judgements ranging from this being highly engaging, to it being frustrating. We could claim that a surprising and valued response is technically speaking creative, according to the most commonly agreed definition of the term. This would be a generous interpretation, since many dumb processes might achieve some such level of surprise in interaction. If instead we were to apply Coltons creativity tripod of imagination, skill and appreciation (Colton, 2008), we would have to accept that at best only skill could be claimed (I would claim that the system can appear skilful in the complex manipulation of electronic sound). An open question is what kinds of systems stimulate perceived imagination and perceived appreciation, and whether these are in fact always relevant in contexts such as this: is it important to the perception of musical creativity that such elements are perceived? Conclusion The evaluation of systems from computational creativity, using qualitative analysis grounded in specific contexts of creative interaction, is an important part of the emerging suite of research methods we use to discover and understand how systems can act successfully to support creativity or act as creative agents. This paper has attempted to dig deeper into how improvising musicians, presented with a live algorithm system, approach, interpret and engage with that system in an applied context. The results suggest ways in which the system, Zamyatin, could be improved to create more compelling improvised musical experiences. Good long-term structure is a challenging area that this system could improve upon. The results appear to affirm the value of exploring forms of software-based musical agency that does not conform to human modes of behaviour but that still produce engagement. This could be developed further by categorising these kinds of behaviour. The relationship between the behaviour of the system and the engagement of the performers could be developed by improving the user-interface to the underlying evolutionary techniques, possibly using interactive evolutionary techniques, so that there is a real capacity for a musician to feedback on and modify the behaviour of the agents. It was also apparent from the study that certain traits of the system, such as the degree of uncertainty of its behaviour, could be explicitly recognised as adding to the musicality of the system, and could be codified into future fitness functions. An immediate goal for Zamyatin is to create a modular system that can be easily incorporated into live performance sets by nonprogrammer musicians, and these ideas can be incorporated into that design. In addition, the view held by this author is that questions of computational creativity are now shifting towards more applied areas where the comparison with human creative activity is less of a concern than a more open-ended understanding of how machines may act creatively. This is weakly supported by the research in this paper, in which a failure to stand up to any sort of Turing-style test does not diminish the discussion of the creative potential of the system. This is a perspective that warrants further study across a range of systems. 2015_18 !2015 Collaborative Composition with Creative Systems: Reflections on the First Musebot Ensemble Arne Eigenfeldt Oliver Bown Benjamin Carey School for the Contemporary ArtsDesign LabCreativity and Cognition Studios, Simon Fraser UniversityUniversity of SydneyUniversity of Technology SydneyVancouver, CanadaNSA, AustraliaNSW, Australiaarne_e@sfu.ca oliver.bown@sydney.edu.au Benjamin.Carey@uts.edu.au Abstract In this paper, we describe the musebot and the musebot ensemble, and our creation of the first implementations of these novel creative forms. We discuss the need of new opportunities for practitioners in the field of musical metacreationto explore collaborative methodologies in order to makemeaningful creative and technical contributions in the field. With the release of the musebot specification, such opportunities are possible through an open-source, communitybased approach in which individual software agents arecombined to create ensembles that produce a collective composition. We describe the creation of the first ensemble of autonomous musical agents created by the authors, and the questions and issues raised in its implementation. Introduction Musical metacreation (MuMe) is an emerging term describing the body of research concerned with the automation of any or all aspects of musical creativity. It looks to bring together and build upon existing academic fields such as algorithmic composition (Nierhaus 2009), generative music (Dahlstedt and McBurney 2006), machine musicianship (Rowe 2004) and live algorithms (Blackwell and Young 2005). Metacreation (Whitelaw 2004) involves using tools and techniques from artificial intelligence, artificial life, and machine learning, themselves often inspired by cognitive and life sciences. MuMe suggests exciting new opportunities for creative music making: discovery and exploration of novel musical styles and content, collaboration between human performers and creative software partners, and design of systems in gaming, entertainment and other experiences that dynamically generate or modify music. A recent trend in computational creativity, echoing other fields, has been to develop software infrastructures that enable researchers and practitioners to work more closely together, taking a modular approach that allows the rapid exchange of submodule elements in the top-down design of algorithms, facilitating serendipitous discovery and rapid prototyping of designs. It is widely recognised that such infrastructure-building can accelerate developments in the field for a number of reasons: getting large numbers of researchers to work together on larger-scale projects, forcing researchers to develop their software in a sharable format, enabling the like-for-like comparison of different system designs, education, and directly providing a large framework for further software development. Charnley et al. (2014), for example, has proposed a cloud-based collaborative creativity tool, supported by a web interface, that allows the rapid creation of text-based, domain specific, creative agents such as Twitter bots. Our research in MuMe, which risks being too localised and insular, will benefit from a similar direction, and for this reason we have proposed the musebot ensemble, a creative context designed to bring researchers together and get their realtime generative software systems playing together. We present a recent effort to design and build the infrastructure necessary to bring together communitycreated software agents in multi-agent performances, an elaboration on the motivation for doing so and the opportunities it offers, and some of the challenges this project brings. So far, we have set up a specification for musebot interaction, involving a community engagement process for getting a diversity of thoughts on the design of this specification, and we have built a number of tools that implement that specification, including musebots and a musebot conductor. Following the outline of the system, we describe the creation of our first exploratory attempts to create and run a MuMe ensemble. We describe our initial experiences working creatively with networks of musebots. We conclude the paper with several open questions that were raised in the implementation of this collaborative compositional experience. Towards a Collaborative Composition by Creative Systems The established practice of creating autonomous software agents for free improvised musical performance (Lewis 1999) the most common domain of activity in MuMe research often involves idiosyncratic, non-idiomatic systems, created by artist-programmers (Rowe 1992, YeeKing 2007). A recent paper by the authors (Bown et al. 2013) discussed how evaluating the degree of autonomy in such systems is non-trivial and involves detailed discussion and analysis, including subjective factors. The paper identified the gradual emergence of MuMe specific genres i.e., sets of aesthetic and social conventions within which meaningful questions of relevance to MuMe research could be further explored. We posited that through June 2015 fi the exploration of experimental MuMe genres we could create novel but clear creative and technical challenges against which MuMe practitioners could measure progress. One potential MuMe genre that we considered involves spontaneous performance by autonomous musical agents interacting with one-another in a software-only ensemble, created collaboratively by multiple practitioners. While there have been isolated instances of MuMe software agents being set up to play with other MuMe software agents, this has never been seriously developed as a collaborative project. The ongoing growth of a community of practice around generative music systems leads us to believe that enabling multi-agent performances will support new forms of innovation in MuMe research and open up exciting new interactive and creative possibilities. The Musebot Ensemble A musebot is defined as a piece of software that autonomously creates music collaboratively with other musebots. Our project is concerned with putting together musebot ensembles, consisting of community-created musebots, and setting them up as ongoing autonomous musical installations. The relationship of musebots to related forms of music-making such as laptop performance is discussed in detail in our manifesto (Bown et al. 2015). The creation of intelligent music performance software has been predominantly associated with simulating human behaviour (e.g., Assayag et al.). However, a parallel strand of research has shed the human reference point to look more constructively at how software agents can be used to autonomously perform or create music. Regardless of whether they actually simulate human approaches to performing music (Eldridge 2007), such approaches look instead at more general issues of software performativity and agency in creative contexts (Bown et al. 2014). The concept of a musebot ensemble is couched in this view. i.e., it should be understood as a new musical form which does not necessarily take its precedent from a human band. Our initial steps in this process included specifying how musebots should be made and controlled so that combining them in musebot ensembles would be feasible, and have predictable results for musebot makers and musebot ensemble organisers. Musebots neednt necessarily exhibit high levels of creative autonomy, although this is one of the things we hope and expect they will do. Instead, the current focus is on enabling agents to work together, complement each other, and contribute to collective creative outcomes: that is, good music. This defines a technological challenge which, although intuitive and easy to state, hasnt been successfully set out before in a way that can be worked on collaboratively. For example, Blackwell and Young (2004) called on practitioners to work collaboratively on modular tools to create live algorithms (Blackwell and Young 2005), but little community consensus was established for what interfaces should exist between modules, and there was no suitably compelling common framework under which practitioners could agree to work. In our case, the modules correspond clearly to the instrumentation in a piece of music, and the context is more amenable to individuals working in their preferred development environment. In order for musebots to make music together, some basic conditions needed to be established: most obviously the agents must be able to listen to each other and respond accordingly. However, since we do not limit musebot interaction to human modes of interaction, we do not require that they communicate only via human senses; machinereadable symbolic communication (i.e., network messaging) has the potential to provide much more useful information about what musebots are doing, how they are internally representing musical information, or what they are planning to do. Following the open community-driven approach, we remain open to the myriad ways in which parties might choose to structure musebot communication, imposing only a minimal set of strict requirements, and offering a number of optional, largely utilitarian concepts for structuring interaction. Motivation and Inspiration One initial practical motivation for establishing a musebot ensemble was as a way of expanding the range of genres presented at MuMe musical events. To date, these events have focused heavily on free improvised duets between human instrumental musicians and software agents. This format has been widely explored by a large number of practitioners; however, it runs the risk of stylistically pigeonholing MuMe activity. For the present project, the genre we chose to target was electronic dance music (EDM), which, because it is fully or predominantly electronic in its production, offers great opportunities for MuMe practice; furthermore, metacreative research into this genre has already been undertaken (Diakopoulos et al. 2009; Eigenfeldt and Pasquier 2013). The 2013 MuMe Algorave (Sydney, 2013) showcased algorithmically composed electronic dance music, an activity originally associated with live coding (Collins and McLean 2014). However, rather than presenting individual systems with singular solutions to generating such styles, it was agreed that performances should be collaborative, with various agents contributing different elements of a piece of music. This context therefore embodies the common creative musical challenge of getting elements to work together, reconceived as a collective metacreative task. Although the metaphor of a jam comes to mind in describing this interactive scenario, we prefer to imagine our agents acting more like the separate tracks in a carefully crafted musical composition. We acknowledge the relationship of musebot ensembles to multi-agent systems (MAS); however, rather than concentrate upon the depth of research within this field, we have designed the specification in such a way so as to combine generality and extensibility with domain specific functionality. As will be described, at heart the musebot project is simply a set of message specifications that are June 2015 fi domain specific to the idea of multiple musical agents. We feel that there is no need to draw on more specific MAS tools and specifications, as there is nothing that is not simply handled by the definition of a few messages. Taking this general approach has the advantage that if people want to incorporate the musebot specification into their MAS frameworks, they can. It is intentionally barebones so that it is simple for people to adapt their existing agents to be musebots. At the same time, we also acknowledge that MAS have been incorporated into MuMe in typically idiosyncratic ways, replicating the interaction between human musicians (Eigenfeldt 2007) while also exploring nonhuman modes (Gimenes et al. 2005); our intention is for musebots to explore both approaches. We summarise the other opportunities we see in pursuing this project as follows, beginning with items of more theoretical interest, followed by those of more applied interest: Currently, collaborative music performance using agentsis limited to human-computer scenarios. These present a certain subset of challenges, whereas computercomputer collaborative scenarios would avoid some ofthese whilst presenting others. Such challenges stimulateus to think about the design of metacreative systems innew and potentially innovative ways; It provides a platform for peer-review of systems andcommunity evaluation of the resulting musical outputs, as well as stimulating sharing of code; It provides an easy way into MuMe methods and technologies, as musebots can take the form of the simplestgenerative units, whereas at present the creation of aMuMe agent is an unwieldy and poorly bounded task; It outlines a new creative domain, which explores newmusic and music technology possibilities; It encourages and supports the creation of work in a publicly distributed form that may be of immediate use assoftware tools for other artists; It allows us to build an infrastructure which can be useful for commercial MuMe applications. Specifically, it provides a modular solution for the metacreative workstations of the future; It defines a clear unit for software development. Musebots may be used as modular components in other contexts besides musebot ensembles. The Musebot Agent Specification An official musebot agent specification is maintained as a collaborative document, which can be commented on by anyone and edited by the musebot team1. An accompanying BitBucket software repository maintains source sam 1 tinyurl.com/ph3p6ax ples and examples for different common languages and platforms2. A musebot ensemble consists of one musebot conductor (MC) and any number of musebots, running on the same machine or multiple machines over a local area network (LAN). The MC is notified of each musebots location and paths to its directories, allowing it to build an inventory of the available musebots in the ensemble. Thus, for the user, adding a musebot to the ensemble simply means downloading it to a known musebot folder. Musebots contain config files that are controlled by the MC, and human/ machine readable info files that give information about the musebots. The MC is responsible for high level control of connected musebot agents in the network, setting the overall clock tempo of the ensemble performance and managing the temporal arrangement of agent performances (see Tables 1 and 2). The MC also assists communication between connected agents by continuously broadcasting a list of all connected agents to the network, and relaying those messages that musebots choose to broadcast. The MC is not necessarily in charge. Currently, it is just a simple GUI program that allows users to control musebots remotely. Ultimately we will automate ensemble parameters such as tempo and key either by making specific variants of the MC, or by writing dedicated planning agents that issue instructions to the MC, or by allowing a distributed selforganising approach in which different agents can influence these parameters. These are all valid designs for a musebot ensemble. /mc/time This is the clock source and timing information. A beat/tick count, starting at zero and incrementing indefinitely, is sent at a rate of 16 ticks per beat at the specified tempo, to be used for synchronising your client bot. The downbeat is on (tick % 16 == 0). /mc/agentList [ ] List of connected musebots in your network. Use this list to reveal messages sent from specific musebots, /mc/statechange This parameter is designed to facilitate high level state changes, which could be anything, depending on the program; however some examples might be overall density of events, range/register, key changes, change in timbre etc. Table 1. Example messages broadcast by the MCto all musebot agents. /agent/kill (no args) Exit gracefully upon receiving this message from the MC. /agent/gain [] Scale your output amplitude, used to apply a linear multiplicationof your output audio signal. Table 2. Example messages sent between the MCand specific musebot agents. 2 bitbucket.org/obown/musebot-developer-kit June 2015 fi Musebots may broadcast any messages they want to the network, providing they maintain their unique name space allocated for inter-musebot communication (see Table 3). Our musebot specification states that a musebot should also respond in some way to its environment, which may include any OSC messages (Wright 1997) as well as the audio stream that is provided: a cumulative stereo mix of all musebot agents actively performing. It should also not require any human intervention in its operation. Beyond these strict conformity requirements, the qualities that make a good musebot will emerge as the project continues. /broadcast/statechange Locally controllable high level state change. Use this parameterif you want to prompt other clients to make changes to their highlevel state. Equally, respond to this message if you want othermusebots to prompt high-level changes. /broadcast/notepool A list of MIDI note values, pitch classes only, no octave info, tobe shared with the network -e.g. chord or scale you are currently playing. /broadcast/datapool Array of floating-point values. Table 3. Example messages broadcast by musebot agents. Thesemessages are speculative, and open for discussion. The First Musebot Ensemble At the time of writing, a draft musebot conductor is implemented and published and a call has gone out for participation in the first public musebot ensemble. Our first experiments with making musebot ensembles followed the obvious path of taking the systems we have already created and adapting them to fit the specification. This step constituted provisional user testing of the specification and support tools and also gave us a sense of what sort of creative and collaborative process was involved in working with musebots. We present two studies here. In the first case, the first author built a musebot ensemble entirely alone. The first author works regularly with multi-agent systems within his MuMe practice, so this was a natural adaptation of his existing approach. In the second study, each of the authors contributed a system that they had developed previously, and we looked at the ways that these systems could use the musebot specification to interact musically. First Author Working Alone In the first study, several musebots were designed in isolation by the first author. While lacking the musebot goal of cooperative development, the situation did allow for the design of ensembles with a singular musical goal, including specific roles for each musebot. For example, a ProducerBot was created that functions to control various oth er instrumental bots a DrumBot, a PercussionBot, a BassBot, a KeyboardBot, etc. in a hierarchical fashion. The organisation of such an ensemble reflects one conception in achieving a generative EDM work, in which each run produces a new composition whose musical structure is generated by the ProducerBot, and the musical surface is produced and continuously varied by the individual instrumental musebots. Such a design has been previously implemented by the first author (Eigenfeldt 2014) to produce successful musical results. This topdown, track-by-track breakdown of relations between musical parts is of course completely familiar to users of DAWs, with the difference that each track is a generative process that receives high-level musical instructions from the ProducerBot. In this case, the ProducerBot sends out information at initialisation, including a suggested phrase length (i.e. 8 measures), and subpattern, which represents how the phrase repetition scheme can be represented (i.e. aabaaabc). It individually turns instrumental musebots on and off during performance, including syncronising them at startup. Furthermore, it sends a relative density request a subjective number of possible events to perform within a measure every 250 milliseconds, as well as progress through the current phrase. Lastly, at the end of a phrase, it may send out a section message (i.e. A B C D E). When an instrumental musebot receives this section descriptor, it looks to see if it has data stored for that section: if not, it stores its current contents (patterns), and generates new patterns for the next section; if it does have data for that section, it recalls that data, thereby allowing for large-scale repetition to occur within the ensemble. As with a DAW, via the musebot specification, we inherently allow for community contributions that accept specific instructions from the ProducerBot: swapping a different BassBot, for example, in the ensemble would result in a different musical realisation, as it is left to the musebots to interpret the performance messages. Multiple Authors Working Together In the second study, the three authors brought together existing systems into the first collaboratively made musebot ensemble. No assumptions were made in advance about how the systems would be made to interact, except that the second and third authors drew their contributions from existing work with live algorithms in an improvisation context (Blackwell and Young 2005), where the audio stream is typically the only channel of interaction. A BeatBot was created by the first author, which combines the rhythmic aspects of both the former drum and percussion musebots, together with the structuregenerating aspects of the ProducerBot, resulting in a complex and autonomous beat generating musebot. With each run, a different combination of audio samples is selected for the drums and four percussion players, along with constrained limitations to the amount of signal processing applied. A musical form is generated as a finite number of phrases, themselves probabilistically generated from weightings of 2, 4, 8, 16, and 32 measures. Each phrase has June 2015 fi a continuously varying density, to which each internal instrument responds differently by masking elements of its generated pattern. The metre is generated through additive processes, combining groups of 2 and 3, and resulting in metres of between 12 and 24 sixteenths. Finally, the amount of active layers for each phrase is generated. All of the generated material metre, phrase length, rhythmic grouping, density, and active layers is broadcast to the ensemble as messages. The second authors DeciderBOT was adapted from his live algorithm system Zamyatin, an improvising agent that is based upon evolved complex dynamical systems behaviours derived from behavioural robotics (Bown et al. 2014). The internal system controls a series of voices that are hand-coded generative behaviours. Zamyatin is most easily described as a reactive system that comes to rest when presented with no input, and is jolted to live when stimulated by some input. The stimulation can send it into complex or cyclic behavior. The final contribution to the first musebot ensemble was _derivationsBOT, designed by the third author. An adapted version of the author's _derivations interactive performance system (Carey 2012), _derivationsBOT was designed to provide a contextually-aware textural layer in the musebot ensemble, responding to a steady stream of audioanalysis from the other bots connected to the network. During performance, _derivationsBOT analyses the overall mix of the musebot ensemble by segmenting statistics onMFCC vectors analysed from the live audio. The musebotcompares these statistics with a corpus of segmented audiorecordings, retrieving pre-analysed audio events to process, that compliment the current sonic environment. Synchronised to the overall clock pulse received from the MC, agenerative timing mechanism conducts six internal playersthat process and re-synthesise these audio events via various signal processing. Importantly, the choice of audio events made available for processing is based upon comparisons both between statistics analysed from the liveaudio stream, as well as statistics passed between the internal players themselves. Thus, without audio input for analysis _derivationsBOT self-references, imbuing it with asense of generative autonomy in addition to its sensitivityto its current sonic environment. To facilitate this, _derivationsBOT is randomly provided an internal stateupon launch, enabling the musebot to begin audio generation with or without receiving a stream of live audio toanalyse. With the three musebots launched, a quirky, timbrallyvaried, somewhat aggressive, EDM results. Like much experimental electronic music, the listening pleasure ispartly due to the strangeness and suspense associated withthe curious interactions between sounds. The BeatBot was not designed to respond to any input and so drove the interaction, with the other two systems reacting. Thus, although very simple and asymmetrical as an ensemble, the musical output was nevertheless coupled. Since the BeatBot is not limited to regular 4/4 metre, it creates dubiousnon-corporeal beats to which both DeciderBot and _derivationsBot respond in esoteric fashions. In addition, BeatBot kills itself once its structure is complete, and theother two audio-responsive musebots, lacking a consistentaudio stream to which to react, tend to slowly expire, bringing an end to each ensemble composition. Figure 1. Audio and message routing in the second described musebot ensemble. Example interactions between musebots, including thissecond example, are available online3. Issues and Questions These studies give insights into how a musebot approach can serve innovation in musical metacreation. Two areas of interest are: (1) what can we learn by dividing up musically metacreative systems into agents and thinking about how the communication between these serves musical goals? (2) related to this, how do we work with others and negotiate the system design challenges? 1. Increasingly, musicians are incorporating generativemusic processes into their work. Thus, the situation described above managing several generative interacting processes is not uncommon. The creative processis different to traditional electronic music compositionbecause rather than making a specific change and listening to a specific effect that results from that change, oneis in a state of continuous listening, as the result of achange might have multiple effects or take time to playout. It is common for electronic music composers to work with complex systems of feedback, and this process is similar, if more algorithmic. One effect of this isthat it can dull decision making, as one gives over to thenature of the systems, or is unclear on what modifications will influence them effectively. Placing thesemusebots together in an ensemble positions us as bothcurator and designer: in the former case, one is forced todecide whether the musebots are interacting in a fashionthat is considered interesting, and whether fewer, ormore, musebots would solve any musical issues. We foresee such decisions to be more common as we accu3 http://metacreation.net/musebot-video/ June 2015 fi mulate more musebots, particularly those with clear stylistic bents. In the latter case, as designers we are placedinto a more traditional role, in which continual iterationbetween coding, listening, critiquing, re-designing, and coding again guides both technical and aesthetic decisions. While we have no control over the other musebots, we can individually control how our own musebotreacts to other musebot actions, even if those actions areseemingly unpredictable. 2. Working together in this way offers a new approach tomusical metacreation, along with a new set of challenges. In building systems, we are typically free to pursue our own aesthetic directions, and make individual decisions, both technical and aesthetic, as to how these systems should act and react. In the case of BeatBot, such aclosed system is maintained, albeit with the additionof transmitting messages regarding its current state. In the case of DeciderBot and _derivationsBot, these existing systems had previously interacted with human musicians, and could rely upon the performers intuitive musical responses to enhance those decisions made computationally. Within the musebot ensemble, both systemsare now reacting to other machines: one that is essentially indifferent, and another whose reactions had previously been keyed to human actions. As is often the case in experimental music production, having set up the interaction between agents and listening to how this interaction unfolds, we found clearly musically interesting content in this first attempt at a musebot ensemble. We anticipate many more musebots being designed and contributed, and imagine that through the unexpected combinations of such autonomous music-generating systems new thinking about automating musical creativity, and making it available to a wide community of users, might arise. The current work is a small affirmation of the potential of a musebot approach, and several questions have arisen regarding the next stage of development. Our next step is to curate a number of musebots to be presented in an ongoing installation of interchangeable ensembles across different genres. In order to reach such a stage of development, the following questions need to be addressed: What kinds of interaction are useful both computationally and musically? At the moment, the three musebots are not sharing any information in the form of network messages. Firstly, the BeatBot is generating beats, entirely unaware of any reactions to its audio, and while the two responsive audio musebots generate emergent musical material driven by audio analysis, they are oblivious to any structural decisions being made by the rest of the ensemble due to their lack of messaging. While such independence is one aesthetic solution, a more responsive and self-aware environment will need to be explored, if for no other reason than structural variety. In the present ensemble, one approach could be to augment the capabilities of DeciderBOT and _derivationsBOT to allow network messages from BeatBOT to have an affect on their internal genera tive capabilities, such as levels of density and musical timing. Alternatively, an augmentation of BeatBOTs capabilities as a producer could enable it to direct high-level changes in state in each of the connected bots, a possibility anticipated in the musebot specification by the availability of the statechange message. What is the minimum amount of information necessary to be shared between Bots to have a musical interaction? A next step is determining the kind of information that should be shared between musebots. The MC is generating a constant click, which affords an acceptance over a common pulse: how that pulse is organised in time (i.e. the metre) is a basic parameter of which each musebot should be aware. However, where should this be determined? Sharing of pitch information is also natural, but should an underlying method of pitch organisation also be shared (i.e. a harmonic pattern)? What happens when conflicting information is generated? Lastly, how should form be determined? An accepted paradigm of improvised music is the evolutionary form produced by self-organisation resulting from autonomous agents (human or computer); however, EDM tends to display a more rigorous structure. How should this be determined? What relationship to human composition and performance should be incorporated? Within the MuMe community, research has been undertaken to model human interaction within an improvisational ensemble of human performers (Blackwell et al. 2012). We suggest that musebots are not merely a robot jam. To quote from the Call for Participation, human musicians having a jam can make for a useful metaphor, but computers can do things differently, so we prefer not to fixate on that metaphor. Either way, getting software agents to work together requires thinking about how music is constructed, and working out shared paradigms for its automation. What aspects of the interaction can go beyond human performance modeling? A great deal of what humans do in performance has been extremely difficult, if not impossible, to model. For example, simply tracking a beat is something we assume any musician can do with 100% accuracy, while computers are seldom better than 90% at this task. However, there are limits to human interaction, which computers can potentially overcome. For example, computers can share and negotiate plans, and thus exhibit a collective telepathic series of intentions. Young and Bown (2010) have offered some interesting possibilities for interaction between agents that could certainly be explored between musebots. What role should stylistic and aesthetic concerns play in formulating ensembles? We imagine that in the future, musebots can query one another as to their stylistic proclivity, and generate interesting and unforeseen ensembles on their own. At the moment, the notion of human curation is still necessary. With only three musebots, the variety of musical output is obviously limited, but we imagine musebots being designed to produce specific stylistic traits. A June 2015 fi related question is how the musebots can, or should, deal with expectation: certain styles of EDM exhibit certain expectations in the listeners; while we acknowledge that we are not constrained to existing stylistic limitations, we are expecting humans to listen to, and hopefully appreciate, the generated music. Ignoring musical expectation outright is perhaps not the best strategy when offering a new paradigm in music-making. What steps would we need to take to make this a more intelligent system of interaction and/or coordination? Many existing MuMe systems have already demonstrated musical intelligence in their abilities to self-organise, execute plans, and react appropriately to novel situations. However, the designers often rely upon ad hoc methodologies to produce idiosyncratic, non-idiomatic systems. How can such systems communicate their internal states efficiently, or is this even necessary? What are the emerging decisions that we would make about messaging? How could we categorise these and generalise them? While audio analysis is one possible method for musebots to determine their environment, relying upon such analyses alone would take up huge amounts of processing cycles, without any guarantee as to an accurate cognitive conception of what is actually going on musically. Furthermore, given that each musebot would require its own complex audio processing module, the hardware demands would be inordinate. For this reason, having musebots simply tell other musebots what they are doing through messages seems much more efficient. However, how much information does a musebot need to broadcast about its current, or possibly future, state, in order for other musebots to interact with it musically? What is the furthest we could get with just in the moment? From the above discussion, it is clear that an important concern for musebot ensembles is addressing the tensions that exist between self-organised generativity and coordinated, hierarchical musical structures. Clearly, in the moment generation of musical materials is a trivial task for complex musical automata like the musebots described in this paper. A balance between autonomy on the one hand, and controlled, structural decisions will need to be carefully considered in the design of both musebots themselves, and their curation into musical ensembles. Ultimately, curatorial decisions surrounding style and musical aesthetic also go hand in hand with concerns regarding determinacy/indeterminacy in musical composition and performance, and we are excited to see how this ongoing tension will influence musebot designers and curators into the future. Conclusion A primary goal in developing the musebot and musebot ensemble is to facilitate the exchange of ideas regarding how developers of musical metacreative systems can begin to collaborate, rather than continue to build individual idio syncratic, non-idiomatic systems that rely upon ad hoc decisions. As we are targeting existing developers of MuMe and interactive systems, we recognize the variety of languages, tools, and approaches that are currently being used, and the reticence at adopting new frameworks that might inhibit established working methods. As such, our goal is to make the specification as easy as possible to wrap around new and existing systems and/or agents. The specification uses a standard messaging system that can be incorporated within almost any language; however, we purposefully have not specified the messages themselves. Our intention is for these messages to evolve naturally, in response to the musical needs of developers. For example, through the use of machine-and human-readable info files, musebots and musebot developers can determine the messages a specific musebot receives and sends, while the open source specification allows for developers to propose new messages. Once these agents are performing together at a basic level, we feel that a community discussion will begin on the type of information that could, and should, be shared. We have presented a description of our successful, albeit limited, first implementation of what we feel is an extremely exciting new paradigm for musical metacreation. Complex, autonomous musical producing systems are being presented successfully in concert, and the musebot platform is a viable method for these practitioners to collaborate creatively. 2015_19 !2015 Generative Music for Live Musicians: An Unnatural Selection Arne Eigenfeldt School for the Contemporary Arts Simon Fraser University Vancouver, Canada arne_e@sfu.ca Abstract An Unnatural Selection is a generative musical composition for conductor, eight live musicians, robotic percussion, and Disklavier. It was commissioned by Vancouvers Turning Point Ensemble, and premiered in May 2014. Music for its three movements is generated live: the melodic, harmonic, and rhythmic material is based upon analysis of supplied corpora. The traditionally notated music is displayed as a score for the conductor, and individual parts are sent to eight iPads for the musicians to sight-read. The entire system is autonomous (although it does reference a pre-made score), using evolutionary algorithms to develop musical material. Video of the performance is available online.1 This paper describes the system used to create the work, and the heuristic decisions made in both the system design and the composition itself. Introduction An Unnatural Selection can be classified as art as research: the author is a composer who has spent the previous thirty years coding software systems that are used as compositional assistants and/or partners. In the last ten years, thesesystems have explored greater autonomy, arguably creating computationally creative musical production systems that produce music that would be considered creative if theauthor had produced them independently. Music has a long history of computational systems created by artist-programmers, in which many aspects of themusical creative process are automated (Chadabe 1980; Lewis 2000; Rowe 2004). Most of these systems have beenidiosyncratic, non-idiomatic production systems specific tothe artists musical intention; however, some attempts havebeen made at evaluation (Eigenfeldt et al. 2012). The authors own investigation into creative softwarehave included multi-agent systems that emulate human improvisational practices (Eigenfeldt 2006), constrained Markov selection (Eigenfeldt and Pasquier 2010), and corpusbased recombination (Eigenfeldt 2012). All of these systems operate in real-time, in that they generate their output in performance using commercially available synthesizers, which, unfortunately, offer limited representa 1 https://aeigenfeldt.wordpress.com/works/music-for-robots-andhumans/ tions of their highly complex acoustic models (Risset and Matthews 1969, Grey and Moorer 1977) In order to bypass these audio limitations, the authors more recent research investigates the potential for generating music directly for live performers (Eigenfeldt and Pasquier 2012b). Complex issues arise when generating music for humans, both in terms of software engineering e.g. producing complex musical notation for individual performers and human computer interaction: asking musicians to read music for the first time during the performance, without rehearsal, and without recourse to improvisation. See Eigenfeldt (2014) for a detailed discussion of these matters. Previous Work An Unnatural Selection builds upon the work of others in several areas, including genetic algorithms, real-time notation, and generative music. Evolutionary Algorithms Evolutionary computation has been used within music for over two decades in various ways. Todd and Werner (1999) provide a good overview of the earlier musical explorations using such approaches, while Miranda and Biles(2007) provide a more recent survey. Very few of theseapproaches have been compositional in nature; instead, their foci have tended to be studies, rather than the generation of complete musical compositions. Several real-time applications of GAs have been used, including Weinberg et al. (2008), which selected individuals from an Interactive Genetic Algorithm (IGA) suitable for the immediate situation within a real-time improvisation. Another approach (Beyls 2009) used a fitness function that sought either similar or contrasting individuals toan immediate situation within an improvisation. Waschka (2007) used a GA to generate contemporary artmusic. His explanation of the relationship of time within music is fundamental to understanding the potential forevolutionary algorithms within art-music: unlike materialobjects, including some works of art, music is time-based. The changes heard in a piece over its duration and howthose changes are handled can be the most important aspect of a work. Waschkas GenDash has several important attributes, a number of which are unusual: an individual is a measure of music; all individuals in all genera June 2015 fi tions are performed; the fitness function is random, leading to random selection; the composer chooses the initial population. Of note is the second stated attribute, the result ofwhich is that the evolutionary process itself, not the result of a particular number of iterations, constituted the music. Waschka provides some justifications for his heuristic choices, suggesting that while they may not be observed inreal-world compositional process, they do provide musically useful results. EAs have been used successfully in experimental musicand improvisation for several years. In most cases, artistshave been able to overcome the main difficulty in applying such techniques to music namely the difficulty of formulating an effective aesthetic fitness function through a variety of heuristic methods. One particularly attractivefeature of EAs to composers relates to the notion of musical development the evolution of musical ideas over time and its relationship to biological evolution. As music is a time-based art, the presentation of successive generations rather than only the final generation allows for the auralexposition of evolving musical ideas. Real-time Notation The prospect of generating real-time notation is an established area of musical research, and has been approached from a variety of viewpoints: see Hajdu and Didkovsky(2009) for a general overview. Freeman (2010) has approached it as an opportunity for new collaborative paradigms of musical creativity, while Gutknecht et al. (2005) explored its potential for controlled improvisation. KimBoyle (2006) investigated open-form scores, and McClelland and Acorn (2008) studied composer-performer interactions. However, the complexity of musical notation (Stone 1980), limited these efforts to graphic representations, rather than traditional western music notation thataffords more precise and detailed directions to performers. Hajdus Quintet.net (2005) was an initial implementation of MaxScore (Didkovsky and Hajdu 2009), a publically available software package for the generation of standard western musical notation, one that allows for complexities of notation on the level of offline notation programs. An Unnatural Selection uses MaxScore for the generation of the conductors score, which is then parsed to individualiPads and custom coded software. Production Systems versus Compositions The creation of a production system for An Unnatural Selection was concurrent with the conceptualization of thecomposition itself, which is often the case in the authorspractice. The desired musical results are imagined throughaudiation, and the software is coded with these results inmind. The attraction of generativity rests in the ability for amusical work to be actuated in varying forms while stillretaining some form of overall artistic control. The author has chosen to create composition-specific, rather than general purpose, systems for two reasons: previous experience has shown that general systems tend tobecome so complex with added features as to obfuscate any purposeful artistic use, and secondly, specifically designed systems allow for a design with a singular artisticoutput in mind. As a result, some modules within the system used in An Unnatural Selection are specific to that work; however, it also builds upon earlier work (Eigenfeldt and Pasquier2010) as well as contributing to successive works. Specifically, the analysis engine and generation engine can be considered a free-standing system, which I refer to as PAT (Probabilities and Tendencies); the evolutionary aspectsare specific to An Unnatural Selection. The GA and its role as Development tool As already mentioned, the use of genetic algorithms modified or otherwise are attractive to composers interested in musical development. While this method of composition has its roots in the Germanic tradition of the 18th and 19th centuries, it remains cognitively useful, since it provides listeners with a method of understanding the unfolding of music over time (Deliege 1996; Deliege et al. 1996). A description of the work from the program notes musical ideas are born, are passed on to new generationsand evolved, and eventually die out, replaced by new ideas may suggest principles of artificial life, or musicbased upon Brahms, Mahler, or Schoenberg. A general conception of the first movement was a progression from chaos to order; an initial population of eight musical phrases are presented concurrently by the eight instrumentalists; the phrases are repeated, and each repetition developsthe phrases independently; segments from the individual phrases infiltrate one another; the individual phrases separate in time, thus allowingtheir clearer perception by the listener. While these concepts began with a musical aesthetic inmind, they were clearly influenced by their potential inclusion of genetic algorithms. The Score as template An Unnatural Selection is the most developed system inmy pursuit of real-time composition (Eigenfeldt 2011): thepossibility to control multiple complex gestures during performance. As will be described, An Unnatural Selection involves a number of high-level parameter variables that determines how the system generates and evolves individuals; dynamically controlling these in performance effectively shapes the music. As the performance approached, Idoubted my performative abilities, and instantiated a scorebased system that allowed for the pre-determined setting ofthe control parameters for each successive generation: while the details of work would still be left to the system, the overall shape would be preset. The use of such templates is not uncommon in other computationally creative media: Colton et al. (2012) used similar design constraintsin generating poetry in order to maintain formal patterns. June 2015 fi Probabilities and Tendencies (PAT) The heart of PAT rests in its ability to derive generativerules through the analysis of supplied corpora. Cope (1987) was the first composer to investigate the potential for stylemodeling within music; his Experiments in Musical Intelligence generated many compositions in the style of Bach, Mozart, Gershwin, and Cope. Dubnov et al. (2003) suggestthat statistical approaches to style modeling capture someof the statistical redundancies without explicitly modeling the higher-level abstractions, which allow for the possibility of generating new instances of musical sequences that reflect an explicit musical style. However, their goalswere more general in that composition was only one ofmany possible suggested outcomes from their initial work. Dubnovs later work has focused upon machine improvisation (Assayag et al. 2010). The concept of style extraction for reasons other thanartistic creation has been researched more recently by Collins (2011), who tentatively suggested that, given the stateof current research, it may be possible to successfully generate compositions within a style, given an existing database. For An Unnatural Selection, corpora included compositions by the following composers: Movement I: 19 compositions by Pat Metheny Movement II: 2 compositions by Pat Metheny and 2 by Arvo Prt Movement III: 1 composition by Terry Riley and 2 by Pat Metheny These specific selections were arrived at through trialand error, as well as aesthetic consideration. The contemporary jazz material of Metheny provided harmonic richness without the functional tonality of the 19th century. Combining this corpus with Prts simpler harmonies andmelodies gave them an interesting new dimension, whilethe repetitive melodic material of Rileys In C, when combined with Methenys harmonies created a new interpretation of minimalist melodic and rhythmic repetition withmore complex harmonic underpinnings. Analysis of corpora PAT requires specially prepared MIDI files that consist ofa quantized monophonic melody in one channel, and quantized harmonic data in another: essentially, a lead-sheetrepresentation of the music. Prior to the creation of melodic, harmonic, and rhythmic n-gram dictionaries (Pearce and Wiggins 2004), harmonic data is parsed into pitch-classsets (Forte 1973). Melodic data is stored in reference to theharmonic set within which it was found, both as an actualMIDI note number and pitch-class as relative to the set. Representation Music representation, and its problematic nature, has beenthoroughly researched: Dannenburg (1993) gives an excellent overview of the issues involved. Event-lists are the standard method of symbolic representation currently used, as they supply the minimally required information for rep resenting music within a note-based paradigm. However, since event-lists do not capture relationships between events, they have proven problematic for generative purposes (Maxwell 2014). For this reason, PAT includes nonevents that are displayed in music notation. Figure 1. A notated melodic phrase, with beats 1 through 4 indicated, and non-events marked below. Figure 1 presents a simple melodic phrase, and its eventbased representation is shown in Table 1. While the onsettimes and durations are captured, their interrelationships, clearly shown in Figure 1, are difficult to determine. Theinitial events prolongation into the second beat, as shownthrough the tie (marked with an x), is missing. Similarly, the rest on the third beat (also marked with an x) segmentsthe second and third beats, also not obvious in Table 1. Event # Beat Pitch Duration 1 1.0 60 1.5 2 2.5 62 0.5 3 3.5 64 0.5 4 4.0 65 1.0 Table 1. The music of Fig. 1, represented as events. The solution in PAT is to include all non-events: rests are represented as pitch 0 with appropriate durations, andties are represented as incoming pitches with negative durations: see Table 2. Event Beat Pitch Duration 1 1.0 60 1.5 2 2.0 60 -0.5 3 2.5 62 0.5 4 3.0 0 0.5 5 3.5 64 0.5 6 4. 65 1.0 Table 2. The music of Fig. 1, showing the non-events 2 and 4. Associations between events are retained within PAT through encoding by beat. As the generative engine usesMarkov chains, the important relationships within and between beats are preserved through separate pitch and rhythm/duration n-gram dictionaries. Rhythm Events are stored as onset/duration duples, grouped into beats, with onset times indicating offset intothe beat. Thus, Figure 1, segmented into individual beats, is initially represented as: (0.0 1.5) (0.0 -0.5) (0.5 0.5) (0.0 0.5) (0.5 0.5) (0.0 1.0) June 2015 fi Each beat, as a duple or combination of duples, serves as an index into the rhythm n-gram dictionary, which storesall continuations and the number of times a continuation has been found. Thus, after encoding only Figure 1, therhythm dictionary would consist of the following: (0.0 1.5) (0.0 -0.5) (0.5 0.5) 1 (0.0 -0.5)(0.5 0.5) (0.0 0.5)(0.5 0.5) 1 (0.0 0.5)(0.5 0.5) (0.0 1.0) 1 Pitch Melodic events are stored in relation to the harmonic set within which they occurred. The total number of occurrences of each pitch-class (PC), relative to the set, are stored, as well as PCs that are determined to begin phrases (initial PCs) and end phrases (terminal PCs). Lastly, an ngram for the continuation of each PC (n>) is stored, alongwith an n-gram of its originating PC (>n). Figure 2. A melodic phrase with accompanying harmony; pitch-classes are indicated. Thus, given the melodic and harmonic material of Figure 2, the melodic dictionary shown in Table 3 is constructed. Note that separate contour arrays are kept so as toretain actual melodic shapes. Set: 0 4 7 Pitch Class 0 1 2 3 4 5 6 7 8 9 10 11 Total PCs: 1 0 0 0 1 1 0 2 0 1 0 1 Initial: 1 0 0 0 0 0 0 0 0 0 0 0 Terminal: 0 0 0 0 1 0 0 0 0 0 0 0 0> 0 0 0 0 0 0 0 0 0 0 0 1 5> 0 0 0 0 1 0 0 0 0 0 0 0 7> 0 0 0 0 0 1 0 0 0 1 0 0 9> 0 0 0 0 0 0 0 1 0 0 0 0 11> 0 0 0 0 0 0 0 1 0 0 0 0 >4 0 0 0 0 0 1 0 0 0 0 0 0 >5 0 0 0 0 0 0 0 1 0 0 0 0 >7 0 0 0 0 0 0 0 0 0 1 0 1 >9 0 0 0 0 0 0 0 1 0 0 0 0 >11 1 0 0 0 0 0 0 0 0 0 0 0 Table 3. The music of Fig. 2, storing individual PCs movementto (n>) and from (>n), as well as a count of overall PCs for theset, and which PCs initiated and terminated phrases. A similar system is used for harmony, with the n-gramstoring the relative root movement of each set. Lastly, as well as melodic contours, an array of root movements (basslines) is also kept. In both cases, these contours are normal ized and their lengths scaled. New contours are comparedto those existing using a Euclidean distance function, andthose below a user-set minimum similarity level are culled, in order to avoid excessive similarity. Generation The generative and evolutionary algorithms within An Unnatural Selection utilize user-set parameters that definehow the algorithms function; it is the dynamic control ofthese parameters over time that shapes the music. As hasbeen mentioned, An Unnatural Selection employs a parameter score to control these values. Evolutionary Methods in An Unnatural Selection An Unnatural Selection uses the architecture of PAT within a modified evolutionary system. Within this system, musical phrases operate as individuals, or phenotypes, and individual beats a combination of rhythmic and melodic material operate as chromosomes. Phrases are developed insuch ways that they represent successive generations. Sinceall individuals pass to the next generation, there is no selection, and thus no fitness function; however, each individual experiences significant crossover and mutation. Several independent populations exist simultaneously. The use of evolutionary methods are extremely heuristic; earlier uses of such techniques by the author are documented elsewhere (Eigenfeldt 2012; Eigenfeldt and Pasquier 2012a). Figure 3. A root progression request (red), and the generatedprogression based upon possible continuations (grey). Generating Harmonic Progressions A harmonic progression is the first generated element. A root progression is selected from the database as a target, and scaled by the requested number of chords in the progression. An initial chord is then selected from those setsthat initiated phrases, and its continuations are compared to the next interval in the target. A Gaussian selection is then made from the highest probabilities. This process continuesuntil a phrase progression is generated (see Figure 3). Atthis point, the progression has not been assigned individual durations. Generating Phrases/Individuals A number of required parameter values are calculated through a combination of corpus data and user-set ranges. For example, in order to select a phrase length for an individual, the actual phrase lengths from the corpus are ordered, and a value is sampled from this list from within a June 2015 fi user-set range (in this case phraseLengthRange). Thus, ifthis range is fixed between 0.9 and 1.0, a random selectionwill be made from 10% of the corpus longest phrase lengths. Individual phrases are assigned to specific instruments; since An Unnatural Selection was composed for eight instrument, Disklavier, and robotic percussion, the population consisted of a maximum of 12 individuals (the pianoand percussion used two independent phrases). An important user parameter is whether the instrument (and thusthe phrase) is considered foreground or background: in the case of the former, rhythmic data is selected from the corpus based upon density, while in the latter, data is selected based upon complexity (syncopation). Foreground individuals are deemed to be more active and have more variation; background individuals are either more repetitive or oflonger duration, as set by a user parameter. Foreground The number of onsets per beat is determinedby a user parameter, phraseDensityRange. At initialization, the corpus average beat density is scaled between 0.0 (the least dense) to 1.0 (the most dense), and a selection ismade within the user range. Background At initialization, the corpus onsets are alsorated for complexity: the relative amount of syncopation within each beat. Background phrases are comprised ofeither rhythmic material or held notes; in the case of theformer, an exponential selection is made from the top 1/3of the corpora (the most syncopated), while a similar selection is made from the bottom 1/3 for held individuals. Background phrases are immediately repeated if they areless than one measure in total duration. Figure 4. The continuations for a specific PC (7), left; a weighting that favors more distant PCs, center; the final probability for PC selection, right. Once an initial selection is made for foreground orbackground individuals, the continuations from that beat are constrained by the same user parameters. Melodic material Similar to harmonic and rhythmic generation, melodic generation selects an initial PC from those PCs in the corpus that began melodic phrases; continuations of that PC are then weighted to derive the probabilities for the next PC. In the case of foreground individuals, a fuzzy weighting is applied so as to avoid direct repetitionand small intervals. (see Figure 4); for background phrases, the opposite weighting is applied to avoid large melodicleaps. Individual locations within overall phrase Once all phrases have been generated, the maximum length is determined, in beats; this value is rounded up to the next measure, and becomes the overall phrase length to whichthe harmonic progression is overlaid. Individuals are placed within the overall phrase, eitherattempting to converge upon other individuals locations, or diverge from them, depending upon a user-set parameterphraseVersusPhrase. Each phrases current onset locationsare summed, which will determine the probability for the placement of individuals in the next overall phrase whilethe inverse will provide probabilities for divergence (seeFigure 5). Rests are added to the beginning and/or end ofthe individual in order to place them in the overall phrase: these rests are not considered part of the individual. Figure 5. The number of total onsets per beat, left; the inverse asavoidance probability, center; the final probability for phrasestarts, right. Because of the individuals length, its placement islimited to the first six locations of the overall phrase. Melodic Quantization With the harmony now in place, PCs are quantized to sets within which they are located. A PC is compared tothe total n-gram for its current harmonic set, which acts asan overall probability function, scaled by intervallic closeness to the PC (see Figure 6). In this way, PCs are notforced to a pre-defined chord-scale for the set, but adjusted to fit the n-gram for the set within the corpus. Pitch ranges are then adjusted for each individual, anddynamics, articulations, slurs, and special text (i.e. arco vs. pizzicato) are applied: space does not allow for a discussion of how these parameters are determined. Figure 6. The n-gram for the set (0 3 7 10), left; a weighting for araw PC (1) that favors intervallic closeness; the final probabilityfor PC quantization, right. June 2015 fi Figure 7. The first two generations of a population of four individuals, demonstrating crossover by segment. Evolving Populations As mentioned previously, all individuals progress to the next generation, unless they are turned off in the user score. Evolution of individuals includes crossover (within set populations) and mutation. Crossover The individuals chromosomes are its beats; as rests are considered events within PAT, every beat, including rests, constitutes a separate chromosome. Crossover does not involve the usual splicing of two individuals, but instead the insertion or deletion of musical segments between individuals. Segmentation is done using standard perceptual cues, including pitch leaps, rests, and held notes (Cambouropoulos 2001), resulting in segments of one to several beats (see Figure 7). Figure 8. Two generations of three individuals (red, blue, green), showing expansion through crossover of segments. Segments a, f, and g are copied to the segment pool, potentially mutated, then inserted into other individuals Individuals will either expand or contract during crossover, depending upon a user-set parameter. Contracting an individual involves deleting a segment, and splicing together the remaining parts in a musically intelligent way. Expansion involves copying segments from different individuals into a separate pool that contains a maximum of 16 segments, differentiated by individual type: foreground versus background (see Figure 8). Segments are potentially mutated (see next section), then inserted into individuals. Mutation Mutation can occur on segments within the segment pool prior to insertion, or on the entire individual, depending upon the user-set parameter multiBeatProbability. Mutations are musically useful variations, including: scramble randomly scramble the pitch-classes; transpose transpose a segment up or down by a fixed amount, from 2 pitch-classes to 12; sort+ sort the pitch-classes from lowest to highest; sort sort the pitch-classes from highest to lowest; rest for notes substitute rests for pitch-classes, to a maximum of 50% of the onsets in the segment. The type of mutation is selected using a roulette-wheel selection method from user-set probability weightings for each type. Logistics An Unnatural Selection is coded in MaxMSP2, using MaxScore for notational display. Custom software was written to display individual parts on iPads, which received JMSL (Didkovsky and Burke 2001) data wirelessly over a TCP network. The generative software composes several phrases in advance, and sends the MIDI data to Ableton Live3 for performance (specifically the Disklavier and robotic percussion); Ableton Live provides a click track for the conductor, and sends messages back to the generative system requesting new material. Discussion An Unnatural Selection is, first and foremost, an artistic system designed to create multiple versions of a specific composition the authors interpretation of generative music. Many aspects of the systems development for example, the multiple populations were arrived at through artistic reasons, rather than scientific. Algorithms were adjusted and parameters tweaked through many hours of listening to the systems output; as a result, heuristics form an important aspect of the final software. Whether the system is computationally creative is a more difficult matter to determine. While I echo Copes desire that what matters most is the music (Cope 2005), I am fully aware of Wiggins reservations that with handcoded rules of whatever kind, we can never get away from 2 www.cycling74.com/ 3 www.ableton.com/ ! ! ! a! b! c! d! e! f! g! h! i! j! k! l! Generation 1 Generation 2 a! b! f! c! d! a! e! f! g! h! i! j! k! l! g! j! a! f! g! Segment pool June 2015 147 fi the claim that the creativity is coming from the programmer and not the program (Wiggins 2008). The overriding design aspect entailed musical production rules derived through analysis of a corpus; however, asI discuss elsewhere (Eigenfeldt 2013), how this data isinterpreted is itself a heuristic decision, especially whenbeing used to create an artwork of any value. Evaluation While the intention of An Unnatural Selection was primarily artistic, the notion of evaluation was not entirely ignored, an issue the author has attempted to broach previously (Eigenfeldt et al. 2012). The work was clearly experimental: it would have been much easier to generate themusic offline, and select the best examples of the system, allowing the musicians to rehearse and perform these inways in which they are accustomed. However, the fact thatthe music was generated live was an integral element to theperformance: in fact, interactive lighting was used in whichthe musicians chairs were lit only while they played, aneffort to underline the real-time aspect. While no formal evaluation studies were done, the musicians were asked to critically comment upon their experiences. Their comments are summarized here. Limited Complexity in Structure Some musicians commented on the relatively unsophisticated nature of the overall form of the generated music: I didn't sense a strong structural aspect to the pieces. I thought the program generated some interesting ideasbut I would like to see more juxtaposition, contrast of elements, in order to create more variety and interest. I would venture to say that the music certainly wasn't as developed or thoughtful as something that aseasoned, professional composer would create. any of the versions would likely have struck me assomewhat interesting but fairly basic. Generating convincing structure is an open problem in musical metacreation, which is not surprising, as it is one ofthe most difficult elements to teach young composers. More Overall Complexity When asked for specific suggestions, several musicians provided very musical suggestions, including a greater variety of time signatures, moresubtle instrumentation and playing techniques, different groupings of musicians, accelerando and rubato. Many ofthese aspects can, and will be incorporated into future versions of the system. Positive comments Keeping in mind that these are professional musicians specializing in contemporary music performance, I was happy to receive positive comments: I assume the software is going to continue to growand become more accomplished through further exposure to, and analysis of, sophisticated compositional ideas. I thought some of music was beautiful, especially inthe second movement. It seems to me that what you are doing is groundbreaking and interesting, even if still at a relativelyprimitive stage. Conclusion An Unnatural Selection was the culmination of my research into generating music in real-time for live musicians. Upon reflection after the fact, my goal was to present musical notation to the performers that was as close as possible to what they were used to, since no improvisation would be expected. Naturally, this would necessitate having the musicians perform the music without any rehearsal and extremely demanding request. While the extendedrehearsals did allow the musicians to gain some expectations of what to expect from the software, it failed to provide them with what rehearsals usually provide: a time todiscover the required interactions inherent within the music. One musician suggested that these indications, normally learned during rehearsal periods, could somehow appearin the notation: Maybe the screen could indicate to the players whenthey have an important theme to bring out, and also indicate which instrument they are in a dialogue with orhave the same rhythmic figure as? Future versions of the system will explore this new paradigm, which also suggests the potential to involve the performers within the generative composition in ways that would not be possible without intelligent technology. Acknowledgements This research was undertaken through a Research/Creation grant from Canadas Social Science and Humanities Research Council (SSHRC). Thanks to the Generative Media Research Group as well as the Metacreation Lab for their support and suggestions during the creation of this work. Particular thanks goes to James Maxwell for his thoughtprovoking work on cognitive music generative systems. 2015_2 !2015 Generating Code For Expressing Simple Preferences: Moving On From Hardcoding And Randomness Michael Cook and Simon Colton Computational Creativity Group Goldsmiths, University of London gamesbyangelina.org — ccg.doc.gold.ac.uk Abstract Software expressing intent and justifying creative decisions are important considerations when building systems in the context of Computational Creativity. However, getting software to express subjective opinions like simple preferences is difficult without mimicking existing people's opinions or using random choice. In this paper, we propose an alternative way of enabling software to make meaningful decisions in smallscale subjective scenarios, such as choosing a favourite colour. Our system uses a combination of metrics as a fitness function for evolving short pieces of code that choose between artefacts. These ‘preference functions' can make choices between simple items that are neither random nor based on an already existing opinion, and additionally have a sense of consistency. We describe the system, offer some example results from the work and suggest how this might lead to further developments in generative subjectivity in the future. Introduction Computationally creative software usually makes many decisions in the process of producing an artefact. These decisions are often in the context of problems for which ‘notions of optimality are not defined' (Eigenfeldt, Burnett, and Pasquier 2012) and so there is no definitive equation or objective measure that can guide them to the ‘best' answer. As a compromise, the developers of such software provide ways to guide the software in making these decisions: sometimes by providing predetermined heuristics; sometimes by allowing the software to create models trained on decisions made by people; sometimes using random chance. In many of these creative decisions there are no right or wrong answers. For example, in (Veale 2013) a system writes poetry by first generating several potential metaphors to work from. These metaphors are all considered good candidates that could produce poems - the system selects one at random, because it has no meaningful reason to choose between them. This is a small decision within a much larger system, and in many ways it is insignificant compared to the larger creative act the software performs. In this paper, however, we argue that there are two important consequences to relying on random choice or predetermined heuristics for decisions such as this. Firstly, we prevent our software from intelligently discussing these choices in framing information, and as a result miss out on opportunities to add value to the artefacts created or raise the perception of our software as creative (Colton, Charnley, and Pease 2011). Secondly, when observers discover or are informed that these choices are made due to external factors or randomness, then we contend that their perception of the software as creative is lowered significantly, even if the decision in question seems trivial. Software is not human - it does not have emotional attachments, it does not have childhood memories, it does not have biochemical reactions. This does not mean, however, that we must shy away from providing software with the ability to make and justify subjective decisions. If the claims that it makes about its preferences are consistent, defensible and reasonable, we believe that this will add to the perception of the software being creative without deceiving the observer about the software's lack of humanity. We describe here a system that can generate simple snippets of code, which we call preference functions, that take as input two objects of some type and express an ordering on them - in other words, they express what amounts to a preference between the two objects. This system works by evolving code segments, using a particular combination of metrics, which we also introduce here, as a fitness function. These metrics have been carefully designed to be domain agnostic, and to limit our influence as designers on the output the system ultimately produces in terms of the subjectivity it expresses. While this process is not perfect, we believe this work represents an encouraging first step towards software making meaningful subjective decisions. To illustrate this, we provide several examples of generated functions in different domains, including colour selection and videogame design, that highlight how this technique might be used in software. We then discuss what further work is needed to integrate this technique into the framing and context of computationally creative software. Background Framing and Subjectivity Framing is the name given to the process by which software produces text or perhaps other content to provide context to a generated artefact. Thus far in Computational Creativ June 2015 8 ity this generally takes the form of a ‘wall text'-like commentary that appears alongside the artefact in order to help explain the creative process, as in (Colton, Goodwin, and Veale 2012). According to (Colton, Charnley, and Pease 2011), the authors claim that the act of framing can increase the ‘value' of generative acts undertaken by software in several ways, one of which is ‘by providing calculations about the concepts/expressions [in an artefact] with respect to the aesthetic measures'. In (Colton, Goodwin, and Veale 2012), for example, the software generates commentaries which explains why it chose particular poetic styles or focused on particular words. In (Charnley, Pease, and Colton 2012), the authors consider three particular aspects of a creative work that framing can tackle: motivation, intention and process. The authors summarise these as ‘Why did you do that?', ‘What did you mean when you did that?' and ‘How did you do that?' respectively. Of motivation, they say: [it is] distinctly human in nature and it currently makes limited sense to speak of the life or attitudes of software in any real sense. However, the authors also point out later that ‘framing need not be factually accurate', and that ‘the motivation of a software creator may come from a bespoke process which has no basis in how humans are motivated'. We claim that it is reasonable for software to possess arbitrary or subjective preference about elements of its creative process, for the purposes of framing and justifying its motivation and output. The technique we outline in this paper has no basis in how people are motivated, as in the quote above, but it does aim to offer a form of motivation for software's actions that is satisfying to the observer and may withstand limited interrogation through framing or even dialogue. Randomness and Believability In (Colton and Wiggins 2012) the authors define Computational Creativity as the creation of systems which ‘exhibit behaviours that unbiased observers would deem to be creative' (paraphrased). The mention of unbiased observers is crucial to the definition, since Computational Creativity is highly reliant on the perception of creativity. A common criticism of creative software is that the designer of the software is a major contributor to the software's creativity. (Colton 2009) proposes a process of ‘climbing the metamountain' to overcome this, whereby creative software is iteratively improved to remove the influence of the original designer on the software, instead adding in new subsystems which take the place of the designer's involvement and allow the software to make the same decisions for itself. The danger of removing designer influence for removal's sake is that the system that replaces the designer's involvement may not actually increase the perception of creativity. There is anecdotal evidence to suggest that people distrust the actions of software, even in cases where the software is proactively explaining that its decisions were intelligently motivated. The work described in (Cook and Colton 2014), for example, provoked an angry response from one member of the public who wrote ‘AI, or just basic random number generation?' in response to the software framing its choice of a piece of music. There are many explanations for why people might be biased against software in some instances, one being that they have good cause to be suspicious: random choice is used very often in the design of intelligent systems, including those in Computational Creativity. Moreover, as we have already stated, researchers are not afraid to have their software tell stories that are ‘not factually accurate' in order to explain their decisions. This is not a dying practice, either: examining system description papers from the 2014 alone, we identified seven systems which explicitly mention random decision-making in their description (omitting cases where random selection might be part of a search-specific process, such as evolution) such as (Rashel and Manurung 2014), a poetry generator which randomly selects an output from any poems which meet a minimum quality, or (Tomasiˇ c,ˇ Znidar ˇ siˇ c, and Papa 2014) which breaks ties in slogan seˇ lection using random choice. Other systems described relying on hand-crafted metrics for making subjective decisions which inherit their decision-making capacity directly from the system's designer. We believe that the underlying cause for this bias against software making decisions independently is not that people believe that software cannot make such decisions, but rather that random choice is not satisfying as a context for these decisions. Random choice cannot be interrogated or understood, does not form a long-term pattern of decision-making, and is also not something that people often do - even when people may in fact be making pseudorandom decisions, we often justify them post-hoc, particularly in the case of creative activity - see (Charnley, Pease, and Colton 2012) for examples. Most importantly, random choice cannot be easily framed through commentary on a creative artefact, because it has no context to reveal. This limits the software's ability to explain itself after the fact. A System For Generating Preferences If we acknowledge that inheriting decisions from a person damages the perception of software as being creative, but also accept that random decision-making is unsatisfying and can be equally damaging to perceptions, it leaves us in an awkward position whenever our software must tackle decisions which are subjective or where the factors involved are hard to quantify. Ideally, we would like our software to be able to provide meaningful reasons for small, subjective decisions. By meaningful, we mean that the decision is defensible in some way: there is a reasoning behind it, even if that reasoning is ultimately arguable (as subjective opinions often are, by their nature). In this section we will describe an evolutionary system that generates code to provide the basis for such decisions, with the primary aim being that these decisions are defensible, despite being subjective. The system we describe here generates what we call preference functions - small snippets of code which express a preference of some kind between two objects of the same type. They are based on the concept of Comparators in Java June 2015 9 which are used to express orderings over lists. A Comparator takes two objects and returns either -1, 0 or 1 if the first object is less than, equal to, or greater than the second object respectively according to some ordering. Our functions act similarly, where a preference can be thought of as an ordering over the set of objects of a particular type. More formally, we define a preference function p as a function which takes two arguments t1, t2 of type T, and returns one of three integer values r 2 {"1, 0, 1}. The return value indicates the following three situations: p(t1, t2) = 8 < : 1 t1 >p t2 0 t1 =p t2 "1 t1

0.95 fitness) that would not act as justifiable, if simple, preferences in some way. Curation is necessary only to avoid showing the same function twice, because the system frequently generates comparators with identical functionality but very different code, as we do not yet implement a novelty search (Lehman and Stanley 2010). Basic Preference Examples The results shown in Figures 1 and 2 were generated with a population of 20 code segments, a test set of 100 random integers in the range {"500, 500} to evaluate the preference functions, and 15 generations of evolution. We found this to be sufficient to evolve high-fitness (0.95 or higher) functions that compared integers. Figure 1 shows a preference function which prefers negative numbers over positive ones. This is expressed in a rather awkward way: by adding the two arguments together and comparing them with one of the arguments on its own. The else case in this conditional statement returns the opposite ordering instruction (-1), meaning that this function has a high consistency while also being precise. Figure 2 shows a more standard ordering on integers, from smallest to largest. Both this method and Figure 1 have large amounts of unreachable or redundant code. This is expected, given that the system is concerned with the function of code rather than its design. The unnecessary code is not impossible to filter out with the right interpretation of compiler messages, since the C# compiler recognises many of these issues and will present warnings to the system when attempting to compile. We touch on this topic in the discussion section. In a further experiment, we expanded the expressivity of the code generation to include the char primitive type as well as the notion of casting to a type. Evolving high-fitness results for this target domain was more difficult and required a larger evolutionary run than with integer types. We ran populations of 30 code segments, a test set of 100 random chars whose ASCII codes fall in the range {0, 128} to evalpublic int compare(int i, int j) { if ((i < i)) { return 0; return 0; } if (((j + i) < j)) { i = i; return -1; } else { j = -491; return 1; } return 0; } Figure 1: If i is negative, it is preferred over j; the second conditional check is true if i < 0. public int compare(int i, int j) { if ((i <= j)) { return -1; j = ((i * 335) % j); } else { return 1; j = j; } return 1; return -1; } Figure 2: Orders numbers from largest to smallest. The first conditional returns a reverse ordering (-1) if the first argument is smaller than the second. Note the copious amount of unreachable code. This constitutes a compile-time warning in C#, which is suppressed here. uate the preference functions, and 15 generations of evolution. Figure 3 shows a function which sorts chars in reverse lexicographic order. We increased the population size because usable preferences were proving difficult to evolve - as one can see, this is most likely because type casting was required to produce the simplest preference functions, which makes the code much longer and therefore harder to evolve. Object Preferences Figure 5 shows a preference function evolved for comparing a more complex type - in this case, an object with four fields representing a Monster from a simple game. The class skeleton for the object is shown in Figure 4. These examples were also evolved with a population of size 40, run for 30 generations, with a test set size of 100. Evolving preference functions for objects gives the system a wider state space to explore with more interesting comparisons available to it, with the potential to generate preference functions which compare along two axes simultaneously. We are building a prototype game, I Like This Monster that uses preference generation as part of a process of automated game design. Choosing one particular kind of game June 2015 12 public int compare(char i, char j) { if ((((int)(j)) <= ((int)(i)))) { return 1; return -1; } else { i = ((char)(((((int)(i)) ((int)(i))) * (((int)(j)) ((int)(j)))))); return -1; } return 0; return 0; } Figure 3: Reverse lexicographic ordering on characters. Note that explicit casts to int types has caused a lot of excess bracketing. element over another has a large component of subjectivity to it, particularly if the game content is already balanced for difficulty and fun. Rather than randomly choosing certain game elements, or choosing them according to a fixed designer preference, the game generates a preference for certain game elements like monsters. This preference is then used to select from a database of pre-generated game content to decide what is included in the game. This is analogous to generating multiple poems and choosing between them as in (Rashel and Manurung 2014) - but unlike random choice, the use of preference functions means that the decision can be framed and given a justification. We discuss how one might generate text from preference functions below. For a more visual example of a preference function, Figure 8 shows another example. In this case, a preference function is generated for an object representing RGB colours, with three int fields representing each colour component. A preference function was generated which prefers colours with more red in them. Figure 8 shows the effect this has: the top row shows a randomly generated line of RGB colours, and the bottom row shows the same line ordered from least preferred on the left to most preferred on the right. The preference is very simplistic - it doesn't quite correlate to a visual language of ‘redness', but the software can justify its decision on a code level even if it does not directly corresponding to visual processing in people. In all of the results given in this section, we found that reflexivity is the metric which was maximised quickest. This is likely because it is the simplest to satisfy, as it primarily safeguards against particular bad patterns of code being generated (as long as the function does not return an answer based on the ordering of the arguments, it is always maximised). If the target is high specificity, this is often maximised next, as this requires the preference function simply return nonzero values. However, more complex specificity requirements may require branching and non-constant return statements. In this case, it is much harder to maximise than transitive consistency. These observations largely apply here because the domains we are considering are relatively simple and the preference functions we are generating are low public class Monster{ public string name; public int health; public int damage; public boolean poisonous; } Figure 4: A dummy class specification used for generating preference functions. health cannot have a negative value, but damage can (some monsters heal by attacking). public int compare(Monster i, Monster j){ i.name = j.name; if ((j.health > i.health)){ i = j; return 1; } else{ return -1; j.name = j.name; } i.damage = (i.health / i.health); } Figure 5: An ordering on Monster objects based on their health variable. in complexity and length. We expect this to change in future - functions which compare multiple variables simultaneously, for example, are far more likely to be transitively inconsistent, while functions which return variable values or have high branching are more likely to have lower specificity. This raises the question of how to find these functions over evolving simpler preferences - it may be that additional ‘interestingness' metrics are required, or it may simply be that asking for longer preferences or a novelty search powered by agreement will be enough to promote the evolution or more complex preferences. Related Work No work we are aware of directly tackles the problem of generating meaningful, defensible preferences for creative agents across arbitrary domains. However, the idea of public override int compare(RGB i, RGB j){ if(((j.r * j.r) > (j.r + (i.r j.r)))){ j = i; return -1; } else { return 1; } } Figure 6: An ordering on RGB objects based on their r (red component) variable. June 2015 13 Figure 7: A screenshot from I Like This Monster, showing a level where a particular kind of enemy poisonous creatures has been selected because of a preference for monsters with the poisonous field set to true. Figure 8: A random colour palette (top row) and an ordering of the same palette according to a preference about RGB colors which prefers colours with more red in them (bottom row, preferred colours towards the right). computationally representing subjective decisions has some precedent. In (Saunders and Gero 2001) the authors describe a community of creative agents which are designed to have some concept of novelty and interestingness. Each agent possesses a neural network which learns by viewing artworks generated by agents in the community. This can be used to gauge novelty for a given artwork by assessing how much the artwork differentiates itself from the network's current state. Interestingness is based on a Wundt curve calculation in which the most interesting artefacts lie between the extremes of high and low novelty. (Saunders and Gero 2001) can be seen as a form of preference modelling, in that the agents are armed with a way of making decisions about creative works, if we interpret interestingness to be a subjective quota. Our work is different in a few important ways: it generates a range of preferences based on different factors that vary according to the objects being considered, whereas the community of agents only work in the realm of novelty. The work is also more prescriptive, in our opinion, than the metrics we propose although we should stress that the authors do not claim to be investigating the generation of varied preferences, the work has other objectives, but we have cited it here as an interesting piece of related work. Similarly, (Maher, Fisher, and Brady 2013) presents a computational model of surprise, which could be considered to be a form of preference if applied to selection or evaluation (as the authors propose). Similar to (Saunders and Gero 2001), we differentiate ourselves from this work primarily because our aim is to produce a higher-level system which can generate a variety of preferences based on different factors, rather than primarily basing it on surprise or novelty. The automatic creation of code by software is not a new concept. Code generation, or ‘unrolling' of code, is a common concept in software engineering, used for purposes such as optimisation, or the automatic reconfiguration of code in response to dynamically changing execution environments. This is often highly template-based, and the code is generated for precise functional objectives that are normally known well in advance. Code-generating systems also exist in artificial intelligence. Machine learning software, for example, can be viewed as producing programs as their primary output. Decision trees, neural networks or inductive logic programs can all be seen as forms of computer programs, sometimes (such as the case of ILP or evolutionary programming) quite explicitly. Machine learning techniques have been seen in Computational Creativity on many occasions. For instance, (Morris et al. 2012) uses machine learning as the basis for a computationally creative soup recipe inventor, trained on a corpus of existing soup recipes, and (Colton 2008) uses machine learning in a module within The Painting Fool, a computationally creative artist. The generation of code is perhaps most explicitly present in Computational Creativity in (Cook et al. 2013), in which we presented Mechanic Miner, a system which explores, modifies and executes the codebase for a simple videogame, in order to discover new concepts for game mechanics and rules. The system was capable of generating single lines of code, modifying the existing game's code to include this new instruction, and then playing the game to evaluate the effect of the generated code on gameplay. In doing this, the system rediscovered several existing game design concepts, previously invented and used by game designers. It was also capable of surprising us as the creators of the system, by presenting solutions which were highly unexpected or took advantage of the system's detailed use of code to perform unexpected operations on the target videogame. This notion of generating directly executable, readable program code in an everyday programming language is one of the motivations for the work we have described here. Discussion The preference functions presented in this paper represent a first step towards a system which can reliably generate interesting preferences for arbitrary targets. We believe it represents a promising new avenue for exploration, and one that could greatly enhance the quality of framing that Computational Creativity systems are able to provide. Generating code which claims to represent ‘preference' is potentially controversial. The reason for many decisions being randomised or guided by hand-designed heuristics in the first place is that software does not hold personal opinions and is not human. We would argue, however, that we are in June 2015 14 the business of perception - recall the definition of Computational Creativity from earlier as being dependent on ‘unbiased observers'. Whether we like it or not, our software is judged on how it presents itself, and our first-hand experience of building systems and presenting them to the public has shown us that random decision-making and heuristics inherited from people are as damaging to expectation and perception as any amount of personification. Furthermore, we would argue that having software express a preference is not necessarily in bad faith. Representing a random decision as having a basis in personhood is deceiving the observer, but with a preference function there is a chain of reasoning, a process that is itself accountable and can be framed, that shows where this preference has originated from. This preference can be interrogated, an observer can present new examples to it to try and better understand what it prefers and why. The software is not claiming to have an emotional basis for this - it is simply stating a preference that it used to guide its decision-making process. Of course this does not offer a perfect solution to all the problems of subjective decision-making in creative software, but we believe it offers a new way of exploring the issue. It is worth noting that these preferences are, in some ways, equivalent to random choice. They are arbitrary, domainagnostic, they do not care about their impact on the viewer (unless we used the agreement to be contrarian, perhaps). We are not claiming here that these preferences provide a benefit to the code over randomly choosing something, nor do we even claim that it makes the system more creative in terms of its functionality. We do believe, however, that they provide a benefit to the perception of the software as creative if its decisions can be justified, if we can claim that no random number generation is involved, and if its decisionmaking process can be inspected and interrogated by observers. There are several important areas of future work to be undertaken in order for the system described in this paper to be able to work in large computationally creative systems. Some of these topics have already appeared earlier in this paper. Firstly, the system should be expanded with a larger state space to explore in terms of code generation, so that more complex functions can be generated. This may be possible with existing techniques simply by applying it at a larger scale, however the state space explosion is significant once more complex programming features - like method invocation - are taken into account. It may be that evolution is not the best approach for generating code at this scale, or that the process requires alteration in order to be more efficient for this kind of optimisation problem. A second point of future work is automatic simplification of generated preference functions. This is an achievable goal, and many optimisation processes for program compilation already do this. We mention it here because it is particularly important for code generation in the context of Computational Creativity, as we explained in (Cook et al. 2013). Compressing a piece of code by removing unreachable or non-functional code makes it easier to understand, easier to compare, and also has the important side effect of making it easier to explain, which is a third future work task. Being able to explain the function of a piece of code is crucial to this work - in the examples we gave in the Results section there was a lack of textual framing to the visual examples. In some senses it is possible to interpret the effect of the preferences simply by looking at the content produced, but in general it is desirable to be able to have the system itself express ‘I prefer redder colors'. Producing English renderings of the function of code is complex - we are currently exploring possibilities which use some metadata tagging on the code prior to generating preferences, but there are many more and better approaches yet to be discovered. Finally, representing preference functions in a higherlevel mathematical language may be advantageous for this work. Many of the problems we have encountered are direct consequences of the code-based representation, such as the presence of unreachable code and functionally identical generation. We hope to look into more abstract representation formats for future versions of our system. Conclusions In this paper we introduced a series of criteria for assessing functions that describe preferences, motivated by a desire to provide non-random justifications for small creative decisions that don't rely on other people. We showed how an evolutionary system can use these criteria as the basis for a fitness function that evolves code which act as preference functions. We gave examples of preference functions we evolved using these criteria for comparing various types, including videogame content and colours, and discussed the issues it raises for Computational Creativity, in terms of the code itself and the nature of generated preferences. The perception of creativity in software is a defining problem for our field. We hope that the work we have described here offers a new avenue to explore for framing decisions made by the software we build. Even the smallest of decisions are affected by people's perceptions of software as arbitrarily random, or clones of their designers. We believe that the future of decision-making in software lies beyond random choice and modelling human opinion - we need to give our software independence and remove the influence of other people on it. We acknowledge that we have by no means managed to remove ourselves from the process of decision-making - we have designed the system which produces preference functions, defined its metrics and provided it a fitness function. But we hope that we have offered a way to take one step further into the background, leaving our software to stand alone at the fore. Acknowledgments The authors would like to thank Adam Smith and Rob Saunders for discussions about code generation, and Alison Pease for input on an earlier version of this paper. Thanks to the reviewers for very thorough reviews which helped improve this paper. This paper is a difficult balance of position paper and systems description, which made it hard to meet all of your comments, but we have taken them on board for our future work and writing, and we appreciate them greatly. This work was funded by EPSRC grant EP/L00206X. June 2015 15 2015_20 !2015 Generalize and Blend: Concept Blending Based on Generalization, Analogy, and Amalgams Tarek R. Besold Institute of Cognitive Science University of Osnabruck ¨ D-49069 Osnabruck, Germany ¨ tarek.besold@uni-osnabrueck.de Enric Plaza IIIA, Artificial Intelligence Research Institute CSIC, Spanish Council for Scientific Research Campus U.A.B., 08193 Bellaterra, Catalonia (Spain) enric@iiia.csic.es Abstract Concept blending, a cognitive process which allows for the combination of certain elements (and their relations) from originally distinct conceptual spaces into a new unified space combining these previously separate elements and allowing the performance of reasoning and inference over the combination, is taken as a key element of creative thought and combinatorial creativity. In this paper, we provide an intermediate report on work towards the development of a computational-level and algorithmic-level account of concept blending, presenting the theoretical background together with the main model characteristics, as well as two case studies. Creativity and Concept Blending The term "combinatorial creativity" (Boden 2003) refers to creativity which arises from a combinatorial process joining familiar ideas (in the form of, for instance, concepts, theories, or artworks) in an unfamiliar way, thereby producing novel ideas. But although the overall idea of combining preexisting ideas into new ones seems fairly intuitive and straightforward, computationally modeling this form of creativity turns out to be surprisingly complicated: When looking at it from a more formal perspective at the current stage neither can a precise algorithmic characterization be given, nor are at least the details of a possible computational-level theory describing the process(es) at work well understood. Still, in recent years a proposal by (Fauconnier and Turner 1998) called concept blending (or conceptual integration) has influenced and reinvigorated studies trying to unravel the general cognitive principles operating during creative thought. In their theory, concept blending constitutes a cognitive process which allows for the combination of certain elements (and their relations) from originally distinct conceptual spaces into a new unified space combining these previously separate elements and allowing the performance of reasoning and inference over the combination. Unfortunately, a proper computational modeling of concept blending as cognitive capacity again is lacking. Neither do (Fauconnier and Turner 1998) provide a fully worked out and formalized theory themselves, nor does their informal account capture key properties and functionalities as, for example, the retrieval of input spaces, the selection and transfer of elements from the input into the blend space, or the further combination of possibly mutually contradictory elements in the blend. These shortcomings notwithstanding, several researchers in AI and computational cognitive modeling have used the provided conceptual descriptions as a starting point for proposing possible refinements and implementations: (Goguen and Harrell 2010) propose a concept blendingbased approach to the analysis of the style of multimedia content in terms of blending principles and also provide an experimental implementation, (Pereira 2007) tries to develop a computationally plausible model of several hypothesized sub-parts of concept blending, (Thagard and Stewart 2011) exemplify how creative thinking could arise from using convolution to combine neural patterns into ones which are potentially novel and useful, and (Veale and O'Donoghue 2000) present their computational model of conceptual integration and propose several extensions to the (at that time prevailing) view on concept blending. Since 2013, another attempt at developing a computationally feasible, cognitively-inspired formal model of concept creation, grounded on a sound mathematical theory of concepts and implemented in a generic, creative computational system is undertaken by a European research consortium in the so called Concept Invention Theory (COINVENT) project (Schorlemmer et al. 2014)1. One of the main goals of the COINVENT research program is the development of a computational-level and algorithmic-level account of concept blending based on insights from psychology, AI, and cognitive modeling, the heart of which are made up by results from cognitive systems studies on computational analogy-making and knowledge transfer and combination (i.e., the computation of so called amalgams) from casebased reasoning. In the following we present an analogyinspired perspective on the COINVENT core model for concept blending and show how the respective mechanisms and systems interact. Two Mechanisms at the Heart of COINVENT: Generalization-Based Analogy and Amalgams As analogy seems to play a crucial role in human cognition (Gentner and Smith 2013), researchers on the computa1 Also see http://www.coinvent-project.eu for details on the consortium and the project. June 2015 150 tional side of cognitive science and in AI also very quickly got interested in the topic and have been creating computational models of analogy-making since the advent of computer systems, among others giving rise to (Winston 1980)'s work on analogy and learning, (Hofstadter and Mitchell 1994)'s Copycat system, or (Falkenhainer, Forbus, and Gentner 1989)'s well-known Structure-Mapping Engine (SME). Generally speaking there are (at least) two families of computational analogy models: one family is based on a (generalization-free) direct mapping approach, the other one relies on a two-step procedure with a generalization stage followed by a subsequent mapping stage. While the former type of analogy engine is, among others, exemplified in the SME and its immediate pairwise mapping of domain elements between elements of source and target of the potential analogy, followed by the accumulation of individual mappings into more complex structures, the latter category is represented by the Heuristic-Driven Theory Projection (HDTP) framework (Schmidt et al. 2014). As COINVENT, for principled conceptual reasons (see the section on the idea(s) behind concept blending in COINVENT below), relies on the generalization-based view on analogy-making, we shortly introduce this model category in the following subsection. In a conceptually related, but mostly independently conducted line of work researchers in case-based reasoning (CBR) have been trying to develop problem solving methodologies based on the principle that similar problems tend to have similar solutions. CBR tries to solve problems by retrieving one or several cases relevant for the issue at hand from a case-base with already solved previous problems (cases), and then reusing the past case(s) to also solve the new task (Aamodt and Plaza 1994). While the retrieval stage has received significant attention over the last two decades, the transfer and combination of knowledge from the retrieved case to the current problem has been studied in an domain-specific way, with (Ontanon and Plaza 2012) ´ being a recent attempt at also gaining insights on this phase of the CBR cycle by suggesting the framework of amalgams (Ontanon and Plaza 2010) as a formal model for reuse of ´ multiple cases. The second subsection gives an overview of amalgams as used in COINVENT. Generalization-Based Models of Analogy Generalization-based models of analogy-making share a close conceptual connection to models of inductive generalization (Smaling 2003). Similar to these, the basic principle is the recognition of a common core between source and target of the potential analogy, which is then used for guiding the formation process of the analogy and the subsequent content transfer and reasoning steps. Fig. 1 gives a schematic overview: The common conceptual elements between source S and target T correspond to a shared generalization G (subsuming both, S and T), which also induces mappings between the respective domain elements, establishing an analogical relation. These mappings, governed by the generalization, then also subsequently define how (previously unmatched) knowledge from the source domain can be transferred to and integrated into the target doGeneralization (G) % KK KK KK KK KK ysssssssss SOURCE (S) analogical relation TARGET (T) Figure 1: A schematic overview of a generalization-based approach to analogy. main, namely by converting elements from S into their corresponding counterparts within T. The precise nature of the subsumption relation between generalization and source or target domain, respectively, is defined by the specific analogy model, possibly ranging from semantic subsumption in a suitable ontology, through taxonomic subsumption based on names and labels, logical subsumption in a model-theoretic sense, to purely syntactic subsumption in a formal language. One example for a generalization-based computational analogy-model (and the system used in COINVENT) is the already aforementioned HDTP (Schmidt et al. 2014). The framework has been conceived as a mathematically sound theoretical model and implemented engine for computational analogy-making, on a syntax basis computing analogical relations and inferences for domains which are presented in (when allowing for re-representation possibly different) many-sorted first-order logic (FOL) languages. Source and target domains are handed over to the system in terms of finite axiomatizations and HDTP tries to compute a generalization between both domains. This is done by aligning pairs of formulae from the two domains by means of restricted higher-order anti-unification (Schwering et al. 2009): Given two terms, one from each domain, HDTP computes an anti-instance in which distinct subterms have been replaced by variables so that the anti-instance can be seen as a meaningful generalization of the input terms. As already indicated by the name, the class of admissible substitution operations is limited. On each expression, only renamings, fixations, argument insertions, and permutations may be performed. By this process, HDTP tries to find the least general generalization of the input terms, which (due to the higher-order nature of the anti-unification) is not unique. In order to solve this problem, current implementations of HDTP rank possible generalizations according to a complexity measure on the chain of substitutions — the respective values of which are taken as heuristic costs — and returns the least expensive solution as the preferred one. HDTP extends the notion of generalization from terms to formulae by basically treating formulae in clause form and terms alike. Finally, as analogies rarely rely exclusively on one isolated pair of formulae from source and target domain, but usually encompass sets of formulae (possibly completely covering one or even both input domains), a process iteratively selecting pairs of formulae for generalization has been included. The selection of formulae is again based on a heuristic component. Mappings in which substitutions can be reused get assigned a lower cost than isolated substitu June 2015 151 tions, leading to a preference for coherent over incoherent mappings. Due to the use of many-sorted FOL as an expressive representation language, and the purely syntax-based generalization approach underlying HDTP, over the last years the framework has shown remarkable generalizability and generality. Having originally been conceived and applied for modelling the Rutherford analogy and poetic metaphors, as well as for providing an alternate account of (Falkenhainer, Forbus, and Gentner 1989)'s heat-flow analogy in (Schwering et al. 2009), without major changes to the model HDTP has by now been applied to different tasks from different domains, such as modeling a potential inductive analogy-based process for establishing the fundamental concepts of arithmetics (Guhe et al. 2010), or studies applying the framework to modeling analogy use in education and teaching situations (Besold 2014). Combining Conceptual Theories Using Amalgams The notion of amalgams was developed in the context of CBR (Ontanon and Plaza 2010), where new problems are ´ solved based on previously solved problems (or cases, residing on a case base). Solving a new problem often requires more than one case from the case base, so their content has to be combined in some way to solve the new problem. The notion of an amalgam of two cases (two descriptions of problems and their solutions) is a proposal to formalize the ways in which cases can be combined to produce a new, coherent case. Formally, the notion of amalgams can be defined in any representation language L for which a subsumption relation v between the formulae (or descriptions) of L can be defined. We say that a description I1 subsumes another description I2 (I1 v I2) when I1 is more general (or equal) than I2. Additionally, we assume that L contains the infimum element ? (or ‘any'), and the supremum element > (or ‘none') with respect to the subsumption order. Next, for any two descriptions I1 and I2 in L we can define their unification, (I1 t I2), which is the most general specialization of two given descriptions, and their antiunification, (I1 u I2), defined as the least general generalization of two descriptions, representing the most specific description that subsumes both. Intuitively, a unifier is a description that has all the information in both the original descriptions; if joining this information leads to inconsistency, this is equivalent to saying that I1 tI2 = > (i.e., they have no common specialization except ‘none'). The antiunification I1uI2 contains all that is common to both I1 and I2; when they have nothing in common, then I1 u I2 = ?. Depending on L anti-unification and unification might be unique or not. The notion of an amalgam can be conceived of as a generalization of the notion of unification: as ‘partial unification' (Ontanon and Plaza 2010). Unification means that what is ´ true for I1 or I2 is also true for I1tI2; e.g., if I1 describes ‘a red vehicle' and I2 describes ‘a German minivan' then their unification yields a common specialization like ‘a red German minivan.' Two descriptions may contain information that produces an inconsistency when unified; for instance I1 I2 ¯I2 ¯I1 G = I1 u I2 A = ¯I1 t ¯I2 v v v v v v v v Figure 2: A diagram of an amalgam A from inputs I1 and I2 where A = ¯I1 t ¯I2. v v v v v v A = S0 t T S0 S T G = S u T Figure 3: A diagram that transfers content from source S to a target T via an asymmetric amalgam A. ‘a red French sedan' and ‘a blue German minivan' have no common specialization except >. An amalgam of two descriptions is a new description that contains parts from these two descriptions. For instance, an amalgam of ‘a red French sedan' and ‘a blue German minivan' is ‘a red German sedan'; clearly there are always multiple possibilities for amalgams, like ‘a blue French minivan'. For the purposes of this paper we can define an amalgam of two input descriptions as follows: Definition 1 (Amalgam) A description A 2 L is an amalgam of two inputs I1 and I2 (with anti-unification G = I1 u I2) if there exist two generalizations ¯I1 and ¯I2 such that (1) G v ¯I1 v I1, (2) G v ¯I2 v I2, and (3) A = ¯I1 t ¯I2 When ¯I1 and ¯I2 have no common specialization then trivially A = >, since their only unifier is "none". For our purpose we will be only interested in non-trivial amalgams. This definition is illustrated in Fig. 2, where the antiunification of the inputs is indicated as G, and the amalgam A is the unification of two concrete generalizations ¯I1 and ¯I2 of the inputs. Equality here should be understood as vequivalence: X ⌘ Y iff X v Y and Y v X. Conventionally, we call the space of amalgams of I1 and I2 the set of all amalgams A that satisfy Definition 1. Usually we are interested only in maximal amalgams of two input descriptions, i.e., those amalgams that contain maximal parts of their inputs that can be unified into a new coherent description. Formally, an amalgam A of inputs I1 and I2 is maximal if there is no other non-trivial amalgam A0 of inputs I1 and I2 such that A @ A0 . The reason why we are interested in maximal amalgams is very simple: a non-maximal amalgam A¯ @ A preserves less compatible information from the inputs than the maximal amalgam A. Conversely, any non-maximal amalgam A¯ can be obtained by generalizing a maximal amalgam A, since A¯ @ A. There is a special case of particular interest that is called June 2015 152 an asymmetric amalgam, in which the two inputs play different roles. The inputs are called source and target, and while the source is allowed to be generalized, the target is not. Definition 2 (Asymmetric Amalgam) An asymmetric amalgam A 2 L of two inputs S (source) and T (target) satisfies that A = S0 t T for some generalization of the source S0 v S. As shown in Fig. 3, the content of target T is transferred completely into the asymmetric amalgam, while the source S is generalized. The result is a form of partial unification that preserves all information in T while relaxing S by generalization and then unifying one of those generalizations S0 with T itself. As before, we will usually be interested in maximal amalgams: in this case, a maximal amalgam corresponds to transferring maximal content from S to T while keeping the resulting amalgam A consistent. For these reasons asymmetric amalgams can be seen as models of a form of analogical inference, transferring information from the source to the target by creating a new amalgam that enriches the latter with the content of S0 (Ontanon and Plaza 2012). ´ Analogy-Based Concept Blending in COINVENT Taking the concept of generalization-based analogies (and HDTP as suitable framework for the computation of the latter) together with the notion of asymmetric amalgams, we now can introduce the core idea(s) behind concept blending as performed in COINVENT in the next subsection, subsequently also showing the feasibility of the approach in two examples. The general suitability of the approach is demonstrated revisiting the "sign forest" metaphor from (Kutz et al. 2012), an implementation using HDTP is exemplified (re-)constructing the concept of a foldable toothbrush. The Core Model: An Analogy-Inspired View One of the early formal accounts on concept blending, which is especially influential to the approach applied in COINVENT, is the classical work by Goguen using notions from algebraic specification and category theory (Goguen 2006). This version of concept blending can be described by the diagram in Fig. 4, where each node stands for a representation an agent has of some concept or conceptual domain. We will call these representations "conceptual spaces" and in some cases abuse terminology by using the word "concept" to really refer to its representation by the agent. The arrows stand for morphisms, that is, functions that preserve at least part of the internal structure of the related conceptual spaces. The idea is that, given two conceptual spaces I1 and I2 as input, we look for a generalization G and then construct a blend space B in such a way as to preserve as many as possible structural alignments between I1 and I2 established by the generalization. This may involve taking the functions to B to be partial, in that not all the structure from I1 and I2 might be mapped to B. In any case, as the blend respects (to the largest possible extent) the relationship between I1 and I2, the diagram will commute. Concept invention by concept blending can then be phrased as the following task: given two representations of G ~~~~~~ ✏✏ %@@@@@@@ I1 %@@@@@@@ I2 ~~~~~~ B Figure 4: A conceptual overview of (Goguen 2006)'s account of conceptual blending. two domain theories I1 and I2, we need first, to compute a generalized theory G of I1 and I2 (which codes the commonalities between I1 and I2) and second, to compute the blend theory B in a structure preserving way such that new properties hold in B. Ideally, these new properties in B are considered to be (moderately) interesting properties. In what follows, for reasons of simplicity and without loss of generality we assume that the additional properties are just provided by one of the two domains, i.e., we align the situation with a standard setting in computational analogy-making by renaming I1 and I2. The domain providing the additional properties for the concept blend will be called source S, the domain providing the conceptual basis and receiving the additional features will be called target T. In COINVENT's account, the reasoning process is then triggered by the computation of the generalization G (generic space), where for concept invention we will only need the mapping mechanism and replace the transfer phase by a new blending algorithm. The mapping is achieved via the usual generalization process between S and T, in which a generalized theory is created that reflects common aspects of both spaces. The generalized theory can be projected back into the original spaces by specializations !S and !T , respectively. As S and T might contain elements which are not reflected in the shared generalization, it holds that !S(G) ✓ S and !T (G) ✓ T. While in analogy making the analogical relations are used in the transfer phase to translate additional uncovered knowledge from the source to the target space, blending combines additional facts (i.e., elements from S \ SC or T \ TC ) from one or both spaces. Therefore the process of blending can build on the generalization and specializations provided by the analogy engine, but has to include a new mechanism for transfer and concept combination. Here, amalgams naturally come into play: The set of specializations can be inverted and applied to generalize the original source theory S into a more general version S0 (forming a superset of the shared generalization G, also including previously uncovered knowledge from the source) which then can be combined into an asymmetric amalgam with the target theory T, forming the (possibly underspecified) proto-blend T0 of both. In a final step, T0 is then completed into the blended theory and output of the process TB by applying corresponding specialization steps stored from the generalization process between S and T (see also Fig. 5). If we now take the domains to be represented in the form of finite axiomatizations as processed by HDTP, in an im June 2015 153 G !T ? ? ? ? !S % #_ #_ #_ #_ G = S0 u T v ysssssssss v %JJJJJJJJJJ Tc v ~ ~ ~ Sc v %? ? ? ? S0 v %KKKKKKKKKKK !S v ? ? ? ? T v yttttttttttt T analogical relation S T0 !T v ✏✏ ✏O✏O✏O TB Figure 5: A general overview of COINVENT's account of concept blending using generalization-based analogy and asymmetric amalgams: The shared generalization G from S and T is computed with !S(G) = Sc. The relation !S is subsequently re-used in the generalization of S into S0 , which is then combined in an asymmetric amalgam with T into the proto-blend T0 = S0 t T and finally, by application of !T , completed into the blended output theory TB. (Here v indicates subsumption between theories in the direction of the respective arrows.) plementation of the general model we can use the analogyengine for computing the generalizations and deriving the corresponding substitutions. In the generalization step between S and T, as usual pairs of formulas from the source and target spaces are anti-unified for deriving the generalized theory G, and the specializations !S and !T become substitutions which are computed during anti-unification. Example 1: The Sign Forest We now want to revisit the example of the blend sign forest discussed in (Kutz et al. 2012), providing an interpretation of the concept from a metaphor-centered perspective and showing how the general COINVENT model can serve for reconstructing the blending process. In what follows we consider sign forest equivalent to the expression "a forest of signs", that shows more clearly its metaphorical nature. The original sign forest blend was defined in the context of blending ontologies, which means that the involved inputs for blending were ontological descriptions of trees, forests, and (traffic) signs. This approach views a concept such as tree defined as an ontological specification of the concept of tree: a specification that is ideally so general as to cover all kinds of trees; the same can be said about forest, and (traffic) signs. As such, certain properties and relations are selected to form these specifications that are useful for an ontology framework. However, our approach follows the notion that concepts in human cognition can often be viewed, in cognitive science, as bundles of their most typical properties (albeit typicality may certainly be context-dependent). This view is also taken in examples by (Fauconnier and Turner 1998) that are used to show how conceptual blending works: a boathouse has typical properties of boat and house —but not other properties that may appear in an ontological specification of boat and house. Thus, in this approach, the concept of tree is typically formed by a plant having roots, a trunk and a crown (even if there may be plants categorized as trees that do not have a trunk, this is ignored as it does not belong to the bundle of properties that are typical); this view is depicted as I2 in the bottom right of Fig. 6, where other properties are included, like plants being not mobile and the roots fixing the (typical) tree to the ground. Finally, a forest is commonsensically defined as a group of trees. The second concept, (traffic) sign, may come in many forms (as we know from own experience), but the first that comes to mind is the most typical one: the signpost. The signpost is typically fixed on the ground near a road, and has a post supporting a surface panel depicting some traffic related information (labeled I1 in the lower left corner of Fig. 6). The cognitive advantage of a signpost is that it has a recognizable physical structure, while "traffic sign" is so generic as to be a merely functional-based concept: any kind of surface panel depicting some traffic-related information is a traffic sign. The generic space G of conceptual blending corresponds to the anti-unification shown as G = I1 u I2 in Fig. 6; G depicts common structure between a signpost and a tree: a stem-like object, fixed to the ground, and supporting another object on top. As discussed later, this common structure is the basis for a metaphor like "a forest of signs" to make sense — in contradistinction to a metaphor that does not make sense such as "a forest of chairs", even when a typical chair is made of wood. Now, the construction of the blended metaphor for sign forest can be interpreted easily in the combined generalization-based analogy and amalgam framework: the input spaces can be generalized in different ways (although always satisfying what they already have in common, namely G). Different generalizations would yield different amalgams, but the one we are considering here can be seen as generalizing I2 into ¯I2, as shown in Fig. 6. Now this generalization ¯I2 can directly be unified with I1, since ¯I1 is identical to I1; this unification yields the amalgam A = ¯I1 t ¯I2 that, as shown in Fig. 6, represents a "forest of signposts". Moreover, since I1 ⌘ ¯I1, this model is an asymmetric amalgam, as evidenced by the fact we generalize the source (Forest) until it unifies with the target (Signpost), while the latter remains fixed (i.e., is not generalized). In order to support our perspective that a metaphor (viewed as an analogy and amalgam combination in natural language) is based on some (strong enough) common June 2015 154 Tree Crown Trunk Root has has has above above Ground attached Signpost Panel Post Post holder has has has above above Ground attached Information displays Plant False mobile Living Being Physical Object fixes-into fixes-into Forest group-of Physical Object Physical Object Vertical Stick Physical Object has has has above above Ground attached fixes-into Postsign Panel Post Post holder has has has above above Ground attached Information displays fixes-into Forest group-of Physical Object Physical Object Vertical Stick Physical Object has has has above above Ground attached fixes-into Forest Signpost Panel Post Post holder has has has above above Ground attached Information displays fixes-into GENERIC SPACE FOREST OF (TYPICAL) TREES TRAFFIC (TYPICAL) SIGNPOST GENERALIZATION is (basically) identical GENERALIZATION OF FOREST BLEND "SIGN-FOREST" I1 I2 G = I1 u I2 ¯I1 ¯I2 A = ¯I1 t ¯I2 Figure 6: Blending schema for "Sign Forest" when inputs are typical concepts for "Sign" (traffic signpost) and "Forest" (forest of typical trees); the arrows indicate subsumption (v) as in Figure 2. structure of the typical concepts participating in the blending process, we checked if other metaphors can be constructed, or better yet, have already been constructed, that are based on the same kind of generic space G. We used Google's ngrams database to search for existing phrases in which "forest of X" is used metaphorically2. Most n-grams starting with "forest of" were about places or kinds of trees, as is to be expected; still, we found the following metaphors used on the web: (1) forest of spears, (2) forest of masts, and (3) forest of marble columns. These three cases have a generic space that is very similar to G: they all represent a multitude of vertical stem-like objects. Some differences are: while masts and columns are fixed, spears are not fixed to the ground, but may be used in a context where they are vertical and immobile stems, supporting a pointed tip; masts 2 Google's 3-grams starting with "fo" are available at: http: //storage.googleapis.com/books/ngrams/books/ googlebooks-eng-all-3gram-20120701-fo.gz and columns support different kinds of objects, but all three examples have generic spaces resembling G in Fig. 6. What about counterexamples? We did not find "forest of chairs" of course, and there were other metaphors on forest, but they were based on different generic spaces and different input spaces; we found these metaphors: "forest of X", where X could be opinions, possibilities, desires, words, human experience. Clearly, these metaphors were not based on the trees being elements of a "forest", but on the human experience of (walking in) the forest as a place of multiplicity of paths, options, destinations. We think they are not counterexamples, but rather examples of blends from different input spaces. Example 2: The Folding Toothbrush Having given an example for the general model in the previous subsection, we now want to also exemplify a concrete implementation of the approach using HDTP as anal June 2015 155 Figure 7: Brillo, an example of a foldable toothbrush as produced by Metaphys. ogy framework. As application example, we will use the blending-driven (re-)invention of foldable toothbrushes as, for instance, the one depicted in Fig. 7. Currently, when using HDTP, the required subsumption relation between theories is given by logical semantic consequence |=, i.e., A v A0 if A0 |= A for any two theories A and A0 . In order to make sure that this relationship is preserved by HDTP's syntax-based operations, the range of admissible substitutions for restricted higher-order antiunifications has to be further constrained to only allow for fixations and renamings. Foldable toothbrushes are a conceptual combination between a typical toothbrush and a folding mechanism like that of pocketknives. In order to reconstruct the underlying blending process, we start with the stereotypical characterizations of a toothbrush and a pocketknife in a many-sorted first-order logic representation from Table 1. Sorts: entity, part, functionality Entities: toothbrush, pocketknife : entity handle, brush head, blade, hinge : part brush, cut, fold : functionality Predicates: has part : entity ⇥ part, has functionality : entity ⇥ functionality Laws of the pocketknife characterization: (↵1) has part(pocketknife, handle) (↵2) has part(pocketknife, blade) (↵3) has functionality(pocketknife, cut) (↵4) has part(pocketknife, hinge) (↵5) has functionality(pocketknife, fold) Laws of the horse characterization: (#1) has part(toothbrush, handle) (#2) has part(toothbrush, brush head) (#3) has functionality(toothbrush, brush) Table 1: Example formalizations of stereotypical characterizations for a pocketknife S and a toothbrush T. Given these characterizations, HDTP can be used for finding a common generalization of both, for instance (due to the syntactic similarities and the system's heuristics) aligning and generalizing ↵1 with #1, ↵2 with #2, and ↵3 with #3. Subsequently, reusing the same anti-unifications (corresponding to !S), the source theory S is generalized into S0 as given in Table 2: $1 corresponds to ↵1/#1, $2 to ↵2/#2, $3 to ↵3/#3, and $4 and $5 are obtained by generalizing ↵4 and ↵5, respectively. Entities: E : entity, P : part, F : functionality Laws: ($1) has part(E, handle) ($2) has part(E, P ) ($3) has functionality(E, F ) ($4⇤) has part(E, hinge) ($5⇤) has functionality(E, fold) Table 2: Abbreviated representation of the generalized source theory S0 based on the stereotypical characterizations for a toothbrush and a pocketknife (axioms not obtained from the covered subset Sc are highlighted by *). Computing the asymmetric amalgam of S0 with the (fixed) target theory T, we obtain the proto-blend T0 from Table 3. As T0 still features axioms containing noninstantiated variables, !T is applied to the theory resulting in the (with respect to !T ) fully instantiated blend theory TB from Table 4, describing the concept of a hinge-equipped toothbrush that can be folded. Entities: E : entity Laws: (%1) has part(toothbrush, handle) (%2) has part(toothbrush, brush head) (%3) has functionality(toothbrush, brush) (%4) has part(E, hinge) (%5) has functionality(E, fold) Table 3: Abbreviated representation of the proto-blend T0 obtained from computing the asymmetric amalgam between S0 and T. Laws: (%1) has part(toothbrush, handle) (%2) has part(toothbrush, brush head) (%3) has functionality(toothbrush, brush) (%4) has part(toothbrush, hinge) (%5) has functionality(toothbrush, fold) Table 4: Abbreviated representation of TB = !T (T0 ). Conclusions We presented a perspective on the blending of concept theories building on generalization-based analogy and the amalgam framework: Building upon analogy models of generalization and domain matching, asymmetric amalgams allow to provide a sound model for the controlled computation of the concept blend(s) of two input theories. Clearly, this is not the only attempt at developing a computational model of (some facet of) concept blending: (Martinez et al. 2014) present an algorithmic approach for blending mathematical theories, (Kutz et al. 2015) give an account June 2015 156 of ontological blending, (Li et al. 2012) describe the goaland context-sensitive blending-based production of creative artifacts, and (Martinez et al. 2012) consider concept blending in a human-level AI context. Still, in combining the generality of generalization-based analogies and the amalgam framework, COINVENT's approach stands out as highlevel, cognitively-inspired perspective on concept blending. Acknowledgements The authors acknowledge the financial support of the Future and Emerging Technologies within the 7th Framework Programme for Research of the European Commission, under FET-Open grant 611553 (COINVENT). 2015_21 !2015 Vismantic: Meaning-making with Images Ping Xiao, Simo Linkola Department of Computer Science and Helsinki Institute for Information Technology HIIT University of Helsinki, Finland {ping.xiao, slinkola}@cs.helsinki.fi Abstract This paper presents Vismantic, a semi-automatic system generating proposals of visual composition (visual ideas) in order to express specific meanings. It implements a process of developing visual solutions from ‘what to say' to ‘how to say', which requires both conceptual and visual creativity. In particular, Vismantic extends our previous work on using conceptual knowledge to find diverse visual representations of abstract concepts, with the capacity of combining two images in three ways, including juxtaposition, replacement and fusion. In an informal evaluation consisting of five communication tasks, Vismantic demonstrated the potential of producing a number of expressive and diverse ideas, among which many are surprising. Our analysis of the generated images confirms that visual meaningmaking is a subtle interaction between all elements in a picture, for which Vismantic demands more visual semantic knowledge, higher image analysis and synthesis skills, and the ability of interpreting composed images, in order to deliver more ideas that make sense. Introduction Aesthetics and meaning are two main concerns of art. The work presented in this paper focuses on meaning-making in image generation. Particularly, we are interested in conveying specific meanings, in contrast to vague or divergent interpretations. A common way of constructing meanings in images is combining objects, where meanings arise from the objects (denotation and connotation) and the relations between them. Such combination involves two main decisions: which objects to combine and how to combine them. Contemporary print advertisements offer abundant examples of combining objects to express specific meanings. In general, an ad tells about a desirable attribute of a product. Hence, usually two objects are combined, the product (or something closely related) and another thing that embodies the attribute. For example, an ad for promoting dairy products shows a bone made of milk. Regarding how to visually combine two objects, Phillips and McQuarrie (2004) identified three ways (visual operations): juxtaposition (two objects side by side), fusion (two objects merged together), and replacement (only one object is present, which occupies the usual place of the other object). Obviously, the above visual operations do not appear only in ads, and the relations between objects are not limited to attribute. The news collage in (Krzeczkowska et al. 2010) (see Related Work) is an example of juxtaposing more than two objects. Dal´ı's liquid clock1 is an example of fusion, and Duchamp's urinal2 surrounded by artworks in an exhibition can be seen as an example of replacement. In this paper, we present Vismantic, a semi-automatic system combining pictures of objects to express simple meanings described by pairs of a subject word and a message word. A message may be an attribute of the subject, or have a causal or an opposite relation to the subject. Vismantic first searches for photos that represent the subject and the message respectively and are as diverse as possible. It then applies juxtaposition, fusion and replacement to the photos found. We provide the formalization and computational implementation of the three visual operations. Nevertheless, Vismantic is not yet fully automatic; it needs user filtering at intermediate stages. Vismantic is a workflow of integrating conceptual and visual creativity in making images. Such integration is necessary, since both kinds of creativity are required in common visual communication tasks and there do not exist many such systems. We present here the first version of Vismantic and the results of an informal evaluation, which functions as identifying the problems in the field. Another important objective of the present work is using computational modeling for studying visual compositional semantics. The semantics of an image is a synergy of every element in it, including the subtle details. But, there has not been much formal study on it. Formalization and computational implementation are great tools for testing rules and hypotheses. Our newly gained insights are presented in the Evaluation and Analysis section. Vismantic focuses on the variety and novelty of compositions (visual ideas), rather than generating perfect images. As an example, Fig. 1 shows some of the ideas generated by Vismantic in order to say "electricity is green (sustainable)". In the remainder of this paper, related work is introduced first, followed by the details of how Vismantic works. We then present the experiment we conducted to evaluate its 1http://en.wikipedia.org/wiki/The Persistence of Memory 2http://en.wikipedia.org/wiki/Fountain %28Duchamp%29 June 2015 158 (a) (b) (c) Figure 1: Example visual ideas generated for Task 1 "electricity is green (sustainable)". 1a: a light bulb replaces a tuft of green leaves; 1b: green leaves are fused with the screw base and wire filament of a light bulb; 1c: a branch of leaves replaces a power station. ideation capacity, as well as our analysis of the test results. Finally, we give conclusions and propose future work. RelatedWork Within the Computational Creativity community, the bulk of work on visual creativity has concentrated on aesthetics, while meaning creation has only come into focus lately. Krzeczkowska et al. (2010) created a computer visual artist, which has a basic level of intention and expresses it with collages. At regular intervals it accesses news articles from a few internet sources, and takes the viewpoints of the authors by extracting most-content-indicative keywords (only nouns) from the articles. The keywords are used to retrieve digital images from a few online and local sources, including Corel, Flickr and Google Images. The retrieved images, in their whole or segments, are assembled according to one of the grid-based templates, which is then rendered with pencils, pastels or paints. In the example presented in the paper, ten nouns are extracted in order to cover all the central subjects of an article. The use of collage makes it easy to present the multiple facets of an event. In contrast, Vismantic relates only two objects and intends more specific messages. Moreover, it combines images in two additional ways, i.e. replacement and fusion. Another computer visual artist, DARCI (Digital Artist Communicating Intent) (Norton, Heath, and Ventura 2010; 2011), renders a given image in order to represent a list of adjectives. It learns, from human-annotated images, the mappings between adjective synsets and low-level image features, including color, light, texture and shape. The mapping for each synset is encoded in a series of artificial neural networks (ANNs). To render an image, DARCI selects a set of image filters through an evolutionary mechanism, where the ANNs are used in each generation to assess how well a rendering reflects the specified adjectives. Unlike DARCI, which focuses on the overall impression of images and the meaning-carrying capacity of low-level image features, Vismantic primarily uses objects and their relations to convey meanings. In addition, there is work on suggesting objects (in the form of concepts) for images to be generated. Xiao and Blat (2013) were interested in the use of pictorial metaphors in advertisements and created a program proposing metaphor vehicles, to which a product (metaphor tenor) and a few attributes with different levels of prominence are given as an input. The program first searches in several commonsense knowledge bases for concepts that have the main attribute as one of their stereotypical properties. Then, it evaluates the aptness of the concepts found as metaphor vehicles, in regard to imageability, affect polarity, attribute salience, secondary attributes and similarity with tenor. Another work is a software called Perception (De Smedt et al. 2013), which assists the brainstorming of artists in general. It is backed by a semantic network of concepts and their adjective properties. By concept clustering and graph path finding, Perception is able to find instances of novel concepts such as ‘creepy animals', and make analogies, e.g., proposing a toad as a symbol of Brussels. Both works are made for creative visual tasks, and both touch only the conceptual aspect. They are relevant in augmenting the conceptual creativity of Vismantic. Outside the Computational Creativity community, a relevant field is Content-aware Image Synthesis (Diakopoulos, Essa, and Jain 2004; Lalonde et al. 2007; Chen et al. 2009), which deals with composing scenes (images) using pictorial elements taken from photos. Its center of investigation is how to make a composition look as realistic as possible, considering that photos normally vary in camera pose, lighting, scale, resolution, etc. There is overlap in the image processing techniques used in this field and by Vismantic. The difference is that, in Content-aware Image Synthesis, it is the user who dictates the composite objects of an image, not the computer. VismanticWorkflow Vismantic takes as an input a subject word and a message word. To generate visual ideas, it follows three major steps: I Find representative photos of the subject and message, respectively; II Preprocess photos found; III Apply visual operations (juxtaposition, replacement and fusion). June 2015 159 subject word message word Input query expansion query expansion Flickr search query n ... query 2 query 1 query 1 query 2 ... query m filter subject photos* filter message photos* preprocess photos preprocess photos filter subject photos** filter message photos** replacement juxtaposition Step III fusion generated images Step I Step II Output Figure 2: Vismantic workflow. * and **: two filterings have different content (see the text for details). Step I and II both involve user filtering. The above workflow is also illustrated in Fig. 2 for clarity. The details of the three steps are presented below. Because the user filterings in Step I and II are influenced by how the visual operations are implemented, Step III is introduced first, followed by Step I and II. Implementation of Three Visual Operations In this subsection, we introduce first the specifications we give to the visual operations and then how they are realized with several image processing techniques. Juxtaposition means that two objects are shown side by side in an image. There is no restriction on whether the image of the subject or message should be on the left or right. Also, it does not rely on the context in the generated image to assist understanding. Replacement means that an object takes the place of another object. The context of the replaced object has to be able to hint about it. Again, it is arbitrary whether the subject or the message object should be replaced. Fusion means that an attribute of an object is fused with an attribute of another object, which creates a new object with mixed traits. The new object has to remind viewers of the original objects, which normally depends on the distinctiveness of the attributes. The above visual operations suggest using pictures of objects in their natural surroundings. We chose Flickr3 as image source, attempting to capitalize on its diverse content. In order to implement juxtaposition, replacement and fusion, we have identified three image processing challenges. The first is discovering the most prominent object in an image. The second is removing an object from an image and filling the empty space left in order to make it a natural part of the background. For fusion, we currently use the texture of an object to blend with the object region (texture) in another image. This particular implementation does not require that the object region has a distinctive texture, except a good object extraction. Again, either the image of the subject or message can provide texture or object region. Hence, the third challenge is blending the texture of an object with another object so that traits of both objects are still recognizable. 3www.flickr.com The families of image processing techniques we have chosen for solving the above challenges are saliency-based object extraction, inpainting and texture transfer, correspondingly. Object Extraction refers to finding the most prominent (salient) region in an image. The available algorithms usually provide floating point estimation of saliency for each pixel/segment, which is then binarized to obtain a mask of the most salient region (see Fig. 3b). This mask can then be used to extract the most prominent object (Fig. 3c). We use an algorithm created by Cheng et al. (2011,2015), which was concluded to be one of the better performing algorithms in a recent benchmark survey (Borji, Sihite, and Itti 2012). However, the robustness of object extraction algorithms is still far from perfection; the deficiencies include, e.g., partial object extraction and the object humans infer as the most prominent is not extracted. Furthermore, when there is no objectness estimation for the extraction results, regions in images with no clear separation of foreand backgrounds, e.g., patterns, can be treated as objects. Inpainting techniques were originally created for restoring damaged images or concealing unwanted objects from images. Our intention is to remove objects from images by filling the saliency masks generated by the object extraction algorithm (see Fig. 3d where the object in Fig. 3c is removed using the saliency mask in Fig. 3b). Inpainting algorithms have to deal with textural and structural soundness; textural soundness means preserving the observed textures around the mask, and structural soundness means merging the continuing isophotes (contours of equal luminance) around the mask. As in object extraction, no existing inpainting algorithm gives decent results across the board. Especially, when the removed object is big and/or its surrounding area is diverse, it is difficult to make the inpainted region a natural part of the original image without manual processing. Typical defects are clear patch borders and blurred images. Moreover, in Vismantic, the defects in object extraction may propagate to inpainting. We use fast spatial patch blending (Daisy, Tschumperl´e, and L´ezoray 2013) as the inpainting algorithm. It iteratively fits small areas (patches) surrounding the saliency mask into the masked area. Patch-based inpainting algorithms are a reasonably fast and convenient way of taking both of the textural and structural soundness into account. The characteristic of spatial patch blending is that it blends overlap June 2015 160 (a) Original (b) Saliency mask (c) Extracted object (d) Inpainted (e) Input texture (f) Result of texture transfer Figure 3: Results of image processing algorithms. ping regions of adjacent patches making their seams less prominent. However, it has several parameters that should be tuned on an image-to-image basis in order to achieve satisfactory results. Texture Transfer techniques take the texture of an image and apply it to another image so that the other image's characteristics are still recognizable. Comparing to more common texture synthesis methods, which only try to produce larger continuous texture based on a small sample image, texture transfer methods also take a map (usually a gray scale version of the other image or its segment) as an input, and generate texture to match the map's shape while trying to preserve the map's features. See Fig. 3f where the texture in Fig 3e has been transferred to the extracted object in Fig. 3c. We use the texture transfer method by Harrison (2005), because we perceived it as more robust than other readily available methods in an informal evaluation. Unfortunately, it has the same shortcoming as the fast spatial patch blending - multiple input parameters need to be adjusted for each image in order to get the best results. Harrison's texture transfer method may produce inferior results on many occasions even with near optimal parameter settings. We noticed that, for our purposes, the best quality is obtained when the input texture and map exhibit similar features, but are still relatively different, e.g., the spatial variability of the texture and the map should be in the same order of magnitude. Combining the three algorithms above, we can achieve our first implementation of the visual operations. Let IS and IM be the subject and message images, respectively; let Is i be the saliency mask obtained from the image Ii; let Io i be the object extracted from image Ii, given the saliency mask Is i ; and, let Ip i be the image, where the area of saliency mask Is i has been inpainted. With these notations, we can realize the visual operations as follows: • Juxtaposition: Resize each of the extracted objects IoS and IoM to be within a bounding box of 240⇥240 pixels (refer to the resizing method below), and position the resized objects side by side on a blank 640 ⇥ 400 image, so that the centers of the objects' bounding boxes are vertically centered and at the 14 and 34 marks on the horizontal axis. • Replacement: Resize IoS to be within the bounding box of IoM , and layer IoS to the same position as IoM in the inpainted image Ip M, i.e. overlapping the centers of the two objects' bounding boxes. • Fusion: Transfer the texture from IoM to IoS and overlay the resulting object upon the original subject image IS. Here, we have defined the operations only in one way, but as we pointed out earlier, subject and message images are interchangeable. For resizing, let Bw O,BhO be the width and height of the bounding box of the object to be resized, and Bw T ,Bh T the width and height of the target bounding box (240 ⇥ 240 in juxtaposition and the bounding box of the object to be replaced in replacement). We formulate the resizing procedure as follows: 1. Calculate the width and height ratios between the bounding boxes: rw = Bw T /Bw O and rh = Bh T /BhO . 2. If rw  rh, then r = rw, otherwise r = rh. 3. Resize if r < 34 or r > 43 (this is for using the original image whenever we can, in order to avoid decreasing the image quality). Finding Representative Photos of Concepts At the first step, Vismantic searches in Flickr for photos that can represent well the subject and the message, respectively. Other concerns are diversity, photo quality and (image processing) algorithm-friendliness. We also pay attention to the copyright of photos, only retrieving photos under Creative Commons license with modification permission. Both the subject and message can be a physical or abstract concept. For physical concepts, such as an object, pictures of the object itself or something closely related, i.e. its internal components or other objects interacting with it, are used to represent it. On the other hand, abstract concepts are represented by pictures of entities through connotation. When searching in Flickr, the subject and message words are used as free text search, sorted by relevance. Photos with more than 15 tags are rejected, considering that too many tags might imply photographers' intention of boosting the rankings of their photos in every query. We also avoid photos tagged with ‘illustration', ‘painting', ‘graphics', ‘infographic', ‘text', ‘collage', ‘scrapbook', ‘photoshop', etc. The photos downloaded are of medium size: at most 640 pixels on the longer side. June 2015 161 Additionally, we take advantage of the ‘related-tags'4 provided by Flickr and one of our previous works (Xiao and Blat 2012) in order to improve the diversity of the search results for abstract concepts. (Xiao and Blat 2012) finds physical concepts that have the intended abstract concept as one of their stereotypical properties. This is achieved by retrieving strong associations from four semantic knowledge bases and subsequently filtering associations that are low in concreteness and imageability. Take the task in Fig. 1 as an example, the concept ‘electricity' is concatenated with each of 53 physical concepts to form Flickr queries, such as "electricity storm", "electricity pylon", "electricity bulb", "electricity windmill", "electricity plant", "electricity outlet", etc. The photos retrieved by multiple queries are organized in groups, one group per query. User Filtering The photos retrieved from Flickr might not be sufficiently representative for the concept of interest. Also, the photo quality might be low, due to, e.g., under/over exposure, blur, highlight, low resolution, colorization, or using a fisheye lens. Besides, as mentioned in the previous subsection, the image processing techniques currently used by Vismantic only work well with certain images. Photos having a recognizable object that is neither obscured nor too small, and is situated in a simple context, are preferred. At present, Vismantic needs a user to choose quality photos that are representative and algorithm-friendly. Preprocessing Each of the three visual operations has distinct requirements for a pair of input images: • Juxtaposition: good object extraction for both images. • Replacement: good object extraction for one image, and having a suggestive context for the other. • Fusion: good object extraction for one image, and having a distinctive texture for the other. The above requirements have to be satisfied before applying visual operations, which can be computationally expensive. Furthermore, they prevent unpromising results early on, which drastically saves the effort needed for evaluating the final output, since the number of images in the final output without filtering is quadratic to the number of input images (each photo of a subject is paired with every photo of a message). At this step, object extraction and inpainting are applied to all the photos retrieved from Flickr and selected by a user. Currently, Vismantic does not have automatic means to judge the quality of object detection, the indicative capacity of a context, or the distinctiveness of a texture. User Filtering For each photo, the object/region extracted is shown to a user, who is asked to decide if it represents the corresponding subject or message and if it has a distinct texture which alone cues the concept. The inpainted image is also shown to the user, together with the question whether the image reminds him of the concept. 4www.flickr.com/services/api/flickr.tags.getRelated.html Evaluation and Analysis To get a first idea of what Vismantic generates, we put it in a test consisting of five typical visual communication tasks, where only the authors interacted with the system. In this section, we first present the output and curation coefficient at each step of the workflow, along with our analysis of the output. Next, we reveal the major factors that cause a generated image to be uninterpretable or end up with unintended meanings. The five tasks are the following (subject and message words in italic): 1. Electricity is green (sustainable). 2. Music is powerful. 3. Lipstick is associated with love. 4. Heating system makes house warm. 5. Earplug reduces noise. At the first step, the purpose is to find representative, diverse, high-quality and algorithm-friendly photos of concepts (subjects and messages). In the test, 50 photos were collected for each subject and message. For abstract concepts, which lead to multiple queries for searching in Flickr, the photos were collected by visiting the photo groups (one group per query) one by one and picking up the first unpicked photo (the photos in a group are sorted by relevance). The upper part of Table 1 shows the number of disqualified, qualified, selected and surprising photos for each concept. Averaging across all ten concepts, 46.4% of the photos retrieved from Flickr are qualified. The disqualified photos are divided into three categories, i.e. ‘nonrepresentative', ‘non-algorithm-friendly' and ‘low-quality'. Non-representativenessness, amounting to 35.2%, was the top reason for rejecting photos. Non-representative photos either lack relevance or represent a sense of a concept other than the one intended. We selected photos from the qualified ones and only kept those that look quite different from each other. On average, around 9 photos were selected for each concept. We also noticed that there were novel representations of concepts among the retrieved photos (the row of ‘surprising' in Table 1), which counts for 4.2%. At the second step, preprocessing, object extraction and inpainting were applied to the photos selected in Step I, and the output is shown in the lower part of Table 1. Averagely speaking, good object extraction was found in 69.3% of the selected photos. The major types of incorrect object extraction include: only part of an object was extracted and the part was not recognizable; the object was not extracted at all but some other part of the photo instead, e.g., another object or part of the background; or the whole photo was extracted. Within the properly extracted objects, distinctive textures were not so common, counting for 20.5%. Some examples are green grassland, red lipstick, water, brick wall, flame and textile. Besides, only 22.7% of the selected photos had suggestive context around an object region. In many photos, the object was relatively big and the context was too small to be distinguishable; or the context could not hint at the object if it were removed. However, surprisingly, the June 2015 162 Table 1: Output of Step I Finding representative photos of concepts and Step II Preprocessing. electricity green music powerful lipstick love house warm earplug noise avg. %avg. photos retrieved 50 50 50 50 50 50 50 50 50 50 non-representative 13 6 6 25 22 30 2 26 17 29 17.6 35.2 non-algorithm-friendly 6 2 5 1 9 4 10 4 21 2 6.4 12.8 low-quality 3 0 9 3 1 3 3 1 3 2 2.8 5.6 qualified 28 42 30 21 18 13 35 19 9 17 23.2 46.4 surprising 0 2 0 5 5 2 0 1 3 3 2.1 4.2 selected (at Step I) 10 7 9 9 13 8 8 9 6 9 8.8 17.6 good object extraction 6 5 5 5 10 5 8 6 5 6 6.1 69.3 distinct texture 0 5 0 3 3 1 2 4 0 0 1.8 20.5 has-context 0 4 4 4 0 2 0 3 2 1 2 22.7 suggestive context 3 5 6 6 2 4 0 3 5 5 3.9 44.3 Table 2: Output of Step III Applying visual operations. gene. = generated, expr. = expressive, supr. = surprise, % = ratio between the two numbers ahead. electricity-green music-powerful lipstick-love house-warm earplug-noise avg. genr. expr. % genr. expr. % genr. expr. % genr. expr. % genr. expr. % genr. expr. % juxtaposition 60 0 0 50 46 92 100 54 54 96 56 58.3 60 44 73.3 73.2 40 54.6 replacement 45 12 26.7 60 28 46.7 50 27 54 24 21 87.5 55 30 54.5 46.8 23.6 50.4 fusion 30 9 30 15 0 0 25 1 4 44 3 6.8 0 0 0 22.8 2.6 11.4 total 135 21 15.6 125 74 59.2 175 82 46.9 164 80 48.8 115 74 64.3 143 66.2 46.4 expr. supr. % expr. supr. % expr. supr. % expr. supr. % expr. supr. % expr. supr. % surprise 21 21 100 74 30 40.5 82 40 48.8 80 80 100 74 32 43.2 66.2 40.6 61.3 errors in object extraction sometimes provided suggestive context. When only part of an object was extracted, the remaining part might be able to cue the object. For instance, in Fig. 1c, the smoke coming out of the power station was not extracted. Including these cases, suggestive context were found in 44.3% of the selected photos. At the third step, applying visual operations, on average 143 images were generated for each task (Table 2). Most of them were juxtapositions and replacements, because distinctive textures were rare. Our primary evaluation criterion is whether a generated image expresses the meaning specified in a task. Averaging across all five tasks, 46.4% of the generated images were considered expressive. Regarding if there is a general trend that one visual operation works better than another, there was no significant difference between juxtaposition and replacement. Both operations produced expressive images about half of the time, 54.6% and 50.4% respectively. The difference showed more in specific tasks. For instance, there was no expressive juxtaposition for Task 1. As seen in Table 2, it seems that fusion yields less than juxtaposition or replacement. However, a factor that has to be taken into account is that fusion can easily go wrong in the current implementation. In juxtaposition and replacement, the image processing involved are mainly resizing and positioning, while the quality of object extraction and the indicative capacity of context have been evaluated in the previous step, preprocessing. On the other hand, the parameters of the texture transfer technique used in fusion have to be fine tuned for each image for optimal performance, which is not yet available in Vismantic. In addition, the number of textures available for fusion was quite small. When there are more varieties of textures, one set of parameters that does not work well with one image may work for another, which could bring us more expressive images. A few examples of the images generated for each task are shown in Fig. 4. More examples can be visited online5. Besides, we have a few interesting observations. Firstly, Vismantic sometimes generates "perfect images" (see Fig. 1a), when some visual features of two objects, such as size, shape, angle and lighting, match by coincidence. Secondly, fusion sometimes produces images of high artistic skill (see Fig. 1b for an example). As in Step I, we counted the number of surprising ideas among the generated images, and found that on average 61.3% of the expressive images were surprising. The surprise came from novel representations of concepts and unexpected combinations of objects in terms of the concepts they denote/connote or the exact visual representations. Additionally, we call attention to Fig. 1c. The meaning of this image is not as straightforward as "the power station (covered by the leaves) is as green as the leaves", which is what we had envisioned. A plausible interpretation is "the leaves (or the concept of ‘sustainability' accompanying it) replaces the traditional power station". Owing to the drastic contrast in size and solidity between leaf and power station in common sense, this image exemplifies immense boldness, which has not been explicitly modeled in Vismantic. In the following subsection, we analyze why some of the generated images do not express the intended meanings. Failure Analysis We have observed that a generated idea may fail mostly in three aspects, namely semantic interaction, visual operation implementation and object affordance and composition. Semantic Interaction For some generated images, there seems to be no plausible interpretation, divergent interpretations, or an interpretation either the one not intended or 5http://vismantic.hiit.fi/examples/ June 2015 163 (a) (b) (c) (d) Figure 4: Examples of generated visual ideas. 4a: for Task 2 "music is powerful", a singer replaces part of waves; 4b: for Task 3 "lipstick is associated with love", the heads of a kissing couple are fused with a red lipstick; 4c: for Task 4 "heating system makes house warm", a house is fused with a pair of crochet mittens; 4d: for Task 5 "earplugs reduce noise", a helicopter replaces part of a man's head with fingers stuck in the ears. the opposite. For instance, Fig. 5a is a juxtaposition generated for Task 4 "heating system makes house warm". It seems rather difficult to get the feeling that the house is being warmed up. Nonetheless, this is well achieved by the fusion of the two objects (Fig. 4c). Another example is shown in Fig. 5b, a replacement (a power station (without smoke) replaces a line of trees) generated for Task 1 "electricity is green". The image looks like a power station in its natural surroundings, which is unable to allure viewers into thinking of other connections between the two objects, such as grass gives energy to a power station (see Fig. 1a for a comparison). Moreover, semantic interaction may not happen as expected for other reasons, such as objects having opposing emotional valence. (a) (b) Figure 5: Semantic interaction. Visual Operation Implementation As explained earlier, fusion requires fine tuning the parameters of the inpainting algorithm, which is not available in Vismantic at present. The current resizing method used in replacement does not produce ideal results when the objects involved have quite different shapes. Besides, we noticed that additional constraints might be applied to visual operations, e.g., texturebased fusion should avoid objects with similar colors. Object Affordance and Composition Fig. 6 shows two different light bulbs placed in the same context, both of which are replacements for Task 1 "electricity is green". Fig. 6a works while Fig. 6b does not. The difference between the light bulbs is that one is for putting on a horizontal surface, such as the ground, and the other is for hanging vertically, such as from a ceiling. The context is a forest with the ground covered by grass and leaves. The bulb for the horizontal plane suits this context well, which suggests that it is the forest where the bulb gets energy. In contrast, the vertical light bulb can not connect to the forest in a similar way. This comparison reveals that two objects can only be connected meaningfully at certain parts, but not every part. Besides, the left and right order (orientation) of two objects sometimes can not be arbitrary. Consider whether the idea is still effective if the singer in Fig. 4a turns his head to the opposite direction. (a) (b) Figure 6: Object affordance. In Table 3, the numbers of different types of failure are presented. It shows that semantic interaction was a major cause of failure for all three visual operations. Failure of visual operation implementation occurred mainly in fusion and replacement, since juxtaposition has less constraints on resizing and positioning. Failure of object affordance and composition happened largely in replacement and juxtaposition, because the current implementation of fusion primarily replies on texture. Table 3: Failure type. The ratio in parenthesis is against the number of disqualified images generated by each operation. juxtaposition replacement fusion disqualified images 166 116 101 semantic interaction 158 (95.2%) 65 (56.0%) 50 (49.5%) visual operation implementation 0 (0.0%) 29 (25.0%) 52 (50.5%) object affordance & composition 8 (4.8%) 21 (19.0%) 0 (0.0%) June 2015 164 Conclusions and FutureWork This paper presents Vismantic, a semi-automatic system for generating visual ideas. The workflow it exemplifies has generality, in the sense that it starts from a conceptual task (described in text) and outputs visual compositions, which fit real-life practice. Vismantic takes advantage of both conceptual and visual creativity in its ideation. At present, with basic conceptual knowledge (semantic associations) and the first implementation of three visual operations (juxtaposition, replacement and fusion), it demonstrated the potential of producing images that are expressive, diverse and surprising. Vismantic generates surprising ideas by using novel representations of concepts and unexpected combinations of objects in terms of the concepts they denote/connote or the exact visual representations. It also generates images with certain particular flavors, such as extreme boldness, though it is not supposed to have such sense. In the future, when deciding objects to be combined, additional effects, such as surprise, boldness and humor, can be considered. For Vismantic to have a higher level of automation and generate more ideas that make sense, we have identified challenges in three areas: • Visual Resources: sources of photos with high relevance and diversity, sources of distinctive textures and sources of indicative context; • Image Processing: automatic means of selecting photos that are high-quality and algorithm-friendly, automatic means of tuning algorithm parameters, taking into account visual features (such as color, shape, orientation and camera angle) when applying the visual operations, and making use of more sophisticated image analysis to accurately locate objects in complex scenes; • Visual Semantics: more visual knowldge, such as object affordance and the meanings of visual features (e.g., orientation, position and contrast), and the ability of interpreting images, i.e. simulating the interaction between all the meaning fragments generated by visual cues at various levels. Last but not least, other visual operations can be added to Vismantic. Acknowledgments This work has been supported by the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET grant number 611733 (ConCreTe). We would like to thank Hannu Toivonen for his valuable suggestion and Flickr users who grant their photos Creative Commons licenses. 2015_22 !2015 The Good, the Bad, and the AHA! Blends P. Martins1, T. Urbanciˇ cˇ2,3, S. Pollak3, N. Lavracˇ3,2, A. Cardoso1 1 CISUC, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal 2 University of Nova Gorica, Nova Gorica, Slovenia 3 Jozef Stefan Institute, Ljubljana, Slovenia ˇ Abstract We present and discuss quality assessment of visual blends based on how humans perceive them. This work represents part of a wider study aimed at determining the fundamental characteristics of a good blend. Based on the obtained insights, we hope to make a more comprehensible explanation of some less clear and not fully described aspects of the conceptual blending mechanism that play a fundamental role in creative thinking. Additionally, we intend to bring these insights into the design of artificial creative systems. Introduction Conceptual blending (CB) is a vital cognitive mechanism by which two or more mental spaces are integrated to produce new concepts (Fauconnier and Turner 2002). Blending is at the heart of the origin of ideas; a new idea or thought can be seen as an insight gathered from a blend, i.e., the result from integrating different mental spaces. Not unexpectedly, the complexity and the quality of blends can be quite heterogeneous. The human brain continuously attempts to blend different concepts either by using a quite uncomplicated web of mental spaces or a more refined and complex network. The majority of these attempts fail in producing good blends, especially because the blend neither has sufficient novelty nor it has immediate purpose (Turner 2014). What makes us prefer one blend to another? Is it sufficient to require novelty and value under a given context? Can quality simply depend on the coherence of the blend and on the easiness to interpret it? As CB theory inevitably links with the phenomenon of creativity, that is, the ability of producing new, surprising, and valuable ideas or artifacts (Boden 1991), it is expectable to regard as good blends the ones which imply a more creative thinking. However, the "intuitive" nature of creativity hinders the construction of a system of strict and immediate rules to explain which mental spaces should be selected and how they should be integrated in order to achieve creative thinking. Therefore, giving a more elaborate answer to the question "What makes a good blend?" is challenging. In this paper, we present and discuss quality assessment of visual blends based on how humans perceive them - as good, as bad, or as surprising and thought-provoking (AHA!) blends. This is part of a wider study whose fundamental goal is to understand what are the key characteristics of a good blend. By finding them, we hope to make a more understandable explanation of some less clear and less described aspects of the blending mechanism that play a fundamental role in creative thinking. Another goal of our work is to bring these insights into the design of artificial creative systems to improve creation and curation processes, especially when they rely on computational models of conceptual blending. We are particularly interested in contributing to the design of frameworks for creativity assessment. While there are already noteworthy works in the field (Ritchie 2001; Colton, Pease, and Ritchie 2001; Wiggins 2001; Colton 2008; Jordanous 2012), there is still room for improvement. To the best of our knowledge, an analysis of blends based on human perception was only followed by Joy et al. when analyzing conceptual blending in advertising (Joy, F. Sherry Jr., and Deschenes 2009). The authors conducted interviews with 28 volunteers who had to interpret several advertisements by describing what they thought was their main messages and how they arrived at such interpretation. The advertisements used in the experiment provided clear examples of conceptual blending. We make use of an online-survey questionnaire in which participants are asked to evaluate criteria that we assume to be related to the quality of blends. In the likeness of the aforementioned work, the examples given to the participants can be easily perceived as instances of conceptual blending. The remainder of this paper is organized as follows. In the upcoming section, we overview the conceptual blending framework and discuss its relation with creativity. Then, we present the content of the survey and discuss its results. Finally, we present concluding remarks and discuss future research. Conceptual Blending and Creativity Fauconnier and Turner originally proposed conceptual blending theory as an attempt to explain cognitive and linguistic phenomena such as metaphor, metonymy, and counterfactual reasoning (Fauconnier and Turner 1998), but later it was extended to describe and explain different cognitive phenomena related to the creation of ideas and meanings (Fauconnier and Turner 2002; Turner 2014). June 2015 166 Mental spaces network A key element in conceptual blending is the mental space, which corresponds to a partial and temporary knowledge structure created for the purpose of local understanding (Fauconnier 1994). Mental spaces differ from frames, which are more stable knowledge structures. In the CB framework, there is a network comprising at least four connected mental spaces, as depicted in Figure 1. Two or more of them correspond to the input spaces, which are the initial domains. A partial matching between the input spaces is constructed. This association is reflected in another mental space, the generic space, which contains elements common to the different input spaces. The latter space captures the conceptual structure that is shared by the input spaces. The outcome of the blending process is the blend, a mental space that maintains partial structures from the input spaces combined with an emergent structure of its own. Generic Space Input 1 Input 2 Blend Figure 1: The original four-space conceptual blending network (Fauconnier and Turner 2002). Integration Integration of input elements in the blend space results from three operations: composition, completion, and elaboration (Fauconnier and Turner 2002). Composition occurs when the elements from the input spaces are projected into the blend and new relations become available in the blended space. This implies projecting into the blend not only the matched elements but also other surrounding elements. Completion occurs when existing knowledge in long-term memory, i.e., knowledge from background frames, is used to generate meaningful structures in the blend. Elaboration is an operation closely related to completion; it involves cognitive work to perform a simulation of the blended space. Elaboration is also known as "running the blend". There is not a pre-established order for these operations and several iterations may occur. Optimality principles Integration is guided by optimality principles, which are responsible for generating consistent blends which in turn are more easily interpreted. Fauconnier and Turner (1998) provided a list of these principles: OP1 Integration: the blend must constitute a tightly integrated scene that can be manipulated as a unit. More generally, every space on the blend structure should have integration. In other words, the integration principle dictates that the blend must be recognized as a whole and as a new concept that is coherent. OP2 Intensifying Vital Relations: compress what is diffuse by scaling a single vital conceptual relation or transforming vital conceptual relations into others. OP3 Maximizing Vital Relations: create human scale in the blend by maximizing vital relations. OP4 Topology: for any input space and any element in that space projected into the blend, it is optimal for the relations of the element in the blend to match the relations of its counterpart. Put differently, the topology principle dictates that every element projected into the blend should maintain the same neighborhood relations as in the input space. This principle can be disregarded without having a major impact in the value of the blend, especially if we are dealing with free combinations, such as an imaginary object with a given goal (Pereira 2005). OP5 Web: manipulating the blend as a unit must maintain the web of appropriate connections to the input spaces easily and without additional surveillance or computation. OP6 Unpacking: the blend alone must enable the blend reader/observer to unpack the blend to reconstruct the inputs, the cross-space mapping, the generic space, and the network of connections between all these spaces. Unlike other principles, unpacking takes the perspective of the blend reader, i.e., someone who is not acquainted with the blend generation process. OP7 Relevance: all things being equal, if an element appears in the blend, there will be pressure to find significance for this element. Significance will include relevant links to other spaces and relevant functions in running the blend. In short, the relevance principle requires the existence of a reason for the blend to occur. Blends and creative thought The theory built around conceptual blending inevitably deals with the phenomenon of creative thinking. The ability of producing new, surprising, and valuable ideas or artifacts comes frequently in advanced forms of conceptual blending (Turner 2014). Due to the "intuitive" nature of creative thinking, the construction of a comprehensive theory of such phenomenon is quite challenging. Conceptual blending theory, without being an exception, is sometimes vague and less prone to formalization when dealing with crucial aspects of creative thought. In particular, the framework does not explicitly deal with novelty and the optimality principles do not clearly dictate whether a blend is creative or not. However, it is a common assumption that novelty can result from the application of these principles. Despite all these limitations, the conceptual blending framework provides not only a set of sound principles but June 2015 167 also a consistent terminology that can be used in creativity modeling. This has been a major motivation to consider the design of artificial creative systems based on computational approaches to conceptual blending. Looking for good blends By understanding what humans perceive as a good blend, we hope to dissipate some of the vagueness surrounding the explanation of parts of the blending mechanism that are not fully described or even ambiguous. We are primarly interested in analyzing the relevance of the optimality principles, the selection of input spaces as well as the projection of elements. Our goal is not to establish solid rules to the blending process - as it would be incongruous with the theory - but to provide some hints to questions such as "How ‘semantically far' should the input spaces be to produce a good blend?","Is there a correlation between the quality of blends and the number of elements for projection?", or "Are all the optimality principles required to produce good blends?". In the case of artificial creative systems, we also expect to find clues to questions such as "Are the typical relationships found in concept maps sufficient to infer the quality of a blend?", "How important is to include common sense knowledge (sensorial and subjective elements) in concept maps to achieve better blends?", or "To what extent is required to have a goal-driven blending to obtain better blends?". It should be noted that we do not use any a priori definition of what is considered to be a good blend. Constructing such a definition is actually the goal of this study. Nonetheless, our work relies on the premise that good blends are creative to some extent, whereas the reciprocal is not necessarily true. In this paper, we focus on visual blends. More accurately, we work with images depicting fictional hybrid animals. Examples of hybrid animals, such as Pegasus or the lion man, are often presented in the literature as wellknown and/or ancient blends. There are also several experiments in the field of computational creativity involving the creation of hybrid animals (Pereira and Cardoso 2003; Neahus et al. 2014). In our case, we opted to analyze this particular type of blend due to the fact that hybrid animals tend to be easily perceived as a blend, i. e., the blend reader can recognize the input spaces and simultaneously identify a novel creature. Nevertheless, we will try to make generalizations from our observations rather than drawing conclusions that only hold for this type of blends. The survey To assess the quality of blends, we conducted an online questionnaire survey in which approximately 100 participants judged 15 novel animals which are the result of blending anatomies from two different animals.1 Each hybrid animal was depicted in one image/scene (see Figure 2). The author of all images but two is Arne Olaf (http://gyyp.imgur.com/). He uses Adobe!R Photoshop!R to create the hybrid creatures. The input images 1 Available at http://animals.janez.me. are put in two layers, adjusted in terms of size and unnecessary regions are removed. After that, he applies some common image processing techniques to make the transitions smoother. Note that our focus is on blending at the conceptual level, overlooking aspects related to technical perfection. However, we are aware that technical perfection of a picture plays an important role in the perception of visual blends. This is why we decided to use blends with a similar level of quality in this respect all chosen blends could be perceived as "good" as far as visual presentation is concerned. Moreover, the pictures share a similar rendering style. This enabled us to investigate other influential factors with more certainty as we ruled out rendering or poor presentation as a reason for bad human perception of a blend. This is particularly important in looking for findings that would hold also for other types of blends, not just the visual ones. One may comment that there are no obviously bad blends in the dataset. This decision was based on a preliminary test done by ourselves, in which we noticed that there were big individual differences in the acceptance of blends, although the blends were all looking "nice" and the dataset presented good candidates for being well accepted. Our voting on blends was almost never unanimous, and this is why we wanted to investigate more thoroughly what could be expected on a bigger and more heterogenous population of subjects. With this in mind, "the good" and "the bad" blends from the title should be understood as "well accepted" and "not so well accepted" blends, showing the way towards creation of blends that will be well accepted by humans. The criteria used in the survey cover some of the optimality principles criteria (e.g., by asking about coherence and consistency we are checking if a blended creature is perceived as having its own identity and corresponds to the optimality criterion of integration) as well as some criteria that define creativity, i.e., novelty, surprise, and value (Boden 1991). Thus, for each image depicting a hybrid animal, we asked the participants to rate the following criteria in a integer scale from 1 (the worst) to 5 (the best): OI Overall impression; N/S Novelty/Surprise; I Interestingness; AA Aesthetic appeal; C/H Comicality/Humor; C/C Coherence/Consistency; PF Evoques positive feelings; NF Evoques negative feelings; CIP Creative industries potential. The participants were also asked to provide a name to the hybrid creature as well us to inform us if they could easily recognize two distinct animals in the image. The latter question evokes the unpacking principle, i.e., the ability of the participant to reconstruct the input spaces. Survey results and discussion Figure 3 depicts the median overall impression for each of the hybrid animals in the dataset. According to these results, June 2015 168 Snorse Chimpanzorse Dorse Guinea bear Hammerorse (snake, horse) (chimpanzee, horse) (duck, horse) (guinea pig, bear) (hammerhead shark, horse) Pengwhale Proboscird Elephaneleon Elephuck Guinea lion (penguin, whale) (proboscis monkey, bird) (elephant,chameleon) (elephant, duck) (guinea pig, lion) Guorse Hammergull Huck Spider pig Sharkador retriever (guinea pig, horse) (hammerhead shark, gull) (horse, duck) (spider, guinea pig) (shark, labrador retriever) Figure 2: Hybrid animals dataset used in the online questionnaire. Each sub-caption contains the corresponding name of the blend as well as the input spaces. Names were coined by the authors of this paper or by the authors of the images and were not visible to survey participants. All blends were created by Arne Olaf, with the exception of Sharkador retriever and Elephaneleon, whose authorship is unknown. For a better visualization, some images were slightly cropped. the top six best blends are Guinea lion, Pengwhale, Guinea bear, Elephaneleon, Proboscird, and Dorse, while Spider pig, Hammerorse, and Guorse were the least favorite blends. Figure 3: Overall impression (median) for each hybrid animal. Figure 4 depicts a more detailed central tendency analysis of the survey results by including the median score for each criterion. Among the six best blends in terms of overall impression, Guinea lion and Pengwhale achieve the best overall scores. Out of the six best blends, five could be characterized by having a relatively big difference in the size of the original animal. It is also worth mentioning that Guinea lion, Elephaneleon, and Pengwhale are all in the best group with regard to the following criteria: novelty/surprise, interestingness, and coherence/consistency. Regarding novelty, most blends achieved a median score of 4. The exceptions are Chimpanzorse, Guinea bear , Elephuck, Guorse, Hammergull, and Huck, which achieved a median score of 3. As it can be observed, high novelty does not necessarily lead to high overall impression. For instance, Guinea bear is a top-rated blend in terms of overall impression; however, its score in terms of novelty is among the lowest in the whole group. Conversely, Hammerorse has a high novelty score but a low overall impression score. As pointed out by some respondents, novelty became more difficult to judge after a few images, as there were similar blends either in terms of input spaces or in terms of the elements for projection. This repetition and the fact that images were shown in fixed order might partially explain the lower scores obtained by Elephuck, Hammergull, or Huck. Blends with a high overall impression score tend to have a high interestingness score. In fact, among the animals with the highest overall impression scores, Dorse is the only animal with an interestingness score of 3. As for aesthetic appeal, we observe that blends with a low aesthetic appeal have a low overall impression score, whereas the most aesthetically appealing ones tend to have June 2015 169 high overall impression scores. Coherence/consistency scores tend to be well aligned with the overall impression. The animals with the lowest overall impression scores - Hammerorse, Guorse, and Spider pig - have a consistency score of 2. For the remaining blends, with the exception of Dorse, the median coherence score coincides with the median overall impression score. We also observe that animals with higher overall impression scores tend to evoke more positive feelings, while animals with lower overall impression scores tend to evoke more negative sentiment. Creative industries potential scores are not always in concordance with overall impression results. Similar results can be observed for the criterion comicality/humor. However, blends with the lowest overall impression scores are seen as having a low creative industries potential. These results clearly show that the novelty alone does not guarantee the overall rating nor creative industry potential nor how interesting the blends are. The one considered to be one of the most novel ones is Hammerorse, but its has the lowest overall rating of all. Similarly, Smorse and Sharkador retriever are among the most novel ones; however, this is not reflected in their overall impression scores. Another statistic analysis is given in Figure 5, which contains the correlation among pairs of criteria. As it can be readily seen, aesthetic appeal is strongly correlated with the overall impression (⇢= 0.8). There is also a strong positive association between overall impression and coherence (⇢=0.76). This result reflects the importance of the optimality principles, as they are responsible for defining coherent blends. The correlation between novelty and overall impression (⇢=0.47) corroborates our previous remarks: it is difficult to establish a straightforward association between these two scores. We received also more than 20 comments related to the questionnaire. The majority of them were expressing satisfaction (having fun, enjoying the survey, etc.). The negative comments related especially to the fact that the survey was too long and that it got monotonous after a while. Specific points were commented, such as that coherence was difficult to judge and that novelty and humor were not applicable after the first few images. It was also proposed that comparing more animals at the same time would be better. A few people also explained which were their favorites, with Pengwhale being mentioned a few times. Some people provided more original explanations, e.g., "The horse duck was boring, because they are both vegetarian". The interview We also conducted an interview with 4 people who took part in the survey. The main goal of the interview was to try to understand and discuss some of the ratings given by these participants. There was a general consensus that aesthetic appeal was an important requirement. For example, blending animals with similar types of coat - in terms of color, texture, or pattern - tends to result in aesthetically appealing blends. Guinea bear and Guinea lion were given as an examples of aesthetically appealing blends. Snorse was mentioned by one of the interviewees as another example of an aesthetically appealing blend, as there were no major differences between the snakeskin in the "snake part" and the coat in the "horse part" of the animal. Pengwhale was also a favorite among these participants. They enjoyed the fact that it was very difficult to establish a clear separation between "the whale part" and the "penguin part". Participants took into account proportions when evaluating the aesthetic appeal. They presented Guorse as an example of a badly-proportioned blend: the proportions in the body of the horse require a head more elongated than the one of a guinea pig. In Hammerorse, the participants observed another instance of badly-proportioned parts. In this case, the head was seen as being too wide for the rest of the body. One of the interviewees said to prefer the blend Dorse over the creature Huck because the head of Dorse has more resemblance with the head of a horse than the head of Huck has with the head of a duck. The interviewees also shared the opinion that surprise was required, but only to a certain extent. Hammerorse and Spider pig were given as examples of "too much surprise", which has a negative effect on the overall evaluation, whereas Guinea bear was presented as a blend with a minimal level of surprise. Some participants suggested Guinea lion as a good example of comicality/humor due to the contrasting personalities of the animals given as input spaces. Although these mental spaces correspond to animals with similar coat and not so different anatomies, one is seen as a fierce creature, while the other one is a small harmless rodent The participants emphasized the importance of recognizing the input spaces. However, there was the general idea that they enjoyed more when unpacking took time to occur. Good blends: input spaces, projection, and optimality principles The level of novelty or surprise in a blend is partially dictated by the selection of input spaces and the choice of elements for projection. While the results from the survey do not show a direct association between novelty/surprise and the overall impression of the blend, it is somehow clear that both novelty and surprise are required to some extent. Selecting seemingly unrelated input spaces seems a good option only if the choice of elements for projection and subsequent tasks are able to deconstruct the idea that both concepts are unrelated. In this particular case, projection should be able to highlight various links between the two mental spaces that are less obvious instead of establishing a reduced number of more obvious connections. Figure 6 depicts the concept similarity between different concepts used in the survey. Instead of using Linnaean taxonomy to compute the similarity between two animals, we opted for a more generic and elaborate measure that is able to generate more fine-grained results.2 The concept similarity was calculated by applying the Personalized PageRank (PPR) (Haveliwala 2003) to ConceptNet. PPR is a variation of the standard PageRank algorithm used to rank nodes in a 2 In our experiments with Linnaean taxonomy, the distances between animals were 5, 6, or 7. June 2015 170 012345 CIP NF PF C/C C/H AA I N/S OI Snorse 012345 CIP NF PF C/C C/H AA I N/S OI Chimpanzorse 012345 CIP NF PF C/C C/H AA I N/S OI Dorse 012345 CIP NF PF C/C C/H AA I N/S OI Guinea bear 012345 CIP NF PF C/C C/H AA I N/S OI Hammerorse 012345 CIP NF PF C/C C/H AA I N/S OI Pengwhale 012345 CIP NF PF C/C C/H AA I N/S OI Proboscird 012345 CIP NF PF C/C C/H AA I N/S OI Elephaneleon 012345 CIP NF PF C/C C/H AA I N/S OI Elephuck 012345 CIP NF PF C/C C/H AA I N/S OI Guinea lion 012345 CIP NF PF C/C C/H AA I N/S OI Guorse 012345 CIP NF PF C/C C/H AA I N/S OI Hammergull 012345 CIP NF PF C/C C/H AA I N/S OI Huck 012345 CIP NF PF C/C C/H AA I N/S OI Spider pig 012345 CIP NF PF C/C C/H AA I N/S OI Sharkador retriever Figure 4: Survey results for each one the 15 hybrid animals. The bars represent the median score for each of the criteria. June 2015 171 Figure 5: Matrix depicting correlations among pairs of criteria used in the survey. Non-diagonal elements contain scatter plots of the variable pairs. Diagonal elements contain histograms of the variables. The slopes of the least-squares lines in the scatter plots correspond to the displayed correlation coefficients. network (Page et al. 1999). The PPR of a node v in a network (PPR v) is a vector which, for each other node w in the network, tells how simple it is to randomly walk from v to w. It is calculated as a stationary distribution of the position of a random walker which starts its walk on node v and at each step either (with probability p) randomly selects one of the connections leading out of its current node and travels along it or (with probability 1"p) travels back to its starting location. In our experiment, p was set to 0.85. If the PPR of node w according to node v (PPR v(w)) is high, this means that the node w is easy to reach from the departing node v. However, the path from v to w is not symmetric to the one from w to v. Therefore, the similarity measure s is proposed, where s(v, w) = PPR v(w) + PPR w(v). In short, the higher the score the stronger the connection between the nodes and the higher the similarity between concepts. ×10-4 0 0.2 0.4 0.6 0.8 1 (shark, labrador retriever) (spider, guinea pig) (hammerhead shark, gull) (guinea pig, horse) (guinea pig, lion) (elephant, duck) (elephant,chameleon) (proboscis monkey, bird) (penguin,whale) (hammerhead shark, horse) (guinea pig, bear) (duck,horse) (chimpanzee,horse) (snake,horse) Concept similarity Figure 6: Similarity between the input spaces used in the survey. For our work, we used the ConceptNet graph to calculate the similarity between two animals. We ran the PPR algorithm on the network to obtain the personalized PageRank vector for each of the animals in question. The personalized PageRank of a vertex is calculated iteratively by spreading the rank of the original vertex along its connections until the rank is no longer substantially changing. This metric cannot be straightforwardly associated with the overall impression or novelty scores, as it does not faithfully reflect how semantically far the concepts are for a given observer. However, it suggests that sometimes seemingly unrelated input spaces (e.g, a horse (mammal) and a snake (reptile)) are sometimes more similar than two mammals (e.g., guinea pig and bear). We believe that exploring these less obvious similarities is a good starting point for the construction of high-quality blends. Not unexpectedly, the results from the survey support the idea that all optimality principles are relevant, with the exception of topology (as already explained in the previous section). Integration is arguably the most important one and it should not be overlooked. It is necessary (but not sufficient) to dictate the coherence of the blend. Figure 7 shows the percentage of affirmative answers to the unpacking question: "Can you easily recognize two distinct animals in the image?" for each one of the blends. In general terms, input spaces were easily recognized, although this task became more difficult when unpacking Guinea bear and Guorse, as the differences between the animal that provides the body and the blend are minimal. We believe that unpacking is a relevant principle, but it should not be given priority over other principles such as integration. On one hand, it allows the blend reader/observer to build his own interpretation of the blending process, which is fundamental to preform assessments from the perspective of the reader/observer. On the other hand, an immediate unpacking sometimes means a lack of surprise or novelty. Conclusions and Future Work We presented and discussed an evaluation based on the human perception of visual blends. This research is part of a wider study which is oriented towards two major aims: (i) June 2015 172 Figure 7: Unpacking: percentage of participants who responded affirmatively to the question "Can you easily recognize two distinct animals in the image?". to help clarify some less clear and less described aspects of the blending mechanism which play a fundamental role in creative thinking; (ii) to improve the creation and curation processes in artificial creative systems. Although we have only dealt with visual blends depicting hybrid animals, some of our observations can be applied to other types of blends. For instance, surprise and novelty are necessary but not sufficient to guarantee a high-quality blend. In fact, too much surprise is unfavorable if it affects the consistency of the emerging structure. The survey results also reflect the importance of having coherent blends, which emphasizes the importance of the optimality principles. In this first experiment, we inevitably dealt with the specificities of visual blends, all being of similar technical quality, depicting hybrid animals. The results demonstrated that aesthetic appeal is an important criterion. Besides the quality of rendering, there are other aspects, namely symmetry and proportions, that influence aesthetic appeal. This may not be a relevant criterion when analyzing non-visual blends. However, since aesthetic appeal is related to symmetry and proportions, we argue that this criterion should be considered even when we are not working in the visual domain. For this reason choosing blends of similar technical quality, even at the cost of lower variety on the scale of all possible blends, seems to be the right decision, if we want to gain more insight into the conceptual level of blending. An interesting question remaining for future work is whether the results would be different if only textual descriptions or concept maps were given to the test subjects. While the correlation of overall scores with other criteria in our experiment helps to identify the blends perceived as good or bad, the AHA! effect is correlated to the level of novelty, surprise, unpacking and creative industry potential. This will be further investigated with the analysis of the names given to the blends by the test subjects. Some of these names were very creative and reflected new qualities, existing in the blend while not being present in the input spaces. This will help us to understand the role of the emergent new structure reflected in such names, and might uncover the potential of blends to trigger the highly individual AHA! effect and human creativity. Acknowledgements The authors acknowledge the financial support from the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under the ConCreTe FET-Open project (grant number 611733) and the PROSECCO FET-Proactive project (grant number 600653). The authors thank also Jan Kralj for designing the concept similarity measure, and Janez Kranjc for the survey platform, which was developed within the WHIM FET-Open project (grant number 611560). We would also like to thank the survey participants for their time and their valuable comments. 2015_23 !2015 Using Argumentation to Evaluate Concept Blends in Combinatorial Creativity Roberto Confalonieri1, Joseph Corneli2, Alison Pease3, Enric Plaza1 and Marco Schorlemmer1 1Artificial Intelligence Research Institute, IIIA-CSIC, Spain 2Computational Creativity Research Group, Department of Computing, Goldsmiths, University of London, UK 3Centre for Argument Technology, School of Computing, University of Dundee, UK Abstract This paper motivates the use of computational argumentation for evaluating ‘concept blends' and other forms of combinatorial creativity. We exemplify our approach in the domain of computer icon design, where icons are understood as creative artefacts generated through concept blending. We present a semiotic system for representing icons, showing how they can be described in terms of interpretations and how they are related by sign patterns. The interpretation of a sign pattern conveys an intended meaning for an icon. This intended meaning is subjective, and depends on the way concept blending for creating the icon is realised. We show how the intended meaning of icons can be discussed in an explicit and social argumentation process modeled as a dialogue game, and show examples of these following the style of Lakatos (1976). In this way, we are able to evaluate concept blends through an open-ended and dynamic discussion in which concept blends can be improved and the reasons behind a specific evaluation are made explicit. In the closing section, we explore argumentation and the potential roles that can play at different stages of the concept blending process. Introduction A proposal by (Fauconnier and Turner, 1998) called concept blending has reinvigorated studies trying to unravel the general cognitive principles operating during creative thought. According to (Fauconnier and Turner, 1998), concept blending is a cognitive process that serves a variety of cognitive purposes, including creativity. In this way of thinking, human creativity can be modeled as a blending process that takes different mental spaces as input and blends them into a new mental space called a blend. This is a form of combinatorial creativity, one of the three forms of creativity identified by Boden (2003). A blend is constructed by taking the existing commonalities among the input mental spaces (called the generic space) into account, and by projecting the structure of the input spaces in a selective way. In general the outcome can have an emergent structure arising from a non-trivial combination of the projected parts. Different projections lead to different blends and different generic spaces constrain the possible projections. This poses challenges from a computational perspective: large number of possible combinations exhibiting vastly different properties can be constructed by choosing different input spaces, using different ways to compute the generic space, and selecting projections. Within the Concept Invention Theory project1 (COINVENT), we are currently developing a computational account of concept blending based on insights from psychology, Artificial Intelligence (AI), and cognitive modelling (Schorlemmer et al., 2014). One of our goals is to address this combinatorial nature. One potential outcome of this work is a deeper understanding of the way combinatorial creativity works in general. The formal and computational model for concept blending under development in COINVENT (Bou et al., 2014) is closely related to the notion of amalgam (Ontan˜on and ´ Plaza, 2010). Amalgamation has its root in case-based reasoning and focuses on the issue of combining solutions coming from multiple cases. Assuming the solution space can be characterised as a generalisation space, the amalgam operation combines input solutions into a new solution that contains as much information from the two inputs solutions as possible. When input solutions cannot be combined, amalgamation generalises them by dropping some of their properties. This process of generalisation and combination can be expensive from a computational point of view, depending on the search space to be explored. The amalgam-based approach for computing blends makes explicit the combinatorial nature of concept blending, which raises the issue of evaluating and selecting novel and valuable blends as opposed to those combinations that lack interest or significance. Although Fauconnier and Turner (1998) suggest a number of qualitative criteria that can be used for evaluating concept blends, it is not straightforward to chararacterise them in a computational model. In this paper, we propose to explore an argumentative approach to understanding and evaluating the meaning, interest, and significance of concept blends. Specifically, we propose to view evaluating blends as a process of argumentation, in which the specifics of a blend are pinpointed and opened up as issues of discussion. Our intuition is that in the context of new ideas, proposals, or artworks, people use critical discussion and argumentation to understand, absorb and evaluate. We also consider the constructive roles that argumentation can play in concept blending. Computational argumentation models have recently appeared in AI applications (Bench-Capon and Dunne, 2007; 1 See http://www.coinvent-project.eu for details. June 2015 174 I1 I2 ¯I2 ¯I1 B G Figure 1: An amalgam diagram with inputs I1 and I2 and blend B obtained by combining ¯I1 and ¯I2. The arrows indicate generalisation. Rahwan and Simari, 2009), and we believe that incorporating argumentation can foster the development of a fuller computational account of combinatorial creativity. The current paper develops these themes at the level of (meta-) design; implementation is saved for future work. Roles of Argumentation in Concept Blending Consider the amalgam diagram modeling the concept blending process (Figure 1): two input spaces I1, I2, two of their possible generalisations ¯I1, ¯I2, which have a generic space G and blend B. When two input spaces cannot be combined because they do not satisfy certain criteria, the inputs have to be generalised for omitting some of their specifics. The combination of each specific pair ¯I1, ¯I2 yields a blend. Informally, we can imagine argumentation taking place at various points in the amalgam diagram. In general this would happen in response to indeterminacy, that is, when some features of the diagram are underdetermined. We foresee that argumentation can be used: a. to express opinions or points of view that can be used for guiding the selection/omission of specific parts of the input spaces; in particular, to select a specific pair of generalisation ¯I1, ¯I2 of the input spaces in the blending process; b. to provide a computational setting for modeling discussions around the quality of a creative artefact, with the aim of evaluating and refining the generated blends. In the first case, arguments would be about generalisation, i.e. which features should be preserved from I1 and which features should be preserved from I2. More complex inferences could be involved, for example in a case where I1 is fixed, and constraints and various optimality criteria on the blend are imposed, which then yield various constraints on what the other input I2 should be. We return to this point in the discussion section, and we focus for the most part on the second case. In the second case, argumentation would be used to evaluate a range of blends, and the evaluation is carried out post hoc, by a variation of try-it-and-see. A range of blends are trialled, each one bringing out different (un)intended meanings. The evaluation is modeled as an argument, or dialogue in which the specifics of a blend are pinpointed and opened up as issues of discussion. This dialogue can be considered as an introspective evaluation, although it usually takes place among several parties as a means for the social development and understanding of creative artefacts. In this paper, we focus on this role. Our Approach To exemplify our approach, we take the domain of computer icons into account. We assume that concept blending is the implicit process which governs the creative behavior of icon designers who create new icons by blending existing icons and signs. To this end, we propose a simple semiotic system for modeling computer icons. We consider computer icons as combinations of signs (e.g. document, magnifying glass, arrow etc.) that are described in terms of interpretations. Interpretations convey actions-in-the-world or concepts and are associated with shapes. Signs are related by sign-patterns modeled as qualitative spatial relations such as above, behind, etc. Since sign-patterns are used to combine signs, and each sign can have multiple interpretations, a sign-pattern used to generate a computer icon can convey multiple intended meanings to the icon. These are subjective interpretations of designers when they have to decide what is the best interpretation an icon can have in the real world. In this paper, we show how the intended meaning of new designed (blended) icons can be evaluated and refined by means of Lakatosian reasoning. Background Computational argumentation Computational argumentation in AI aims at modeling the constitutive elements of argumentation, that are i) arguments, ii) attack relations modeling conflicts, and iii) acceptibility semantics for selecting valid arguments (BenchCapon and Dunne, 2007; Rahwan and Simari, 2009). The most well-known computational argumentation framework is due to Dung (1995). Dung defines an abstract framework to represent arguments and binary attack relations, modeling conflicts, by means of a graph. He defines different acceptibility semantics to decide which arguments are valid and, consequently, how conflicts can be resolved (Figure 2). a1 a2 a3 Figure 2: Dung framework example: Nodes represent arguments and edges (binary) attack relations. Argument a1 is attacked by a2 which is attacked by a3. Thus, a2 is defeated and a1 can be accepted. a3 is also accepted. Abstract argumentation frameworks do not deal with how arguments are generated and exchanged. They merely focus on attack relations between arguments and acceptibility semantics. However, the intrinsic dialectical nature of argumentation is fully explored when an explicit argumentation process is considered. Then, the purpose of a dialogue becomes essential to determine how arguments should be generated and exchanged, and how a dialogue should be structured (Walton and Krabbe, 1995). Lakatosian argument and dialogue Lakatos (1976) was a philosopher of mathematics who developed a model of argument, presented as a dialogue, to June 2015 175 Accept Exception-bar/ Monster-bar/ Monster-adjust Surrender Counterexample Conjecture Figure 3: Our interpretation of Lakatos's game patterns. describe ways in which mathematicians explore and develop new areas of mathematics. In particular, he looked at the role that conflict plays in such explorations, presenting a rational reconstruction of a dialogue in which claims are made and counterexamples are presented and responded to in various different ways. His resulting model describes conceptual continuity and change in the growth of knowledge, and contains dialogue moves, or methods, which suggest ways in which concepts, conjectures and proofs are fluid and open to negotiation, and gradually evolve via an organic process of interaction and argument between mathematicians. These dialogue moves are: Surrender consists of abandoning a conjecture in the light of a counterexample. Piecemeal exclusion is an exception-barring method that deals with exceptions by excluding a class of counterexamples, i.e., by generalising from a counterexample to a class of counterexamples which have certain properties. Strategic withdrawal is an exception-barring method that uses positive examples of a conjecture and generalises from these to a class of object, and then limits the domain of the conjecture to this class. Monster-barring/monster-adjusting is a way of excluding an unwanted counterexample. This method starts with the argument that a ‘counterexample' can be ignored because it is not a counterexample, as it is not within the claimed concept definition. Rather, the object is seen as a monster which should not be allowed to disrupt a harmonious conjecture. Using this method, the original conjecture is unchanged, but the meaning of the terms in it may change. Monster-adjusting is similar, in that one reinterprets an object in such a way that it is no longer a counterexample, although in this case the object is still seen as belonging to the domain of the conjecture. The moves above are not independent processes; much of Lakatos's work stressed the interdependence of creation and justification. These moves describe the evolution of both arguments and conclusions in mathematics, and as such constitute argument patterns, or schemes. However, they are a rational representation of exchanges between mathematicians and describe dynamic, rather than static arguments, presented as a dialogue. Thus, they also have temporal structure, and can be seen as a dialogue game, in which at any point various dialogue moves are applicable (see (Pease et al., 2014) for a description of Lakatos's methods in these terms). The fact that we include negotiations over definitions and changes in the conclusions being argued means that it is difficult to apply traditional abstract argumentation frameworks, which assume that such aspects are stable. However, we can see some of the moves in terms of Dung's framework: for instance if an initial argument for a conjecture forms a1 in Figure 2, then a2 might be a counterexample to the conjecture, and a3 might be the monster-barring move. The Lakatosian way of conceiving the reasoning as an open-ended discussion about a problem suggests that we can exploit Lakatos's moves for structuring dialogues for the evaluation of creative artefacts. Evaluation in creativity is not a static and rigid process, and the discussion should flow in a dynamic way. As such, in this paper, we propose to use Lakatosian reasoning to model the negotiation about the intended meaning of generated blends (icons). Figure 3 shows the dialogue game we will adopt to model these dialogues. For another formal framework of dialogue games for argumentation see (Prakken, 2005). Icons and Signs We follow a semiotic approach to specify the intended meaning of computer icons. Semiotics is a transdisciplinary approach that studies meaning-making with signs and symbols (Chandler, 2004). Although it is clearly related to linguistics, semiotics also studies other forms of non-linguistic sign systems and how they may convey meaning; this includes not only designation, but also analogy, and metaphor. Although some people may regard Peirce's Sign Theory as the origin of semiotics, Saussure founded his semiotics (semiology) in the social sciences. Currently, cognitive semiotics and computational semiotics take their own perspectives on the relation between sign and meaning-making. In this paper, we take a semiotic approach to describe computer icons in the sense that icons, as a spatial pattern of shapes, are viewed as signs, and compositions of signs are interpreted to convey a meaning, as when we say ‘this icon means the download is still active'. The shapes recurrently used in icons are interpreted as signs; screens, magnifying glasses and folders are examples of signs. A magnifying glass sign can be used in different icons in such a way that its meaning is context-dependent, that is, it depends on other signs related to it in different icons. We associate to each sign a set of interpretations, that encode the kinds of intended meaning associated to that sign as actions-in-the-world or concepts. An icon is represented as a pattern defined by a collection of signs and qualitative spatial relations like above, behind, etc. We can find patterns of meaning that are shared among different icons by analysing recurrent patterns of signs and their spatial relation. We call them sign patterns. A sign pattern has an associated collection of interpretations that encode the intended meanings associated to that sign pattern. Signs, sign patterns, and interpretations, which we will use in the paper, can be built by analysing and annotating existing libraries of computer icons. As we shall see, the inherent polysemy of signs, sign patterns and icons opens the way to use arguments for evaluating the quality or adequacy of new icons created by concept blending. June 2015 176 Shapes Sign ID Intepretations Document {info-container, document, text, page, file} SIGN (I) The structure of the DOCUMENT sign, including associated shapes and interpretations. Shapes Sign ID Intepretations MagnifyingGlass {examine, analyse, preview, search, find-in} SIGN (II) The structure of the MAGNIFYINGGLASS sign, including associated shapes and interpretations. Figure 4: Example of signs (a) (b) (c) (d) Up Down Arrow X (I) (a) the sign pattern FROM-DOWNARROW with three examples of the pattern where X is a sign for (b) cloud content, (c) document content, and (d) audio content. (a) (b) (c) Down Arrow X Up (II) (a) the sign pattern DOWNARROW-TOWARD with two examples of the pattern where X is a sign for (b) a hard disk storage, and (c) an optical disk storage. Figure 5: Example of different sign patterns used with the same sign DOWNARROW A semiotic system for icons In this section, we will formalise the notions presented above. A sign S is a tuple hid, F, Ai where id is a sign identifier, F is a set of shapes embodying the sign S and A is a set of interpretations. We use S to denote the available set of signs. Figure 4 provides two examples. Figure 4I shows the structure of the DOCUMENT sign, with several shapes embodying the sign, and a list of interpretations that express how this sign is used in different ways to convey meanings such as info-container, document, text, page, file. Intuitively, this means that the shapes used in the icons are sometimes interpreted as a document and other times as a page, etc. Moreover, the specific shapes can be used interchangeably to embody a DOCUMENT, i.e. there is no clear distinction, regarding the shapes, between document vs. page vs. file. Another example of a sign is the MAGNIFYINGGLASS shown in Figure 4II, with interpretations examine, analyse, preview, search, and find-in. We will also describe a library of annotated icons I, where each icon I 2 I consists of two parts: (1) a spatial configuration of signs and (2) the intended meaning of that icon. For instance, in Figure 5I, the icon (b) has the spatial configuration of a ‘cloud on top of a downward-arrow' and its meaning is ‘downloading content from the cloud'. Sign patterns In our framework, sign patterns relate signs in icons using spatial qualitative relationships such as above, behind, up, down, left, etc. We assume that these relationships are represented as binary predicates, Above(X,Y), Up(X,Y), etc., where X and Y are variables ranging over signs in S. For our current purposes, we use the qualitative spatial relationships defined in (Falomir et al., 2012). Let us consider two examples of sign patterns that include the DOWNARROW sign. DOWNARROW has a vertical downward-pointing arrow shape and is associated with the interpretations {down, downward, downloading, downloadfrom and download-to}. The sign pattern called FROMDOWNARROW (shown in the schema labelled (a) in Figure 5I) uses the qualitative spatial relationship up between a variable X and the sign DOWNARROW. Examples (b), (c) and (d) in Figure 5I illustrate the intuitive meaning of the sign pattern FROM-DOWNARROW: ‘downloading X'. Thus, example (b) refers to downloading cloud content, (c) document content, and (d) audio content. The inherent asymmetry of arrows in general, and arrow signs particularly, can be appreciated when considering the opposite spatial relation, when the sign DOWNARROW is "up" from another sign (Figure 5II). Then, the sign pattern DOWNARROW-TOWARD is used to mean that X is the destination of the downloading. Example icons (b) and (c) are intended to mean that the data being downloaded (whose type or origin is now elided) is to be stored in a destination such as a hard disk or an optical disk. Evaluating Blends using Argumentation As briefly described previously, the amalgam-based computation of concept blending amounts to combine different input spaces into a new space, called blend, by taking the commonalities of the inputs into account, by generalising some of their specifics and by projecting other elements. In the following, we describe how concept blending can account for modeling the creative process of a designer of computer icons. June 2015 177 Above(MagnifyingGlass,HardDisk) Above(MagnifyingGlass,Y) Generalisation Combination Above(MagnifyingGlass,Document) Preview Page Interpretation: Search-HardDiskContent Interpretation : Preview-Page Blend Input 1 Above(Pen,Document) Interpretation: Edit-Document Input 2 Above(X,Document) Above(X,Y) Generalisation Generic Space Figure 6: Generating an icon interpreted as Preview-Page through amalgam-based concept blending. A design scenario Assume a designer is looking for creating a new icon with the intended meaning of previewing a document or a page. The creation of such icon can be achieved by the following amalgam-based concept blending process (Figure 6). In addition to the DOCUMENT and MAGNIFYINGGLASS signs, we assume we have available a HARDDISK sign and a PEN sign which have already been used to make icons. The input mental spaces. The input mental spaces of the designer are an icon of a hard-disk with a magnifying glass hovering above it, whose meaning is Search-HardDiskContent, and an icon of a document with a pen above it, whose meaning is Edit-Document. The generic space. The sign pattern Above(X,Y) is used in both icons. The first icon contains the relation Above(MAGNIFYINGGLASS, HARDDISK) between the MAGNIFYINGGLASS and the HARDDISK, and the second contains the relation Above (PEN,DOCUMENT) between the PEN and the DOCUMENT. Further generalisation. Two generalisation steps are needed: Above(MAGNIFYINGGLASS,HARDDISK) ! Above(MAGNIFYINGGLASS,Y); correspondingly, Above(PEN,DOCUMENT) ! Above(X,DOCUMENT). Combination via variable substitution. We combine the schemas Above(MAGNIFYINGGLASS, Y) and Above(X, DOCUMENT) via [X/MAGNIFYINGGLASS, Y/DOCUMENT]. The icon of a page with a magnifying glass hovering above it is generated. The intended meaning. The designer associates to the icon the intended meaning of Preview-Page, by selecting the interpretations (Preview, Page) for the MAGNIFYINGGLASS and DOCUMENT signs. In this case, the designer decided that the intended meaning of Above(MAGNIFYINGGLASS, DOCUMENT) is Preview-Page, that is, a page can be examined without opening it. However, during the creative process, the designer could have generated other blends, not only by combining other signs, but also by selecting different interpretations associated to the MAGNIFYINGGLASS and DOCUMENT signs. For instance, the icon in Figure 7 still represents a page with a magnifying glass hovering above it, but it has been given a different intended meaning. Find-in-Page Magnifying Glass Document Object Level Above Interpretation Level Find-in Page Figure 7: An example of interpreting the sign pattern of an icon as Find-in-Page. The meaning of a blended icon cannot simply be considered right or wrong: interpretation depends on different points of view. Thus the evaluation of whether it is useful or valid for a specific purpose can be the object of a discussion. Arguments about intended meanings In the icon domain, arguments may include a clear interpretation of any constituent signs in the icon if it is a composition of signs, or a good fit with other icons in the icon set. For example, we can consider a counter-argument, i.e. an argument that attacks the interpretation a1 "magnifying glass above document means Preview-Page" in Figure 6, to be phrased as follows: a2 : "However, the icon in Figure 6 can also be interpreted to mean Find-in-Page." The rationale is that the MAGNIFYINGGLASS sign can often be understood as finding or searching for something. Thus, the icon can be also interpreted as Find-in-Page by associating the interpretation find-in instead of preview for the same sign MAGNIFYINGGLASS (Figure 7). This attacking argument can be made at an abstract/conceptual level, for instance, by taking other possible blends of the DOCUMENT and MAGNIFYINGGLASS signs related by the sign pattern Above(X,Y) into account. Or, alternatively, if there is an icon library that contains an icon that ‘satisfies' the argument above, then this attacking argument can be supported by a specific counterexample. Any of these two forms of attack evaluates negatively the icon in Figure 6. Therefore, if there are several alternative designs for a new icon, this attacking argument diminishes the degree of optimality/adequacy of that design with respect to alternative designs. The original interpretation can be defended, as usually done in computational argumentation models, by a new argument that attacks the attacking argument a2. For instance, the designer may say: a3 : "The icon in Figure 6 can only be interpreted differently if MAGNIFYINGGLASS is understood to mean find-in instead of preview. However, the other icons in June 2015 178 my library use MAGNIFYINGGLASS to mean preview, not find-in." Argumentation semantics can then be used, once a network of arguments is built, to determine the outcome. For instance whether argument a1, the original interpretation, is defeated or not can be determined as follows (Figure 2): in this example a3 has no attack, so it is undefeated, which means it defeats a2; since a2 is defeated, the attack against a1 is invalid and a1 is undefeated (i.e. is accepted). Arguments about the intended meanings of an icon can be embedded in a dialogue modeled in terms of Lakatos's moves and the dialogue pattern shown in Figure 3. Lakatosian reasoning for blend evaluation Here we present a Lakatos-style dialogue between two players, a proponent P and an opponent O. The goal of each player is to persuade the other player of a point of view, in this paper, the intended meaning of a new blended icon. In such a setting, we expect to see negotiations over the meaning of an icon take place between experts and novices, or between people designing icons and people using (interpreting) them, or various combinations. To discuss a given icon using Lakatosian reasoning, we assume that an initial conjecture is about the interpretation of an icon usually being an action-in-the-world or a concept, together with an example of a particular icon and a particular interpretation. The conjecture could be constructed by inductive generalisation. Example 1. In this example, Lakatosian reasoning is used for discussing the intended meaning of a new icon generated by concept blending: P1: "An icon with a magnifying glass over a page means Preview-Page" (Conjecture) O1: "I disagree, this icon (Figure 7) means Find-in-page." (Counterexample) P2: "No, this is a different case because the magnifying glass must be over pages with text on them to magnify (it shows what we're about to magnify)." (Monster-barring) After this dialogue, it is agreed that the intended meaning of the icon is Preview-Page and the icon itself has been clarified. Alternatively, the proponent and the opponent could make a different evaluation by following different moves. For instance, if the proponent accepts the counterexample, then the intended meaning of the icon can be refined due to piecemeal exclusion: P1: "An icon with a magnifying glass over a page mean Preview-Page" (Conjecture) O1: "I disagree, this icon (Figure 7) means Find-in-page." (Counterexample) P3: "Ok, only icons with a magnifying glass over a page with text mean Preview-Page". (Piecemeal exclusion) After this dialogue the intended meaning about the new icon has been changed by modifying the conjecture and taking the counterexample into account. Sometimes players have different points of view due to the sign patterns they have used in their concept blending. (I) Composite cloud icon (II) Stateful component (III) Processing component Figure 8: Interpreting the design of cloud icons2 Example 2. Let us imagine that the proponent has generated an intended meaning for an icon using the FROMDOWNARROW sign pattern, whereas the opponent has used the DOWNARROW-TOWARD pattern (Figure 5 illustrates these cases). The two players can engage in the following dialogue: P1: "Look at icons in Figure 5I, icons with a DOWNARROW relate to content." (Initial Conjecture) O1: "The icon in Figure 5IIb has a DOWNARROW but doesn't relate to content." (Counterexample) O2: "The icon in Figure 5IIc also has a DOWNARROW but doesn't relate to content." (Counterexample) P2: "The conjecture is right because the two examples actually do relate to content as they are to do with storage and content is part of storage." (Monster-adjusting) In this case, the proponent excludes the counterexamples using monster-adjusting, and reinterpreting them in a way that they are not counterexamples anymore. A conjecture might even be at a higher level, for asserting that a particular metaphor is appropriate or inappropriate. Example 3. For example, someone who is familiar with the ‘gear means adjust setting' metaphor in one program may be comfortable with it in another program: P1: " An icon containing the ‘gear' sign is a good one for Settings, because it invokes the idea of a gear change on a bicycle" (Initial Conjecture) O1: "The ‘gear' sign does not invoke the idea of a gear change on a bicycle from my point of view." (Counterexample) P2: "Ok, you're right, it does not invoke the idea of a gear change on a bicycle, but it is often used for Settings." (Monster-adjusting) Example 4. Argumentation may also consider the role a given abstract design plays within a given icon set. P1: "Even without knowing what the first or third icon in Figure 8I stands for, I can make a conjecture that it has to do with a server or a user interface accessed via the cloud. However, with the second icon, I'm not sure what it means. It is composed of various signs that I don't understand. It's probably badly designed." (Conjecture) O1: "Did you notice that icons in Figure 8II and Figure 8III are both defined as part of the same icon set? They mean ‘Stateful component' and ‘Processing component' respectively. Therefore, the second icon is actually well designed, because it uses signs appearing in other icons of the same icon set." (Counterexample) 2 From http://cloudcomputingpatterns.org. June 2015 179 P2: "But the second icon contains a pipe sign that is not used anywhere within the icon set, so I still don't know what the second icon means. If there were an icon with a pipe sign with a clear meaning, then I could understand the second icon better. " (Strategic withdrawal) The main characteristic of employing Lakatosian reasoning is that it allows a dynamic and social development of the intended meaning of blended icons. This cannot be achieved by using only abstract argumentation frameworks, since they assume that the object of discussion does not evolve. Therefore, having an argumentation process of this kind has several advantages: it promotes not only open-discussions around the meaning of an icon, but also the construction of a discourse about how an intended meaning is obtained. This is a desirable characteristic in computational creativity when evaluating creative outcomes such as concept blends. In this way, the evaluation evolves into a refinement process of an initial created concept, giving much more flexibility at the moment of deciding whether a blend is suitable. Discussion We have illustrated the use of argumentation to evaluate completed blends. We alluded earlier to the role argumentation can play in the generation of blends, for instance by suggesting different ways to generalise the input spaces. Indeed, successive statements may serve to carry out the steps in the blending process iteratively, relaxing or refining as needed. These steps can be modelled using Lakatos's moves. From a conjectural candidate solution, to additional criteria that reveal this blend to be a ‘monster' (i.e. which identify features of the candidate solution that cannot be allowed in the final solution for one reason or another), to adjustments that yield a more complete description of the problem and point the way toward a more satisfactory solution. An example of using argumentation for deciding which generalisations to use for creating a new icon is the following: A: "We can create a different blend icon starting from the same icons of before." (see Figure 6) B: "We could use the HARDDISK sign from the first icon and the DOCUMENT sign from the second icon." A: "But putting the DOCUMENT sign above the HARDDISK does not make sense from my point of view." B: "You're right, let's use the HARDDISK sign from the first icon and the PEN sign from the second icon." A: "Sounds good, now we have a Write-HardDisk icon." From this discussion and the previous sections, we think that it is feasible to bring the framework of argumentation inside the concept blending process. Moreover, this appears to work in a symmetric direction: the steps in an argumentation process can be carried out through blending. For instance, concept blending could be seen as the process behind the creation of rational arguments (Coulson and Pascual, 2006). One area closely akin to the icon domain is the domain of sentences in a natural or artificial language. These can be evaluated for their coherence, succinctness, and fitness-topurpose from a semantic standpoint (including relationship to other sentences), among other criteria; cf. (Abramsky and Sadrzadeh, 2014) for a category-theoretic view. Since people have different standards for evaluation, they frequently disagree about what constitutes a satisfactory result, be it a final outcome or a design decision that is only a step the way to developing an artefact. They may also disagree at a more fundamental level about what can be considered a valid point of view or an appropriate manner of conducting an argument. For example, "Godwin's law" states that an online discussion ends when someone compares one of the discussants to Hitler and whoever made the comparison automatically loses the debate. Naturally, the validity of this principle is itself debatable. During the course of argumentation, the goalposts may shift, as new information is revealed about the domain under discussion, and about the discussants themselves. The relationship between argumentation and decision-making has been explored (Ouerdane, 2009), including the case of updating models of preferences (Ouerdane et al., 2014); the latter is quite similar to our previous work on Lakatos's games (Pease et al., 2014). Conclusion and Future Work Computational models of combinatorial creativity faces the daunting issue of evaluating a large number of possible novel combinations. Particularly, Fauconnier and Turner (1998) propose a model that includes a collection of optimality principles to guide the construction of a ‘well-formed integration network'. Our computational model, based on generalisations of input spaces and amalgams, makes this combinatorial nature more explicit. The heuristic criteria called ‘optimality principles' are too underspecified to be used as computational measures to evaluate and select possible blends. Moreover, alternative numeric measures may be not enough to evaluate the quality or novelty of creative artefacts. Our intuition is that in the context of creative outcomes, people use argumentation to understand, criticise, modify and evaluate them, and that computational argumentation is a useful tool for computational creativity. The domain of computer icons generated by blending, where the evaluation of new icons is focused on their intended meaning, shows that symbolic argumentation is a process that is adequate to distinguish well-formed icons from mix-and-match combinations, unambiguous and clear icons from ambiguous or incomprehensible icons. This domain supports our claim that numeric heuristic evaluation measures are insufficient to recognise good blends, and shows the usefulness of an argumentation-based process for identifying good blends, detecting their critical problems, and refining them in an evolving, open-ended process. We have shown how Lakatosian reasoning can be used in evaluating concept blending for icon design. Our approach offers two main advantages. Firstly, the evaluation process can improve the blend, since the dialogue about it refines resulting blends. Secondly, the reasons behind a particular evaluation are made explicit. This is crucial given recent work on the importance of context in creativity judgments (Charnley, Pease, and Colton, 2012; Colton, Pease, and Charnley, 2011). Argumentation offers a framing story that June 2015 180 shows how and why a particular artefact was constructed, which can be presented alongside the artefact itself. We envision several future works. First, we intend to specify an ontology for modelling the semiotic system presented and to build a library of icons. Having a domain knowledge will allow us to generate arguments by induction, for instance, by analysing icons cases. Moreover, it will also open the possibility to explore the use of value-based argumentation (Bench-Capon, Doutre, and Dunne, 2002) for selecting the input icons to be used in the concept blending process. This latter point is important, since usually the inputs of a blending process are assumed to be already provided. Second, as far as the interpretation of icons is concerned, we are thinking to take advantage of existing approaches to natural language processing and understanding, especially Construction Grammars (CxG). In CxG, the grammatical construction is a pairing of form and content. In our semiotic system, sign patterns seem equivalent to the form, while interpretations would be akin to the content. Working with a grammar would make evaluation more explicit, e.g. we could use quantitative measures of ambiguity; and this would open many other domains for application. Finally, we plan to implement Lakatosian reasoning by employing existing computational tools for argumentation (Devereux and Reed, 2010; Wells and Reed, 2012). Our goal is to provide a computational argumentation framework and to integrate it into the framework for computational creativity we are developing in the COINVENT project. Acknowledgements This work is partially supported by the COINVENT project (FET-Open grant number: 611553). 2015_24 !2015 Visual Information Vases: Towards a Framework for Transmedia Creative Inspiration Britton Horn PLAIT Research Group Northeastern University Boston, MA USA bhorn@ccs.neu.edu Gillian Smith PLAIT Research Group Northeastern University Boston, MA USA gillian@ccs.neu.edu Rania Masri College of Arts, Media and Design Northeastern University Boston, MA USA rania@raniamasri.com Janos Stone College of Arts, Media and Design Northeastern University Boston, MA USA j.stone@neu.edu Abstract Inspiration is an important aspect of human creativity and one that creative systems are only recently implementing. In this research, we describe and implement a transmedia creative inspiration model for generative art systems. Our implementation of this model is Visual Information Vases (VIV), an artificially intelligent ceramicist that creates 3D-printable vases using inspiration from a user-supplied image. VIV scores an image along four aesthetic measures—activity, warmth, weight, and hardness—by evaluating the image's color palette. VIV then attempts to create a vase with similar aesthetic measures through evolution. The resulting vases are diverse and functional creations. We hope that this model will allow future generative systems to create inspired artifacts from a wide variety of sources. Introduction In current models of creative AI systems, one underexplored aspect of creativity is inspiration: interpreting concepts from one medium and translating them into another. The analogical mapping of perceptions and concepts (Hofstadter and Mitchell 1994) is a critical step in human creativity since it allows people access to solutions or creations outside their current mental state through influence by some external source (Hadamard 1996). This method of inspiration is common in many areas and can produce novel results. Composers interviewed by McCutchan conveyed their inspiration came through channels including music, nature, and poetry. More technical fields also involve creative inspiration, including examples of animals and insects inspiring work in robotics (McCutchan 2003). In Thrash and Elliot's conceptualization of inspiration, three commonalities arose from their readings on previous literature: Inspiration is evoked, involves transcendence and implies motivation (2003). In this paper, we describe a system that focuses on evocation of inspiration from a source domain and transcendence of that inspiration to create an artifact in an entirely different domain. We also believe motivation is a critical step in creativity, however one which is outside the scope of this paper. A full computational model of inspiration is still a long way off, however we attempt to close this gap by modeling one piece of inspiration: cross-domain analogy mapping. In order to show this is a feasible construct for creative systems, we developed a framework for transmedia analogy mapping from color palettes of images to 3D printable vases using the four aesthetic measures activity, warmth, weight and hardness. These measures were chosen primarily because of their importance among sculptors we interviewed, as well as their history within color science (Eysenck 1941; Granger 1955; Ou et al. 2004). Our inspiration framework was derived to mimic a common creative process performed by many sculptors and artists—choosing a color from an image to be the basis of inspiration for a new piece. The artist must transfer her feelings about the color onto a completely different domain. The essence of the inspiration source is not lost, but expressed in the new domain using techniques available in the target domain. Visual Information Vases (VIV) is an AI-based generative art system which uses our model of inspiration to produce 3D-printable vases with inspiration from 2D images uploaded by a user. Users interact with VIV online by uploading images, viewing results, and printing vases for everyday use. Our proof-of-concept implementation presented in this paper produces vases through evolution using the four aesthetic measures stated above as primary components of the fitness function. To our knowledge, this is the first instance of a system using evolution to create content optimized on aesthetic characteristics from an entirely different domain. VIV analyzes the colors of a user's image to create a color palette from salient and dominant colors. Color palette analysis is performed to create an aesthetic profile for the image. VIV then uses an evolutionary algorithm to produce a vase with a similar profile to that of the user supplied image. The resulting vase can be printed from a myriad of materials and printers to produce a functional, decorative vase. Vases are June 2015 182 described in a manner similar to that of Reed's while researching beauty as an aesthetic measure for evolutionary vase creation (2013). The main contribution of this research is the implementation of a novel cross-domain inspiration framework which translates aesthetic qualities from color to vases. This framework resembles a small part of methods used by human artists to create content with external inspiration sources. While humans have successfully used this technique perhaps for centuries (Thrash and Elliot 2003), our goal is to show this is a viable form of inspiration in generative art systems through its implementation in VIV and the creation of usable, decorative vases. Related Work Generative Art Systems Existing generative art systems use a wide range of techniques. Some create content based solely on preprogrammed rules (Cope 1996; Krzeczkowska et al. 2010; McCorduck 1990; Norton, Heath, and Ventura 2013) while others use user input (Clune and Lipson 2011; Draves 2005; Machado and Cardoso 2000; Secretan et al. 2008) or external sources (Cook and Colton 2011; Smith et al. 2006). Systems that use external inputs could be seen as receiving inspiration from outside stimuli. However, existing systems using external inspiration directly map stimuli to generation rules (Cook and Colton 2011; Smith et al. 2006). Also, these systems gain their inspiration from a pre-defined domain and so their inspiration model is non-extensible. Our model of inspiration allows an artifact from a wider range of domains to be used as inspiration for another domain since it is the high-level aesthetic measures which translate knowledge rather than a direct mapping. A popular fitness function in generative art systems is to have either an individual or larger audience choose their favorite artifact from a set of produced artwork. The system then generates future content using responses from users. This method can be seen in Endless Forms (Clune and Lipson 2011) and Pic Breeder (Secretan et al. 2008) where users choose their favorite item from a given set of produced content. These systems create the next generation of candidates which are variants of a user's choices. On a larger scale, Electric Sheep (Draves 2005) produces abstract art work to please a more global audience. When a computer goes to sleep, the Electric Sheep come on to create morphing abstract animations that can be voted up by users. More popular sheep live longer and thus allow the system to evolve its creations to the favorability of a large audience. VIV was not created with the intent of personalization. Rather than have a human intervene in each generation step, VIV generates vases using aesthetic metrics found to be important by subjects of a preliminary survey. In the domain of vase generation, one previous system has created printable vases through evolution with aesthetic measures as fitness functions (Reed 2013). Reed's generation of vases based on Birkhoff's beauty metric (Birkhoff 2003) produced many interesting vases rated highly by viewers. This research differed from previous generative art Figure 1: Design of the VIV system. The input image is analyzed and scored along the four aesthetic measures of activity, warmth, weight, and hardness. VIV's evolution component then evolves a vase to match the given aesthetic scores. research which focused on 2D abstract art by expanding the application of aesthetic measures to a functional and decorative 3D object. Birkhoff's metric was adapted to be suitable as a fitness function in an evolutionary algorithm to great success. VIV, in contrast, does not have one aesthetic score which she is trying to maximize each time. Instead, VIV generates vases using four aesthetic measures with scores varying between evolutionary runs based on user input. Cross-Domain Inspiration Cross-domain knowledge transfer as inspiration is a concept creative people have put to great use throughout history. Artists, scientists, and social leaders have gained inspiration from supernatural, internal (intrapsychic), and external (environmental) sources (Thrash and Elliot 2003). Creative computer systems, on the other hand, are only beginning to have the concept of inspiration incorporated into their makeup. Research by Ranjan et al. (2013) had expert artists create paintings that were the artist's interpretation of one of a small set of instrumental music pieces. Results showed that people were able to correctly identify which painting went with a particular piece of music. The painters in this research did not convey which aspects of the music they were inspired by and they also did not state how they would manifest that inspiration into their painting. Similarly, viewers gave no indication of the features they found correlated between the two artistic mediums. Similar research was conducted in the opposite direction- composers were asked to create music pieces using a square, lightning bolt, curved shape and an edgy shape as creative stimuli (Willmann 1944). This research showed composers are capable of interpreting an image, creating abstract concepts based on that image, then constructing those concepts within the domain of music. This is a complicated set of events which have yet to be implemented in computational systems. Our research attempts to close this gap by using consistent and limited aesthetic measures to demonstrate a June 2015 183 Figure 2: Two examples of VIV's color palette extraction and the resulting vases which correspond to a similar aesthetic profile. The left example is a warm, soft vase and the right is a cool, hard vase. system can gather abstract characteristics from one domain and produce those concepts in a different domain with techniques unique to that domain. One of the few generative systems that uses transmedia inspiration to create its content is Game Blender (Lopes and Yannakakis 2014). Game Blender uses conceptual blending as its means of cross-domain inspiration to create games. This crowdsourced, mixed-initiative system blends audio, narrative, ludus, and level architecture facets into a playable game. Blended creations consist of one artifact from each facet and can be controlled by the user through a number of parameters. Rather than a direct conceptual blending approach, VIV utilizes a mediation layer which performs analogy mapping from one domain to another. This is an attempt to move away from domain-specific blending approaches and towards an a more abstract methodology. VIV A diagram showing an overview of VIV's process is shown in Fig. 1. This section will detail the image analysis and vase generation portions of the system. Image Analysis VIV extracts a color palette of dominant and salient colors from the source image in the CIELAB color space. Dominant colors are chosen by selecting the average color from the most common bins in the image's color histogram. Colors are determined to be salient if they are at least two standard deviations from the mean color of an image (Huang, Liu, and Yu 2011). VIV then ranks salient colors according to dominance preventing tiny areas of a few pixels from making it into the color palette. Duplicates are removed using the current CIELAB distance function (Sharma, Wu, and Dalal 2005) and a final color palette is produced with a maximum of 8 dominant and salient colors each. An example color palette obtained from an image can be seen in Fig. 2. Previous research by Ou et al. developed formulas to model single color emotion by having Chinese and English viewers rate individual colors along the aesthetic dimensions of activity, weight, warmth, and hardness (2004). We use these equations to determine scores for each color in an extracted color palette along the same four aesthetic dimensions. VIV then applies a weighted average of all colFigure 3: Example silhouettes of a variety of vases. Bezier´ curves for each side can be identical or unique. These curves are then interpolated around the center axis with a variable sampling rate. ors from the dominant and salient color groups. The highest ranked colors from the dominant and salient groups are weighted at 75%. The remaining percentage is progressively halved until all colors are evaluated in the color palette. We use a weighted average rather than equal weights to allow for a more distinct aesthetic profile. We still use all colors in the color palette, although at reduced levels, since each color is a prominent color in the image and should have some effect on the overall analysis. The final aesthetic profile is then passed to an evolutionary algorithm which will use these scores in its fitness function. We acknowledge colors are a very small subset of information processed by human viewers of images and our color weighting is not necessarily human-like, however we feel this information is sufficient to demonstrate transmedia analogy mapping. Vase Depiction Similar to Reed's work with vases, our vases are described as two Bezier curves interpolated around a center axis. The ´ distance from each curve to the center axis may vary and be unique between curves. Also, the interpolation can be performed with a variable sampling rate, producing vases with triangular, square, or round bases and anything between. Each vase begins as a cylinder (two straight lines of equal distance to the center axis). Vase genetic data corresponds to a set of initial parameters (e.g. starting height, width, interpolation points, number of points per line) and a list of vase manipulations. Vase manipulations in our initial implementation are only squeeze/pull and shorten. Each of these operations can be done on one or both sides of the vase. Even with these limited and simple manipulations, definite variation can be seen (see Fig. 3). The squeeze and pull manipulations are described using two numbers: size and depth. The size determines how drastic of an alteration occurs and the depth June 2015 184 determines how many neighboring points are affected. This produces manipulations which can be smooth or jagged. Some constraints were placed on these alterations in order to maintain a functional and printable vase. For example, due to 3D printer constraints, a minimum wall width needed to be enforced so that the vase wouldn't break during the printing process. Vases with a minimum distance between curves below this threshold were considered non-viable and thrown out during evolution. Data Collection In order to determine which vase metrics contribute to each of our four aesthetic measures, we administered a web survey to both trained artists and novices. Recruitment was done through campus e-mails, social media posts and leveraging existing professional and artist contacts. There were 50 respondents of which 27 described themselves as artists with at least three years of experience. The remaining respondents labeled themselves as "hobbyists", "no experience" or did not complete demographic information. The survey was administered anonymously therefore no background verification was done on self-reported artistic experience. All demographic information was collected at the completion of the survey. This questionnaire design was modeled after previous research which attempted to model player preference in generated Mario levels (Shaker, Yannakakis, and Togelius 2013). We applied similar techniques replacing players' level preference along fun, challenge and frustration with respondent's assessment of vase activity, weight, warmth, and hardness in order to determine features associated with each dimension. Survey respondents were given a series of randomly generated paired vases and asked to compare them along the four previously mentioned aesthetic dimensions in a fouralternative forced choice questionnaire. Responses included "both" and "neither". An example comparison can be seen in Fig. 4. Subjects were allowed to do as many comparisons as they desired before completing the survey and filling out demographic information. The least number of comparisons performed by a single respondent was 1 and the greatest was 30 (mean=8.58). Data is still being collected, but at the time of writing this paper, 430 comparisons had been obtained. Using these 430 comparisons, vases were ranked along each aesthetic dimension by number of votes using existing pairwise comparison techniques (Shaker, Yannakakis, and Togelius 2013). A winning vote garnered one point, each vase received half a point for a vote of "both", and losing or "neither" resulted in no points awarded. Once rankings had been determined, we then used Principal Component Analysis and Multiple Linear Regression to determine which vase metrics contributed to each aesthetic measure (Freedman 2009). One big advantage of using Multiple Linear Regression is that it creates a function which is human-readable and easily implemented in a computer system. Feature Selection We identified several metrics for evaluating the vases. Many more are possible but using previously applied vase metrics as a starting point, we compiled the following list: Figure 4: Example comparison from our four-alternative forced choice questionnaire. • H — Height: In vases with a height difference between sides, the greater of the two is selected. • Wmax — Maximum width: Greatest total distance perpendicular from the center axis. • Wmin — Minimum width: Least total distance perpendicular from the center axis. • I — Inflection points: Number of changes in slope along each side of the vase. Inflection points from each curve of the vase silhouette are added to obtain the total inflection points. • Center of mass: x and y location of the center of mass of the 3D rendered vase. CoMx denotes the x location and CoMy denotes the y location. • Linearity: Variance from a straight line between inflection points averaged along each side of the vase. A cylinder would have a linearity of 1.0 as the base and lip location count as inflection points and there is no variation between the two in the vertical direction. • S — Sampling Rate: number of equidistant points around the unit circle which are used during interpolation. Can also be viewed as the number of points in the base. We also included additional relational metrics which are the result of combining these: • A — Asymmetry. Evaluated as CoMx Wmax • Rmin — Minimum width to height ratio. Wmin Height June 2015 185 Figure 5: Printed vases created by VIV. The left and center vases were created with inspiration from the artwork in fig. 2 (first example) and the vase on the right is an attempt by VIV to make her most active vase. • Rmax — Maximum width to height ratio. Wmax Height Principal Component Analysis of our survey data for activity yielded important vase metrics to be the number of inflection points, lateral asymmetry and a low number of interpolation points. Warmth was influenced by lateral symmetry and a higher number of interpolation points. The ratio of the location of the minimum and maximum widths to the vase height correlated with weight. Hardness was determined by a high ratio of maximum width to height, high center of gravity, and less interpolation points. Each of the vase metrics used are not direct inputs to the vase generation algorithm. Instead, they are tools for expression of aesthetic qualities interpreted from another domain. Vase Generation Each generated vase is given an aesthetic profile by the four equations below which was determined through Multiple Linear Regression of our survey data. Activity = !0.2 ⇤ I + 2.3 ⇤ A ! 0.002 ⇤ S + 0.5 (1) W armth = !2.0 ⇤ A + 0.001 ⇤ S + 0.41 (2) W eight = 0.06 ⇤ Rwidth ! 0.06 ⇤ Rmax + 0.6 (3) Hardness = 0.2 ⇤ Rmax ! 1.8 ⇤ CoMy !0.2 ⇤ Rmin ! 0.01 ⇤ S + 1.6 (4) The fitness function used during evolution is the Euclidian distance between an image's aesthetic profile and the generated vase's profile where evolution is trying to minimize this score. Vase creation is done through genetic evolution of a population of 100 vases over 100 generations. We used 100 generations because this is the point where additional generations produced results which were no closer to an aesthetic profile than the current population. For each generation, there is a 10% elitism rate where vases are kept without change, 40% crossover rate, and 50% mutation rate. Recall that vase representation is comprised of initial parameters including starting height, width, sampling rate, and points per line as well as a list of vase manipulations. Our crossover implementation involved choosing initial parameters from one parent or the other and combining manipulation lists. Manipulations lists could be combined in a couple Figure 6: Depiction of the complete vase generation process using inspiration from one version of the famous Scream works by Edvard Munch. of different ways. Most simply, the manipulations from the second parent could be appended to the first parent's list. Alternatively, for each list index, one manipulation was randomly chosen from a parent's list at that same index. Mutations involved re-assigning one of the initial parameters to a different value, adding a manipulation to the manipulation list at a random index, randomly removing a mutation, or altering the size of a manipulation. Results The examples given in this paper show input from a variety of famous artworks (see Fig. 6) and the diverse vases created by VIV using each of these artworks as inspiration. There is great variety in input and output to the system yet VIV consistently creates vases with an aesthetic profile which reflects that of the inspiring work. Fig. 8 demonstrates a set of vases produced from the amateur photo in Fig. 7. VIV determined this image to be a light and soft image so the vases produced tended to be round with a high center of gravity. Using a generative art system such as VIV coupled with modern 3D printing techniques, vases can be produced in a matter of hours which previously took expert artists weeks, if at all. Fig. 9 is an example which ceramicists we corresponded with stated would be extremely difficult for them to replicate because of the sharp edges throughout the internals of the vase. Discussion In order to prove our inspiration model, we set out to create a working generative art system with this model at its core. VIV has been used to create numerous distinct vases with various aesthetic profiles inspired by images ranging from some of the most famous paintings to amateur photos. While many may argue VIV is not truly creative since she neither possesses any type of novelty search nor a true understanding of her creations, we can see that cross-domain analogical inspiration is a viable model for generative art systems. Our initial implementation uses the four aesthetic measures of activity, warmth, weight, and hardness as the inspiration channels between two dimensional images and 3Dprintable vases. This model is not confined to our proof June 2015 186 Figure 7: An example image and color palette extracted from an amateur photo. Figure 8: A set of vases created with the image from Fig. 7 as inspiration. The image was viewed by VIV as soft and light therefore the vases produced had a high center of gravity and a round base. of-concept, but extensible to other analogy mappings and domains. Our implementation has shown how a system can interpret aesthetic measures from one domain using techniques specific to that domain, create an analogous mapping to another domain, and produce content within the target domain using techniques separate from those of the source. Fig. 5 contains examples of VIV's final printed output. Future Work Color analysis is just one piece of information people take in when viewing art. In future implementations, a more robust image analysis which includes line, angle, feature, and object detection would be desirable as well as the extension of our single color affect analysis to color combinations. Just as human viewers take in a range of stimuli from artwork, we want VIV to mirror this in her analysis of images with a more in-depth interpretation. We plan to conduct user studies in order to quantitatively determine if our resulting vases fit within acceptable bounds of the previously stated aesthetic measures for a large portion of human viewers rather than our initial face-value assessment. We envision this proceeding in two phases. For the first validation phase, we will give subjects a pool of vases with varying pre-defined aesthetic profiles and ask them to group together the vases they feel are most similar. If our vase profile equations are adequate, subjects should be able to organize vases by aesthetic profile. The second validation phase would extend this method to grouping vases by image. Because our inspiration model only uses an image's color palette rather than the image as a whole, this validation may be better suited to grouping by color palette rather than by original image. Also, extension of these aesthetic measures to other researched methods would be beneficial. Birkhoff's beauty metric is an abstract aesthetic measure which could be incorporated since it perhaps is more easily studied in a broad range of domains rather than something such as warmth or hardness. As this measure has already been studied in both the domains of evolutionary vase creation and color science, its addition to our initial implementation would be rather straightforward. However, its use in domains where more granular aesthetic principles are hard to assess could be useful for future applications. Conclusion We have presented the detailed design of VIV and her use of a novel cross-domain inspiration framework. We demonstrated how VIV uses this framework to create vases with an aesthetic profile interpreted from a different domain. In this way, abstract artistic concepts can be gathered from one domain and manifested in another mirroring creative methods utilized by people. Generative art systems in parallel with new media technologies, allow for a wider range of artistic content to be produced by both humans and computers. Our hope is that this model of inspiration can be used to provide creative systems with the ability to translate high level knowledge between new domains and expand their expressive range as well as broaden people's creative potential. June 2015 187 Figure 9: Example vase from VIV obtained when she tries to max out the activity measure. This vase was considered to be difficult to replicate by some ceramicists. 2015_25 !2015 The Painting Fool Sees! New Projects with the Automated Painter Simon Colton1,2, Jakob Halskov3, Dan Ventura4, Ian Gouldstone2, Michael Cook2 and Blanca Perez-Ferrer ´ 1 1The MetaMakers Institute, Academy for Innovation and Research, Falmouth University, UK 2Computational Creativity Group, Department of Computing, Goldsmiths, University of London, UK 3UBIC Inc., Tokyo, Japan 4Computer Science Department, Brigham Young University, USA Abstract We report the most recent advances in The Painting Fool project, where we have integrated machine vision capabilities from the DARCI system into the automated painter, to enhance its abilities before, during and after the painting process. This has enabled new art projects, including a commission from an Artificial Intelligence company, and we report on this collaboration, which is one of the first instances in Computational Creativity research where creative software has been commissioned directly. The new projects have advanced The Painting Fool as an independent artist able to produce more diverse styles which break away from simulating natural media. The projects have also raised a philosophical question about whether software artists need to see in the same way as people, which we discuss briefly. Introduction The Painting Fool (thepaintingfool.com) is software that we hope will be taken seriously as a creative artist in its own right, one day. It is a well established project, with an emphasis on implementing processes which could be described as artistic and/or creative, rather than merely producing images which look like they may have been painted by a person, as with many graphics packages, as per (Strothotte and Schlechtweg 2002). Many technical details of the project and discussions of the outreach activities performed with The Painting Fool are given in (Colton 2012b). Progress in the project is usually both technical and/or societal, and the work presented here addresses both aspects. On the technical side, we have enabled The Painting Fool to use machine vision techniques before, during and after painting, to take on more creative responsibility, produce more interesting pieces and provide better framing information. This has involved integrating aspects of the machine vision abilities of the DARCI system (Norton, Heath, and Ventura 2013; Heath, Norton, and Ventura 2014). In addition to being used in art generation itself (Norton, Heath, and Ventura 2011), DARCI has been used as an artificial art critic (Norton, Heath, and Ventura 2010), which makes it the perfect complement to The Painting Fool. Implementing such synergies is rare in Computational Creativity research, with a few notable exceptions, such as the combination of parts of the MEXICA, Curveship and GRIOT programs into the Slant storytelling system (Montfort et al. 2013). On the societal side, to get The Painting Fool accepted as an artist, we engage the public, journalists and members of the art world (artists, art students, art educators, critics, curators, gallery owners, etc.), as natural stakeholders in the question of whether software can be creative or not. Exploration of some of the stakeholders issues in Computational Creativity is given in (Colton et al. 2015), where The Painting Fool is a case study. This, along with a philosophical underpinning given in (Colton et al. 2014) provide a general grounding for the design decisions presented here, in terms of why they represent significant progress towards the long-term aim of public acceptance of The Painting Fool as a creative artist. In this context, we describe here three new art projects where The Painting Fool has used its new visual capabilities with increasing sophistication, to produce interesting art and experiences for audiences via more autonomous behaviours in the software. These projects include a moodbased portraiture demonstration, where the visual processing was used to express intent; The Painting Fool's first art commission for a third party; and a private art project. The collaborative projects with DARCI have progressed The Painting Fool project along a number of axes. With machine vision abilities, it can now analyse its output, albeit simplistically: new functionality with potential to make it more appreciative in motivating and assessing projects, and via analysis during sketching activities. Also, choosing rendering styles can now be done by the software itself, rather than a person. This adds much autonomy, increases impressions of creative responsibility in the software, and has led to surprising results, as the paintings no longer only resemble those produced in traditional ways by people. The Painting Fool uses the digital medium more fully in interesting new styles difficult for people to achieve, which again increases the impression of independence and creative responsibility. This paper is organised as follows. In the next section, we describe aspects of The Painting Fool and DARCI used in the collaboration, followed by a discussion of how association networks from DARCI were used by The Painting Fool in increasing levels of sophistication. We then present the three new art projects enabled by this collaboration, and put these into the context of related work. We conclude with a discussion of the advances made in The Painting Fool project, and we briefly question whether software artists need to see in the same way as people. June 2015 189 Figure 1: A workflow representation of The Painting Fool's processing for the You Can't Know My Mind exhibit. Background The Painting Fool: Workflows There is no single way in which The Painting Fool produces artworks, but rather a set of tasks it can achieve through performing certain behaviours, and workflows which combine these into art-producing processes. The behaviours make use of various AI techniques including natural language processing (Krzeczkowska et al. 2010), constraint solving (Colton 2008b), evolutionary search (Colton 2008a), design grammars (Colton and Perez-Ferrer 2012) and ma´ chine learning (Colton 2012a). The workflows are constructed through a teaching interface currently consisting of 24 screens. An example workflow, for the You Can't Know My Mind exhibit (described below) is given in figure 1. This highlights that the vision system is used both at the start of the process and towards the end (the ‘AN evaluation' node). Before the work described here, The Painting Fool had a very rudimentary visual analysis system that was able to evaluate features of an image such as texture, colour variance and symmetry. It is also able to segment a given digital photograph into a set of colour regions, using a thresholdbased neighbourhood construction method, path-finding for edge rationalisation and edge abstraction methods. A waypoint in every workflow is the construction of such a set of colour regions, which can be achieved using this segmentation process, via design grammars, variation of hand-drawn scenes and/or a constraint solver placing rectangles onto the canvas. The colour regions direct the rendering process, whereby each region is either filled-in or outlined via the simulation of natural media such as paints and implements such as paintbrushes. The rendering of each region can include multiple fill/outline passes, and the rendering of the entire segmentation of colour regions can be done repeatedly, building up a layered image. The segmentation and rendering methods are highly parameterised, requiring 14 and 57 parameters to be set respectively, as described in (Colton 2012b). Choosing from the space of possible segmentation and rendering methods constitutes a large part of the creative responsibility taken on in an art project, along with choosing and arranging subject matter, etc. We show below how the software now takes on the responsibility of choosing the rendering settings. DARCI: Association Networks One way for The Painting Fool to have an increased appreciation of the artefacts it produces and some level of intentionality (both desirable qualities), is for it to employ a perceptually grounded cognitive model that can associate visual stimuli with linguistic concepts. That ability was realized by borrowing a piece of the DARCI system, a visuo-linguistic association approach, which consists of a set of neural networks that perform a mapping from low-level computer vision features to adjectival linguistic concepts, learned from a corpus of human-labeled images. These images come from a continuously growing dataset obtained via a public facing website (darci.cs.byu.edu) that solicits volunteer labeling of random images. Volunteers are allowed to label images with any and all adjectives they think describe the image, and as a result, images can be described by their emotional effects, most of their aesthetic qualities, many of their possible associations and meanings, and even, to some extent, by their subject. Furthermore, through additional labeling exercises, volunteers can specify labels that explicitly do not describe the image, allowing the collection of explicit negative labels as well as positive ones. The result is a rich, challenging, dynamic dataset. A recent snapshot of the data reveals 17,004 positive labels and 16,125 negative labels using 2,463 unique adjectives associated with 2,562 unique images, an average of approximately 12 unique labels per image, and 110 adjectives with at least 30 positive and 30 negative image associations. Images are perceived by the system as a vector of 102 low-level computer vision features extracted from the image using the DISCOVIR system1. This level of image perception does not admit significant semantic understanding, but it does allow appreciation of concepts that can be adequately expressed with global, abstract features dealing with characteristics of the image's color, lighting, texture, and shape. Given training data in the form of (image feature vector, adjectival label) pairs, a mapping is learned using a set of artificial neural networks that we call association networks. Since learning image-to-concept associations is a multi-label classification problem, and we cannot assume implicit negativity, the only appreciation networks trained for a particular image are those explicitly labeled with (positive or negative examples of) the associated concept. Each adjectival concept is learned by a unique association network, which is trained using standard backpropagation and outputs a single real value, between 0 and 1, indicating the degree to which an input image can be described by the network's associated adjectival concept. 1 appsrv.cse.cuhk.edu.hk/˜miplab/discovir June 2015 190 Figure 2: Seventeen painting styles along with layering scheme and partial visual profile. Implementing Vision-Enhanced Painting From the DARCI system, The Painting Fool inherited a set of 236 association networks (ANs), and a method of turning a given image I into the numerical inputs to the ANs. Each AN corresponds to a particular adjective, i.e., the higher the output from the AN for adjective A when given input values for I, the more likely (the AN predicts) that a viewer will use A to describe I. We first determined which of the adjectival ANs were suitable for dealing with The Painting Fool's output. To do this, we ran each AN over hundreds of painterly images from The Painting Fool and recorded the range of the numerical outputs. We found that for the majority of the ANs, the output range was so low that we couldn't meaningfully claim that it was differentiating between images based on visual properties. We selected all ANs where the range of outputs was 0.05 or greater, and then performed a sanity check on those remaining, removing any which described images in a particularly counter-intuitive way, e.g., the AN for ‘red' outputting a higher score for a patently green image than for a patently red image. This left a selection of 65 usable ANs, to which we implemented an interface in The Painting Fool. For each selected AN, we recorded the highest and lowest outputs over the hundreds of images mentioned above, and when output from a new image is calculated, it is normalised between these extremes. As described below, the ANs have been used in a number of new workflow behaviours for The Painting Fool. The simplest of these is to allow the software to frame it's output (Charnley, Pease, and Colton 2012) by describing it visually. It can also compare and contrast images in terms of a particular adjective, or in terms of a profile of multiple adjectives. It can also employ the ANs during the painting process, as described in the following subsections. A Space of Simulated Visual Art Implements/Styles Given an image segmentation of colour regions as described above, The Painting Fool produces a non-photorealistic rendering of it in a series of whole-segmentation layers, during which each region itself is rendered in multiple layers. During the rendering of each layer, which can either be filled in, or outlined, the software simulates natural media such as paints, and the usage of implements such as brushes in outline/fill styles such as hatching, as described in (Colton 2012b). The rendering of a layer is determined by a set of 57 parameters, which cover the simulation of the media itself (e.g., wetness of paint), the implement (e.g., brush size) the support (e.g., canvas roughness) and the style (e.g., number of times to draw an outline). We defined a space of painting styles by fixing the rendering to a single whole-segmentation layer during which a scheme of up to five rendering layers per region was allowed. The region layering scheme was represented as a string with letters A, B, C, a, b or c. Upper case letters represent a fill layer with lower case letters representing outline layers. Where upper and lower case letters correspond (e.g., A = a), all the other settings are the same, hence they represent the simulation of the same natural media in roughly the same way, but one produces an outline, the other produces a fill. For instance, ABCab represents three fill layers and two outline layers, with all the settings of the first two fill layers exactly as for the two outline layers. We found that it increased visual coherence if the fill layers corresponded to the outline layers in this way. After initial experimentation, we constrained the space to include five layering schemes: aB, Ba, ABab, Aab and ABCab, which we found produced a suitably large variety of visual styles. We generated 1,200 painting styles by randomly sampling the space of rendering styles with each of the 57 parameters set randomly to an appropriate value in its range, and then mapping a set of these styles onto one of the five layering styles above, also chosen randomly. For each style, we used The Painting Fool to render a given segmentation of an abstract flower. Then, for each of the 65 selected ANs described above, we calculated the normalised output for each of the 1,200 flower paintings, thus creating a visual profile for each style. Example painting styles, along with the layering scheme and part of their visual profile are given in figure 2. The seventeen pictures demonstrate somewhat the diversity in the painting styles within this space. The partial profiles indicate that while the AN outputs have a relatively small range, it is sufficient for a choice of painting style based on these values to be meaningful. Employing Vision During Painting To recap, we supplied The Painting Fool with 1,200 different painting styles, each with a visual profile derived from applying association networks. As described above, there are various workflows for producing images with The Painting Fool. When the workflow starts with a digital photograph, images are segmented into a certain number of colour regions, with more regions usually leading to more photorealism in the final paintings. Each colour region corresponds, therefore, to a region of the original photograph, and this photo-region can be interrogated in order to choose an appropriate painting style. To do this, The Painting Fool extracts the photo region onto a transparent image, then applies June 2015 191 all 65 adjectival ANs to the extract, to compile a profile. The Euclidean distance of this photo-extract profile from the visual profile of the 1,200 painting styles is used to order the styles in increasing distance. The distance can be interpreted as an appropriateness of the painting style to the underlying photo extract. That is, the style with least distance will render the region in a way that is most similar in nature to the original photograph (according to the ANs). The new workflow for The Painting Fool uses machine vision during painting as follows: it takes a photograph and segments it into colour regions. For each colour region, a photo-extract profile is produced using the ANs, and this is used to order the painting styles in The Painting Fool's database, in terms of how appropriate they are to the photo extract. From the top ten most appropriate styles, one is chosen randomly to paint the region in question. Choosing from the top ten in this fashion means that each time the same photograph is painted, it produces a different image, yet each time, each painting style is appropriate to the region it is used to paint. We have enhanced this workflow by enabling a sketching mechanism. That is, The Painting Fool tries all of the ten most appropriate sketching styles in situ, then produces a visual profile of the resulting region of the painting, and chooses the one where this profile is closest to the photo-region profile. This reduces the reliance on the initial flower experiments somewhat, as The Painting Fool can see what each style looks like actually in the painting, before committing to one in particular. It also opens up the potential for The Painting Fool to produce a sketchbook to accompany each painting, as framing information. Cultural Applications In the subsections below, we describe new cultural application projects with The Painting Fool which have been enabled by its access to a vision system. These span the kinds of public, private and commissioned art projects that an artist might expect to undertake as part of their general activities. The ‘You Can't Know My Mind' Exhibition For the You Can't Know My Mind exhibit reported in (Colton and Ventura 2014), we focused on the question of intentionality in creative software. As software is programmed directly, it is fair criticism to highlight that in most Computational Creativity projects, the intention for the production of artefacts comes from the software's author and/or user. For the You Can't Know My Mind project, we raised our intentions to the meta-level, i.e., we intended for the software to produce portraits and entertain sitters in order to learn about its own painting styles. However, the aim of each artefact production session was determined by The Painting Fool itself, in order for it to exhibit behaviours that unbiased observers might project the word ‘intentionality' onto. An 8-point description of how The Painting Fool operated in this project is given in (Colton and Ventura 2014). Of note here, we used the machine vision system from DARCI offline, to prepare the software for portraiture sessions. That is, for each of 1,000 abstract art images produced by the Elvira sub-module (Colton, Cook, and Raad 2011), and for Figure 3: Example comparisons of sketch conceptions (left) and associated portraits (right). The percentages portray the range of the output from the adjective AN for the second image in terms of the AN output value for the first image. each of 1,000 image filters produced by the Filter Feast submodule (Torres, Colton, and Rueger 2008), the output of all of the adjective ANs were calculated. Hence the software could choose from the most appropriate abstract backdrops and the most appropriate filters for an adjective, A, chosen to fit a mood, to produce a sketch conception to aim for with each portrait. The ‘background image' and ‘filtered image conception' nodes in figure 1 correspond to these. Under the assumption that the sketch will invoke people to project certain adjectives onto the image upon viewing, the sketch conception has aspects which The Painting Fool aspires to achieve in its painting. The conception image is segmented into colour regions, and a simulation of various painting media (paints, pastels and pencils) are used in one of eight styles, to produce a portrait. At the end of each portraiture session, The Painting Fool uses the vision system to compare the level of adjective projection in the portrait to that of the sketch. To do this (indicated by the ‘AN evaluation' node in figure 1), it applies the adjectival AN for A to the sketch conception and to the final portrait, and compares the output. If the AN output for the portrait, Op, is within 95% to 105% of the AN output for the conception, Oc, i.e., 0.95 ⇥ Oc  Op  1.05 ⇥ Oc, this is recorded as satisfactory. If it is higher than 105%, this is recorded as a success, and if it is higher than 110%, this is recorded as a great achievement, with failures similarly recorded. Three examples comparisons of conception and portrait are given in figure 3. The level of achievement/failure is used to update a probability distribution that The Painting Fool can use to choose painting styles later to (attempt to) achieve an image with maximal output respect to a given adjectival AN. June 2015 192 Figure 4: Front page and excerpt from the Japanese version of the essay for the ‘I Can See Unclearly Now' commission. Third image: an early photograph of artwork hung in the Behaviour Informatics Laboratories of UBIC. The ‘I Can See Unclearly Now' Commission UBIC2 is a behavioural information data analysis company based in Tokyo. In early August of 2014, UBIC's CTO, Mr. Hideki Takeda came across The Painting Fool's website while exploring recent advances in Artificial Intelligence research on the web. At that time, UBIC's Behavior Informatics Laboratories (BIL) in Shinagawa, Tokyo, was implementing a complete office renovation scheme reflecting the company's reorientation from eDiscovery vendor to supplier of in-house Big Data Analytics solutions powered by an AI engine called the Virtual Data Scientist. The new office concept of the BIL can be summed up as: "Shaking the boundaries between the virtual and the real so as to stimulate the senses and promote intelligence and creativity". For example, the new office features both real bamboo and bamboo imprinted on a glass wall. The choice of bamboo is not arbitrary, but motivated by the fact that this plant plays a prominent role in traditional Japanese culture. It is highly symbolic and associated with, for example, Noh theatre3 in which the protagonists are often ghosts from another plane of existence but appearing in the real world. Mr. Takeda decided to commission artworks from The Painting Fool, as this would fit very well with the blurring of virtual and real spaces in the BIL The first author of this paper - who is the lead researcher in The Painting Fool project - was contacted by the second author acting on behalf of UBIC, and ultimately three series of images were commissioned, along with an essay highlighting how the machine vision system was used in increasingly sophisticated ways from the first to the third series. Constraints were put on the commission: (i) to include a portrait from a live sitting, and (ii) to include a piece involving Alan Turing, as an AI pioneer. Moreover, it was agreed that the commission would involve an element of research and implementation, driving The Painting Fool project forward. Example images (with details) from the three series are given in figure 6, and details from the essay, along with an early photograph of one of the pieces hung in the BIL are given in figure 4. The title of the commission was chosen to highlight The Painting Fool's new usage of machine vision techniques, while indicating that the system is far from perfect. To tie the three series of images together, the same style of 2 www.ubicna.com 3 en.wikipedia.org/wiki/Noh backdrop was used, consisting of 10,000 adjectives rendered in a handwritten way in varying shades of greyscale pencil, onto dark backgrounds. In all the pieces, the mass of adjectives open up in multiple places into which red handwritten adjectives are strategically placed. For the first series, StarFlowers, paintings of the abstract flowers used for assessing painting styles were placed using a constraint solver to avoid overlap, as per (Colton 2008b), with slightly differing sizes. Before placement, each flower image was assessed by the 65 adjective ANs, and from the top ten highest scoring adjectives, two were chosen to appear alongside the flower in the piece, in red handwriting. The pairs were automatically chosen so that no flower had the same two adjectives next to it. For instance, in the detail of figure 6, the first flower is annotated with ‘peaceful' and ‘warm'. In the second series, Good Day, Bad Day, two photographs of the second author seated, posing firstly in a good mood, and secondly in a bad mood were used. The 65 adjectives were split into positive, neutral and negative valence categories, e.g., happy, glazed, bleary respectively. The painting style with the highest average AN output over the positive adjectives was chosen to paint the first pose, and the most negative style was similarly chosen to paint the second pose. Each portrait was annotated at its edges with red handwritten adjectives appropriate to the painting at that edge point. In the third series, Dynamic Portraits: Alan Turing, a photograph of Turing was hand annotated with lines to pick out his features. We then used the method of arbitrarily choosing from the top ten most appropriate painting styles for each colour region described above, to produce a number of portraits, with the annotated lines being painted on at the end, to gain a likeness. The rendered painting was analysed with the 65 ANs and the 17 most appropriate adjectives were scattered around the backdrop of the image, in a non-overlapping way, as usual in red handwriting. Dozens of images from the three series were sent to UBIC to choose from for the BIL, with very little curation from the first author. UBIC representatives confirmed that the commission achieved the brief of producing pieces which blur the line between the real (i.e., painted by a person) and the virtual (i.e., painted by a computer), and were very happy with the commission. They produced a translated version of the essay for visitors to the lab, and hung an example from each series in the BIL. June 2015 193 Figure 5: Portrait of Geraint Wiggins. The Portrait of Geraint Wiggins In (rather belated) celebration of a milestone birthday, we used the vision-based sketching approach described above, to produce a portrait. Given an original image, handannotated with lines picking out facial features, The Painting Fool segmented it into 150 colour regions/lines, and for each, chose the top ten most appropriate painting styles, as described above. For each of the ten, it painted the region, calculated the visual profile of the region of the painting that resulted, and finally chose the style with minimal distance between its visual profile and that of the original photo-extract. In this way, the painting process was deterministic, but not predictable, and produced a striking portrait with painterly and distinctly non-painterly effects. To add a physical uniqueness, the image was printed onto 300 4cm by 4cm squares which were composed into the final piece in an overlapping formation, as per the Dancing Salesman Problem piece described in (Colton and Perez-Ferrer 2012). ´ The portrait is shown in figure 5. Related Work It is commonplace for an artist to be commissioned to work with a bespoke piece of software, or even to develop new code, to produce artwork, with the person using the software as a tool, and this tool may be generative. However, it is much less common for a commission to be made specifically because the software will take on some of the creative, not merely generative, responsibilities. The ANGELINA system (Cook and Colton 2014) has been commissioned to produce games for the New Scientist, Wired and PC Gamer Magazines. In the former, ANGELINA designed a game as normal, but its designer provided custom visual theming, drawing new sprites and creating sound effects for Space Station Invaders, since ANGELINA was not capable of this. The commissions for Wired and PC Gamer came much later, when ANGELINA had more independence and could produce full games, given just an initial theme of a short phrase, proposed by the journalist. For the PC Gamer game, NBA Mesquite Volume 2, ANGELINA used a database of labelled textures compiled from social media mining, for the first time in a released game. This happened because the theme chosen, ‘avocado', matched a label in the database for the first time since the database had been added. This created an additional talking point for the article, and in general the games were well received and drove up online viewing figures. The Paul drawing robot by Patrick Tresset (Tresset and Fol Leymarie 2012) has much in common with The Painting Fool, in that it uses a camera and machine vision techniques to capture an image, then automatically draws a portrait: in this case, physically, using a robotic arm and a pen. It also simulates looking while it draws, but this is only for entertainment purposes, i.e., after the initial photograph is taken, the vision system is not used again. Paul has been commissioned on a number of occasions, most notably for a weeklong workshop at the Centre Pompidou in late 2013. Tresset has also found success in selling versions of the robot painter to art museums. Another robotic painter, which does use machine vision during painting and has also been commissioned for art is the eDavid system, as described by (Lindemeier, Pirk, and Deussen 2013). Here, a camera is used to photograph the canvas after a series of paint strokes have been applied, with a vision system employed to optimise the placement of future strokes based on the visual feedback. It is beyond the scope of this paper to perform a survey of commissions where software creators rather than artists controlling software have produced artworks. However, we can tentatively introduce some metrics for comparing projects/software/programmers to begin to characterise such commissions. For instance, one could compare the domain specific training of the programmer, e.g., comparing the commissions of artist Harold Cohen (who represented the UK in the Venice Biennale) and his AARON system (McCorduck 1991) with Oliver Deussen (who has no artistic training) and his eDavid system mentioned above, as this may indicate more autonomy in the software (but doesn't necessarily). Other measures could include how much curation takes place, i.e., how much of the software's output is usable; what amount of hand-finishing of output takes place; and how much extra coding is required for each project. Conclusions and Future Work Through the above projects, The Painting Fool has advanced as an artist in three major ways. Firstly, the creative responsibility of choosing a painting style has been handed to the software. With the You Can't Know My Mind project, it learned a probability distribution which can choose between one of eight painterly rendering styles, to produce an image which people will probably describe using an adjective, chosen intentionally to express a mood. With the I Can See Unclearly Now project, the software gained the ability to choose between 1,200 painting styles for each colour region dynamically during painting. With the Portrait of Geraint Wiggins project, it went further: performing in situ sketches June 2015 194 to test painting styles in the context of the painting at hand. Hence, the decision making involved in determining rendering styles is now undertaken by the software, which is a major advance in autonomy, and potentially towards its acceptance as an artist in its own right. Secondly, on close inspection of the pieces in figures 5 and 6, while the images produced retain a painterly style somewhat, there are aspects which couldn't be produced with natural media simulation. This is because the painting styles in its database include ones which simulate the ground in-between natural media such as paints and pastels, and others which have no analogue in the physical world. This means that - for the first time - The Painting Fool can produce images using a much broader range of pixel manipulations, producing styles which have little grounding in traditional painting, We also see this as a major advance, as it extends the variety of images the software can produce, and potentially increases perceptions of autonomy. The third advance will be expressed more in future work than in the projects presented here. Through the mapping of visual stimuli to linguistic concepts, The Painting Fool is able to project adjectives onto images, and we plan to enhance this with the ability to similarly project nouns. This will increase its capacity to appreciate its own work and that of others, enabling it to provide more sophisticated commentaries about what it has produced, and we touched on this with the output in the You Can't Know My Mind project, where the conceived and rendered images are compared visually. We plan to take this framing further, with The Painting Fool keeping a sketchbook for each project, adding value, and helping audiences to understand its processes. It is clear from figure 2 that the visio-linguistic system does not yet match that of people perfectly. Moreover, we acknowledge that - as pointed out by a reviewer - we have not provided data to verify that our strategy to match the visual profile of an image with appropriate painting styles for regions is a good strategy, nor have we yet compared and contrasted alternatives or tested people's reactions to the artworks produced. We aim to experiment with this approach and explore alternatives in future work. However, before undertaking much further work, we wish to raise, discuss and be guided by responses to a philosophical question for the Computational Creativity community: is it important that an automated artist has a visual system similar to that of people? For communication/framing value, it might be preferable for the software's visual judgements to match ours closely. However, as illustrated by a recent internet storm about colours in a dress (Rogers 2015), we all have different visual perception systems, and notions of beauty differ from generation to generation and person to person. As art is driven forward by such differences, it may be more interesting and important artistically for us to learn The Painting Fool's visual system, rather than it learning ours. Acknowledgements We wish to thank the anonymous reviewers for their very helpful and insightful comments. This work was funded by EPSRC project EP/J004049, EC funding from the FP7 ERA Chair scheme (project number 621403) and by UBIC Inc. 2015_26 !2015 Make Something That Makes Something: A Report On The First Procedural Generation Jam Michael Cook Computational Creativity Group Goldsmiths, University of London ccg.doc.gold.ac.uk Abstract We report on the first procedural generation jam, PROCJAM, an event designed to bring together artists, researchers and game developers to experiment with new techniques and applications for generating content for videogames. Much of the event's resulting work has applications beyond videogames, however, and we believe the event may be a strong platform for engaging creators and programmers in Computational Creativity in the future. We discuss the structure of the event, the results it yielded, and the potential future impact of such events on the Computational Creativity community. Introduction Procedural content generation (PCG) is a crucial and rapidly developing area of videogames technology (Togelius et al. 2011). PCG is a rich area of games culture - it has been used as a supplement for human effort (Interactive Data Visualization 2002), a source of wonder and unpredictability (Toy et al. 1980), a tool for artistic expression (Betts 2014), and a unique mechanical design tool (Yu 2009). Its increase in popularity and its growing importance in the culture of videogames has also been mirrored by a surge in the aesthetic of generative software in art, web culture (such as Twitterbots) and other creative media. These are all areas which have considerable overlap with Computational Creativity in terms of the techniques they use to generate artefacts, and represent a great opportunity to share the field's philosophy and theory with a vibrant, active community of people. Game jams are increasingly common events within the game development community where people develop games under the constraints of both a time limit and some kind of common theme (which might be a technical constraint such as containing the game within 4 kilobytes of Java1 or a creative constraint such as incorporating a theme like fishing2). Entrants to game jams typically fall within one of two categories: novices looking to use the event to create their first game, and experienced developers looking to experiment and innovate (Zook and Riedl 2013). In both cases the 1 http://www.java4k.com 2 http://www.fishingjam.com short timescale helps encourage entrants to set themselves projects which are small enough to be easily completed. Popular game jam formats are repeated at regular intervals throughout the year. Ludum Dare,3 one of the most popular, runs every four months. Entrants make a game in 48 hours from scratch, including game design, code, music and visual art, following a theme voted on by entrants in the week prior to the game jam. In December 2014, Ludum Dare 31 received over 2000 entries for the theme Entire Game On One Screen. By running repeatedly, these regular game jams build communities of creators who meet to create together, share ideas, give feedback on games (there is an extensive period of reviews and ratings after the jam) and often form collaborations or extend their jam entries into full commercial releases (Zucconi 2014). They form strong global communities who share ideas, draw in new practitioners, and push forward the state of the art (Gray et al. 2005). In this paper we present a report on the first procedural generation jam, or PROCJAM, an event held in November 2014. The jam ran over nine days, starting with a streamed day of talks about procedural generation and ending with 138 entries being submitted in the form of games, tools, experimental prototypes and artworks. Although styled as a game jam, PROCJAM deviated from the traditional format in several important ways, which helped expand the appeal of the event beyond game developers and draw in people interested in generative techniques in general. We will go into these changes to the format in depth later in this paper, as we believe they are crucial to the success of PROCJAM and point to a format for generative events that could form the basis of Computational Creativity outreach in the future. We also will outline how PROCJAM itself is fostering work related to Computational Creativity and how this can grow in coming years. We believe that the community-building and experimental aspects of game jams are extremely valuable, and make the game jam format ideal for engaging communities of programmers such as Twitterbot writers, programmer-artists and game developers with Computational Creativity, as well as being rich sources of inspiration and code which would be of benefit to everyone working in and around this field. Additionally, events like PROCJAM can also be valuable ways 3 http://www.ludumdare.com/compo June 2015 197 to expose people to Computational Creativity for the first time, in much the same way that game jams encourage people to try out making a game, and may serve as a useful model for student workshops and similar activities. In this paper, we will outline the format for PROCJAM and explain how and why we deviated from several common elements of typical game jam organisation to create a better community for creating and sharing ideas. We then give several specific examples of entries to the jam and discuss their relevance to Computational Creativity. We follow this with a general analysis of entries, identifying issues related to Computational Creativity that arose in them, and also areas where our research could contribute to future entries to the event. Finally, we discuss the jam format as a model for outreach and engagement, and look ahead to the future of PROCJAM. PROCJAM Format and Organisation PROCJAM took place from November 8th to the 17th 2014, co-ordinated across the web using Twitter hashtags and a central website where people could submit their entries.4 Subsequently, the jam has registered its own website to coordinate the community and future events.5 The most common format for a game jam is as follows: at the beginning of the jam a theme is announced, normally on the jam's website so that people can take part from around the world. Participants then have 48 hours to develop a game from scratch, including art and music assets, that somehow incorporates the jam's theme. Entries are then submitted at the end of 48 hours. A review phase then takes place in which people vote for their favourite entries, with the voting pool consisting either of the general public, other entrants to the jam, or a select panel of judges. Prizes may be awarded to the winners. This format for a game jam is very popular and is replicated hundreds of times a year from large-scale jams with hundreds of entrants down to small-scale local jams run between small groups of friends. With PROCJAM we made several changes to the standard game jam format with the express aim of increasing participation, particularly with those who had relevant experience writing generative software but had not interacted with game developers before. A secondary aim for the jam's format was to encourage experimentation and allow people to prototype unusual projects that stretched the state of the art in generative techniques for games. Unlike most other game jams, making a game was not the only way to enter PROCJAM. Entrants could alternatively submit a piece of software that simply generated something (the jam's slogan was Make Something That Makes Something). Developing a game is a highly specific skill that people are unlikely to have unless they already work in games, and developing a game in the timeframe of a game jam is even more difficult. By relaxing this constraint, people who have interesting ideas, knowledge or skills can contribute generative systems to the jam that might spur on projects or 4 http://itch.io/jam/procjam 5 http://www.procjam.com inspire developers to integrate new kinds of system in their future games. As a result, the jam received many entries in the form of complete games, but equally saw systems which generated dungeons and planets, weapon ideas and fabric designs, music loops and more. Bringing in people from different backgrounds helped make PROCJAM feel more like a melting pot of ideas and less like a competition. We removed the requirement to produce original artwork and music for the jam, too. Since the primary focus of the jam was on new ideas in procedural generation, rather than testing game development skills, it didn't make sense to require people to put effort into elements of a game that were unrelated to their main contribution. This encourages people to enter the jam by relieving pressures on them to take on more work. We also removed a similar requirement that all code should be written from scratch. Game jam games tend to be very simplistic in nature because of their short development cycles, which works well for the goals the jams often have. However, in order to allow people to spend the week focusing on procedural generation, it made sense to allow them to use existing codebases or even entire games. One group of developers took a game they had been working on and added a procedural generation system to it as part of the jam. This doesn't just make the jam more appealing to outsiders, it can also allow deeper work to be done that builds on existing efforts (Hecker 2012). By allowing entrants to make anything from a small script to a full game, and removing the restrictions on existing code and art assets, the process of evaluation becomes an issue. This raises the question of how to rate and compare entries when they are so varied in their origins. PROCJAM circumvents this simply by removing the ratings process people are encouraged to comment on each others' entries and share them among one another, but there is no numerical rating system and no winners are declared. This solves the issue of comparing, say, a script which generates quilt patterns with a full murder mystery game. However it simultaneously encourages people to try out more experimental ideas without the intimidation of being judged and ranked by someone else's idea of what a good jam entry should be. All of these changes have the same ultimate aims: to encourage people to take part, particularly those who are not game developers by trade, and to encourage experimentation and the sharing of new ideas. We supplemented PROCJAM with a day of talks which we livestreamed on the web on the first day of the jam6. 80 people turned up to attend the day of talks, with 200 viewers tuning in to each talk throughout the day, and many hundreds more have viewed the recordings of the talks online since. The talks provided inspiration to jam entrants, with many citing the talks during the development of their jam entry, but they also provided an opportunity to be exposed to new views on generative systems - the speakers included an academic researcher, an artist and a creative director at an indie games studio. One of the aims of the event was to elevate procedural generation in games beyond "random levels", and having a variety of speakers giving talks was a 6 http://www.procjam.com/talks/2014 June 2015 198 Figure 1: A screenshot from Dreamer of Electric Sheep. good way of doing this. We hope to have a speaker at next year's PROCJAM event to promote Computational Creativity as a new philosophy for procedural generation in games. Selected Entries In this section we will briefly describe and discuss three entries to the jam. We look at their most interesting features and the responses some of them received from the games community. We selected three projects that we believed would be of most interest to the Computational Creativity community, either because of their philosophy, innovative concepts, or the relationship between the techniques used and ideas within the field of Computational Creativity. We give details of where to find these entries, as well as all other jam entries, in the next section. Dreamer of Electric Sheep Dreamer of Electric Sheep is a text adventure submitted to PROCJAM by Tom Coxon, who also gave a talk at the jam's opening event about his procedurally-generated adventure game, Lenna's Inception (Bytten Games 2014). Like most text adventures, players are presented with descriptions of their surroundings and can manipulate the world by inputting simple commands for their character to execute. Dreamer attempts to procedurally generate the game world using a combination of ConceptNet (Liu and Singh 2004), a commonsense knowledge database, and Spritely7, a tool which generates game art from web searches. ConceptNet stores its data in the form of concepts, series of facts that are all about a similar topic. These facts are linked to one another through triplets (such as {magazine, AtLocation, bookstore}) which can be explored through an API which Dreamer uses. ConceptNet has seen use in academic Computational Creativity research, such as (Llano et al. 2014). Dreamer searches the ConceptNet API for concepts which other concepts are linked to via the relationship AtLocation. It then populates those places with objects and characters that ConceptNet says should be found in them, and 7 http://www.github.com/gamesbyangelina/spritely Figure 2: A screenshot from Inquisitor showing the conclusion to a case. The player has correctly guessed the motive but not the murderer or the weapon. uses Spritely to generate an illustration of the location. Spritely queries online image databases such as Google Images and Wikimedia Commons, searching for images that can be cleanly shrunk down in size with their backgrounds extracted, to make relatively clear sprites for use in games. The player can perform common text adventure commands such as moving in the cardinal compass directions to travel between places, as well as picking up objects. However, because the game lacks deeper knowledge about the objects it places in each location, this can result in surreal interactions like picking up shop assistants and taking them with you. The game gets around this somewhat by being set inside a dream world, thereby allowing unusual things to take place without the game's sense of reality breaking. The Inquisitor The Inquisitor is a murder mystery game by Malcolm Brown. The game tasks the player with solving a murder by investigating the crime scene, discovering evidence, questioning witnesses and identifying the murderer, murder weapon and motive. The crime is procedurally generated, generating a cast of characters and relationships between them, simulating the movement of the characters before and after the murder (so that evidence such as blood trails and witnesses are realistic and consistent) and then leaving the player to put together the details within a time limit. Although the individual generative techniques within The Inquisitor are not new per se, the way it uses them to generate murder mysteries is novel and quite effective. In particular, interviewing witnesses yields partial and sometimes conflicting information, forcing the player to take notes and draw up potential scenarios in which certain characters are lying, and information is procedurally redacted from certain kinds of evidence, leaving out the contents of a letter but revealing its author, for example, or smudging the name of its author but revealing unrequited love. The Inquisitor also June 2015 199 Figure 3: A screenshot from Secret Habitat showing a generated gallery in the generative game landscape. The player can walk inside and view the pieces, as well as exploring outside to find other galleries. adds little additional touches on top of the game, such as a procedural system for applying accents to the pre-written dialogue. This takes dialogue written in plain English and then adds affectations to it to simulate a character who is drunk or has a particular speech impediment. To our knowledge this is a completely novel idea for content generation in games. Secret Habitat Secret Habitat is an ‘art gallery simulator' and ambient exploration game by Strangethink. The game has no obvious end state. Instead, the player is encouraged to enjoy walking around the game's generated world, entering the various buildings, viewing artworks and listening to audio recordings, all of which are also procedurally generated. The game seems to appeal to a particular aesthetic of wonder and mystery associated with generative software. One journalist wrote about the game: [The paintings] seem to use a similar algorithm, or similar parts, or similar something, as colours, patterns, and other motifs repeat across them; you can recognise they're part of a series. Seeing different spins on common themes can be delightful, and it's awfully exciting when you discover one painting very different to the rest of its set. (O'Connor 2014) Strangethink's biography simply reads I make strange computer worlds8, and Secret Habitat leans towards digital, interactive art as much as it does the traditional ideas of videogames as systems of rules and objectives. Procedural generation has much overlap with both game development and interactive art, and generative software has a unique value in being able to present extremely large or infinite scales to a user (Betts 2014). Analysis of Entries PROCJAM received 138 entries in total, although more entries may exist that were not officially submitted, since we are aware that the jam was set as a class assignment in at 8 https://twitter.com/strangethink23 least two universities and not all students submitted their entries to the site. All of the jam entries are available online9. The entries include full games, prototypes, tech demos, tools and libraries designed for developer use, as well as standalone generators and art pieces. We encourage the reader to visit the site and browse the entries themselves. By way of a brief survey, we categorised the entries to the jam into two categories: game, or tool. The categories are defined loosely as follows: a game is any software designed to be interactive, but not for the purposes of producing something; a tool is any software designed to facilitate the generation of content as part of a larger creative activity. These definitions are not strict, but we offer them here as a rough partition of the entrants to the jam. Overall there were 79 game submissions and 59 tool submissions. The games typically involved some kind of generative element in how they set up their game world, such as generating the 3D galleries in Secret Habitat or using simulation to create murder mystery scenarios in The Inquisitor. Most tools fell into one of two categories: some generated common kinds of content in accessible ways, such as world map generators of which multiple were submitted to the jam. This is partly because procedural generation lacks a cohesive community and established baseline software that solves common problems such as world generation instead, developers tend to reinvent solutions to common problems repeatedly. We believe this is a key problem PROCJAM can target to benefit the generative software community in coming years. Other tools generated unusual kinds of content which are not commonly seen in games, like GlyphGenerator's alphabets or Bootleg's 3D shoe models. These are exciting because they break new ground in generative techniques and offer new applications for games, similar to those submitted to the jam. Computational Creativity Issues PROCJAM's aim was primarily to produce generative tools and games, and to bring together both novices and experts to try out new ideas and learn more about the field. The theory and practice of Computational Creativity is gaining awareness in generative communities, but we believe that many developers are not confident about how to concretely use these ideas in the software they are building. That said, many entries to PROCJAM touch upon issues in the field, and others show clear areas where they could be extended to take advantage of results from Computational Creativity research. Many of the tools explore co-creation, in which the software either creates alongside the user or tries to assist the user in achieving a particular goal, such as (Liapis, Yannakakis, and Togelius 2012). Synthetic Poetry allows the user to write poetry on alternating lines, along with one of three poet models based on Keats, Shakespeare and YouTube Comments. Other entries were more straightforward tools, such as Nodemancer, which allows users to specify the components of an item, such as a sword, and then lets the system design the specific details autonomously. Most 9 http://itch.io/jam/procjam#entries June 2015 200 tools focused on the user retaining control, however: SPARTAN proudly announces that ‘the user has complete control over every step of the generation' - encouraging people to explore ideas that break ideas like complete user control is something that will need to be emphasised in future years of the jam. We hope to expand the jam's resource pool to include tutorials on basic Computational Creativity techniques and perhaps an invited talk from a Computational Creativity practitioner in a future event. Issues relating to framing and context, as in (Charnley, Pease, and Colton 2012), arise in several entries including games like The Inquisitor which generate text as part of gameplay (for example, to provide dialogue and scenesetting for the murder mystery). This text is partly game content but also acts as contextual information that justifies decisions made by the generative system in producing other content. We have argued in the past that framing for game content generation is a broader concept than simply being ‘wall text', and can extend to text that appears in-game to help the player understand and contextualise generated scenarios and systems (Cook 2014). Many of these games are beginning to explore these ideas, and we hope to see this trend continue in the future of the event. Also related to framing, several entries play with the problem of communicating the logic, internal representation or behaviour of the generative system. Both Diversitizer and Meadows present the player with a natural environment populated with various flora. The locations of each plant, as well as its properties, are governed by procedural systems and vary each time the game is started. The player can gain an understanding of these parameters and the expressive range of the generator by observing the environment and repeatedly generating new worlds, even though the software does not communicate any information to the player through text. In this way, discovering the decisions made by the software become part of the purpose of interacting with the artefact, which is an interesting kind of implicit framing that is not often discussed in Computational Creativity discourse. Many entries to the jam have clear ways in which they could be extended using techniques from Computational Creativity if the developer wished. Identifying simple ways in which common game ideas can be extended is important both in planning ‘code camp' events for Computational Creativity, and for giving compelling examples at events like PROCJAM to show developers steps they can take to begin exploring the field. Many generative systems use parameters selected by the developer, such as Infinity Explorer which generates 3D worlds for the player to fly around in an airship. Encouraging developers to build their systems such that they can select parameters either based on external, contextual factors as in (Colton, Goodwin, and Veale 2012) or by evaluating its own output as in (Smith and Mateas 2011) is a good way to begin to move some of these generative systems in new directions. The idea of the software evaluating its own work seems to be one of the most accessible ideas from Computational Creativity that generative software developers can start experimenting with. Generative systems tend to be developed in such a way that they are guaranteed never to produce bad output: in other words, they rely on reorganising hand-made elements that the developer knows in advance will produce reliable content. This is an effective method for game development as it ensures the player will not be disappointed, however the culture of experimentation that we tried to encourage makes PROCJAM an ideal place for people to try out ideas that are less robust but perhaps more interesting and experimental. Discussion PROCJAM has relevance to the Computational Creativity community because it represents the founding of an interdisciplinary community of generative programmers who we hope will, over time, be introduced to and begin experimenting with ideas and approaches from Computational Creativity too. The results of the jam and the details about its organisation are important in their own right; that said, we believe that PROCJAM also holds interesting potential for future events that could strengthen and broaden the appeal and reach of our field. Computational Creativity is a relatively young academic field that is still laying some of its foundations (Colton and Wiggins 2012). At the same time, many of the aims it has and the technologies and techniques it uses are highly relevant to movements in digital art, videogames and web culture as it stands today. In much the same way that outreach events must target academics in related fields, we should also look to engage these non-academic communities, to share solutions, and to encourage the adoption of our ideas. We often use the term ‘mere generation' as a way of describing purely generative software, but we must also bear in mind that a lot of exciting and interesting work is being done in generative software communities, and we should seek to engage with these communities, learn from them, and try to convince them that ideas from Computational Creativity are exciting and interesting, too. The format of an event like PROCJAM, particularly with some of the changes we made that we discussed earlier, make it ideal for informally bringing together several communities at once, making new connections and allowing them to demonstrate their working practices and techniques to one another. It also serves as a small-scale and selfcontained event to set for students who may be interested in the field; PROCJAM was a credit assignment for one university class in particular, and we understand that feedback from the students was extremely positive. With encouragement and additional resources about Computational Creativity, future versions of PROCJAM (or perhaps a separate Computational Creativity jam) could serve as informal, global workshops that introduce people to the area in a practical way. Despite their short length and highly applied nature, jams can serve a similar purpose for researchers as they do for programmers. There already exist examples of published research work which started off as a jam submission, in which an idea was quickly prototyped and then later developed after the jam (Cook and Colton 2014). PROCJAM also played host to jam games which were implemented to demonstrate an existing research tool or technique in a more concrete June 2015 201 way (Cerny 2014). We hope that subsequent PROCJAMs will see more researchers from this community take part to produce games or tools that demonstrate their work to game developers. PROCJAM also leaves a legacy of code and ideas that persists after the jam has ended and gives the event lasting value in the months when it is not running. PROCJAM's 138 submissions offer ideas and inspiration, and in some cases code samples and open source projects. Entrants to the jam have already collaborated with one another, expressed intentions to develop their entries into full games, and in one case an academic issued an open call to the PROCJAM community for PhD applications, which was taken up by one entrant. PROCJAM had multiple features written about it in industry magazines and several others on major websites1011, and many of the games created for the jam were individually featured and written about as well. The jam's slogan, Make Something That Makes Something, has reappeared in other events relating to procedural content generation too12. All of this shows that the jam is more than just the week in which it is held - it has a larger impact by creating a community of people that we hope will thrive. To build upon this, we are planning for PROCJAM 2015 to have more resources ready online before the jam begins, aiming to encourage newcomers to writing generative games, tools and software. Through talking to entrants, we've identified several ways in which we can make the event easier to enter for people. We are hiring an artist to produce some public domain art assets for people to use, specifically designed to be easily recombined and mashed up in procedural systems. We're also talking to developers and researchers with the aim of producing some short tutorials that demonstrate simple generative techniques. These resources will persist beyond the jam itself, hopefully making it easier for people to begin learning about generative techniques at any time of year. We intend to include ideas from Computational Creativity among these resources, in the hope that it will encourage people to think of these ideas as being as essential as any algorithm for making things that make things. Next year we hope to run some analysis on the entrants to the jam, primarily through optional surveys. This will help us get an idea of the jam's makeup, and people's motivations for entering the event. We are concerned that the lack of evaluation will make the jam complicated for people to curate and explore afterwards, and also acknowledge that some people will be interested in being rated by their peers. We are still reviewing our decision to remove rating altogether from the jam we may make alterations in 2015 to improve filtering and curation, although it is still unlikely we will implement global ratings that declare overall winners, as we felt the lack of rating contributed a lot to the jam's informal atmosphere. 10http://tinyurl.com/procjampcgamer 11http://tinyurl.com/procjameurogamer 12http://tinyurl.com/aigameslecture Conclusions In this paper we reported on the first procedural generation jam, or PROCJAM. The event was designed to create a new community around generative techniques for games and other software, with an emphasis experimentation, sharing of ideas and introducing new people to writing generative code. We described the changes we made to the classical game jam format to encourage more participants and make the event more accessible. We believe we were successful in this regard, but we also know that there is a lot of work left to be done in maximising the event's impact and accessibility, which we hope to address in future years. We then showed some illustrative examples from the 138 entries received, and discussed the potential for jams to impact communities close to Computational Creativity and potentially nurture relationships and collaborations between them. Acknowledgements Thanks to the reviewers which helped improve the paper, particularly the introductory segment on procedural generation. We would like to thank Azalea Raad for her help organising the event as well as the speakers, press and entrants who supported the jam. We wish to thank PROSECCO for kindly sponsoring the day of talks that launched the jam. Thanks also to Sophie Houlden, whose Fishing Jam inspired several of the changes to PROCJAM's format. This work was sponsored by EPSRC grant EP/L00206X. 2015_27 !2015 SMUG: Scientific Music Generator Marco Scirea, Gabriella A. B. Barros, Noor Shaker Center for Computer Games Research IT University of Copenhagen, Denmark {msci,gbar,nosh}@itu.dk Julian Togelius Department of Computer Science and Engineering New York University, NY, USA julian@togelius.com Abstract Music is based on the real world. Composers use their day-to-day lives as inspiration to create rhythm and lyrics. Procedural music generators are capable of creating good quality pieces, and while some already use the world as inspiration, there is still much to be explored in this. We describe a system to generate lyrics and melodies from real-world data, in particular from academic papers. Through this we want to create a playful experience and establish a novel way of generating content (textual and musical) that could be applied to other domains, in particular to games. For melody generation, we present an approach to Markov chains evolution and briefly discuss the advantages and disadvantages of this approach. Introduction Some traditional works in music or lyrics generation already take into account real-world information. For instance, Colton et al.'s work creates a mood based on a newspaper article, and uses this to generate a poem (Colton, Goodwin, and Veale 2012). In general song composition process, the composer takes inspiration from his life experiences and perceptions of the world around him. This can enrich the final result, creating meaningful pieces of melody, harmony and/or stories. Dynamic music generation in itself is not novel. Algorithmic music composition has been actively researched for the last several decades, using a large variety of approaches. Some examples include Mezzo's take into creating Renaissance style music through manipulation of leitmotifs (Brown 2012); the Cell-based approach (Houge 2012) used in Tom Clancy's EndWar; and the use of neural networks to create musical improvisations (Smith and Garnett 2012). This work attempts to create lyrics from academic papers and appropriate melodies to go with them. We believe this system can also be modified to use different initial data sources, be it text sources for the lyrics or music sources for the music style. We chose academic papers as input due to their diversity and availability. Furthermore, due to their usual seriousness, it was our opinion that it would be amusing, not only for readers but also for authors, to see these works in a different light. We believe that this system has value in being an interesting novel idea, and for creating a playful experience with something that, generally, very much lacks fun and playfulness. We also see the proposed approach applicable in multiple areas. The most interesting for us would be in games: we think that our system (or a fork of it) could be used to improve player experience. For example, to create content for games where story is expressed through music (e.g. Karmaflow (Karmaflow ) or Brutal Legend (Studio 2009)). Or by increasing re-playability and personalized content creation in games where music plays an important part in, either as ambiance or gameplay. Some adventure games even use music in small game sections to remind the player of the game's story or to provide a little comic relief moment (e.g. Deponia (?)). This paper is divided in six sections. The following section (2nd) will describe background theories that we have adopted and the state of the art of research in those particular areas. The third and fourth section will present our approach for music and lyrics generations respectively, giving special attention to our algorithms' behaviours. Then we will present our results and, finally, section six will discuss these and expose our conclusions. Background Lyrics generation Natural Language Generation, a sub-field of natural languages processing, has been the focus of several studies across the years. It includes creating text which is contextual, grammatical and lexical coherent, and is strongly related to poetry and lyrics generation. One of the most important works in poetry generation uses a grammar-driven approach to create poetry, out of a given subject, that is metrically constrained. This work define three evaluation criteria to poetry generation: grammaticality, meaningfulness and poeticness(Manurung 2004). Grammaticality means that the poetry/lyrics must follow linguistic conventions dictated by a grammar; meaningfulness states that the work must convey a context or message that is understandable; and poeticness involves poetic aspects, such as rhyme and rhythm. A different approach uses a corpus-based approach to June 2015 204 write lyrics about an user-specified theme (Toivanen et al. 2012; 2013). It copies a piece of text (in this case, a poem) and iteratively alters it, changing the words one by one. These words are extracted from a graph and are morphologically similar to the original. The novelty of the final piece is evaluated by calculating how many words were changed. Oliveira's "PoeTryMe"(Oliveira 2012; Oliveira et al. 2014) uses semantic networks, generation grammars and sets of relation instances to create sentences. Nguyen and Sa generate rap lyrics, by extracting words from a database of real rap songs, and a rhyming database produces words that rhyme with the extracted ones (Hieu Nguyen 2009). Finally, they combine them into a fixed song structure. There has been a great amount of work dedicated to create Tamil lyrics. Tamil is an old language spoken mainly in Tamil Nadu and Sri Lanka, with literature that goes back two thousand years(Suriyah et al. 2011). Sridhar et al(Sridhar et al. 2014) use the ontological meaning of a scene and a N-gram based approach to generate verses in this language. It identifies syllable patterns for the lyrics, and then create sentences that match said patterns. Case-based reasoning has also been applied by the COLIBRI poetry generator to generate poetry from text provided by the user (D´ıaz-Agudo, Gervas, and Gonz ´ alez-Calero ´ 2002). The quality of this approach results rely heavily on the quality of the original user-given text. It is also possible to find applications online for this purpose. Country Western Song Machine1 randomly creates country musics using a templates, and can output a very large amount of possible combinations. The Romantic Love Poetry Generator2 uses pre-defined templates and user inputs to create poems. The words simply replace specific spaces in the template. Similarly, the Song Lyrics Generator3 allows the user to select a style (e.g. "Freestyle" or "Love song") or an artist (e.g. "The Beatles" or "Katy Perry"), and to fill a form, that varies according to the style/artist. The form answers replace words in real music. Our method differs from previous work in the sense that we extract structures from real songs, unlike (Oliveira 2012; Oliveira et al. 2014) extraction of words or the use of templates. Thus, we believe our system can allow for more diversity and expressiveness. Also, none of the cited works use the same input as we do (scientific papers), and very few try to parse information about the real-world into lyrics. Music generation Procedural generation of music content is an interesting field which has received much attention over the last decade. Examples of research on this topic range from creating simple sound effects, to avoid repeating the same clip over and over, to create even more complex harmonic and melodic structures (Shaker, Togelius, and Nelson 2014). While many 1 Country Western Song Machine, 1998, http://www.outofservice.com/country/ 2 Romantic Love Poetry Generator: http://www.links2love.com/poem generator.htm 3 Song Lyrics Generator: http://www.song-lyricsgenerator.org.uk/ games use some sort of procedural music structure, there are different approaches (or degrees), as suggested by Wooller et al.: transformational algorithms and generative algorithms (Wooller et al. 2005). Transformational algorithms act upon an already prepared structure, for example by having the music recorded in layers that can be added or subtracted at a specific time to change the feel of the music (e.g., The Legend of Zelda: Ocarina of Time (Nintendo 1998) is one of the earliest games that used this approach). Note that this is only an example and there are a great number of transformational approaches (see GenJam (Biles 1994) and Music Sketcher (Abrams et al. 1999)), but we won't discuss them in this paper. Generative algorithms instead create the musical structure themselves, which increases the difficulty in maintaining consistency between the music and the game events. This approach usually requires more computing power as the musical materials have to be created on the fly. An example of this approach can be found in Spore (Maxis 2008): the music written by Brian Eno was created with Pure Data, where many small samples created the soundtrack in real time. Also note that hybrid approaches are possible, see Experiments in Music Generation (Cope 1996) In this project we adopt the generational approach, although limited to the generation of melodies. The motivation for us choosing this approach is that we believe we can create more novel content this way, instead of applying transformations to already existing content. Another pitfall of the generational approach is the amount of time necessary for generating the content; in our case, as the evolution of the Markov chains that will generate the melody is done a priori, we have a very fast (and inexpensive) generation of melodies. Lyrics generation The lyric generation process used in this approach takes as input an academical paper in PDF format, and output a series of verses. It has two main steps: pre-processing and lyric generation. Pre-processing Pre-processing involves populating databases of words (and their stems) and song structures. It needs to be executed only once, prior to the first lyric generation. Firstly, the word database was populated using Google searches for lists of word types (e.g. verbs, prepositions, pronouns). For each word in the database, its stem value was also extracted using SnowbalStemmer(Porter and Boulton 2001). Afterwards, it was necessary to populate the structure database. By structure we define a group of word types in sequence that represent a sentence. For instance, the structure for "We see our big, blue sky" would be "Pronoun verb pronoun adjective comma adjective other". Possible values for the structure are: verb, pronoun, preposition, adjective, adverb, conjunction, other, onomatopoeia, comma and dot. "Other" represents both nouns and words that may not fall into other categories. We chose to use it, instead of "noun", June 2015 205 because it allows a higher level of diversity while choosing the word. This way, not only can we choose a noun, but also any of the other categories previously mentioned as well. Onomatopoeia is an other value with less than three letters (e.g. "Po-po-poker face" would be represented as "onomatopoeia onomatopoeia other other"). These types are represented, in code, as integers. To identify structures in real songs, a group of 50 songs were analysed. These songs were selected from famous artist (e.g. Rihanna, Michael Jackson), using as criteria that all songs need to be in English and there cannot be more than 3 songs per singer. For each sentence in the lyrics, the algorithm extracted its equivalent structure which is then inserted into the structure database. Lyric generation The process for generating lyrics is divided further into three steps: parsing and analysis of paper, creation of song structure, and lyrics word generation. In the first step, the algorithm receives a PDF file containing the paper and extracts its words using the PDFBox library4. Then, the text is processed, removing everything before the abstract and after the references. This aims at avoiding inputting data that will not significantly improve the user's understanding of the paper. If the system cannot identify the abstract or the introduction (in the absence of the abstract), it will start at the very beginning. In order to evaluate the importance of each word in the text, a word count is performed. It uses the stem value of the word, and is calculated as the sum of all occurrences of words derived from this stem, in the text. For instance, assume that "wait" appears once and "waiting" appears twice in text. The count would be 3 for both of them, because they have the same stem "wait". Also, each word was added to a collection of values types present in paper, according to their value type (see Section Pre-processing). Secondly, the algorithm randomly selects a number of structures from the database. They will represent the total structure of the music, i.e. each structure will represent the structure of a line in the final lyrics. For the purposes of this paper, all songs have a total of 24 structures, divided into 6 groups of 4 structures each. Finally, for each structure chosen, a sentence is created according to type values in the structure. Comma and dot values are translated directly into "," and ".". Types verb, pronoun, preposition, adjective, adverb and conjunction trigger a roulette selection among all words from that type that appeared in text. This selection uses the word count as probability. Onomatopoeia inserts either "aah", "ooh" or a random word from text with its start repeated (e.g. "ta-tataxonomy"). Lastly, other trigger a roulette selection with all words in text. Music generation To create a melody we decided to use two Markov chains. These are mathematical systems that undergo transitions 4 PDFBox is a Java open-source PDF library: https://pdfbox.apache.org/ from one state to another on a state space (Norris 1998). A Markov chain is a stochastic process with the Markov property: the next state to be selected only depends on the previous one. Markov models can be trained using existing sequences of events (e.g., words in a book, or notes in a musical piece) and, once trained, be used to generate a new sequence of events statistically similar to the training data. It is highly unlikely for a Markov model to recreate an exact training sequence as it contains an intrinsic stochastic element. However this depends very much on the training data. An important limitation of Markov chains is that they capture statistical similarities only on a local scale, and not on a high level; this means that we lose information of structures like repetition of musical phrases in different part of the composition. Nevertheless, even with this disadvantages Markov chains have classically been extensively used for the purpose of melody generation, as they can be trained easily to create sequences of notes (Ames 1989). Examples of research that use Markov chains and Evolutionary Algorithms are Manaris et al.'s Monterey Mirror (Manaris, Hughes, and Vassilandonakis 2011) and Bell's work (Bell 2011). Manaris' work focuses on evolving Markov chains to obtain the rare chains that will with high probability reproduce high-level structure (repetition of entire phrases or in general more structured music) while Bell's work uses interactive evolution to produce chains that create music pleasant to the listener. These are much more complex works that generate complete music and not just melody, as in our case. There are some reasons why we have decided to approach the creation of these Markov chains through such an unorthodox method of using evolutionary algorithms (unorthodox only in this particular application of course). Using traditional (EM-based) training would have been faster and easier, yet it is in its nature to lead to an overfitting of the chain to the training set. What we hope to achieve through our approach is obtaining a chain that would reflect the characteristics of the training set while avoiding overfitting: in short obtaining a chain that reflects the characteristics of the training set while maintaining some diversity. Another interesting feature that this approach gives us is introducing constraints through the fitness function. This gives us the option of tuning our chain in more interesting ways (of course this means in parts deviating from the training set, but that's exactly the point). These constraints could be musical rules, for example avoiding too large intervals between notes. We discuss these in the fitness function section. Markov chains and Representation In our approach we decided to use two Markov chains: one to determine the notes of our melody and another one to select the duration of these notes. Markov chains can be expanded to include some memory of the previous states by considering as state not only the current one but some of the previous ones. The amount of previous states we "remember" is called order of the Markov chain; if we consider a chain of order 2, it means that every state is a couple con June 2015 206 Figure 1: Fitness changes in the notes Markov chains population during 5000 generations. In black is represented the fitness of the best individual of the generation, while in blue is the average fitness of the population. sisting of the previous note and the current one. In our final implementation we decided to use an order 2 chain. We implemented a Markov chain as a hash-map. Labels (or keys) are the name of the state (the current note), and values are another hash-map containing the probabilities of choosing a note (transition) from the current state. We also adopt this hash-map as our genotype. You could visualize it as a labelled bi-dimensional matrix, with as labels states and transitions, the next state can be calculated as: (previous state older note) + transition. The state space can be calculated as n! o!(n!o)! where n is the amount of notes we consider and o is the order of the chain. In the case of our order 2 chain, where we consider 3 octaves (36 notes) it would be 36! 2!(36!2)! = 630. To restrict this space we apply restrictions to remove states which we consider not to be good, in particular all the states that contain a transition between notes with intervals higher than an octave. To avoid leafs in our chains we do not allow for allowed states to have a 0 probability to move to any other (allowed) state. To extract musical information without be restricted by the key of the song, we have our Markov chain for note generation work by degrees. In music degree is defined as the position of a note in a specific key's scale: for example a C can be considered differently depending what is the key of the song, in a C major song it will be a Ist degree, while in a G major song it would be a IVth degree (as the scale of G would be [G A B C D E F]]). Evolving Markov Chains We evolved our Markov chains using a genetic algorithm. The parameters used for our final chains are: • Population size = 200 • Generation number = 5000 • Elitist factor = 1/4 (this means that we keep the best 1/4th of the population in the next generation) Figure 2: Fitness changes in the durations Markov chains population during 5000 generations. In black is represented the fitness of the best individual of the generation, while in blue is the average fitness of the population. • Mutation chance = 10% The procedure to create the new generation is to copy to the new one the best individuals of the previous, then we fill the rest of the population with offspring of randomly selected individuals from the previous generation. Finally each individual has a chance of mutating. To create offspring we use a one-point crossover approach: we select a random index of the states in our Markov chain (hashmap) and we create two new chains, the fist will contain the values of the first parent until the index and the values of the second parent for the following ones, while the second one the opposite. Because of the way we are representing the chains all of them always have the same amount of states, so the only thing changing while doing crossover are transition probabilities between states. We are aware that using crossover will sometimes lead to broken Markov chains, with some orphan sub-chains that will result unreachable. This is an inherent issue with crossover, but we assume that a broken chain with high fitness will still present the characteristics that we desire and through our elitist strategy we will be able to preserve the individuals that presents good gene combination, be they broken or not. Vice versa, a broken chain with low fitness has a higher chance to be replaced. To mutate a chain we consider a chance of 1 numberOfStates for each state to randomize it's transitions; this way we will statistically only have one state changing when mutating the chain, but still allowing for bigger mutations (or no mutation) to happen. Fitness function The fitness function we have chosen to apply for the evolution of the chains can be described as: f = X Si2Songs P redictRew(Si) ! ConstraintsP en where Songs is a set of melodies from existing songs, P redictRew(Si) is defined as the probability of the Markov June 2015 207 Figure 3: Output of program the program for a paper by Togelius et al.(Togelius et al. 2011). a) Actual lyric generated. b) Random structure used, represented in code. c) Same structure, in natural language. chain we're currently evaluating to predict the melody in the song Si and ConstraintsP en is the penalty assigned to the chain according to the constraints we want to apply to it. By considering Si = {n0, n1, ..., nk}, where ni is the i-th note in the melody and k + 1 is the amount of notes in the melody, we calculate P redictRew(Si) as: P redictRew(Si) = X {ni,ni+1,ni+2}⇢Si P(ni+2|ni, ni+1) where P(ni+2|ni, ni+1) is the probability that the Markov chain we are evaluating presents for the transition ni+2 from the state (ni, ni+1). To make a practical example, if the song presents a sequence of the type (C, D, E), the fitness of the chain will increase by the probability it has of making the transition E from the state (CD). ConstraintsP en is composed by two rules we introduced to eliminate cases we consider musically uninteresting: ConstraintsP en = BigLeap + SameNoteLoop where BigLeap is defined as: BigLeap = X nk|(ni,nj )2Chain P(nk|(ni, nj )) if |(nk ! nj )| > 12 So it will increase for every transition that appears in the chain that presents a voice movement bigger than an octave (e.g. (C1, C1) ! D2 ). SameNoteLoop is instead defined as: SameNoteLoop = X ni|(ni,ni)2Chain P(ni|(ni, ni)) This way we will have a higher penalty for transitions that keep us in the same state when the state is comprised of a couple of identical note (e.g. (C1, C1) ! C1 ). The fitness function for the chain that will determine the duration of the notes (instead than the notes themselves) is evaluated the same way, but without ConstraintsP en, as these constraints are pitch specific. Our Songs set consists of 20 songs taken from a list of most popular pop songs. It presents a variety of styles, but all the songs are in a major mode. This limits our generation to melodies in major mode, while for minor melodies we would have to evolve a new chain using a set of songs in minor key. This is necessary because the intervals between the notes in a major and minor scale differ, making us hypothesize that our chain will only be able to produce melodies appropriate for the key of the songs used to calculate the fitness. The elements of the set are the voice track from the songs; we have isolated the voice melody and stored it in a MIDI file, from this file we extract the degrees of the notes of the melody (by considering in which key the song is) and the duration of these notes. We will then use these values in the evaluation of our chains (remember that to abstract the key our chain work by degrees). From text to melody To create a melody to go with some particular lyrics we our method is: 1. Find the total amount of "syllables". In this case we con June 2015 208 sider a simplistic concept of syllable: we consider a syllable for every time we encounter a vowel (groups of vowels are considered as part of the same syllable). 2. Create as many notes as the syllables in the lyrics using the notes chain 3. Define the duration of the notes using the durations chain 4. Add rests after each word (with a 30% chance that there is going to be no rest) Finally, for easy usage and visualization of the melody we produce a midi file representing our melody. Results Lyrics Figure 3 shows two verses of lyrics generated using Togelius et al. paper (Togelius et al. 2011), and it's basic structure. It is possible to notice some degree of understanding in the sentences, and diversity in word choices. Figure 4 shows a small part from lyrics generated by the system using Darwin's paper (Darwin 1991), with melody. Another verse from the same work goes as follows: Natural who in, throw all selection Re-re-related nature relations law on any Cl-cl-class that false but it inhabitants generic It natural origin its, species to sp-sp-special and on its . That more be all, - reflecting Each relations on these - natural Each dr reflecting gr-gr-grouping circumstances Selection introduction Figure 5 show some verses from a song generated with this paper. In a different iteration, the following verses were generated: Parent chain mutating Another should, we pre-processing musical be with prpr-pre-processing states That songs structures that, sridhar, other possibility use Im-im-improve, pr-pr-priori, ad-ad-add, im-imimproved, rh-rh-rhyming, on-on-on, ooh ' Papers to music lyrics generation, figure Generation that, consider, generate, correct, with we is using Restrict input and note statistical final as music Create it, approaches, create, be, into it chains generated Music In this section we'll try to analyse some of the melodies our generator produced. In figure 4 we can see an example of a melody generated by our system from Darwin's paper On the origin of species by means of natural selection, the generation of melodies is very fast, as the training is done a priori. Interesting to note is how our generator doesn't create melodies that strictly stick with the diatonic scale but introduces alterations. In the figure we can see how in the fifth bar it lowers the VII degree to a B[, and more interestingly how it presents the note again on a different octave. Looking at the other notes played in the chord we can recognize how the chord underlying the measure could well be a C7 with the omission of the V degree [C E B[]. While this chord goes out of the normal key it is not uncommon to use it in this key and it doesn't necessarily signify a change of key. Another example generated from this paper can be observed in figure 5. This score shows even more alterations than the other one with a more dissonant and almost jazzlike feel. Interesting to note how musical passages seem to emerge and be repeated with alterations: for example the succession E-D-C (bars 1, 2 and 3, with a rest in the latter) and the succession C-C-B[-C] (repeated two times in bar 4 and inverted and transposed just afterwards becoming C-CD[-C]. Nonetheless, we haven't conducted an evaluation study on the melodies produced so we cannot make any statement on how interesting or musically pleasing the melodies are to the listener. Also we believe that to achieve a more interesting result we would need a harmonic framework to give more musical context to the produced melody; as we discussed in this section we can see some passages that seem to present some chord, but that is a purely emergent behaviour. Discussion This section will discuss our main findings in this project, and final considerations about them and the work in general. Lyrics Regarding lyric generation, although our approach may be perceived by some as simplistic, we believe it is capable of creating relatively fluid and interesting lyrics. The sentences structure seem somewhat sensible, although there are definitely space for improvement. Rhyming also happens in some moments, however it does occasionally, as the current version of the system cannot guarantee rhyming. We intend to correct it in further implementations, perhaps using a rhyming library or accessing a service online to check possible words. This would permit to create musical rhyme patterns (e.g. ABAB or AABB, where A and B represent rhyming endings of sentences). The size of sentences, too, varies, and a syllable measure constraint could help improve it. Furthermore, it is possible to understand, to some extent, basic ideas transposed from the paper to lyrics. Some words that are clearly significant in the paper also appear in the lyrics. But there is no perceptible line of thought. It would be interesting to take the structure of the paper into account in the generation, by changing the probability value of words according to the current verse number. For instance, in the first verse, words from the paper's introduction would be more likely to be chosen than others. We would also like to try different techniques, such as an evolution strategy, to see if the outcome presents higher or lower semantic meaning in comparison to this approach. Further mechanisms for dealing with the meaningfulness of lyrics need to be applied. June 2015 209 Figure 4: Excerpt from the score generated from Darwin's paper On the origin of species by means of natural selection (Darwin 1991) C major. Figure 5: Excerpt from the score generated from this paper in C major. Music The main point we have to discuss is our choice of adopting evolution of Markov chains instead of the more common method of training them. While this method is more time consuming, we believe it is interesting. There is an argument of novelty, because the method of evolving Markov chains for music production, while not completely new, is not very explored. As we stated at the beginning of the Music Generation section, we believe that this method results in lower dependency on the training set than traditional training. We think that, this way, our chains should be able to express a greater music space while maintaining some structure from the training set. One cost we expect to have to pay is a smaller rate of emulation of the training set style. Sadly, at the moment we don't have enough data to support this statement, but an evaluation study is already planned. Another pitfall is the possibility of getting in a part of the melody space where there is not enough information to create musically interesting melodies, degenerating in the worst case scenario to a random search. As seen in section , we see some interesting emergent behaviour (like the almost key changes and the jazzier sections) which might hint to how the Markov model might not be very effective at producing a coherent whole. Still, we believe that our approach will be able to capture the style of a specific genre/style of music with a large enough corpus of songs to use in our fitness function. We have to recognize how we might have achieved better results by having a bigger training set, but we believe we have already achieved some very interesting results. Finally by observing the increase of the fitness function of our evolved population in Figures 1 and 2 we notice how the duration chain evolves much faster and with higher fitness. This is due to the smaller space we consider for this chain, which is less than half of the notes chain's one. Conclusions We have presented a method for creating melody and lyrics using real-world data. To do so, we developed a musical generator that evolves Markov chains to create melodies, and a lyric generator, that extracts content from academical papers and transforms them into songs. We have a fully functional system that complete both tasks, taking an academic paper in PDF format and outputting a melody and the according lyrics. Our generator seems to produce interesting music/lyrics combinations, but we still have to conduct further studies to prove their interestingness. The generator also still shows much room for improvement, as discussed previously, and future work will be in both fine-tuning the evolutionary approach and introducing more features in the lyrics generation, such as rhyming, stricter metric structure and improved semantic content transfer from the original paper. Still, we need to recognize that there might be issues inherent to using Markov chains for melody production that June 2015 210 might not be resolved, like insuring the production coherent whole. Ultimately, we believe think these techniques might be used in music-based games to add and customize content. 2015_28 !2015 Generative Mixology: An Engine for Creating Cocktails Johnathan Pagnutti and Jim Whitehead Computer Science Department University of California, Santa Cruz {jpagnutt, ejw}@ucsc.edu Abstract This paper details an expert cocktail generation system. After using expert knowledge to break down cocktails into eight categories, the system generates cocktails from a particular category using a context-free stochastic grammar. These cocktails were then evaluated by human participants in a research setting. Participants evaluated the cocktails on the basis of quality, novelty and typicality to check the creative potential of the generator's output. Introduction Some domains, such as music and visual art, have been studied in depth by the computational creativity and procedural content generation (PCG) communities. Yet other domains, such as preparing food, have not. Part of this is that food preparation is a complex task, not only in dealing with which particular combinations of ingredients should be used, but also how those ingredients should be prepared and transformed into a finished product. Even simple domains, such as chocolate chip cookies, can have many ingredient and preparation step permutations (Kenji Lopez-Alt 2013). ´ However, there is a strong interest in artificial chefs, servers and bartenders, as evidenced by the steady rise of restaurants featuring robotic servers and bartenders (Sloan 2014; Kross et al. 1976) as well as home meal serving and bartending robots (Glass 2014; Monsieur, LLC 2015). The next step for this niche mechanization of the food and beverage industry is to implement an AI system that can create new dishes or drinks to prepare for patrons. A factor analysis on the Creative Achievement Questionnaire (CAQ), a creativity assessment test, revealed three categories of creative achievement: Expressive (Visual Arts, Writing, Humor), Performance (Dance, Drama, Music), and Scientific (Invention, Scientific, Culinary). This result shows that culinary creativity falls into a similar domain as scientific and innovative creativity (Carson, Peterson, and Higgins 2005). This implies that techniques used in creative recipe generators have applications in problem solving and research direction. Therefore, the development of creative recipe generation and other culinary arts may have applications for more general-purpose problem solving AI. We can break recipes into two parts: the static ingredient list and the dynamic preparation instructions. The ingredient list is composed of the ingredients that the recipe will use; the preparation instructions are how those ingredients are transformed into a final dish. However, there are a very large number of potential ingredients that could go into any dish, and even more ways those ingredients can be combined to become a final product. Therefore, work with smaller, less complex domains is needed to gain insight into the problem of an artificial chef. One such useful domain is mixed drinks, as the potential ingredient space for cocktails is smaller than that of culinary dishes and the mixing instructions are far simpler, while still retaining a lot of the interesting complexity. As such, we developed an expert system for cocktail generation and evaluated the artifacts it generated to start understanding the nature of computational cooking. Related Work PIERRE (Morris et al. 2012) uses a genetic algorithm to generate crock-pot recipes from a corpus gathered from various websites. The fitness function is based around novelty, trying to maximize the number of rare n-grams in a recipe. Recipes have also gotten attention from case-based AI planners, such as CHEF (Hammond 1986). Both these generators have a high chance to output a ‘bad' recipe. PIERRE makes no claims about the quality of its output, and CHEF needs to learn from bad examples in order to create good ones. A similar branch of research to this is JULIA (Hinrichs 1992), which uses case-based design techniques such as case adaptation to determine how to best design and present a meal. Our work does not need to learn from ‘bad' examples and attempts to always produce a believable drink. Pinel and Varshney have worked on a recipe generator (Pinel and Varshney 2014), which unlike PIERRE or CHEF does not deal with a particular style or type of cooking. Using a cognitive model of creativity and a large knowledge base built from scraping recipe wikis, they created a mixed initiative generator that produces ingredient lists and rough steps to completing a recipe. This work is part of a larger system by Pinel, Varshney and Bhattacharjya (Pinel, Varshney, and Bhattacharjya 2015) that generates recipes by mining data from the Wikia recipe repository and Wikipedia to build an extensive knowledge base of recipes. From there, the system uses a mixed initiative approach, in which a new recipe is generated with user-selected categories. Varshney June 2015 212 et al. (Varshney et al. 2013) discuss many of the difficulties in working with recipe generation, emphasizing that how something tastes is actually the result of all five classical senses working together, plus several psychological, neurological and social phenomena. Cocktail generation has been done, although not on any formal level. The Mixilator (Haigh 2004) is an online cocktail generator based on the writing of mixologist David Embury. The Mixilator picks a random ingredient from each of the three categories defined by Embury, and makes a predefined cocktail from it, with mixing instructions hardcoded. Although it uses an impressive amount of ingredients, the generator is highly constrained—Embury believed all drinks should contain at least three ingredients, so the Mixilator can never create a gin and tonic, for example. The Mixilator also has no knowledge about how combinations of ingredients function. It assumes that, as long as it picks from each correct category, the resultant cocktail will be good. Yet, as the authors point out, no considerations for quality went into its development. While investigating the Mixilator in writing this paper, ingredient combinations like lime sherbet and maple syrup were suggested for cocktails. Another drink called for "2 drops of liqueur" without ever specifying which flavor of liqueur. This makes the Mixilator's output appear more whimsical than structured cocktail generation. While related investigations like the Mixilator are hobbyist projects, and some large scale recipe generators give a passing glance to the cocktail domain, we aim to be the first to take a domain sensitive, computational creativity approach to cocktail generation, and maintain a critical eye towards drink quality and expressive potential. Mixologists have been inventing new drinks in popular literature. One of the first books on mixology as an art, The Fine Art of Mixing Drinks(Embury 1953), by David Embury, details a basic ratio to follow for cocktails, as well as several ingredient categories to use. More recently, DIY Cocktails (Simmons 2011), details several basic ratios for a wide variety of drinks. However, mixology books commonly focus on presentation or are just a compiled list of cocktail recipes(such as (Regan 2003; Joseph 2012)). Sadly, mixologist blogs ((Bovis 2015; English 2015; Jamieson 2015), for example) also tend to focus on recipe compilation or product review rather than cocktail theory. Computational creativity is a vibrant field, with a plethora of definitions, theories and evaluation methods for creativeness in computer programs. Much modern work stems from three techniques (Boden 1998) and their formalisms(Wiggins 2006) for establishing creativity in AI: by producing novel combinations of familiar ideas, by exploring potential conceptual spaces or by making transformations that allow the generation of previously impossible ideas. These relate to creativity in the process of artifact generation. The other side of the coin refers to creativity as a quality in generated artifacts (e.g. the difference between "this painting was made by a creative person" vs. "this painting is creative"). In these terms, metrics for evaluating the creativity of generated artifacts have been proposed (such as (Pease, Winterstein, and Colton 2001)), and we evaluate our cocktails on the categories of quality, novelty and typicality as defined in (Ritchie 2007). Expert System Constraints The cocktail generation system defined here is derived from the rules and opinions of two primary texts: The Fine Art of Mixing Drinks (Embury 1953) and DIY Cocktails (Simmons 2011). Both texts treat cocktail creation as a process, and outline several basic rules to follow in the terms of ratios and ingredient categories. In addition, they provide mixing instructions for various categories. Multiple source texts were used to try to minimize the amount of author bias in the system. The Fine Art of Mixing Drinks is an older text. Several common modern cocktails are impossible to create by following its rules alone, and today there are far more popular cocktail ingredients than there were in Embury's time. By augmenting Embury's rules with a more modern text, the generator can be more expressive and better reflect modern cocktail design aesthetics. In addition, DIY Cocktails (Simmons 2011) gives a theoretical basis for which ingredients work well together. This helps the cocktail generator avoid various pitfalls in ingredient choice (such as combining a citric acid and a cream, which will curdle the cream), and also be smarter in selecting which ingredients to use to create a cocktail. Finally, there were some constraints set at the discretion of the authors. Shooters and shots are not considered cocktails, and are ignored. In addition, the generator does not use overproof spirits (those that contain more alcohol than proof spirit), as they can be difficult to acquire. Cocktail Properties We divide a recipe into two parts: the mixing instructions (dynamic instruction) and the ingredient list (static elements). Cocktail mixing instructions are either derived from the ingredients used in the cocktail or previously decided steps in the mixing process. Lighter ingredients (juices and spirits) only require stirring; heavier ingredients (syrups and purees) may require shaking or rolling in a cocktail shaker. There are a few generally uncommon preparation instructions that are more common to cocktails, such as muddling (mashing the ingredient in the bottom of the glass). As it makes no sense to muddle an ingredient in a shaker for mixing and/or rolling (the straining head of the shaker would keep the muddled ingredients in the shaker and not in the glass), step order occasionally matters. However, someone could shake various juices and pour them into an ingredient they had muddled, so keeping track of what process is being applied to which ingredient is important. There are several other ways to mix a drink that deal with spectacle: floating a high proof liquor on the top of a drink before setting the liquor on fire, or floating several ingredients on top of each other to provide a layering effect. These techniques do not have a strong bearing on flavor, so they are not considered by the cocktail generator. The ingredient list is more complicated. Depending on the source, the raw ingredients of a cocktail can either have many very fine qualities (such as undertone, notes or hints) or be very basic (sweet, sour). This makes it difficult to June 2015 213 Figure 1: System Architecture. A grammar is chosen from a list of various cocktail grammars and then expanded as a set of symbols, from functional to terminal. For some expansions, symbols are built on the fly by requesting information from an external data structure. Once the grammar has expanded to terminal symbols, it is rendered as a human readable recipe and presented to a user. Figure 2: Three general categories of grammar expansion. ascertain a good top-down or bottom-up model of how a particular ingredient tastes—do notes of elderberry work well with sour? Should undertones be sweet or smoky or smoky-sweet, and does any of that work well with strawberries? However, this also is not how common sense reasoning about taste functions. When someone hears the ingredients in a drink (or dish), they recall what each ingredient tasted like in the past, and make some guesses as to what they might taste like together. The cocktail generation system was built on this basic reasoning concept. Rather than try to accurately model how each ingredient tastes, the generator keeps track, on an abstract level, which ingredients work well together, and then creates drinks with combinations of good ingredients. These ingredient pairings were built based on expert knowledge, rather than a database or chemical hypothesis, as in (Ahn et al. 2011). In addition, the cocktail generation system breaks cocktails into eight categories. Each of these categories is based around a particular set of exemplars in cocktail literature. These exemplars either all share an ingredient category (such as the use of cream for drinks derived from a White Russian) or a particular ratio (2 to 1 ratios for drinks derived from a Gin and Tonic). It is important to note that all of the International Bartenders Association official cocktails roughly fall into these eight categories. Although this categorical system does not perfectly cover every potential drink, it encapsulates most of the space of potential cocktails. Generator Architecture The cocktail generation system has four main components: a set of stochastic, context-sensitive cocktail grammars, an engine to expand the grammars, a set of outside data structures used to build grammar symbols and a text rendering system to present generated recipes, as seen in Figure 1. All four of these components work in a serial fashion to generate a new cocktail; no part of the system runs in parallel. One of the main difficulties when designing the cocktail generation system was the amount of symbols in the grammar. There are at least 260 symbols, so writing rules out directly would have taken a large amount of human authoring time. Cocktail Grammars Following the research on mixology, most cocktails can be broken down into eight categories of drinks. The categories are all based on exemplar drinks; the Old Fashioned category uses the same ratios present in an Old Fashioned, for example. Sometimes a drink is considered an exemplar because it has a unique and useful ratio (the margarita's 3 parts strong : 2 parts sweet : 1 part sour), or a particularly important ingredient (the cream in a White Russian). Usually, a category also has a trend: Old Fashioned based drinks always have muddled ingredients or syrups, while Margaritalike drinks always use a liqueur as one of their sweetening agents. As such, to capture these trends, a unique grammar needs to be built for each drink category. The categories are Old Fashioned, Martini, White Russian, Margarita, Daiquiri, Mai Tai, Gin and Tonic, and Mojito. All the grammars use the same set of symbols, but each category has its own unique production rules and constraints. Several rules were reusable (a single context free replacement rule, for example), however, each grammar has several custom, unique rules. There are three basic ways that a grammar expands, as seen in Figure 2. The first two examples shown here are deterministic, although they both have stochastic variants where a random choice is made from a list of potential symbols. The last example is always stochastic. First, grammar symbols can expand without considering what other symbols are currently in the grammar, commonly referred to as context-free expansion. This expansion is shown at the top June 2015 214 Figure 3: Overview of database expansion. of the figure, where an expansion function, f(), takes an input grammar, the symbol to be expanded (‘A') and the symbol to expand to (‘B') and returns an output grammar where ‘A' has been replaced with ‘B'. Second, grammar symbols sometimes do care about context, and look at the other symbols in the current string before expanding. If certain symbols are present, then that symbol expands differently. This is referred to as context-sensitive expansion. This expansion is shown in the middle of Figure 2. The expansion function, g(), takes an input grammar, the symbol to be expanded (‘B') and the symbol to expand to (‘D'). g() scans the input string, and since the grammar contains both B and C, transforms B to D. Unlike other context sensitive grammars, the C's location in the string is unimportant—if the string contains a C, B will transform to D. Finally, some rules, instead of going from one symbol to the next, instead request an external structure to supply the next symbol. These rules can be context free or context sensitive. This expansion is shown at the bottom of Figure 2. The expansion function, h(), takes an input grammar, the symbol to be expanded (‘E') but does not have a symbol to expand to. Instead, h() makes a request of an external data store as to what ‘E' should expand into. The data store returns ‘F', and the function replaces ‘E' with ‘F' and returns the grammar. External data structures make no promises about being able to fulfill a request. When a database cannot fulfill a request, grammar expansion is restarted from the axiom. External database calls are outlined in Figure 3. The top graph in Figure 3 shows expansion using the ingredient graph. When queried, the expansion function either passes a symbol that has a node in the graph (in this case, gin) or polls the graph randomly. The node's neighbors are returned, and the function chooses one to use for expansion. The bottom graph shows expansion using the ingredient list. When used, the list is supplied a symbol that needs to be expressed (in this case, mint). The list returns all the possible expressions of the symbol, and then the expansion function chooses which one to use. Symbols also have a flag that is set to ‘false' for contextfree symbols and ‘true' for context-sensitive symbols. This flag is used to allow for context-free symbols to be expanded before context sensitive symbols, so that symbols that need context will have as much information as possible before expanding. To help keep track of how a grammar is expanding, symbols are actually a (symbol, type) tuple. The symbol is what gets replaced or used for replacement, while the type helps various context sensitive rules determine the right time in the expansion to execute. Types work like walls, all symbols need to be of a particular type before the next set of rules can apply. The types used are functional, ingredient, expression, and terminal. Functional symbols are qualifiers like "strong", "sweet" or "sour". They describe the function of a particular ingredient, according to a ratio. So, a margarita can be described in rough terms as 3 parts strong : 2 parts sweet : 1 part sour. Ingredient level symbols fill in the functional symbols with high level ingredient qualifiers, so, "lemons" could fill in for "sour". The next type of symbols, expression symbols, tell us how each ingredient is going to be expressed in a cocktail. So, "lemons" could become "lemons-juice" or "lemons-muddled". Most expression symbols are also terminal symbols, however, occasionally the grammar needs to add a few more details to a symbol before it can get rendered to text. Symbols keep track of what they replaced, which allows us to trace a symbol's lineage. This commonly happens when we divide up the ingredients into parts. If a drink calls for 2 parts sour, and both lime and lemon juice are being used, then the cocktail generation system checks that the juices both come from the same original sour symbol. It then correctly divides the parts equally among the juices. It is also possible for a rule to rewrite a symbol's lineage, as in Figure 4. Lineage rewrites perform abstraction and recategorization within a set of rules. This increases the expressive potential of a particular category, so that it can still accurately represent the drinks that fall into that category without resorting to having a collection of starting axioms. This also allows for axiomatic change based on how a current grammar is expanding. If it suddenly makes more sense for a particular expansion of symbols to have been derived June 2015 215 Figure 4: Overview of a lineage rewrite rule. The rule transforms a strong symbol into a sweet symbol. Normally, new symbols are appended as children to the symbol they expanded from, but for lineage rewrites, we replace the symbol and the symbols it descended from. from a similar axiom, the axiom can shift to reflect that. Each grammar expands until it hits a set of terminal symbols. Then, the set of symbols is passed off to the text renderer to generate a human readable recipe. Axioms for a cocktail grammar start with a set of functional symbols, and the amount of each symbol that should be in the final drink expressed in parts. An example grammar (the Old Fashioned grammar) is presented in Figure 5. Ingredient Representation There are two data structures that the expansion engine can query for information to build new symbols. How both of these structures are used is outlined in Figure 3. The first structure is used primarily for expanding ingredient symbols; the second structure is used for expanding expression symbols. The first data structure is the ingredient graph, a bidirectional graph where each node corresponds to an ingredient such as chocolate or strawberries. Adjacent ingredients on the graph work well together in a drink, according to experts. So, if we already know we are going to use one ingredient, we can get that ingredient's neighbors in order to see what other ingredients should be used with it. There are some nodes that are connected to every other node, but it is, in general, a sparse graph. The other main data structure is a list of ingredient expressions, organized by ingredient. This lets us look up how lemons can expand, and pick an expression that fits a rule. The list makes no promises about having the right entry for a rule. If a rule is looking for a puree and is trying to expand lemons, the query on the list will return nothing (as lemon puree was not considered a valid ingredient by experts). Expansion Engine The expansion engine takes a cocktail grammar axiom and expands it in turn, from functional symbols to ingredient symbols, then to expression symbols, then to terminal symbols, as seen in Figure 1. The engine also makes requests of the outside data structures to get information needed to create a symbol when required. The engine has a hard rule: expand context free symbols before context sensitive symbols. In addition, when two or more context sensitive symbols can expand, and no context free symbols can expand, a random one is chosen to expand first. The best way to go over the expansion engine is to go through a sample run. A user has asked for the system to generate something based off of a White Russian. The grammar starts with three symbols: (strong, function), (sweet, function) and (mild, function). Now, the system looks through the current rules it has and there are two that can potentially apply: perform a lineage rewrite to transform the strong symbol into a sweet symbol, or expand the strong symbol into an ingredient level base spirit. As these are both equally valid rules, the system picks one at random, and decides to transform the strong symbol into a sweet symbol. The grammar now reads like (sweet', function), (sweet, function), (mild, function). The grammar now has several context-free rules it can apply, expanding sweet' into a ingredient level sweet symbol, expand sweet into an ingredient level sweet symbol and expand mild into an ingredient level mild symbol. The grammar randomly picks among these three rules, as all of them are context-free. After those rounds of expansion, the grammar now looks like (generic-sweet, ingredient), (generic-sweet, ingredient), (generic-mild, ingredient). Now, there is a context-free rule for expanding the generic-mild symbol, whereas expanding generic-sweet is context-sensitive. So, generic-mild gets expanded next and the grammar now looks like (genericsweet, ingredient), (generic-sweet, ingredient), (cream, expression). At this point, we start to expand one of the generic-sweet symbols and the grammar needs outside help, as there are many symbols that it can expand into. The ingredient graph is queried, looking for neighbors of the cream symbol. A list of neighbors is returned. This is stored, in case the grammar needs to expand another ingredient from the graph. One is picked from them: mint. The first generic sweet symbol is expanded, and now the grammar looks like (mint, expression), (generic-sweet, ingredient), (cream, expression). The grammar then checks the stored symbols from the last graph query and selects another one to expand the last generic-sweet symbol into. The grammar now looks like (mint, expression), (chocolate, expression), (cream, expression). The last round of expansion has the grammar query the ingredient list three times, once for each of these symbols to look for valid ways to express them. The end grammar string is (mint-creme, terminal), (chocolateliqueur, terminal), (heavy cream, terminal). Text Rendering The last part of the system is the text renderer. After expanding out the grammar, we have ingredients and the amount of each ingredient expressed in parts. This still needs to be converted to a human readable recipe. For the most part, this means replacing the dash in the symbol with a space. Some symbols are important to a cocktail, but are not given a part amount because they are used for garnishes, taste or in such small amounts it makes no sense to display them as a part. A prime example would be bitters, used in cocktails based on the Old Fashioned. In this case, amounts given in dashes are used (or other garnishing terms, like a "twist of lemon"). The mixing instructions are appended to the ingredient list. The mixing instructions come directly from the original category of drink and the ingredients used, with some simple replacement (such as ingredient names rather than functional terms) to make the recipe easy to follow. Finally, a name (currently an adjective, noun pair) is added to the cocktail. An example final recipe is presented in Figure 6 June 2015 216 Figure 5: Diagram of the Old Fashioned grammar Figure 6: A cocktail generated with the Mai Tai grammar. Expressivity In both PCG and computational creativity, the expressive range of a generator is a strong consideration for how well that generator performs. Expressive range can be thought of as the range of parameters that change the kind of content the generator can produce(Smith and Whitehead 2010). For a cocktail grammar, expressive range is tied with the connectivity of the ingredient graph (the more connections the graph has, the more symbols a particular grammar can access). In addition, we can look at how ‘open' a particular grammar is to various ingredient expressions. If a grammar can use a lot of ingredient expressions, then it can generate many more combinations. To measure this, each grammar generated 1,000 cocktails, and the amount of times a particular terminal symbol occurred was counted. As we have a list of all potential terminal symbols, if a symbol was never used, it was given a use count of zero. The result of this count is shown in Figure 7. Each cocktail grammar is not equally expressive. Some categories are more restrictive than others, and lean more heavily on particular ingredients. However, each grammar seems to focus on different parts in the potential ingredient space, and when looked at all together, the entire system does a good job of making sure that all provided ingredients get used. Evaluation To see if a particular generated artifact is creative, three metrics were used: quality (the measure of how well an artifact performs a particular purpose), novelty, (the measure of how unique an artifact is to an evaluator), and typicality (the measure of how well the artifact fits in a particular class of artifacts). For cocktails, quality is how well the cocktail tastes as compared to other cocktails the taster has drank. Novelty is how different a cocktail tastes as compared to other cocktails the taster has drank. Typicality is how much like a cocktail a current cocktail tastes like. This forms an evaluation space, where differing rating triples have meaning. High ratings in quality but low ratings in novelty imply that a cocktail was good, but very similar to what the taster usually orders. High in novelty and low in quality implies an interesting cocktail, but one that does not taste very good. A low score in typicality implies that the cocktail does not taste like a cocktail at all, and tastes closer to a non-alcoholic drink or straight base spirit. In order to be considered creative, a generated artifact needs to perform highly in all three categories, as per artifact-focused definitions of creativity. Figure 7: Ingredient use heatmap. The x-axis graphs individual ingredients, the y-axis graphs the grammars using them. As squares get lighter, they were used more times in the generated run. June 2015 217 Figure 8: Quality ratings for the generated and baseline drinks. Quality in the generated drinks appears to be more polarizing than quality in the baseline drinks. Figure 9: Novelty ratings for the generated and baseline drinks. Table 1: P-Values Category Quality P-Values Novelty P-Values Overall 0.048 0.344 Margaritas 0.001 0.003 Martinis 0.505 0.118 Margaritas have detectable differences in both quality and novelty, martinis have no detectable differences and there is a detectable difference in quality but not novelty overall. To this end, we had two of the eight grammars evaluated for quality, novelty and typicality by human tasters. Two cocktails from both the Margarita grammar and the Martini-based grammar were generated. In addition, two established cocktails that followed the rules for each grammar were chosen as a baseline to compare the generated cocktails against. Participants tasted each cocktail and then evaluated the cocktail based on their sip. Cocktails were presented in a random order, and participants were told that all eight cocktails were generated. The use of a baseline to compare the generated drinks against also helped reduce taste effects—if a particular participant did not like martinis, for example, they were expected to rate both the generated martinis and the baseline martinis low. One third of the participants had prior experience mixing drinks. All participants had at least two cocktails over the past year, with 20% having had two to four cocktails, 33% having had five to seven cocktails and the rest having had more than seven. All participants were at least 21 years old. 60% of participants were 25-29, ⇡ 27% were 21-24, and the rest were 30 or older. 40% of participants were female, the rest were male. The tastings occurred in an office environment. Participants were asked to evaluate the cocktails on quality and novelty using a five point scale, with a score of one being low and a score of five being high. To rate the cocktails for typicality, participants were asked if they believed what had been served was a cocktail, and to try to classify which exemplar the cocktail was based on. Table 1 contains the pvalues from an unpaired t-test between the baseline and the generated cocktails. There was not a detectable difference in the novelty metric, overall. However, the generated drinks, overall, did perform slightly detectably worse in the quality metric. These results are captured in Figures 8 and 9. For typicality, the generator performed well, with very few participants believing that they were not served a cocktail. However, when asked to try and identify which exemplar drink the cocktail had come from, participants did poorly. Participants only correctly identified the exemplar cocktail 26.67% of the time. This can imply two things: 1) that a general audience does not have enough skill in cocktails to taste where particular drinks came from and/or 2) the classification scheme used by the generator is not how the average person classifies cocktails. Threats to Validity With no detectable difference in novelty between the baseline cocktails and the generated drinks, we can not conclude anything about the generated drinks compared to the baseline. In addition, the generator and evaluation did not take into account the environment the cocktail should be consumed in. It is possible that bar ambiance could impact the perception of flavor. Garnish selection is not considered in the current generator, and garnishes can strongly impact how people perceive cocktails. There are weaknesses in any expert system—how well did the experts describe their process, and how well was that process encapsulated in the system? The majority of the cocktail generation system came from expert knowledge, from the structure of the ingredient graph to the types and numbers of grammars used. This still leaves out certain cocktails. A Cement Mixer, for example, breaks one of the cardinal rules of the system (citric acid and cream should not be mixed) to create a novel texture. There are several weaknesses with the open loop of generating, then evaluating with human evaluators. The generator itself cannot react to the evaluations of its own output and make adjustments to its internal drink mixing philosophy. As pointed out by Stokes(2011) as well as others, this June 2015 218 implies that the current generator is not creative, regardless of how highly its output is scored. In addition, the generator makes no attempt to account for any sort of taste. It blindly puts ingredients together without understanding why those ingredients might work well together. This generator will also never modify either ingredient representation to discover new cocktails. Discussion and Future Work Other computational ways to evaluate the system could be employed. Output recipes could be compared to existing rated recipes from online websites and databases, and a quality, novelty and typicality metrics could be derived from this comparison. However, the use of human evaluators, at least in the current state of the field, is important, because there are aspects of taste not captured in an ingredient bill or preparation steps. Data driven cocktail generators are a strong next step. This, generally, would mean scraping various lists of cocktails (either from the web or from popular literature) and attempting to derive some heuristic for cocktails from the data. Online databases are particularly attractive, as they may have both good and bad examples to learn from. As alluded to earlier, typicality can be a tricky part of the generator to evaluate. A way around this problem would be to have evaluators rate several well-known variations, to establish a baseline for 'cocktail recognition' that the generator's output can be compared to. Finally, there is a need to evaluate cocktails on the merits of taste. In turn, we need a computational model of taste to see how potential drinks might taste. This lets us truly close the loop so the generator can evaluate its own work. This sort of model would allow for the use of modification or repair to a poorly evaluating dish, like several case-based reasoning techniques modify plans to best fit the current scenario. In addition, such an evaluator could evaluate many recipes, far faster than a human could. 2015_29 !2015 Stimulating and Simulating Creativity with Dr Inventor * Diarmuid P. O'Donoghue, * Yalemisew Abgaz, * Donny Hurley, §Francesco Ronzano, §Horacio Saggion *Department of Computer Science, Maynooth University, Ireland. §Universitat Pompeu Fabra, Barcelona, Spain diarmuid.odonoghue@nuim.ie Abstract Dr Inventor is a system that is at once, a computational model of creative thinking and also a tool to ignite the creativity process among its users. Dr Inventor uncovers creative bisociations between semi-structured documents like academic papers, patent applications and psychology materials, by adopting a "big data" perspective to discover creative comparisons. The Dr Inventor system is described focusing on the transformation of this textual information into the graph-structure required by the creative cognitive model. Results are described using data from both psychological test materials and published research papers. The operation of Dr Inventor for both focused creativity and open ended creativity is also outlined. Introduction This paper describes the Dr Inventor project that is both a creativity support tool while its internal operation means that is also functions as a model of creative discovery. One of the core artifacts processed by Dr Inventor to boost scientific creativity is represented by Research Objects (RO) (Belhajjame et al., 2012), which are creative academic outputs including academic publications, patent applications and related data. Dr Inventor aims to actively explore creative bisociations (Koestler, 1964) between these Research Objects using a cognitively inspired model of creative thinking. This paper adopts a big data perspective on Research Objects attempting to uncover latent creative comparisons that might lie undiscovered within its dataset. Dr Inventor directly addresses two of Honavar's (2014) facets of computationally mediated scientific discovery: firstly the development of computational representations and secondly, computationally augmenting scientific discovery. This paper is structured as follows. We first present a case for bisociative and analogy-based creativity, addressing some issues arising from Boden's attribution of bisociative reasoning to a category called "combinatorial creativity" (Boden, 1998). We then describe the Dr Inventor model, focusing on the processes that enable it to identify analogies between its text-based inputs. Next, we outline some results from text-based sources including human psychological tests and published research papers, illustrating its operation as both a tool for focused creativity and also for open ended creativity. Finally a summary and some concluding remarks are made. Analogical Reasoning and Creativity The model of bisociative reasoning developed in this paper is built primarily on a computational model of analogical reasoning, which is extended to include additional background information. While computational treatments of analogy originally focused on the analogy per se, recent attention has focused more on situated models addressing topics like Ravens Progressive Matrices (Kunda, McGreggor and Goel, 2013). The analogy process provides a unique perspective from which to view computational creativity, lying at the crossroad of research in areas including cognitive science (Gick and Holyoak, 1980), developmental psychology (Rattermann and Gentner, 1998), computer science (Ramscar and Yarlett, 2003; O'Donoghue, Bohan and Keane, 2006; O'Donoghue and Keane, 2012) and neuroscience (Green et al., 2010) . Research in these areas often constrain one another and offer the possibility of uncovering truly deep insights into the creative process. This may ultimately lead to formation of a cohesive multi-perspectival vision of one mode of creativity. Analogy in Creative Reasoning Psychological evidence has highlighted people's ability to reason using analogical comparisons in the laboratory setting (Gick and Holyoak, 1980). Subjects are typically presented with two analogous stories and are required to develop the latent analogy as a key to solving a problem in one of those stories. Later in this paper we shall demonstrate Dr Inventor's ability to take the texts used in these psychology tests and develop the same analogies as observed in (many) human participants in these trials. The use of analogy has also been described in a "real world" scenario. Blanchette and Dunbar (2001) recorded and described the use of real-world analogies during laboratory meetings of molecular biologists and immunologists. They examined 16 different meetings in a number of different laboratories. They identified over 99 analogical comparisons and scientists typically used anything from 3 June 2015 220 to 15 analogies in a one-hour meeting. The majority of the analogies discovered were between biological and immunological information - the so called "within-domains" analogies. However , the authors noted that scientists used more "between-domains" analogies (involving semantically distant source domains such as literature or engineering), when the goal involved a creative task such as formulating an hypothesis. Goldschmidt et al, (2011) and others have highlighted that "problem fixation" often frustrates peoples efforts to think creatively. That is, people experience difficulties in seeing new uses for existing information. The authors argue that to overcome this fixation and to promote creative thinking, that people be presented with semantically distant comparisons for a given problem. Research by Bowden et al (2005) and others has highlighted that insight occurs when problem solvers suddenly see a connection that previously eluded them. One possible mechanism of supporting insight is the discovery or a creative bisociations, like analogies and blends (Fauconnier and Turner, 1998). Analogy and Transformational Creativity Margaret Boden (Boden, 1990) offers three well-known levels of creativity, with increasingly impressive impact at the levels of improbable, exploratory or transformational creativity. Boden argues that analogy is effectively the lowest form of creativity (improbable); however we argue that when analogical reasoning is seen within the context of a cohesive system of human reasoning the picture is less clear. If the inferences mandated by an analogy contradict some fundamental axiomatic belief, especially beliefs with that numbers of associated deductions and inferences, then resolving this contradiction might well involve the "shock and amazement" associated with Boden's highest level of transformational creativity. It appears that analogies may in fact, drive creativity at any of Boden's levels of creativity. Our creativity model is domain independent and does not include a pragmatic component or domain context. So, as our model does not use domain-specific knowledge, arguably it cannot be easily cast as one of improbable, exploratory or transformational creativity in Boden's terms. Creativity Producers and Consumers Creativity is generally seen from the perspective of the creator. But, Dr Inventor needs to make a distinction between itself and its users who are consumers of its creative outputs. O'Donoghue and Keane (2012) made the point that a creative process may present a creative comparison so as to highlight the latent similarities, perhaps using terminology that highlights this commonality. However, discovering such creative comparisons will generally have to combat these differences in order to discover that commonality ab initio. When they encounter a creative artefact, the interested consumer should also experience an episode of creativity, once they engage properly with the artefact. The process of engaging with a creative artefact should empower the consumer, ultimately leading them to a new conceptual space akin to that of the creator. If the artefact doesn't cause this reaction, then its creative impact is greatly lessened and may be considered less creative. So, a truly creative output is not merely a recorded by-product of the creative experience of its creator, but it must also engender creativity within those consumers that engage properly with it. To achieve this, creative artefacts must have communicative potential and arguably, multiple creative artefacts may be necessary to clarify a new conceptual space or to convince an unwilling consumer. We call secondary creativity the act of engaging with a creative artefact so as to transform ones conceptual space, with primary creativity being the initial creative episode. We believe that secondary creativity is also essential for truly creative artefacts, helping wide adoption of this new perspective. Dr Inventor is concerned with both finding creative bisociations and with presenting these outputs to its users. It will use both ontology and visual analytics to support this secondary creativity. Dr Inventor Dr Inventor is a computationally creativity system that can both model scientific creativity and can also use its outputs to stimulate creative thinking within its users. It is as concerned with the process of creativity as it is with the products that arise from these processes (Stojanov and Indurkhya, 2012). Dr Inventor is built on a cognitively inspired model of human bisociative reasoning, based on analogical comparisons and the counterpart projection of conceptual blends (Fauconnier and Turner, 1998; Veale, O'Donoghue and Keane, 2000). CrossBee (Jursic et al., 2012) looked at exploring scientific papers, its focus lay in finding bridging terms between them. The focus of Dr Inventor is on finding and extending systemic similarities for creative purposes. This paper focuses primarily on three of the four spaces of conceptual blending, namely the two input descriptions and the generic space. The dotted lines in Figure 1 indicate the correspondences between these inputs, derived with the help of Gentner's structure mapping theory (1983). Dr Inventor's 3-space model identifies a generic space containCounterpart mapping Input ROS-1 Generic and Ontology Input ROS-2 Output ROS Figure 1: Conceptual spaces used by the bisociative model of Dr Inventor including the analogically founded mapping between the inputs June 2015 221 ing the ontological similarity between paired relations from the Input 1 and Input 2. Dr Inventor thus identifies the generic space corresponding to the aligned items from the bisociation. This generic space also enables Dr Inventor to monitor the semantic congruity within a bisociation, to uncover comparisons more in fitting with the users' needs. Finally, the output space represents the new interpretation of one of those inputs. As each "target" maybe reinterpreted by multiple sources, and because that target may also act as a source for some other Research Object Skeleton (ROS), each newly created ROS is stored separately. For simplicity, this paper generally uses the terms source and target, unless specific point about the Blend is being made. This means that a new ROS may act to later inspire subsequent creativity. Thus, Dr Inventor can potentially operate as a "Self-sustaining" creativity model as of described in O'Donoghue et al (2014). One of the chief obstacles hindering Dr Inventor in achieving this self-sustaining creativity lies in the quality of the new ROS and a sufficiently diverse knowledge base from which to progress. The core data artefacts used by Dr Inventor are Research Objects (Belhajjame et al., 2012), which are research outputs including publications, patents, data, software (O'Donoghue et al., 2014b), social network information and other resources. Dealing with such heterogeneous data sources, characterized by consistent amounts of information to integrate and process, big data approaches and technologies are essential in order to enable the computational approaches to creativity in Dr Inventor. This paper focuses on the textual contents of RO, particularly of publications and patents. These documents are first subject to a number of processing activities to properly mine their contents in order to generate inputs that are useful to Dr Inventor's analogy-based model. From each RO Dr Inventor generates a graph-based representation called the Research Object Skeleton (ROS) representing the key concepts and relationships extracted from that RO. Dr Inventor identifies similarities between these ROS with a view to extending these similarities and uncovering creative possibilities. Dr Inventor Model The overall Dr Inventor model contains components that deal with document summarization, information extraction; ontology learning, matching and personal recommendation; ROS generation, assessment, similarity and analogy/blending; validation, mapping, retrieval and finally visual analytics. The discussion in this paper will focus on the ROS generation, analogy/blending model and the creativity assessment components. Mining Textual Contents to Populate ROS In Dr Inventor, Research Object Skeletons (ROSs) are built by mining the contents aggregated by the corresponding Research Objects (ROs). To populate a ROS, Dr Inventor mainly relies on the extraction of information from the textual contents of a RO. To analyze these contents, Dr Inventor integrates a Natural Language Processing Pipeline (DRI-NLP pipeline) that aggregates and customizes several Information Extraction (Piskorski and Yangarber, 2013) and Text Summarization (Saggion, 2014) approaches and tools. Since scientific publications constitute one of the main kinds of textual documents included in a RO, DRI-NLP pipeline has been properly structured to support the analysis of research papers. The great majority of papers are currently available as PDF files. As a consequence, the conversion of PDF into plain text constitutes an essential prerequisite to properly perform any further text analysis. To this purpose, DRI-NLP pipeline relies on PDFX (Constantin, Pettifer and Voronkov, 2013) that converts a PDF document of a scientific publication to a semistructured text (XML). The plain text output of PDFX is thus processed so as to identify sentences by means of a custom rule-based sentence splitter. Each sentence is processed by means of the MATE dependency parser (Bohnet, 2010) to extract dependency relations which are represented in a dependency tree. DRI-NLP pipeline dependency parser has been customized in order to properly deal with several peculiarities of scientific publications, including the presence of inline citations. In particular, inline citation markers like "(AuthorA et al.)" or "(1)" are excluded from the dependency tree if they have no syntactic functions in the sentence where they are present. Dr Inventor is focused on the discipline of computer graphics as its test-bed, thus a particular challenge has been dealing with the many mathematical expressions in these papers and allowing their treatment separately from the main body of the text. Besides dependency parsing, DRI-NLP pipeline enables the creation extractive summaries of papers by ranking their sentences by relevance (Saggion, 2014). As result of dependency parsing, each word of a sentence is characterized by its Part-Of-Speech (POS) (noun, verb, adjective, etc.) and dependency relations (subject, object, verb chain, modifier of nominal, etc.). The linguistic information extracted from each publication can be condensed in the tables: the Syntactic dependency and the POS tag table. In particular, Figure 2 focuses on the analysis of a specific sentence taken from the abstract of a paper. While Dr Inventor is focused on the test-bed of computer graphics publications, it remains a general model capable of dealing with arbitrary text inputs. This paper also uses data derived from psychology text materials and work is ongoing using the texts of patent applications. Figure 2: Processing PDF papers by Dr Inventor Natural Language Processing Pipeline June 2015 222 ROS Generation The next task for Dr Inventor is to generate a ROS from the results of the parsing process. The representation we chose for these graphs is sufficiently general to represent different types of RO. Since we want to structure objects and their inter-relationships this information is stored as a graph, aimed at supporting the later structure mapping process (Gentner, 1983). Each ROS is constructed as an attributed relational graph (ARG), which is a directed graph where nodes and edges may contain additional properties like labels, categories and numeric values. If required, we can store additional identifying information (e.g. Author, Affiliation, etc.) within the graph, but this information is not required for the analogy process per se. The primary information in a ROS is the concept nodes (nouns) and the relationships (verbs) between them. Concept nodes are not linked directly to one another but are connected with relation nodes. To generate the ROS we use the general structure "subject" - "verb" - object" as required by SMT. These triples arise from the dependency and POS tables as the input to ROS generation. Early testing has shown that taking triples directly from the dependency table typically leaves many of them incomplete, leaving ROS without the necessary structure to support identification of creative inter-domain correspondences. Therefore, Dr Inventor performs a deeper exploration of the tables in order to generate more useable ROS structures. By constructing a dependency graph from the tables and applying a set of heuristics to the graphs, a more complete set of triples is generated. The heuristics involve combining some of the nodes and tracing through the graph finding pairs for each verb. Figure 3 depicts two ROSs generated for the "Zerdia" and "Karla" stories (Table 1) used in human psychological studies (Gentner and Landers, 1985). They were generated by the text mining and ROS generation techniques discussed earlier, but some manual post-editing was performed to identify co-referencing concepts nodes in the ROS. In the "Zerdia" story the word "it" is used twice, but the ROS were edited so one instance was replaced by the referent "Zerdia" and another by "Gagrach". In the "Karla" story the word "she" refers to "Karla" and "he/him" refers to "hunter". While these co-referents were resolved manually work is underway in the text pipeline to automatically resolve these referents. Dr Inventor explicitly represents higher-order (causal relations connect first-order relations) relations within a ROS. A distinct set of nodes represents the higher-order relations, these connecting the first-order (and potentially other high-order) relations. However, ROS generated from within our Research Objects corpus show that high-order (causal) relations are rarely explicitly identified. As we shall see, this influenced our choice of mapping algorithm. Karla the Hawk: Karla, an old hawk, lived at the top of a tall oak tree. One afternoon, she saw a hunter on the ground with a bow and some crude arrows that had no feathers. The hunter took aim and shot at the hawk but missed. Karla knew the hunter wanted her feathers so she glided down to the hunter and offered to give him a few. The hunter was so grateful that he pledged never to shoot at a hawk again. He went off and shot a deer instead. Zerdia True Analogy: Once there was a small country called Zerdia that learned to make the world's smartest computer. One day Zerdia was attacked by its warlike neighbor, Gagrach. But the missiles were badly aimed and the attack failed. The Zerdian government realized that Gagrach wanted Zerdian computers so it offered to sell some of its computers to the country. The government of Gagrach was very pleased. It promised never to attack Zerdia again Table 1: Textual contents of "Karla" and "Zerdia" Figure 3: ROS for the "Karla" and "Zerdia" analogy used in human studies and Dr Inventor June 2015 223 Graph Storage Dr Inventor uses the Neo4j graph database to store its ROSs. Neo4j has as its core structures; nodes, relationships between them and properties on both, this being the same structure as the ARG. Additional information such as the SentenceID or SectionTitle for each triple can also be stored in the Neo4j database. This can be useful when we want to map only between particular sections (e.g. Abstract or Conclusion) and also to reference back to the original sentences from which the triples were extracted. Data Differences Previous analogy models like SME (Forbus, Ferguson and Gentner, 1994), IAM (Keane and Bradshaw, 1988) and Kilaza (O'Donoghue and Keane, 2012) used hand coded data. The ROS generated above differs from the earlier hand-coded data in at least two significant respects. First ROS contain very few high-order relations, which are heavily used by mapping models mentioned above. Dr Inventor does not focus on the hierarchical structure of hand-coded data, using instead some lower level topological structure. Secondly (as mentioned in (O'Donoghue and Keane, 2012)) hand coded data often simplifies the comparison process by using relations that highlight the latent similarity. Dr Inventor must uncover and identify the hidden similarity even in the absence of such lexical cues. Dr Inventors Creativity Engine This paper focuses on the creativity engine that lies at the heart of Dr Inventor. Thus, we focus on creative analogybased comparisons and show a number of features of Dr Inventor that specifically attempt to support the identification and generation of these creative analogies. Creative Analogies A number of properties appear to be shared amongst many creative analogical comparisons (O'Donoghue and Keane, 2012) and these facets are used to generate novel and potentially useful analogies and blends. Firstly the source (domain) of inspiration is typically semantically different from the given target problem. That is, creative sources tend to be sufficiently different and any similarity is nonobvious and has not been previously explored in detail. Secondly, the creative source contains the necessary structural similarity that is required to generations of viable analogy with the given problem. To this end, Dr. Inventor specifically seeks out bisociations that involve two semantically distant domains, that form a rich inter-domain mapping and that yield inferences suggesting something new about one of those domains. Graph Mapping To support creative analogies Dr Inventor's retrieval and mapping activities makes frequent use of topological features derived from each ROS. For analogical mapping we exploit features such as type of the nodes (verb, noun), types of relationships (subject, object), degree (in-degree, out-degree) and node rank values calculated by Node Rank algorithm (Bhattacharya et al., 2012). We initiate the mapping process by calculating the Node Rank and by sorting the nodes in a descending order. The ranking allows us to start the graph matching process from the most centrally connected and useful node. This will further be used to serve as a threshold to screen useful nodes to improve the performance of the mapping process. The results presented in this paper have been generated using smaller RO (such as the abstracts of graphics paper RO), so performance has not been an issue. However this situation will change when mapping between ROS with large number of nodes is required. The relation (verb) nodes in each ROS are represented distinctly, with one instance of a relation node for each verb contained in the RO. Verb nodes are central to the process of representing the content of the RO, however their connectedness is limited to a degree of 2 and thereby affects the resulting Node Rank values. However, multiple references to the same concept (noun) node will appear in the ROS as a single concept node - but referenced multiple times by each distinct relation node. Thus concept nodes have the greatest direct impact on the Node Rank values as a single concept node may be linked through many relations within a ROS. The mapping process avails of this referential structure when generating the largest graphmapping between two ROS. To identify a pair of mapping nodes from the source to the target, we used structural similarity score (using the connectedness of the nodes) and the literal similarity score. Using structural similarity, we consider two nodes as candidate mapping nodes if they have a higher similarity score. Whereas, literal similarity calculates the similarity coefficient between two words and yields a value between 0 and 1, where 0 indicates no similarity and 1 indicates complete similarity (synonym). This is achieved by using the Wu&Palmer (Wu and Palmer, 1994) WordNet-based similarity metric. The mapping algorithm firstly selects a pair of nodes P(sNode, tNode) from the source and the target respectively, with the highest node ranked nodes being selected first. In this way, the algorithm focuses on highly connected nodes within the graph because they contribute most to the mapping and analogical inference activities. Secondly, the mapping process checks if the selected pair P(sNode, tNode) is structurally feasible for analogical mapping. A structurally feasible pair contains a source node which has degree (in degree and out degree) greater than or equal to degree (in degree and the out degree) of the target node respectively. The comparison ensures the identification of a sub-graph or an isomorphic graph of the target graph in the source graph. It further assesses the semantic similarity of the two nodes using Wu& Palmer. Next, mapping adds P(sNode, tNode) to the inter-domain map to incrementally build a mapping sub-graph, if P(sNode, tNode) is feasible. The mapping stores the pair of mapping nodes along with their similarity scores. June 2015 224 The mapping process then generates new candidate mappings by expanding sNode and tNode of P(sNode, tNode) to their respective connected nodes that are not already expanded. By following the "subject" or "object" relationship path it reaches the connected nodes of the graph, incrementally adding these to the inter-domain mapping. After the candidate pairs are generated, they are ranked using their semantic similarity score. Ranking the candidate pairs will give us a chance to expand pairs with the highest semantic similarity first. After including all mappings arising from the initial root mapping, the process then resumes with the next highest ranked and unmapped predicates. The algorithm employs depth first search to expand the nodes in the graph to identify new mapping pairs. Finally, it selects the mapping that contains the largest sub-graph and returns the mapping nodes together with their semantic similarity score. We now look at the results produced by generating a mapping between the "Karla" and "Zerdia" psychology materials listed above, with the corresponding ROS being depicted in Figure 3. We note that this simulation of human analogy process began with the same text materials that were presented to human subjects. This comparison is an example to focused creativity, where both the source and target have been pre-identified. The mapping between "Karla" and "Zerdia" gives us 11 mapped nodes between the source and target (Table 2). For example the noun node "Karla" maps to "Zerdia", "Feather" maps to "Computer" and "Hunter" maps to "Gagrach". Such a mapping identifies analogous items between the source and the target and is crucial for transferring new knowledge form the source to the target. In this specific example 50% of the nodes in the target ROS are mapped to the source ROS with an average Wu&Palmer similarity score of 0.56. The original domains can often include information that does not participate in the mapping, such as the (missile be-take-aim attack) in the Zerdia story. However the absence of this relation from the mapping is not terribly significant as it is an isolated fragment of information and does not contribute largely to the main story - that contributes to the largest connected component of that ROS. Mapping Nouns Mapping Verbs Source Target Source Target Hunter Gagrach Want Want Crude World Live Offer_to_sell Feather Computer Arrow_have Learn_to_make Want Country Glide_offer_to_give Be_attack Karla Zerdia Glide Promise_to_attack Know Call Table 2: Mapping between "Karla" and "Zerdia". Analogies within Graphics Collections To examine the mapping process, we used 10 papers from computer graphics domain. The abstracts of the papers were extracted and were processed using the previously described steps. Each ROS were mapped to the other 10 ROS including itself. The most basic step is to compare a ROS against itself. For all the 10 papers Dr Inventor yields the highest number of mapped nodes when a ROS is compared with itself - with all or almost all nodes being successfully mapped. This could be considered as a very basic step toward the evaluation of the mapping component of Dr Inventor. The mapping of a ROS against the remaining 9 ROS identifies pairs that have the highest mapping nodes and pairs that have the lowest mapping nodes. The most analogous papers are those with large number of mapped nodes and highest similarity score. For example, the most similar non-identical mapping among the 10 papers is between "Bar-Net_Driven Skinning for Character Animation" and "Real Time Large Deformation Character Skinning in Hardware" with 14 mapping nodes and an average Wu&Plamer similarity score of 0.36. While the semantic similarity score may appear quite low, this was achieved from within a small collection of papers. We conducted a quick manual comparison between the abstracts of these papers and initial results indicate that these papers can be considered somewhat analogous to one another as for example, both papers present different approaches to the computer graphic topics of "skinning". This analogy arose from the desire to identifying the largest mapping with the strongest semantic similarity from within the 10 papers however, the next section will discuss a more creative Use Case scenario. The lowest mapping occurs between the papers "Curve Skeleton Skinning for Human and Creature Character" and "Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Driven Deformation" with 5 mapping nodes and average similarity of 0.35. The mapping process, as it is expected, is not symmetrical, i.e. mapping between (S, T) may not be the same as mapping between (T, S). For example, saying "a man is like a pig" is not the same as saying that "a pig is like a man"! However, in this specific data set, it does not significantly affect the highest mapping node pairs. Analogical Inference and Pattern Completion Once we find a mapping between the two ROS, the next phase is to generate the resulting inferences by applying a "pattern completion" process to that mapping. This adds the newly inferred information to the target ROS to produce the new interpretation of that concept. In the exploratory analogy mapping process, the user may be interested to explore all the candidate nodes once he/she knows the existence of analogy between the source and the target. Creativity Support and Evaluation in Dr Inventor Dr Inventor is focused firstly on operating as a Creativity Support Tool (CST) and secondly, as a simulation of the analogy process. Shneiderman (2007) noted that there are no obvious metrics to quantify for CST's and this problem lies at the core of creativity assessment and evaluation. The following two approaches are useful to evaluate the level of creativity support provided by Dr Inventor. June 2015 225 Among the functionality being developed for Dr Inventor is an "Inspire Me" Use Case, enabling users to creatively re-interpret one of their own papers. This will be achieved by using the paper as a target and searching the archive for papers that can act as a creative source domain, forming a large and semantically well-balanced mapping by making use of the topological structure of each ROS. Dr Inventor will identify and present to the user those analogies offering a collection of novel inferences that highlight the potential benefits of adopting this creative analogical comparison. Internal metrics will serve to select the most promising analogies to present to users, assessing the structural and semantic foundations of the comparison. Implicit feedback on the presented analogies will be gathered by the user interface, enabling comparative evaluation of different comparisons by monitoring user engagement. Explicit user feedback also plays a very important role in evaluating Dr Inventor, using experts in computer graphics for evaluation. The Creative Support Index (CSI) (Cherry and Latulipe, 2014) is a psychometric survey that will serve to assess the creativity support provided by Dr Inventor. It is quick and easy to administer and is composed of two sections; a rating scale section and a pairedfactor comparison section. It identifies 6 major factors of creativity, namely: enjoyment, exploration, expressiveness, immersion, results worth effort and collaboration. Under each of the factors, CSI asks two questions that are rated between 0 and 10, where 0 indicates the lowest value and 10 indicate the highest achievement. The paired-factor comparison section consists of each factor paired against every other factor for a total of 15 comparisons. As Dr Inventor will support users with different levels of expertise (first year PhD students to experienced Professors), this factor in particular will have to be controlled monitored during the evaluation process. The Creative Achievement Questionnaire (CAQ) (Carson, Peterson and Higgins, 2005) is a very broad and general creativity assessment technique. Within the context of Dr Inventor, achievements appear to be primarily assessed by qualifying the number of published scientific papers. But the CAQ provides poor coverage of lower levels of creative achievement (before publication) that could guide development of the Dr Inventor project. However, the CAQ might be useful for the final evaluation of Dr Inventor. We also note that (Jordanous, 2014) identifies five criteria to support meta-evaluation of computational creativity per se, as opposed to our current focus on Dr Inventor as a creativity support tool. Summary and Conclusions Dr Inventor is a computationally creative model that acts as both a simulation of human creative reasoning and also as a creativity support tool. We described how Dr Inventor performs text extraction from research publications presented in pdf format, describing how it addresses many complications that result from use of the pdf format. The dependency parser was described, as was the process of constructing the graph representation used by the core model. Some peculiarities of the resulting graphs were noted, particularly the extreme rarity of identifiable higherorder causal relations. Some implications were noted the process of identifying the inter-domain correspondence. The text used from human psychological trials showed the ability of Dr Inventor to generate comparison using these same textual materials. Results for other Research Objects were outlined. The mapping and evaluation process uses ontological information as a preference criterion to choose comparisons with the greatest potential for creativity on Dr Inventor users. Ontology also opens the way to re-describe the original documents, highlight the identified similarities. While this work is ongoing, it opens the way for early evaluation of Dr Inventor by comparing the impact upon users of creatively re-interpreted documents. Acknowledgements The research leading to these results has received funding from the European Union Seventh Framework Programme ([FP7/2007-2013]) under grant agreement no 611383. 2015_3 !2015 Attributing Creative Agency: Are we doing it right? Oliver Bown Design Lab University of Sydney NSW, 2006, Australia oliver.bown@sydney.edu.au Abstract When contemplating the creativity of computational systems, a host of factors have been taken into consideration, many of which people have attempted to measure or otherwise operationalise: novelty, value, P-creativity versus H-creativity, exploration versus transformation, the subjective evaluation and contextualisation of the artefact, and so on. Whilst of equal importance, the systematic and rigorous attribution of creative agency to different actors in the production of a specific output has been given less attention. It is common to make the simplifying assumption that the most direct contributor to an artefact is that artefact's sole author, but arguably this is never the case: all human creativity occurs in the context of networks of mutual influence, including a cumulative pool of knowledge. This paper looks at how we might better formalise creative authorship such that for any artefact, a set of agents could be precisely attributed with their relative contributions to the existence of that entity. It asks only what the nature of this formalisation should be, and concludes that a more critical approach is needed to the creative agency of human actors, and thus the expected creative agency of machines. I draw on two critical notions that can inform a methodology for the ascription of creative origins in computational creativity: becoming, and the agency of networks of interaction. I look at a example from both historical human creativity and computation creativity, to consider how we can break down creative agency and ascribe it to different sources. Practical implications are dicsussed. Introduction In contemplating creativity, we are comfortable with taking at face value statements such as "Ludwig van Beethoven composed Beethoven's Fifth Symphony" or "Leonardo da Vinci painted the Mona Lisa". At the same time, we are well aware that such attributions are rough at the edges when scrutinised. Creativity does not occur in a vacuum. All creators are subject to influence from their culture or environment, and other forces at play in the creative process include chance, the influencing of opinions such as value attribution, the emergence of outcomes through collective action, and the need to consider the potentially active role played by passive objects, as discussed most famously by Latour (1996) and Clark (2003), but with recently renewed interest by Malafouris (2007), Miller (2010) and Ingold (2007). Longstanding theories of creativity have successfully managed these apparently conflicting perspectives, most notably the work of Simonton (2003) and Csikszentmihalyi (1999). In both cases, creativity is properly understood as a process that operates at a macro level (sometimes described as a network or systems level). For Csikszentmihalyi the macro perspective is critical because the process of creativity involves the interaction between heterogenous groups of participants, and for Simonton it is because creativity is best modelled as a stochastic process across a population, which cannot be properly understood when looking at single instances. However, it has been difficult to translate such knowledge into practical methods for evaluation in computational creativity, which despite its strong acknowledgement of such theories does not successfully draw on this macro-level perspective in evaluating individual systems. In this paper I present this challenge in terms of recurring misconception that evaluation can be performed on isolated individuals, i.e., at a micro-level, which I refer to as the "islands of creativity" view. Drawing on literature from creativity research, philosophy and the social sciences, I consider how a macro-level view of creativity can work in the applied task evaluating computationally creative systems. I suggest that a critical step is to recognise how the objects of evaluation are dynamic, in flux, and have boundaries that shift at different stages in their history, as they interact with other people and things. I propose a "dynamic analysis" of any system, which details (i) the fluid and temporary boundaries between entities, and when these aggregations act as agents, (ii) when and where influence occurs, (iii) what constitutes an output. Such an analysis, it is proposed, could help us better attribute creative agency in the evaluation of computationally creative systems, by clarifying how novelty and value are determined (by whom) and what influences feed into the creative system at different times. Simonton's macro-creativity model Simonton, for example (Simonton, 2003), showed through quantitative analysis of scientific achievements that the June 2015 17 arrival of creative breakthroughs was sufficiently unpredictable as to be effectively random. This does not mean to say that a member of the population chosen at random might make an advance in quantum physics. Naturally, strategies for creative success involve becoming expert in a field, focusing on problems, working hard, knowing what to look for, and so on. Indeed, Simonton showed that success was proportional to activity: the more active you were in a field the more likely you were to produce creative outcomes, but equally the more likely you were to produce uncreative ones. What remained stable was the rate of success, measured as the ratio between successful output and total output. From this perspective, in computational creativity what we might describe as strictly micro level focus - privileging the creative agency of individual creators without considering how these agents interact with each other and with other elements in the world - is a detrimental but seductive simplification, which is often assumed to be reasonable where in fact it is problematic. Simonton's micro-level view of creativity tallies with his macro-level view: at the micro-level an individual iterates through many trial-and-error attempts at a solution, understood in creativity research through cognitive processes such as incubation. This trial and error is the best that can be done in an unknown search space; there aren't reliable analytical or inductive approaches available to the kinds of problems that we would define as creative, because the problem spaces are unknown - at least, in the case of Boden's ‘transformational' creativity (Boden, 1990). Thus we may imagine a population of individuals searching for solutions to the same problem, working at the same rate. When one individual discovers the solution, in Simonton's view, we should not leap to the conclusion that there is anything fundamentally different about the creative process used by that individual. Simonton also draws on evidence from ‘simultaneous scientific discoveries' to support this view, arguing that the common occurrence of such discoveries is due to the fact that it is the discovery context, and not the creative ability of the discoverer, that is key to the arrival of the discovery. Such work is widely acknowledge in computational creativity research, but this macro-level thinking remains largely absent in the methods that we apply to the evaluation of computationally creative systems. The "islands of creativity" problem Such approaches have been successfully applied in the context of studies of traditional human (i.e., not computational) creativity. But in computational creativity, although we frequently pay homage to these macro theories, we have yet to find a way to incorporate them into a working methodology in the complex area of evaluation. I suggest that a significant obstacle to computational creativity evaluation lies in the idea of "islands of creativity", the idea that creativity is situated in specific systems (mostly humans, now also computers), without any fluidity between these systems and the rest of the world: Definition: The "islands of creativity" problem in creativity is the misuse of the simplifying view that individual human actors (or individual computer actors) are sole originators of specific creative artefacts. It conflicts with the more holistic view that stochastic and network macro processes involving interactions between heterogeneous elements underlie the big picture of creative production. Is this view actually a misconception, and what have been the implications of holding it? Would our approach to evaluation benefit from avoiding it, and shifting towards thinking about creation occurring through the relationships between entities? I will argue that looking at creativity only by reference to the human cognitive capacity for creativity continues to be problematic for computational creativity, not least because the kinds of computational systems that will do creative things in the near future may not do them in particularly human-like ways. Rejecting the "islands of creativity" problem is a necessary part of stepping away from a humancentric frame. Specifically, the embrace of an alternative, macro-level theoretical framework may enable two important contributions to computational creativity: (i) in the way we understand what we mean by human cultural activities such as art and music. There is a tendency to trivialise such questions in pursuit of simple computable targets, whereas these areas of activity are some of the most ethnographically rich that humans exhibit, so as to be far from easily reducible; and (ii) by providing practical methods to help us attribute creative agency properly when asking questions of the form "did system x do something creative?" Defining creative production in terms of interaction In the words of Heraclitus, via Nietzsche, "the whole flows as a river", the river's evident dynamism, by which it is constantly in a state of re-creation in the movement of water, is an apt way to understand those less obviously fluid things in our environment: "being is an empty fiction" (Nietzsche, 1998). We tend to take the consistency of objects at face value, but for practical, not only philosophical, reasons it can be preferable to view things not as entities that have the property of being; instead their existence is in constant re-creation, captured through the notion of ‘becoming'. Viewing things without this frame of dynamism, as neat bounded entities, may be a practical way of simplifying and understanding the world in the everyday, but risks missing the myriad ways in which entities transform, influence each other, have porous boundaries and fuse and fissure. Such thinking has been applied successfully in the social sciences, and may be helpful in thinking about evaluation in computational creativity, particularly in how we frame the notion of a creative agent. Creative agents Theorists have embraced the idea of fluidity in the context of social systems, which are more evidently fluid, using a network interaction approach, most famously the actor network approach of Latour (1996) and Law (1992) and the extended mind theory of Clark (2003). More recently, Malafouris (2007) makes a terse argument for the abandonment of the human as a privileged category of agency. For Malafouris, June 2015 18 much as for Clark, if a blind man can be said to ‘see' with his stick, then the physical matter of the stick is exactly to the blind man what the optic nerve is to the sighted. For as long as the blind man is using the stick, we can designate a transient entity of the form blind-man+stick which is in some sense capable of sight. Importantly, the stick is part of that unit, not apart form it. The man does not see with the stick; the man+stick sees. Similarly, he argues, as a potter shapes clay on a wheel, one cannot successfully draw neat lines of causality that show the potter's hands influencing the clay, and not vice versa. The potter is responsive to the clay, and in her adaptivity, allows causality to flow back in the opposite direction from clay to action. The right way to understand the resulting creation of a pot, Malafouris posits, must not presume potter as agent and wheel and clay as other, but to conceive of a unity in interaction between them. In his words: "If human agency is then material agency is, there is no way that human and material agency can be disentangled. Or else, while agency and intentionality may not be properties of things, they are not properties of humans either: they are the properties of material engagement, that is, of the grey zone where brain, body and culture conflate." (original emphasis). (Malafouris, 2007, p. 22). The purpose of this thought experiment is to preempt and thus interrogate the implied objection: "surely we can see that the potter is the active, intelligent agent in this interaction, whilst the wheel and clay are passive non-agents, there to be operated or shaped". This presents a problem: although it seems mistaken to start to talk of the agency or intentionality of clay and mechanical wheels, how else can we handle the fact that the resulting pot owes its form to clay and wheels, and not merely to a single human actor? I understand Malafouris as saying here, as with the blindman and his stick, that it would be more correct to say that the temporary interaction of potter+wheel+clay is responsible for the creation of the pot, than to say that the potter created the pot using the wheel and the clay. Although apparently a trivial distinction, the question of agency has been shifted in a way that significantly transforms discussions of creative authorship in computational creativity, and equally resolves the "islands of creativity" problem. This is a more palatable option than talking about the agency of inanimate objects, and is particularly apt in the context of machines, for which the perception of agency might slide easily up and down a scale. It also takes care of collaborative action between individuals, whether in a clearly bounded working unit such as a band, or a fluid genre movement. Turning to computational creativity, we see that attention to this detail concerning the existence of bounded agents is generally overlooked. In major mathematical and logical formulations such as those of Ritchie (2007) and Wiggins (2006), understandably, this would be a complex step. Here the focus is more on artefacts anyway. In other work where the focus is on the individual and the process of production, there is still little in terms of acknowledging the fluid boundaries between components of a creative system. Dividing individuals Further to this, thinking from philosophy of mind, AI, evolutionary psychology, anthropology, and other disciplines, has in different ways converged on a notion that human agents, equally, should not be viewed as unitary in action, but consist of networks of interaction themselves. This thinking can be found in Minsky's society of mind (Minsky, 1988), Baars' global workspace theory (Baars, 2005), Barkow, Cosmides and Tooby's (Barkow, Cosmides, and Tooby, 1992) multi-domain model of the evolved mind, and many psychological accounts that reveal conflicting drives and processes and dedicated channels of activity. In anthropological theory we have the notion of the ‘dividual' (Marriott, 1976). This concept was initially specific to an ethnographic analysis of how South Asians viewed personhood, but it may also describe Western conceptions if we admit them to have more variability: "Single actors are not thought in South Asia to be ‘individual', that is, indivisible, bounded units, as they are in much of Western social and psychological theory, as well as in common sense. Instead, it appears that persons are generally thought by South Asians to be ‘dividual' or divisible. To exist, dividual persons absorb heterogeneous material influences. They must also give out from themselves particles of their own coded substances, essences, residues, or other active influences that may then reproduce in others something of the nature of the persons in whom they have originated . . .What goes on between actors are the same connected processes of mixing and separation that go on within actors." (Marriott, 1976, p. 111) Although framed in terms of a distinction between Indian and Western perspectives, it is fair to say that in all world views there is some freedom to flip between different conceptions of personhood and individuality. It is common to talk about feeling like you are ‘defined' by your family or friends or the objects you possess. We are also familiar with the idea expressed at the end of the quote, that two people can ‘think together', for example through brainstorming, and that this is in some way isomorphic to the same process happening within an individual. In our computationally creative systems, this fluidity is more evident. A piece of software is itself an assemblage of subsystems and may communicate beyond its nominal boundaries to form supersystems, including with humans. We should expect that in some cases it is clear that agency is more strongly associated with a specific subsystem than with others, whereas in other cases, agency takes the form of interaction between subsystems or the system and its environment. An evolutionary framework As others have discussed (Dawkins and Krebs, 1978; Boyd and Richerson, 1985; Aunger, 2000; Shennan, 2002), Darwinian evolutionary theory provides a good template, recog June 2015 19 nising in natural evolution exactly that agency lies in ‘processes of interaction' rather than in specific entities (Nietzsche was also heavily inspired by Darwin). It is interesting to contemplate the non-human creativity of evolution in contrast to what we typically think of when considering human creativity. Given a specific organism and asking, "what created that organism?" we see very clearly that such an act of creation can only be understood as a continuous process of interaction between organisms and their environment, and amongst individual organisms. We cannot pin our form on the creativity of our parents, nor even on our entire ancestral history. This view naturally takes into account the the many interesting cases of coevolution, runaway sexual selection, niche construction and, in humans, gene-culture coevolution which produce things through diverse forms of interaction. When we talk of function in such systems we are actually referring to teleofunctions (Sperber, 2007), specifically, functions that serve their own existence. This is in contrast to the functions of things we build, which are imposed upon them and are external to the existence of the thing. But cultural traits and artefacts can and often do have teleofunctions too and can come about in ways that are more or less similar to evolutionary processes occurring at a cultural level. Sperber (Sperber, 2007) discusses the interesting case of the perception of suntanning. Furthermore, machines that learn or evolve can have teleofunctions by virtue of the fact that their goals can be adaptive, but mostly, today, are built with regular functions. Dynamic analysis of fluidity in creative systems Our earliest efforts at building machines that create have resulted in superlatively weak creative agents when held up against human beings, as would be expected. But the contemporary language of creativity is geared towards the superlative creativity of humans. It does not do well at describing the simple forms of computational creativity we are developing today. For this reason, an "islands of creativity" view, that works for humans, needs to be replaced by a more fluid conception of creativity that will work equally well for computational systems. By comparison, a view of this process of production based on networks of interaction between elements (whether brain, body and culture, as Malafouris suggests (Malafouris, 2007), or some other active ingredients) makes less of a conceptual meal of that scenario. Even if these various perspectives may be technically true, is it any use to try to use them to rethink evaluation in computational creativity? It would be counterproductive to take clearly delineated elements and blur them into a loosely defined muddle of interaction purely for the sake of being more accurate. A danger with adopting this perspective is that useable categories disappear to dust. Evoking a Beethoven-piano-stave-pen-church-kingorchestra-etc.-etc. network complex to explain the creation of the Fifth Symphony may not have any practical value and if so, should not be pursued. But as part of a wider investigation into how qualitative, situated human science methods can contribute to the understanding of evaluation in computational creativity (Bown, 2014, 2012; McCormack et al., 2014), I believe that it will be necessary to take on the "islands of creativity" problem by introducing such thinking to form a method of "dynamic analysis" of creative systems. As a first step in a dynamic analysis approach, we would need to look at where we have pre-emptively identified creative agents. Mostly, these will be either individual people, or the computational systems we have built. For each presumed agent, we should investigate what assumptions we hold about their boundedness, their autonomy (any cases in which we say the system did something "on its own") and the origins or their actions. We can also investigate where different systems might be seen to unite in co-action or break down into interacting components, and we can look at how each system is influenced to change its state or structure over time. In each case, this will be a temporal process where different system boundaries are recognised over time. In the case of many computationally creative systems, the full analysis of such a process would include the role of the system developer, observing outcomes and iterating their design in order to improve it (what Colton, Pease, and Ritchie (2001) refer to as "fine tuning"). We may also find that the process is so widely distributed across elements that such descriptions take on a more statistical nature, as we have seen in both Simonton's theories (Simonton, 2003), and in Darwinian evolutionary thinking. In this case, it should be fine to attribute some degree of creativity to a macro-level stochastic process itself. Through the examples below it is proposed that a simple but effective way to dynamically analyse creative events is through simple dot-point timelines that discuss sequences of events, the influence of systems on each other, and the potential coupling of systems. This is relatively crude, but may have the potential to feed ultimately into more formal frameworks such as that of Wiggins (2006). Application Without adopting a strong cultural Darwinism - which is contrary to what I would argue for, and what Sperber's article (Sperber, 2007) emphatically argues against - it follows from all of the above that every creative act should be framed in terms of processes of interaction. The issue still remains of showing that this is practically useful. I consider the following instances and how such an approach serves to clarify the creative agency. The Violin In a recent article (Nia et al., 2015) evidence was given to support the theory that the shape of sound holes in violins emerged through an essentially evolutionary process whereby apprentices copied their masters' designs with random variation, and those designs with louder sounds, due to the shape of the holes on the body of the violin, were over time more successful. The winning design, the familiar f-shape that we know today, maximises the ratio between the perimeter of the hole and its size, providing greater amplification of the sound, whilst providing a pleasing visual appearance. Who designed the violin as we know it today? If the above account is correct we could answer as we would June 2015 20 with the design of organisms in the biological world, that there is no one designer, and there are not really any designers in the sense of psychological creative discovery. The design came about through a macro-level process. Indeed, we could go so far as to say that the design of the optimised sound holes was not due in any way to a human creative capacity, although, difficulty arises when we ask whether a given luthier's new design was actually a conscious improvement, or a random variation that turned out to be successful. As with Simonton (2003), we may be mistaken in attributing creativity to the individual mind instead of to the broader cultural process. The creative process, as described by Nia et al. (2015), might look something like the following if represented as a dynamic analysis timeline: 1. An existing design is copied and modified in ways that do not explicitly attempt to optimise sound amplification 2. Given time, the louder designs make more money, and these workshops grow and reproduce whilst the workshops responsible for the quieter designs diminish. Paul Hession / Arne Eigenfeldt Live at Cafe Oto At a recent concert of live algorithms1, drummer Paul Hession and flautist/saxophonist Finn Peters performed with a number of live algorithms. I consider the performance between Paul Hession and Arne Eigenfeldt's (Eigenfeldt, 2014) system2 (a discussion of the factors underlying such concerts can be found in Bown et al. (2013)). Clearly, as an improvised duet, the interaction between the two participants is critical to understanding the creative output. Musical improvisation is possibly the most unambiguous case of a process of interaction underlying a creative result. But over a longer timescale we can consider Eigenfeldt's development of the system, and his interaction with Hession during rehearsal as part of the creative process. It has been proposed in various ways (e.g., McLean and Wiggins, 2010), that creative software development involves a cycle of interaction between developer and software, and we can see this as directly analogous to the case of the potter described by Malafouris, with the same arguments applying. Such notions have also been discussed in the case of Cohen's work with AARON (McCorduck, 1990). A full picture of the development of the outcome might look something like the following. Through discussions with many live algorithm developers, this seems typical, and really it is just a specific case of what any musicians do in preparing for a collaborative performance: 1. Designer takes on project, listens to recordings of Musician in order to approach design of System; 2. Designer iteratively develops System; 3. Designer, System and Musician rehearse; 4. System and Musician perform. 1 Cafe Oto, London, June 29th 2014, as part of the New Interfaces for Musical Expression 2014 Conference. 2 https://www.youtube.com/watch?v=vL6Jty5hOFc In this we can look at the moments where there is influence. Of interest, in Stage 1, the musician has influence on the System. In Stage 2, the system has influence on the designer, and in Stage 3 the System has influence on the musician, influencing how they might choose to perform. Under Malafouris' framework, these interactions, no matter how consciously or authoritatively the subject of the influence is receiving this input, imply that boundaries between these entities are fluid, or porous. We should be aware that that design of the system contains iterative, hence albeit minutely autopoietic, development, and the final form of both system and musician are the result of a longer-term interaction. Still, does this matter? It is not burningly evident that it does. But it provides a more complete analysis than if we say that a system, all of a sudden, stands alone as an autonomous agent and ‘produces' things. A rich qualitative description takes account of the actual pathways that lead to something being produced. Conclusion In this paper I consider what is still, despite its long standing in social sciences, quite a radical approach to thinking about attributing creative agency. This view removes the privilege of the human actor, making place for the idea of humans and other actors forming temporary networks of interaction that produce things. It does not unfortunately offer us a powerful analytical framework that makes agency attribution easy or formulaic, but asks us to avoid making mistaken and simple agency attributions, whether to humans or to creative machines. 2015_30 !2015 Casual Creators Kate Compton and Michael Mateas Expressive Intelligence Studio University of California, Santa Cruz kcompton/mmateas@soe.ucsc.edu Abstract Many creativity tools exist to support task-focused creativity, but in recent years we have seen a flourishing of autotelic creativity tools, which privilege the enjoyable experience of explorative creativity over taskcompletion. Because these tools are much smaller in scope, less commercially significant, and less "serious" than their larger siblings, they have been overlooked in academic research. This paper coins the term "Casual Creators" for these tools, and provide a definition to identify tools that belong to this category. We also identify the particular design considerations that arise from autotelic creativity, and propose a number of strong design patterns that serve those considerations, patterns which are demonstrated by case studies of software built with those patterns. We believe that once this field is identified and named, the currently-isolated practitioners who make these casual creators will be able to share knowledge, like these design patterns, and develop a community of practice. Introduction An alternative design space Often when we talk about tools to support creativity, the ‘creators' exhibiting creativity are task-oriented professionals or amateurs, who have a specific problem to solve, or design task to accomplish. There exist many complex, powerful, and frequently expensive tools for different kinds of creative tasks: Maya for 3D modeling, FinalCutPro for editing video, Ableton for music production, Photoshop for images. These professional tools must support a broad range of possible actions with a focus on efficient task completion, as their users are typically being paid to complete predefined design tasks. Design, as an activity, is "goal-oriented", "intentional, purposeful" (Gero 1994). "Task" is an appropriate, and common, term for the primary action of using these tools, because the goal is to enable productive labor from the user. Is this the only way to use creativity tools? Is productivity the only goal of creativity? We propose the category of Casual Creators as an alternate design space for tools which support creativity as an intrinsically pleasurable activity, rather than as an extrinsically-motivated way to accomplish tasks. Creativity is often an autotelic activity: we paint, draw, sculpt, sew, write and make music creatively, often with complete disregard for the quality of the final product (much less concern for task productivity), because the activity itself is so enjoyable. This autotelic, intrinsically-rewarded form of creativity is psychologically quite distinct from creativity exhibited in a environment with extrinsic motivation (Amabile, Goldfarb, and Brackfleld 1990). We should expect that support tools for autotelic creativity will be correspondingly different. There is a thriving ecosystem of appropriately-designed software tools to support task-focused creativity, so why isn't there a corresponding set of software to support autotelic creativity? One reason is economic: the labor of the creative person is of commercial value to either their employer or their client, so there is a market incentive to maximize that output with effectively-designed tools, either to be purchased by the creative person themselves (for an independent contractor) or by their employer. The other reason is the perceived "seriousness" of output. This division between serious and frivolous output mirrors the division in creativity research, between psychological ("P") creativity and historical ("H") creativity (Boden 2009). Psychological creativity is the ‘everyday' (Kaufman and Beghetto 2009) creativity of average people, while historical creativity is the ‘eminent' creative ability ascribed to famous world-changing creators and their innovative creations. Historical creativity is more valued socially and economically, so when building a tool to support creativity, tool designers like to imagine that it will be used by the next great artist or genius inventor (Shneiderman 2003), or at least used to make some famous or commercially successful product. We believe, however, that it is also important to build tools supporting "everyday" creators in enjoying pleasant and fulfilling creative exercise, even if they never produce worldchanging output. Existing creators Despite these reasons for not receiving "serious" attention, there are many small applications that do exist to support casual creativity. The recently-developed app marketplaces for mobile have provided a perfect haven for this sort of creativity app. These apps each support creating only a single kind June 2015 228 of artifact, such as abstract generative pictures (Secretan et al. 2011), virtual pottery ("Let's Create! Pottery" 2014), or 3d printable bracelets (System 2015). These apps create artifacts from a greatly-reduced possibility space, compared to the previously mentioned general-purpose professional tools. The narrowness of the possibility space allows the tool to provide greater support for the user, eliminating potential bad artifacts and speeding the process of creating good ones, at the expense of flexibility and versatility. This loss is acceptable, however, as these products aren't being created in response to the exacting demands of a boss or client, but rather because they are intrinsically fun to make. In fact, the end product may be discarded entirely after completion! (Nilsson 2003). Other examples of autotelic creativity tools have come from games, a field that has always valued pleasurable user experience over productivity. Not every game contains creative activities, but many games do feature the creation (and curation) of a house, creature, or avatar, and these creativity tools can end up being more fun than the eventual gameplay. And yet, though these tools exist, there is no central community in which their designers can communicate shared knowledge. There is no set of best practices that can be referenced by those who are attempting to make such a tool, or even a name for them so that the tool-maker can describe what they are making. The World of Warcraft (Blizzard 2004) character-creator and the "Let's Create! Pottery" virtual potter's wheel (to choose two out of many, many existing tools) may not seem to have much in common at first glance, or even share the same marketing category. We hope that by identifying this software genre, including lessons and design patterns learned from existing examples, we can define a distinct area for future research and tool implementation. Introducing Casual Creators Casual creators can be distinguished from other creativity support tools by their goal of supporting autotelic creativity, not task-focused creativity. From this initial difference in goals arises a variety of other differences: in design considerations, optimal design patterns, and the psychological states that they encourage in the user. To that end, we propose the following definition, which encapsulates the very exciting alternate design space of Casual Creators. Definition A Casual Creator is an interactive system that encourages the fast, confident, and pleasurable exploration of a possibility space, resulting in the creation or discovery of surprising new artifacts that bring feelings of pride, ownership, and creativity to the users that make them. Casual Creators are interactive systems. There are historical examples of non-digital casual creators, like the classic generative art toy Spirograph or the knitting toy Knit Magic, though digital software systems provide more affordances. They do, however, need to be interactive, driven by the user, because the learning and creating process is so core to the psychological experience of using one. Computational creativity can be used to assist the design process, but must be in a mixed initiative partnership with the user. Casual Creators are tools that create artifacts, of some kind, which may be instances of virtual models or static images, or more abstract artifacts like story grammars or AI behaviors. Each creator has some possibility space, the set of all possible artifacts that could be created using that tool. The user creates (or discovers) artifacts by searching or exploring the space for 'good' artifacts. For the casual creator to be successful tool, there must be a way for users to find artifacts meeting functional and aesthetic criteria, avoiding getting stuck in a space of bad artifacts. The possibility space should be narrow enough to exclude broken artifacts (such as models that fall over or break when 3D printed) but broad enough to contain surprising artifacts as well. The surprising quality of the artifacts motivates the user to explore the possibility space in search of new discoveries, a motivation which disappears if the space is too uniform. It also provides feelings of ownership and creativity when the artifact is discovered. In a sufficiently multidimensional possibility space, ‘search' and ‘creation' become blurred, as the only way to arrive at a particularly interesting artifact is to move through the space intentionally, rather than randomly searching. The user will feel greater ownership and creativity the more they attribute their discovery to their own actions, and pride is increased further when they feel that their discovered artifact is somehow special, surprising within the possibility space. How does the user navigate this possibility space? Do they make tiny adjustments, tentatively inching through the possibility space, or do they make wild jumps, from solution to novel solution, exploring large regions of the space over a short period of time? An optimal creative process is described as making ‘creative leaps', so we want to guide the user toward a fast-moving and confident exploration of the possibility space. The user's experience should feel playful, powerful, and pleasurable, like a flow state. The user of a casual creator is a casual user, and the system can expect no previous domain knowledge, no previous technical experience, or adherence to a long learning process. All of the learning and creativity described above must occur in the first few minutes, and provide a good experience even if the user never spends time to gain mastery. This definition focuses on the design goals of a Casual Creator, the experiences that Casual Creators are particularly suited to create. How we can design tools that achieve these goals is explained in the rest of this paper, through a description of design patterns and case studies that tested them. Some patterns come from existing Casual Creatorlike tools, like the Spore Creature Creator, Nervous System's design tools (System 2015), and academic experiments like Picbreeder (Secretan et al. 2011). More patterns come from our current understanding of autotelic creativity, anticipating design patterns that support such a psychological state. To predict these new potential patterns, we draw from existing theories of creativity, flow, and design. June 2015 229 Related Fields Reflection-in-action and Direct Manipulation During the creative design process, the user modifies the artifact, moving quickly through a cycle of evaluation, planning, modification and reevaluation. This process shows the "Reflection-in-action" theory of learning, in which a learner hypothesizes, acts, and reflects on the results as a way to iteratively understand a domain or problem. The originator of that theory, Donald Schon, also applied it to the process of designing (Schon 1992) in which the designer "sees, moves and sees again". The seeing and moving are grounded in the materials themselves. This cycle cannot take place disembodied in the mind but must be enacted in dialogue with the artifact. Seeing may include the user's visual perception of the artifact, but is also a way to describe the evaluation of the artifact. Is it aesthetically pleasing, stable, strong? How does the designer predict that it will perform in its intended role? Some of these evaluations could be performed or assisted computationally. Schon surmises that while the reflective design process itself is not well suited to unsupervised computational processes, computation could provide new ways of "seeing", or provide constrained micro design spaces "extending the designer's ability to construct and explore them." He concludes that "The design of design assistants is an approach that has not in the past attracted the best minds in AI. Perhaps the time has come when it can and should do so." In Direct Manipulation (Shneiderman 1993), a UI concept that parallels reflection-in-action, a complex software system provides "continuous representation of the object of interest" and "[r]apid, incremental, reversible operations whose impact on the object of interest is immediately visible" and promises that "after obtaining reinforcing feedback from successful operation, users can gracefully expand their knowledge of features and gain fluency." The user can manipulate the system with rapid operations, then evaluate the effect immediately, because the artifact is always visible and responds immediately to the modification. The actions are reversible so the user is encouraged to experiment without anxiety, incremental, so subtle changes can be observed, and rapid (and rapidly seen), so that this learning cycle can operate continuously with each tiny iteration. Flow Csikszentmihalyi's Flow theory is influential in games and creativity studies, but seems particularly well suited to the autotelic creativity of casual creators as"[i]deally, flow is the result of pure involvement, without any consideration about results."(Csikszentmihalyi 2000) For flow to be achieved, the activity must have goals to create a sense of progress, immediate feedback so that progress can be sensed, and a balance between their perceived skills and challenges. Flow can be disrupted if the user feels frustrated, intimidated, or overwhelmed by choices. Flow has a complex relationship with goals. Though the activity should be enjoyable in itself, without the pursuit of an outside reward, goals provide the required direction and progress. Goals can be provided as preset challenges, but often it is better to encourage the user to develop their own internal design goals. A good goal can be evaluated momentto-moment, may change over time, and can be either highly specific to the user, or come from knowledge of the design space. The flow state is very conducive to both creativity and an autotelic experience, and so provides important design considerations for potential Casual Creators, especially for avoiding conditions which disrupt flow, like choice paralysis or hard failures. Creativity Support Tools Lubart (Lubart 2005) identifies several categories of human and computer collaboration in the creative process: the computer can act as nanny, coach, pen-pal or colleague. Riedl and O'Neill (Riedl and ONeill 2009) suggest "audience" as a fifth role for the computer. These categories provide a useful taxonomy, but do not provide implementable patterns. The field of Creativity Support Tools, of which Casual Creators could be considered a subcategory focused on autotelic creativity, provides many concrete design patterns. Resnick et al (Resnick et al. 2005) identify many such patterns. Some, like "Support Exploration" and "Make It As Simple As Possible and Maybe Even Simpler" are patterns to support flow experiences and reflection-in-action styles of learning. Other principles like "Support Many Paths and Many Styles", "Low Threshold, High Ceiling, and Wide Walls" reflect how users will start with diverse goals and skills, which will further evolve as they use the system. Some principles, "Choose Black Boxes Carefully," "Support Collaboration," and "Support Open Interchange", ask the designer to reflect on the communities in which creative collaboration and learning occur, and how creativity develops as multiple users share knowledge. When we look at the creative communities that flourished for tools like Spore, Scratch (Resnick et al. 2009), and Twine (Klimas 2012), it becomes clear that, though the design of the single-user software is important, the technology decisions of data format, data interchange, hosting, and modifiability are equally critical to enabling creativity and fostering ownership. Creativity occurs between the user and client-side application, but also in the communities of practice that develops outside of the app, so creativity support tools must consider both sites, personal and communal (Maher 2012). Generative Methods and Computational Creativity Computational creativity is the science (and art) of encoding human-style creative process as automatable systems, with the goal of building a system which "exhibits behavior that would be deemed creative in humans." (Colton et al. 2009). How ‘creativity' can be detected in the finished artifacts of these systems is its own difficult problem (Maher 2012), but the field has successfully built generators that can design artifacts for domains as diverse as jokes (Petrovic and Matthews 2013) and paintings (Colton 2012). These systems create artifacts by encoding the process of creating art (or literature, jokes, game levels, music, etc). The resulting algorithms must be able to create not only one June 2015 230 successful example, but a wide and interesting space of possible valid artifacts, some of which should be able to surprise even the person who wrote it. Such algorithms can be called generative methods (Compton, Osborn, and Mateas 2013); they use a range of technologies (genetic algorithms, grammars, declarative modeling), but all share the goal of creating large possibility spaces of valid-yet-surprising artifacts. This is the optimal type of possibility space for computational creativity systems, and also for Casual Creators. Computational creativity and generative methods are often a poor fit for productivity-focused creativity apps. Professional creativity involves creating to very specific requirements, requires complete control and the ability to fine tune the resulting work. Generative methods create a lot of work, very fast, but with minimal control over the output (compared to hand-authored content) and often no way to iterate on the output. A casual user, without the need for complete control, is willing to trade a loss of control for the speed, power, and surprise of generative methods. The expressive range of such systems must always be balanced with the need to produce valid content. A system could produce a wide variety of mostly broken artifacts, or produce a set of high-quality yet homogeneous artifacts, but both of these are failures. We have found the phrase ‘1000 bowls of oatmeal' useful to describe the common antipattern of generating a set of artifacts which are technically distinct to the computer, but perceived by humans as uniform. Computational creativity systems usually run autonomously and unsupervised by humans. Pairing these methods with human users can add additional power to the process (Davis et al. 2014), as humans provide aesthetic evaluations and intuitive leaps to the rapid generativity of the computation creativity processes. Mixed initiative systems, in which the computer and human users operate simultaneously or by turn-taking, support a creative cycle in which each user reflects on the previous contributions of their collaborator and modifies the artifact according to their particular abilities. The end products of the creative process are improved, and ideally the user enjoys the experience of collaboration, if the system is well designed. Interaction with a highly generative system has a particular set of pleasures, whether in the context of a game or a creativity tool. Chaim Gingold refers to such pleasurably interactive systems as ‘Magic Crayons' (Gingold 2003): computational, accessible, sketchable, expressive systems which invite the user to play with them and discover hidden secrets and affordances. Design Patterns The definition of a Casual Creator as an autotelic creativity tool provides an abstract guide for what we would want a potential Casual Creator to accomplish. To actually design such a tool, these high-level patterns must be interpreted into concrete design patterns. We have identified a number of these patterns, drawing from existing Casual Creators, and from the related fields, and tested them by using them to create a wide variety of systems, described in the Case Studies section below. These design patterns are not exhaustive, but are representative ones that are versatile, common, and easy to apply. Instant feedback Recall that both direct manipulation and reflection-in-action require the user to observe the artifact, make a change, and see the results, a process which allows them to discover patterns and affordances in their possible changes, mastering the system while iterating on an artifact. In the instant feedback pattern, the changes should be immediately visible in the modified artifact. However, just visually regenerating the artifact in response to changes, even in real-time, is not necessarily enough to provide appropriate feedback. ‘Seeing', in the reflection-in-action model, encompasses more than just ‘looking at'. ‘Seeing' actually encompasses the entire process of sensing and evaluating the artifact's fitness according to both the potential use case and the user's own design model. For objects with a strictly aesthetic role, this is easy: the user glances at it, and can instantly decide their opinion of it. Other evaluations are complex, and must be either mentally simulated by the user, or else evaluated by the system. Requiring the user to mentally simulate complex consequences will take a lot of time and attention, and the evaluation could be inaccurate or flawed, slowing the iteration process. The instant feedback pattern would recommend computationally simulating and visualizing as much as possible so that the user can get feedback at a glance. The Chorus Line Named after the choreography concept in which many dancers all execute the same routine simultaneously, the chorus line pattern was used in Spore (Hecker et al. 2008) as an internal tool to test animations on a wide range of creature morphologies. The chorus line is a subpattern for instant feedback, in situations where what is being generated is not a single artifact, but a space of artifacts. In that situation, the user should be able to ‘see' (in the reflection-in-action sense) the space of their creation, instantly. Instead of generating one example, this pattern suggests generating many examples, and overlaying them (spatially, temporally, graphically) to make subtle differences and similarities easier to spot. Simulation and approximating feedback Automated visualization becomes especially important when the artifact being generated would take minutes or even hours for the user to evaluate, rather than milliseconds for an image, or seconds for an animation. For artifacts such as game levels, the artifact is judged by the many gameplay traces over time that could be played on it, which cannot be visually evaluated with much accuracy by a casual user. Nor can a system show the user all possible gameplay traces, so the user must be shown a proxy of the evaluation. When Riedl and O'Neill (Riedl and ONeill 2009) add ‘computer as audience' to Lubart's categories, their simulation proposed to accurately model how a human reader would evaluate generated stories. In Sentient Sketchbook (Yannakakis, Liapis, and Alexopoulos 2014), the system calculates "navigational and topological properties" as the user interacts with it, providing instant feedback for a complex artifact. This evaluation does not fully encapsulate the actual gameplay im June 2015 231 plications of the map, which for a finished level being put in a game, could be a potential design issue. However, for a Casual Creator, the goal of the evaluation is to provide the sense of progress towards a goal necessary for achieving flow. Only the perception of progress is necessary: as long as the user perceives progress, the accuracy of the evaluation is irrelevant. Entertaining evaluations One nice benefit of relaxing the need for accurate evaluations is that the evaluations can themselves be pleasurable and entertaining. In the Spore Creature Creator, when the user modifies their creature the creature will respond by laughing and shaking the new body part in appreciation, or, less commonly, expressing distress. The choice of happy or sad reaction does not actually represent any real system state, it just provides arbitrary feedback. That feedback is psychologically significant, for encouraging the flow state, but also for letting the user feel pride in pleasing their little AI judges. Even if the user starts with no particular design direction of their own (a common issue with casual artists) having a simulated critic present can suggest a direction for the user, even if they choose to ignore it. The abstract generative art game BECOME A GREAT ARTIST IN JUST 10 SECONDS (Brough and McClure, 2014) waggishly compares the user's glitch art to classic masterpieces and rates it with a percent similarity, an intentionally arbitrary metric that still serves to provide optional direction to the casual user. No blank canvas One benefit of focusing on these intrinsically-motivated users is that they are often much more flexible about the final product. In contrast to a system like Maya, which must support extremely broad use cases and a high degree of fine-tuning in order to make a very particular finished product, a casual user will have more flexible requirements for their product. They likely want it to be functional and aesthetically pleasing, but are willing to consider many more possible kinds of solutions, or may not even start with any particular solution in mind (Nilsson 2003) Professional artists know the terror that comes from facing a blank canvas (Bayles, Orland, and Morey 2012), but this experience is also intimidating and paralyzing to the novice user. However, this can be very easily mitigated, by providing either a starting shape (Spore) or a suggested challenge (Let's Create! Pottery). The first move is the hardest, so this restricts the first move to a single decision: accept the prompt, or discard it. Once this one move is taken, subsequent actions are easier. Limiting actions to encourage exploration This restriction of actions can be useful even after the user has moved past the blank-canvas moment. To achieve a flow state (Csikszentmihalyi 2000), the user should be able to quickly and confidently make decisions, which is easier if the available choices are appropriate, limited in number, and their consequences are clear (or at least suggested) to the user. One strategy is modal interaction: limiting actions by the particular mode that the creator is in. This approach is common in character creators like that in World of Warcraft and Spore Creature Creator, which have different modes corresponding to user actions like painting or building and panels with sub-actions within those modes, to choose hair or faces, revealing only actions for the mode that the user is currently in. Another approach is to limit the actions available, slowly unlocking them in response to experience, challenges defeated, purchases, or some other pacing mechanism. If the possibility space is temporarily restricted, the ability to more fully explore the space scaffolds the user's understanding of the possibility space. Figure 1: Top: Mutant-shopping for images in Picbreeder. Though the user can control the rate of mutation, they can only ‘create' an image by selecting the parents of the next generation. Below: Sentient Sketchbook shows automated evaluations, allows direct editing, and also provides some alternate mutants on the right Mutant shopping One feature that can help the user find unexpected solutions in the possibility space is not a creative ‘action' at all, but the availability of suggested alternatives, like artifacts near the current one in the possibility space. In some tools, the user is not given any way to edit the artifact, and must navigate the possibility space by picking one of the new options, as in Picbreeder (Secretan et al. 2011). In other cases, as in the parametric tree modeler Dryad (Talton et al. 2008), the user may use these alternative to browse the space, but can also further edit the artifacts that they discover in that way. A third framing of this pattern is found in Sentient Sketchbook (Yannakakis, Liapis, and June 2015 232 Alexopoulos 2014), in which the user edits a game level normally, but the system uses that information to generate additional suggested artifacts that are ‘nearby' for some more abstract calculated metric, rather than ‘nearby' in their underlying representation. Although this process has a lot in common with evolutionary algorithms (specifically human-guided evolutionary algorithms (Klau et al. 2010)), the focus is not on producing an optimal specimen, but on the enjoyment that the user feels from this process. For this reason, we named this pattern mutant shopping to capture the psychological pleasures and motivations of a less-directed browsing and discovery process like shopping. Modifying the meaningful In Spore, parts can be placed anywhere on the creatures, then modified by rotation or pulling on their morph handles. In a traditional sculpting program like Maya, these handles would be expected to control a clear parameter like z-scaling, for maximum control over the changes. The Spore designers discovered that it was more interesting to have these handles control higher level changes, like shifting a jaw from top-heavy overbite to jowly underbite, or extending a foot's shape from round toes to pointed claws. Higher-level modifications like these give the user a more meaningful space to explore. Saving and sharing As noted in the "Design principles for tools to support creative thinking" report, the client-side application where the user is editing their artifact is only one site where creativity occurs, and designers of Casual Creators should also consider how they support creativity outside of their app. One example of this principle is the use of common, free, human-readable filetypes for saving data, such as JSON or images. Spore embedded the creature's save data stenographically in a PNG image, and the latest version of Twine 2 embeds the editable hypertext into the HTML that plays the Twine game. Even if the client app is still necessary to rebuild the content from the saved data, as in Spore and Twine, users can share their data using existing platforms. Most hosting sites allow text and images, but not arbitrary files. If users can easily host their save files on such hosting sites, they can build communities independently from the makers of the original casual creator app. Hosted communities An alternate pattern is to provide a hosted community that is tied more closely to the client app, as Picbreeder and Let's Create! Pottery do. Casual creators should encourage the user's pride in their discovered or created artifacts, so providing a showcase where user's can publish their work to share it with others supports this feeling of ownership. Creations are often annotated or tagged, and usually there is a commenting and messaging system, enabling a large community to communicate within itself. Modification is its own form of communication, so if the system supports modification of artifacts, they should show their ancestry, and notify the original creators so that they can take pride in their influence. Modding, hacking, teaching Users of casual creators will quickly find that the tool does not support every action that they want. The tool and its surrounding community support should facilitate users in teaching each other mods and hacks that expand the boundaries of what's possible with the tool. The previous two patterns support this pattern, as this teaching can happen on external sites, or internal ones, but the easier it is to find a clever hack, import it into the tool, and modify it and republish the new results, the quicker these ideas will spread through the community. Case Studies Instant Feedback: PendantMaker PendantMaker is an online design tool for creating 3D printable pendants. We observed that although 3D printing is interesting to many people, the tools to create printable content are difficult, with many potential pitfalls for making unprintable and broken content. By restricting the domain space to extruded tubes, we could guarantee that our generated geometry would be valid for printing, and print reliably on a cheap printer (a difficult set of physical constraints). When combined with turning sliders, supporting the Direct Manipulation patterns of "rapid, incremental, reversible operations" (Shneiderman 1993), PendantMaker provided a very ‘safe' place for the user to experiment without fear of failure. We also noted that casual users often doubt their drawing ability (Bayles, Orland, and Morey 2012) and lack direction, so we designed a generative algorithm in which undirected scribbles from the user would be reflected around an axis, creating a design of surprisingly attractive symmetry. We provide a canvas for the user to draw a line, which is extruded, shaped, and reflected into the many intersecting tubes on the right, creating the printable pendant in real time. This very immediate feedback was critical: users could draw aimlessly, but notice when the reflections would intersect or join together, allowing the users to easily create a complicated knotwork of intersecting tubes that would be impossible to predict without feedback. We also added sliders for a variety of tuning values, reflecting the Modifying the Meaningful pattern above. Some sliders corresponded to clear values like thickness and arm count, but ‘bloom' performed a complicated sculptural task of flaring the outermost tubes in a curved shape. Complicated tools like bloom are only usable with rapid feedback: their action is indescribable to the user, but with a little experimentation, the user quickly learns how to use them artistically. Sharing and Ownership: IceMaker IceMaker was an evolution of PendantMaker's design, to create extruded 3D snowflakes, and similarly uses tuning sliders, symmetry, and extrusion to create complex geometry that is both modifiable and guaranteed valid, with immediate feedback. The extrusion path is not controlled by the user's drawing, as in PendantMaker, but rather by a particle simulation. The behavior of the particles would be very hard for a casual user to program, so instead we provide sliders for values that represent the resulting appearance of the path (‘complexity', ‘wiggle', ‘sharpness') allowing the user to explore the possibility space while not having to understand the complex process behind it (Fig. 2). June 2015 233 Figure 2: Ice-Maker, a 3D snowflake maker, guides the user to create a snowflake and further personalize it with a message, then embed the design into a single URL that the user could share. Since this interaction provides less agency than the drawing interface in PendantMaker, we wanted to augment IceMaker with other ways to declare ownership of the discovered snowflake. Following the Saving and Sharing pattern, we encoded the snowflake into a unique URL which the user could share, post or send as a ‘saved' version of their artifact. Search and Discovery: Funky Ikebana and Tiny Dancer Creativity-as-discovery is further explored in Funky Ikebana, in which L-system flowers are are generated from a ‘DNA' of floating point tuning values. Similar to exploration process in Dryad (Talton et al. 2008), the user iteratively selects the flowers that they like, and the system generates more nearby examples. This human-guided evolutionary algorithm allows for ‘optimization' of the flower, but as this was designed as a Casual Creator, we focus on the pleasures of mutant shopping more than potential optimization. The flowers are arranged together, which makes different ones easy to spot, so the user can pick from flowers that are very different, or mostly the same. Regeneration when one flower is selected and its children repopulate the space is instantaneous, so the user can very quickly move through the space of flowers. Picking from one of 10 children limits the number of actions that the user can take, so choice paralysis does not occur, as shown in previous mutant shopping examples like Picbreeder (Secretan et al. 2011). Unlike the Dryad and Picbreeder systems, we were also able to use the L-system to create a simple animation for the flowers, causing them to ‘dance'. Flowers danced differently, an emergent property of their morphology, and users could selectively evolve flowers for their movement instead of just shape. Because we used the chorus line pattern to show many flowers dancing at once, the user was able to notice particularly graceful or vigorous ones, and select for that. Tiny Dancer takes this idea further by simultaneously evolving the morphologies of ragdoll dancers and their dance-responses to music, so that the dances can also be selected by mutant shopping on a chorus line. Figure 3: Iteratively evolving smaller flowers in Funky Ikebana, starting with the center flower in the first image. The user's current heuristic is to pick small simple flowers, but that heuristic can change each time the user spots a flower style that they like better. Interventions: BotPrint and Binary Fission The Casual Creator framework has been usefully applied as an intervention in two existing designs, successfully modifying the designs to improve the user's creative experience. BotPrint was an existing application to design lasercuttable robotics kits for children. Users could drag handles to shape the outline of the bot's chassis, and some automation would occur to figure out placement of components. Unfortunately, the implications of moving components and changing chassis size were not visible to the users, so making modifications felt meaningless. Using casual creator design patterns, we updated the system to simulate the bots moving in an ‘arena' with many other similar bots. This provided a way for the user to evaluate the behavior of the bots visually (chorus line), see and select variants (mutant shopping) and enjoy watching the bots struggle for victory (entertaining evaluations), while also directly modifying the bots and then rereleasing them into the arena. Binary Fission is a game designed to help the user make binary decision trees to filter loop invariant data for a crowdsourced science task about software verification. At first, this does not seem to be a creative task, but by using casual creator patterns to emphasize the creative side of selecting the filters to build the filter tree, users enjoyed the task and were able to explore the possibility space of trees much faster. The biggest insight provided though the casual creator lens was to show many filters for each choice point. Calculating how well a filter would behave is a very arduous evaluation for the user to perform themselves, so we colored June 2015 234 each by how well it filtered data at that point. Users were able to glance through this potential ‘filter space' for suitable filters, and were able to apply them, see their implications, and rebuild trees very quickly, turning what could have been an opaque and arduous task into a fun reflection-in-action learning experience. Conclusion This paper defines a new term, Casual Creators, to identify a category of interactive systems which prioritizes the experience of autotelic creativity above productive output, an exciting new design space that is distinct from existing productivity-focused creativity support tools. We have illustrated the distinct design considerations of Casual Creators by identifying and describing representative design patterns drawn from theories of creativity and current successful systems. These patterns were used to design several new systems, and to evolve some existing designs to better support casual creativity. From these case studies, we learned that these patterns do clarify and inspire the process of building systems to support casual creativity, as it was easy to identify new system features from the patterns. Additionally, using the lens of Casual Creators enabled us to easily find examples of those features in a wider range of existing systems than would have otherwise been possible. 2015_31 !2015 Interaction-based Authoring for Scalable Co-creative Agents Mikhail Jacob, Brian Magerko School of Interactive Computing Georgia Institute of Technology Atlanta, GA USA {mikhail.jacob, magerko}@gatech.edu Abstract This article presents a novel approach to authoring cocreative systems called interaction-based authoring that combines ideas from case-based learning and imitative learning, while emphasizing its use in open-ended co-creative application domains. This work suggests an alternative to manually authoring knowledge for computationally creative agents that relies on user interaction "in the wild" as opposed to high-effort manual authoring beforehand. The Viewpoints AI installation is described as an instantiation of the interaction-based authoring approach. Finally, the interaction-based authoring approach is evaluated within the Viewpoints AI installation and the results are discussed guiding development and further evaluation in the future. Introduction Within the computational creativity community, our research has focused on domains that are open-ended, artistically performative, improvisational, and co-creative between human and AI agent. co-creative AI agents that can succeed in these kinds of domains tend to be large-scale and knowledge-rich since they have to collaborate creatively on an equal footing with humans. Therefore, one of the key bottlenecks for developing co-creative agents has been the knowledge-authoring bottleneck. According to Csinger et al. (1994), the difficulty, cost, or delay in acquisition of expert instantial knowledge followed by its subsequent structuring and storage so as to enable efficient future utilization is often referred to as the knowledgeauthoring bottleneck. In fact, the knowledge-authoring bottleneck has historically been a significant problem for the intelligent agent community in general and the computational creativity community in particular. Many solutions have been proposed in the past to mitigate the problem. Case-based reasoning (CBR) approaches and machine learning approaches have utilized online case acquisition and data mining from corpora as fundamental methods for dealing with the knowledge-authoring bottleneck. Data mining approaches have faced a general lack of corpora for instantial or behavioral content within improvisational performative domains. Traditional CBR systems while learning from experience still require instantial content to be authored in the form of an initial case library. Learning by demonstration or observation can avoid these pitfalls, but traditionally require explicit training or teaching phases before they can be used in the final task. Within the games research community procedural content generation (PCG) research has focused on developing algorithms to generate the instantial content that was once manually authored by expert designers. This has seen success with the development of procedurality-centric games such as Spore and Galactic Arms Race (Hecker et al. 2008; Hastings et al. 2009). However, PCG systems have yet to focus on generating behavioral content that is flexible enough to work in open-ended improvisational domains. In contrast to the previous authoring approaches mentioned, this article describes a hybrid knowledge-authoring paradigm that combines case-based learning with learning by observation / imitative learning - called interactionbased authoring. Interaction-based authoring aims to i) minimize the authoring bottleneck while ii) ensuring that the subjective experience of interacting with the system is high quality and that iii) the computer collaborator supports equal creative agency (the extent to which a creative collaborator can take decisions, make choices, and affect co-creation). It proceeds to demonstrate the interactionbased authoring paradigm within an improvisational interactive art installation called Viewpoints AI (Jacob et al. 2013a) after comparing the installation to related work in the field. A brief updated system description is provided (c.f. Jacob et al. 2013b). Finally, the paper details an initial attempt to evaluate the interaction-based authoring approach instantiated within the Viewpoints AI installation and discusses the results as a guide for iteratively developing / refining the installation. Interaction-Based Authoring Interaction-based authoring is a hybrid approach to authoring instantial knowledge and control knowledge for cocreative interactive systems, combining case-based learning with imitative learning. While using an interactionbased authoring approach learning occurs over the lifetime of the full performance and not during an explicit training or teaching phase. This is done to boost participant motivation and engagement encouraging prolonged interaction June 2015 236 with the agent thereby facilitating greater knowledge acquisition. There are three main aspects to interaction-based authoring. First, case-based learning is used to index and store agent experiences in a reusable manner that can be utilized to drive future behavior or responses in general. Cases can be stored as input- output pairs (from the agent's perspective) with a process to map between inputs and outputs in order to use them interchangeably. Second, an imitative learning / learning by observation system (Tomasello 2000) that can model the way a human partner responds to the agent's actions is utilized in order to interact with other partners in other interaction contexts. If the new partner's input action (from the new interaction context) is similar enough to an input action it has learnt a model for in the past, it can use that to select an output action. The agent takes the interactor's role in that case and responds, as they (presumably) would have. Finally, an open-ended co-creative improvisational domain in which to situate the agent is required so that the participant or interactor is engaged and therefore motivated to teach the system for an extended period of time. The open-ended nature of the domain encourages exploration of the interaction space, increasing the coverage of the learning algorithms for future interactions. The co-creative and improvisational aspects of the domain emphasize the egalitarian nature of creative decision-making. They also encourage the user to further explore novel regions of the interaction space in the event that the system makes a ‘poor' choice of response, thinking of it as an interesting offer that they hadn't considered rather than a mistake. The interaction-based authoring approach has been instantiated in an interactive improvisational human- AI art installation called Viewpoints AI. A description of the installation follows a brief account of related work. Related Work Technology has been contemporarily used to augment performances and art installations (Reidsma et al. 2006; Latulipe 2011; MacDonald et al. 2015). These pieces use performance technology as an integral part of their overall aesthetics and content of the artwork. However, these technologies have been subservient to human performers, with shallow knowledge, and / or a lack of clear collaboration between the machines and humans on stage. Combining research in arts, AI, cognitive psychology and philosophy, the field of computational creativity has focused on many different creative domains (c.f. Boden 2003; c.f. Colton 2012). However, most traditional computationally creative systems assemble pre-authored content in novel combinations, without attempting to solve the knowledge-authoring bottleneck, leading to small systems with limited scope. In addition, many in the past have ignored creative collaboration or co-creativity focusing on systems that do not involve humans except as consumers or evaluators of the creative artifact or process. Computationally co-creative systems on the other hand collaborate with humans in order to participate meaningfully in the creative process or outcome. Much work has been done on co-creative agents in the music improvisation domain (Thom 2000; Hsu 2008; Hoffman and Weinberg 2010). The Digital Improv Project virtual agents that could perform theatrical improvisation (O'Neill et al. 2011) and the Computational Play Project virtual agents that could play pretend with people using toys (Magerko et al. 2014) are examples of co-creative systems in other domains. Both however, required extensive pre-authored instantial content to produce improvisational behavior. The Digital Apprentice (a virtual collaborator for abstract visual / sketch art creation; Davis et al. 2014) is a co-creative system that closely resembles an instantiation of the interaction-based authoring approach. The Viewpoints AI Installation The Viewpoints AI installation is a participatory interactive installation where a human interactor and a virtual agent - named VAI - collaborate to improvise movementbased performance pieces together in real-time. The installation (see Figure 1) is composed of a large translucent muslin projection screen that has a human-sized manifestation of VAI projected onto it from the front and the interactor's shadow cast onto it from the rear. This enables an occlusion-free juxtaposition of the interactor's shadow onto the projected virtual agent when their positions overlap. While the installation is highly participatory in nature and the experience of improvising is intrinsically tied to it, an audience can also watch the unfolding performance from the front of the installation. Participants interact with the virtual agent behind the muslin screen while a Microsoft Kinect depth camera senses and records their movements. Recorded movements are analyzed systematically using a formal version of the Viewpoints framework, as described by Bogart and Landau (2005). Viewpoints is used in theatrical movement and the staging of scenes to focus on the physicality of action and analyze performance in terms of the physical Viewpoints of time (tempo, duration, kinesthetic response, and repetition) and space (shape, gesture, spatial relationship, topography, and architecture), as well as the vocal Viewpoints (pitch, dynamics, acceleration, silence, and timbre). The Figure 1: The Viewpoints AI Installation June 2015 237 participant's movements are interpreted through a subset of the Viewpoints framework and are then responded to by the agent. The formalization of Viewpoints is thus used as a framework to represent and reason about movement. The Viewpoints AI installation uses contrasting visual elements of light and shadow to showcase how the human participant and the virtual agent arrive at this liminal interaction space from two very different worlds. Visually, VAI is a glowing anthropomorphic character composed of a playful cloud of fireflies. The participant's crisp shadowed form is transported to the ephemeral 2D space between the two worlds through the medium of shadow theatre. System Description The Viewpoints AI installation is powered by an agent architecture that is conceptually composed of three modules - perception, reasoning, and action. Earlier versions of the system are described in Jacob et al. (2013a; 2013b). The following sections describe the agent architecture briefly, going into more detail where necessary to illustrate updated aspects of the system. Perception The Viewpoints AI agent architecture receives input from the depth camera as a frame of joint positions in continuous 3D space at a certain frame rate to get "joint space" gestures. It then discretizes the joint space gestures and derives additional information about them in real-time using a formalization of the Viewpoints framework to get discrete "predicate space" gestures. These two types of gestures are then sent along to the reasoning module. Parsing Viewpoints Predicates The Viewpoints predicates that have been formalized to date make up a subset of the physical Viewpoints, including tempo, duration, and repetition, as well as parts of spatial relationship, topography, shape, and gesture. The current version of the installation has a general purpose machine learning toolkit (Hall et al. 2009) integrated within the agent architecture that classifies Viewpoints predicates using classifiers trained using supervised learning on expert movement-practitioner / dancer data. Adding new predicates to the system is as straightforward as training new classifiers with more data demonstrating or exemplifying that particular attribute or aspect of the Viewpoints framework. Emotional content of the performance portrayed through gestures are also classified through this supervised learning process. The current movement analysis pipeline employs modular feature detectors for motion-based features (eg. vertical knee velocity, tangential knee acceleration, etc.) of the joint space gestures. These are then used to feed classifiers (with the specific classification algorithm chosen empirically according to classification performance). Training data for classification is obtained by collaborating with expert local movement-practitioners and dancers. Turn-taking Model Turn-taking refers here to the process of naturalistically timing the use of the shared performance space so as to coordinate each other's (potentially overlapping or simultaneous) movements. This can be decomposed into the problems of how to best time the agent's movement turns coordinating with the interactor and how to segment a user movement turn or gesture. The first problem is solved by the interaction convention that the agent moves whenever the interactor does, either mirroring them (when they move arrhythmically) or improvising an original response to their movements (when they perform rhythmic repeated movements). The second problem is discussed below. In the current version of the installation the agent tries to detect a beat to the interactor's movement (helped by playing dance music during the interaction) and segments their gestures using the detected beat. It does this by creating a set of 1D motion vector-based local beat detectors for each moving joint. These report possible joint-level candidate beats by looking at the half period of the joint motion. When candidate beats are repeated multiple times, they are confirmed and reported to a global tracker. The global tracker then chooses a candidate local beat as the global beat, which is then used to segment the movement at the start and end of the beat duration. Additional trimming of the segment is done so that the start and end are the same. Reasoning Segmented gestures in both joint and predicate space are sent to the reasoning module for the agent to determine an appropriate response gesture. Joint space gestures are then stored in a gesture library in exemplar clusters, each cluster having a universally unique identifier number (UUID). These clusters are produced through an approximate gesture recognition algorithm using a content vector of aggregated versions of the same set of motion-based features used earlier in Viewpoints predicate classification. This is done in order to find patterns in interactions and cluster similar gestures together. It is a simplification of the hard problem of online matching in real-time of an input gesture to one (or potentially none) out of a potentially unbounded set of gestures without prior training of any sort. The corresponding predicate space gesture is then sent to a Soar agent (Laird 2012) for further processing in order to choose a response gesture. This case-based learning is a key mechanism within the Viewpoints AI installation that helps it instantiate interaction-based authoring. Response Strategies The Soar agent has a set of strategies for selecting responses to the input gesture that are then output to the action module. These strategies are selected amongst using pragmatic and aesthetic rules for agent behavior. The response strategies were chosen using an analogy to methods that jazz improvisers use to respond to offers from fellow musicians. For example, repetition is important for establishing a motif, signaling understanding or acceptance of a communicative intention, signaling which performer is being lead by another, etc. The most important response strategy, which forms the lynchpin of the interaction-based authoring approach, is the June 2015 238 application of observationally learned input- response gesture pairs. The agent observes the collaborator's response to its action and builds an association with parameters to control its application. The use of observed action response association leverages the collaborator's more advanced reasoning faculties in order to respond to some other interactor in another context. For example, when the agent learns to associate "waving" gesture inputs to "bowing" gesture responses by watching the collaborator execute "bowing" responses to its own "waving" gestures, it can respond using the learned association of "waving" and "bowing" gestures when a new interactor "waves" at the agent. Of course in this example, "wave" and "bow" gestures are actually clusters of gestures with corresponding IDs to which the actual input gestures match approximately (as mentioned earlier) - no semantics of the words "wave" or "bow" are implied to be understood by the agent. A key assumption that the input response association is based on is that the interactor's response is always related to the previous gesture from the agent and that there is always some reason behind it. Both of these could well be false, if the interactor gets bored and tries something completely new for example. However, associations that are seen more often are given positive reinforcement helping to weed out weaker associations. This role-taking process forms the key mechanism for learning by observation and imitative learning within the Viewpoints AI installation that helps it instantiate interaction-based authoring. Another response strategy is the selection of emotional reflex reactions to emotional content portrayed in the gestures. There is an "emotional algebra" authored in the system that responds according to a commonsensical set of rules (e.g. responding to angry input gestures with angry or fearful responses). This emotional algebra is rigid and uncomplicated yet enables a simple short-circuit reflex response system to quickly respond to portrayed emotionally salient content within gestures. An important response strategy is for the system to mimic the interactor's input gesture back to them. Mimicry / repetition is important in facilitating smoother interactions between people (Behrends et al. 2012). In contrast, a (trivial) response strategy involves performing no response at all, though this promotes a sense of uncertainty and is thus discouraged unless as a last resort. Another response strategy is for the agent to consider an existing gesture and transform it. This creates a variation of that gesture using dimensions or aspects of the Viewpoints framework (eg. faster in tempo, smoother in movement, adding repetitions, etc.). In addition, the system can use acontextual functional transforms to add variety in the enacted form of the gesture, such as reversing the direction of movement, changing the limb in which movement takes place, etc. See Jacob et al. (2013a; 2013b) for more detail. A final response mode is for the agent to consider past experiences from its episodic memory and choose a similar gesture to bring into the new interaction context. This is achieved using Soar's episodic memory partial graph matching capabilities in order to approximately match the Viewpoints predicates of gestures and / or the direct predicate space representation of their movements from other interaction contexts to the current interaction context. This is valuable to inject novelty into the current interaction context. It uses the lower dimensionality (and higher level of abstraction) of the Viewpoints predicate space to pick a gesture that is roughly midway on a scale of novelty (from completely identical to absolutely novel). This episodic retrieval process is a key mechanism for case-based learning within the Viewpoints AI installation that helps it instantiate interaction-based authoring. Viewpoints predicates form the index vocabulary for the case-based retrieval. It should also be noted that this particular response strategy introduces novelty to the creative experience, balancing the predictability of other response strategies such as the application of observationally learned patterns. Action The action module receives both predicate and joint space gestures from the reasoning module and proceeds to create the suitably transformed and rendered virtual agent embodiment procedurally. The Viewpoints predicates associated with the gesture being performed directly affect the visualization (e.g. the energy of the agent's movements control the colors of the agent). The visual embodiment of VAI is an anthropomorphic figure with a body composed of glowing particles that keep to the bounds of the figure while flying around probabilistically. In the current version of the installation, the agent also has a region around the chest of a corresponding interactor that glows with a diffuse red colour in time to a rhythm if the agent has detected the user moving to a beat. This has the visual effect of a glowing heart beat that rises and falls with the interactor's movements. This was also designed to serve as a subtle form of coordination between the two collaborators. For more detail see Jacob et al. (2013a; 2013b). Interaction-Based Authoring Beyond the Viewpoints AI Installation The Viewpoints AI installation instantiates the interactionbased authoring approach to acquire knowledge from interactors while attempting to provide a high quality subjective experience to the interactors and support their creative agency. It does this through knowledge acquisition of two kinds. Firstly the case-based learning component stores all gesture content it has seen or experienced in episodic memory. Secondly it learns how to use these gestures to respond to people by learning interaction patterns or pairs of gestures from observing people and then imitating their actions in a novel context. Finally the installation is situated in a co-creative performative domain so that there is a low bar for meaningful collaboration as well as to encourage exploration of the interaction space due to player engagement and acquire more knowledge as a result. The approach differs from others by attempting to provide a full-fledged co-creative experience right from the outset June 2015 239 without requiring explicit training or teaching phases. The approach can be extended beyond the movementimprovisation domain to increase the scalability of other co-creative agents as well. The instantial gesture content that the system learns using case-based learning can be generalized to other types of response content, for example strokes on a canvas or notes played on a synthesizer. Imitation learning in turn can also be used to learn more general response control knowledge. For example, the system could learn patterns of strokes on a canvas or sequences of notes. Currently the Viewpoints AI installation only does a first order pairwise learning of gestures, however that could be extended to higher order sequences of patterns. Evaluation Methodology The following sections describe an initial effort to evaluate the success of the installation in addressing three main research questions. RQ1: Can the interaction-based authoring approach minimize the authoring bottleneck? RQ2: Can usage of the interaction-based authoring approach create high quality subjective experiences using the system? RQ3: Can systems built with the interaction-based authoring approach support collaboration with equal creative agency (the extent to which a creative collaborator can take meaningful decisions, make meaningful choices, and affect the co-creative process or product)? RQ1 was evaluated with formal analysis of authorial leverage (Chen et al. 2009) as an initial attempt. More detailed, pragmatic testing is required next. For the analysis, three cases were compared: 1) a purely mirroring version of the installation where the agent would only mirror the movements of the interactor but not respond in any other way, 2) a version of the installation with a pre-authored tree of ‘plot points' (pairs of input gestures and agent responses) of arbitrary length and branching factor, and 3) the full Viewpoints AI interactive art installation. The RQ2 and RQ3 were evaluated using empirical quantitative and qualitative methods in a pilot study (sample size of 10). For the empirical evaluation, three different experimental conditions were used. Condition 1 had only mirroring of interactor movement as our baseline for comparison. Condition 2 had mirroring of interactor movement along with random movement responses, selected from a library of prerecorded movements, when the participant was performing rhythmic repeated movements. Finally, condition 3 had the full response capabilities of the agent available to respond whenever the interactor was making rhythmic repeated movements. The order of the experimental conditions was also randomized each time. In each case, participants interacted with the experiment for 3 minutes, filled out two surveys administered online, and then debriefed with a semi-structured interview. RQ2 was evaluated using a set of validated survey instruments measuring system usability, flow, and enjoyment of the installation (Brooke 1996; Jackson et al. 2008; Vorderer et al. 2004). RQ3 was evaluated using a set of validated survey instruments measuring the creativity support index (CSI) and effectance of the installation (Cherry and Latulipe 2014; Klimmt et al. 2007). The individual scales (excluding the CSI) were administered online as part of the IRIS Measurement Toolkit (Vermeulen et al. 2010). The CSI had responses on a 7 point Likert scale while the IRIS Measurement Toolkit used a 5-point Likert scale. Results Formal Analysis The interaction-based authoring approach was designed primarily to address the knowledge-authoring bottleneck. Therefore the results of the formal analysis directly estimate how much of an improvement is achieved using this approach for acquiring knowledge within a co-creative agent in the movement improvisation domain. For the three experimental conditions described earlier (as with most existing literature in the field) only the variability was used as a factor for quality of the user experience. Therefore authorial leverage was calculated as the ratio of the number of unique experiences (variability) to the number of authorial inputs involved in creating the system. In addition, a few assumptions were made during the calculation. 1) In order to compare the Viewpoints AI installation variants to existing interactive narrative literature, the notion of plot points was loosened to represent sequences of human - agent movements or gestures. 2) The authorial inputs were considered to be the sum of the number of gestures that were authored prior to the start of the calculation in addition to any manually authored transition rules between them or between interactor gestures and agent responses. For the first condition evaluating the purely mirroring agent, it was assumed that both interactor and agent movement responses were occurring simultaneously. Thus a plot point would represent the interactor gesture and the same agent gesture performed simultaneously. Therefore the same sequence of N interactor gestures input to the system would always return the same sequence of N gestures back as responses. The authorial leverage is thus nearly infinite since there is almost no prior manual authoring of instantial content (authorial inputs near zero). For the second condition with the pre-authored branching tree of input gesture and agent response pairs, a tree of average branching factor b and depth d would have at most a total of (b(bd )−1) ⁄ (b−1) nodes or loose analogs to plot points. Also, such a tree would have at most (bd ) linear paths through it from root to leaf node representing unique experiences. Therefore, the authorial leverage is roughly (bd )(b−1) ⁄ (b(bd )−1). This function has an asymptotic upper bound of 1 given any b or d. June 2015 240 Finally, the third condition with the full installation active has the capability to select responses dynamically based on the reasoning processes mentioned earlier. For the first condition, the only possible response was to mirror the interactor's input gesture simultaneously. In the full installation that capability is present (though mimicking not mirroring) in addition to various other responses possible. Therefore the number of unique responses possible for a specific input gesture can only be higher. For the same N input gestures as in the first condition, the number of possible agent responses would be RN, where R is the total number of unique responses to any one input gesture given a set of response strategies (rather than just mirroring). In the worst case this is 1 and RN reduces to N possible unique agent responses. In the best case, this becomes ΣRi N where Ri represents the total number of unique responses to one input gesture from the i th response strategy. Each of the response strategies is analyzed below. For the "no response" response strategy, there is only ever one agent response. For the "repeat input gesture" response strategy, Ri N is N since each input gesture returns the same gesture as the agent output. For the "transform input gesture" response strategy, Ri N becomes 2(V+F)N, where V and F are the number of Viewpoints dimensions and functional transformations that the agent can use to transform the input gesture into an agent response. In this case 2(V+F) represents the cardinality of the power set containing (V+F) elements. For the "emotion algebra" response strategy, the number of emotionally appropriate gestures available to respond with is dependent on the past history of gestures learned by the agent. For a history of H gestures with h appropriate gestures, that amounts to hN possible responses. In the worst case, this reduces to repeating the input gesture and Ri N reduces to N. This is justified by equating the emotional mirroring taking place with emotional contagion (Hatfield et al. 1994). In the best case, the entire history of gestures has the appropriate emotional content and Ri N becomes HN. For the "novel response from episodic memory" response strategy, the exact magnitude of Ri N is difficult to estimate for the best case since it is completely dependent on the past ordering of learned gestures and received input gestures. However, the lower bound for Ri N is N since in the worst case, if no gesture is found that is similar to the input gesture, the input gesture is repeated as the agent output. Finally, for the "learned interaction patterns" response strategy, given a set of b learned responses on average for the right hand side of each of N input gestures, the Ri N would be bN. Therefore RN or ΣRi N for all the i response strategies in the Viewpoints AI installation becomes 1 + N + 2(V+F)N + N + N + bN or (1 / N + 3 + 2(V+F) + b)×N in the worst case. This becomes (1 + N + 2(V+F)N + HN + ≥N + bN) or (1 / N + ≥2 + 2(V+F) + H + b)×N in the best case. Thus, the number of unique experiences possible with the full installation is much higher than in the first condition. The amount of authorial input is equally minimal in the full Viewpoints AI system. Therefore, since the authorial leverage for the first condition is very large, the authorial leverage for the full Viewpoints AI system is even larger. In addition, if complexity were a factor in our calculation of authorial leverage, it is visibly clear that the full installation has significantly higher complexity in its decision-making and in the user experience offered than the mirroring version of the system. Pilot Empirical Study The aggregated results for both the IRIS Measurement Toolkit and Creativity Support Index are presented in Figure 2. The system usability, flow, and enjoyment scales were used to evaluate the system in terms of its ability to produce high quality experiences for the user. The effectance and creativity support index scales were used to evaluate the ability of the system to co-create alongside the participant with equal creative agency. The results show that each of the experimental conditions did well, though no statistically significant results could be obtained between the different conditions potentially due to the small sample size (sample size of 10). However, regardless of the apparent lack of difference between the conditions, the Figure 2: Results of Pilot Empirical Study. For descriptions of specific questions refer provided citations. June 2015 241 survey ratings for the third condition show clearly that usage of the interaction-based authoring approach instantiated within the Viewpoints AI installation can indeed both create high quality subjective experiences for participants interacting with the installation as well as support collaboration with equal creative agency. The semi-structured interviews were used to guide future development and contained questions regarding feedback about the experience, goals that users had while interacting with the system, what they liked / disliked about the installation, etc. The feedback was overwhelmingly positive, with particular emphasis on the aesthetics of the VAI's visual representation, freedom of creative expression felt by participants, amount of fun had by users, and sheer "cool factor" of the installation. Some of the negative feedback suggested that more was required to show that the agent was actually doing something other than mimicking the user. In addition, potentially indicating a miscommunication of the design goals for the installation, it became clear that some users felt like they should have been able to control the agent's actions to a greater degree. The goals of the users varied depending on how many times they had interacted with it and how inhibited they were. The goals generally went from exploring the boundaries of the system, to trying to get the agent to do certain reactions / responses, to trying to do novel interactions with the system that hadn't been tried before. Discussion The results given above help answer the three questions used to evaluate the interaction-based authoring approach instantiated within the Viewpoints AI installation. Using the interaction-based authoring approach led to a significantly higher authorial leverage (the ratio of variability of the experience, or more generally the quality of the system, to the amount of authorial input) than any pre-authored or pure mirroring version of the installation. The pilot study showed that the interaction-based authoring approach also led to high quality experiences, as judged by the system usability, flow, and enjoyment metrics administered. In addition, the study revealed that the interaction-based authoring approach was able to support collaboration with equal creative agency using the effectance and creativity support index metrics. However, it did not show a significant difference in ratings between the three experimental conditions for any of the survey metrics. The lack of significant different between ratings for the different experimental conditions could be because of a number of reasons. Firstly, the study was conducted using a very small sample size. However, given that the ratings were so similar for all three, it is also possible that users had difficulty distinguishing between the different conditions in terms of the metrics used. Secondly, in terms of the evaluation, users were blind to the nature of the experimental condition as well as blind to the processes occurring within the virtual agent. According to Colton (2008), the process and the product are equally important to influence evaluation of creativity within the system. Therefore Turing-test style approaches to evaluation are found lacking. This seems particularly true when the creative domain is improvisation where participants evaluate the improvisational experience / process. The results (especially from the semi-structured interviews) suggest that in order to improve the differential ratings of the full Viewpoints AI installation to the other conditions, the system's actions and outputs should be more noticeably different to highlight the system's original efforts better. This points to the requirement for a more full featured list of implemented transforms (both Viewpoints and functional transforms as well as gestural combination). In addition, video analysis showed that novice users had a hard time triggering the system's rhythmic repeated movement gesture segmentation mechanic. Thus current efforts focus on replacing the existing gesture segmentation algorithm with a more naturalistic automated gesture segmentation algorithm from Kahol et al. (2004). Finally, the experimental design is being refined to make the framing more explicit and will be scaled up. Conclusion In conclusion, this paper introduced a hybrid approach to knowledge authoring for co-creative systems called interaction-based authoring. The approach incorporates ideas from case-based learning and imitative learning, while emphasizing incorporation into open-ended co-creative application domains. This paper then presented an instantiation of the interaction-based authoring approach within the Viewpoints AI installation. The installation was then evaluated in terms of the extent to which it mitigated the knowledge-authoring bottleneck, produced high quality subjective experiences, and supported equal creative agency. Finally, the results of the evaluation were discussed in terms of guiding the future iterative development and evaluation of the installation. 2015_32 !2015 Imagining Imagination: A Computational Framework Using Associative Memory Models and Vector Space Models Derrall Heath, Aaron Dennis and Dan Ventura Computer Science Department Brigham Young University Provo, UT 84602 USA dheath@byu.edu, adennis@byu.edu, ventura@cs.byu.edu Abstract Imagination is considered an important component of the creative process, and many psychologists agree that imagination is based on our perceptions, experiences, and conceptual knowledge, recombining them into novel ideas and impressions never before experienced. As an attempt to model this account of imagination, we introduce the Associative Conceptual Imagination (ACI) framework that uses associative memory models in conjunction with vector space models. ACI is a framework for learning conceptual knowledge and then learning associations between those concepts and artifacts, which facilitates imagining and then creating new and interesting artifacts. We discuss the implications of this framework, its creative potential, and possible ways to implement it in practice. We then demonstrate an initial prototype that can imagine and then generate simple images. Introduction The concept of imagination is not often talked about in cognitive psychology without reference to creativity (Gaut 2003; Vygotsky 2004). In fact, the term ‘imaginative' is many times used as a synonym for ‘creative'. Defining imagination, like creativity, is difficult because the word is used broadly and depends on the audience, the level of granularity, and the context (Stevenson 2003). In cognitive psychology, imagination is commonly generalized as thinking of something (real or not) that is not present to the senses (Beaney 2005). In terms of creativity, it is being able to conceive of and conceptualize novel ideas. Imagination, thus it seems, should be an important consideration when developing creative systems. In the field of computational creativity, imagination is discussed explicitly only on rare occasions, such as Colton's creative tripod (2008). Most creative systems incorporate imagination implicitly and do not model it directly. In this paper, we propose a computational framework that attempts to explicitly model imagination in order to perform creative tasks. Our framework, called the Associative Conceptual Imagination (ACI) framework, uses associative memory models (AMMs) combined with vector space models (VSMs) to enable the system to imagine and then create novel and interesting artifacts. We begin by looking more closely at the psychology literature in order to establish a cognitive basis for imagination, which will motivate the design of our framework. We then consider how current computational models of creativity both succeed and fail at addressing imagination. We then outline in detail the ACI framework for imagination and demonstrate an initial implementation (proof-of-concept) in the domain of visual art. Finally, we discuss the possibilities this framework can afford us in building creative systems and talk about questions regarding its application. Psychology of Imagination Imagination is ubiquitous in everyday life. We can visually imagine a world described through narrative, or imagine how to get to the grocery store, or imagine what it would be like to be a celebrity. We can imagine what a lion crossed with an eagle could look like, or imagine new ways to express meaning through art. Although most often thought of as visualizing in the mind, we can imagine in conjunction with any of our senses. Indeed, we can talk about imagination across the whole range of human experience. Imagination is a broad term with many different taxonomies and ways to interpret it. We restrict our view to two major types of imagination that are commonly used by psychologists (Currie and Ravenscroft 2002). The first type of imagination is sensory (or reproductive) imagination. This is mentally recalling past experience, which is directly related to our memories. For example, one can imagine what their favorite food tastes like without actually tasting the food, or imagine their mother's face when she is not present, or imagine an annoying song that is stuck in one's head. This type of imagination can be thought of as creative in the sense of recreating in one's mind a previous experience. The second type of imagination is creative (or productive) imagination. It is the ability to combine ideas in different ways never before observed, or the ability to think about the world from a different perspective than previously experienced. For example, one can imagine what a hairy banana monster could look like, or what life would be like if born in another country, or imagine how to compose music that is happy and uplifting. This type of imagination is more clearly tied to creativity and some have argued that it forms a necessary basis for creativity (Vygotsky 2004), while others have argued that imagination is merely a tool used in the creative process (Gaut 2003). June 2015 244 Most psychologists agree that our senses, our conceptual knowledge, and our memories form the bases of imagination (Beaney 2005; Barsalou 1999). As we perceive the world and have experiences, we create memories by establishing and strengthening connections in our mind. These connections form concepts, which are in turn interconnected. Memories are often argued to be distributed and content addressable across groups of neurons (Gabora and Ranjan 2013). This means that multiple neurons respond in varying strengths to certain experiences, different experiences may activate overlapping neurons, and similar experiences will have more overlapping neurons than dissimilar experiences. This distributed memory allows the brain to implicitly associate concepts and experiences together. Thus we have associations between concepts (e.g., rain is related to water) and between what we perceive and these concepts (e.g., apples look round and are typically reddish in color). Creative imagination cannot make something out of nothing, nor is it random; everything we imagine is anchored to things we have actually experienced in the past and on their connections (Vygotsky 2004). The novelty is in combining these experiences in different ways. When a chef imagines new recipes, she uses her knowledge of existing recipes, ingredients, methods, and kitchen tools. The new recipe is essentially a recombination of this previous information in a novel and (hopefully) delicious way. A computational model of imagination should address the abilities to perceive, to create memories, and to learn associations between concepts. Such a model should then be able to reconstruct this information (sensory imagination), as well as recombine this information in novel ways to create new and interesting things never before experienced (creative imagination). Related Work In accounting for creativity in computational systems, Colton was one of the first to explicitly mention imagination as part of the creative process (2008). In order for a system to have imagination, it should be able to produce artifacts that are novel. Others have mentioned imagination in relation to a creative system that produces narratives (Zhu and Harrell 2008). A computational system that explicitly tries to model imagination is SOILIE (Science Of Imagination Laboratory Imagination Engine) (Breault et al. 2013). SOILIE maintains a large database of labeled images, and words are associated together when they appear as co-occurring labels. For example, a picture of a face could be labeled with ‘face', ‘ear', ‘mouth', etc. and the system learns to associate those labels together. A word is given to the system which then finds 5-10 associated words and creates a collage out of images that have been labeled with those associated words. This system demonstrates a rudimentary form of sensory imagination in which it tries to recreate an image of the inputed word. SOILIE is similar to one of the abilities of the Painting Fool, which can extract key words from a text document and create a collage by finding images of those key words in a database (Krzeczkowska et al. 2010). Creative imagination was partially demonstrated in a system that used recurrent neural networks to produce melodies according to a set of other melodies arranged on a 2D plane (Todd 1992). Each of the melodies in the training set were tied to a specific 2D location, and the model was trained to reproduce each melody at their respective locations. After training, the system would be given a new location on the 2D plane and could essentially interpolate a new melody according to its proximity to the original set of melodies. This is the beginnings of creative imagination in that the system is blending melodies together according to spacial proximity. Imagination has been mentioned in conjunction with systems that perform conceptual blending to produce metaphors and narratives (De Smedt 2013; Zhu and Harrell 2008; Veale 2012). Conceptual blending is the process of taking two input mental spaces (representing concepts) and mixing them together to make a blended mental space that is novel, meaningful, and has emergent structure (e.g., lightsaber is a blend of sword and laser) (Fauconnier and Turner 1998). Computational models of conceptual blending have been used to produce narrative (Permar and Magerko 2013), poetry (Harrell 2005), and even mathematical axioms (Martinez et al. 2011). Conceptual blending certainly has potential for imagination as it explicitly attempts to blend conceptual knowledge into novel ideas. Although there are still many technical challenges in autonomously blending input spaces, conceptual blending does seem to address creative imagination. Unfortunately, most implementations do not consider sensory information and the input spaces are typically hand engineered, so the system does not learn from experience and cannot imagine sensory type artifacts. However, one computational system does try to implement conceptual blending with images (Steinbruck 2013). The system takes two ¨ pictures that each represent a concept and blends them by extracting commonly shaped objects in one image and pasting them over similarly shaped objects in the other image (e.g., a globe in one image is pasted over a bicycle tire in another image). Evolutionary computation is a common method incorporated into creative systems because of its innate ability to yield unpredictable yet acceptable results (Gero 1996). Indeed, evolutionary computation seems to at least partially model creative imagination in that it recombines and modifies existing artifacts through crossover/mutation and can, thus, diverge and discover novel artifacts. The fitness function also guides the evolutionary process to converge on quality results. Many systems incorporate the use of evolutionary techniques to produce artifacts in domains such as visual art (Machado, Romero, and Manaris 2007; DiPaola and Gabora 2009; Norton, Heath, and Ventura 2013), music (Miranda and Biles 2007), and semantic networks (Baydin, de Mantaras, and Onta ´ n˜on 2014). ´ Evolutionary computation appears to have potential in addressing both sensory and creative imagination. However, the creative intent seems to reside solely in the fitness function, which is separated from the actual generation of artifacts. The creation of artifacts is an independently ran June 2015 245 Figure 1: An overview of the Associative Conceptual Imagination framework. The vector space model learns, from a large corpus, how to encode semantic information into concept vectors that populate conceptual space. Multiple associative memory models can then learn associations between these concept vectors and example artifacts from various domains, such as art, music, or recipes. These associative memory models are bi-directional and can not only discriminate, but also generate artifacts according to a given concept vector. The semantic structure encoded in the concept vectors allows the framework to facilitate the imagining of artifacts according to concepts for which it has never seen examples. dom event that is not connected to any associations learned through experience (except for maybe the population of artifacts themselves). The act of imagination in this case is mostly a selection/filtering process, which, although viable, doesn't seem to capture the complete picture. In its basic form at least, there is no notion of associations between concepts and artifacts. Associative Conceptual Imagination We attempt to explicitly model imagination through a computational framework called the Associative Conceptual Imagination (ACI) framework. ACI uses ideas from other domains in a novel way that is capable of both sensory and creative imagination. ACI is composed of two major types of components, a vector space model and associative memory models as shown in Figure 1. We will discuss the major components of the ACI framework, how they interact to perform various imaginative tasks, and the creative potential of systems built using this framework. Vector Space Model Creativity is valued not just because of the novelty of things created, but also because of their utility. For example, in domains such as visual art, the value is in how the art conveys meaning to the viewers (Cs´ıkzentmihalyi and Robinson ´ 1990). There is an element of intentionality as an artist purposefully expresses meaning through art. How can an artist intentionally express meaning without having knowledge of the world and of what things mean? Conceptual knowledge helps to provide a foundation for the ability to imagine and Figure 2: A 2D visualization (projected from high dimensional space) of several word vectors color coded by topics. These concept vectors were learned using the skip-gram VSM, which was incorporated into the DeViSE model (visualization courtesy of Frome et al. 2013). Note that concepts from similar topics generally cluster together because the concept vectors encode semantic relationships. create. Incorporating conceptual knowledge into a creative system can potentially be achieved through Vector Space Models (VSMs) (Turney and Pantel 2010). It is commonly agreed that a word (or concept), at least in part, is given meaning by how the concept is used in conjunction with other words (i.e., its context) (Landauer and Dumais 1997). Vector space models take advantage of this by analyzing large corpora and learning multi-dimensional vector representations for each concept that encode such semantic information. These models are based on the idea that similar words will occur in similar contexts and words that are often associated together will often co-occur close together. These models reduce words to a vector representation that can be compared to other word vectors. VSMs have been successfully used on a variety of tasks such as information retrieval (Salton 1971), multiple choice vocabulary tests (Denhiere and Lemaire 2004), TOEFL multiple choice ` synonym questions (Rapp 2003), and multiple choice analogy questions from the SAT test (Turney 2006). Concepts similar in meaning will have vectors that are close to each other in "vector space", which we will refer to as conceptual space. Associations between concepts are implicitly encoded by their proximity in conceptual space. Figure 2 shows relationships between example word vectors that correspond to various topics projected onto a 2D plane. These concept vectors capture other interesting semantic relationships that are consistent with arithmetic operations. For example, vector("king")!vector("man")+ vector("woman") results in a vector that is closest to vector("queen"). The potential of VSMs in creative systems has been discussed before, and we aim to make use of them in this framework (McGregor, Wiggins, and Purver 2014). The semantic information encoded in the vectors provides a form of conceptual knowledge to the ACI framework, which will help provide a basis for imagination. June 2015 246 Associative Memory Models In addition to knowing how concepts relate to each other, the ACI framework needs to allow understanding of how concepts relate to actual artifacts. In other words, ACI systems should be able to perceive and observe the world (i.e., to be grounded in sensory information). ACI incorporates Associative Memory Models (AMMs) to learn how to associate artifacts with concept vectors. For example, models built using ACI can learn what a ‘cat' looks like by observing pictures of ‘cats', or learn what a ‘car' sounds like by listening to sound files of ‘cars'. Here we use "associative memory model" as a generic term that refers to any computational model or algorithm that is capable of learning bi-directional relationships between artifacts and concept vectors. Not only should the AMM be capable of predicting the appropriate concept vector given an artifact, but it should also be capable of going the other direction and producing an artifact given a concept vector. Of course, the quality of learning will be dependent on the quality and quantity of labeled training data, as well as on the characteristics of the particular associative memory model that is chosen. Bidirectional associative memory models (BAMs) seem like an obvious possible choice to implement the AMM (Kosko 1988). A BAM is a type of recurrent neural network that learns to bidirectionally map one set of patterns to another set of patterns. Given an artifact (encoded into a pattern), a BAM could return the appropriate concept vector. Conversely, given a concept vector, a BAM could return an appropriate artifact, which can essentially be thought of as performing sensory imagination. Variations of BAMs have been used in computational creativity to associate input patterns to features in order to model the phenomenon of surprise (Bhatia and Chalup 2013). Another family of algorithms that have potential use in the ACI framework are probabilistic generative models. These models learn a joint distribution for observed data and their respective labels/classes. Once trained, not only can these models classify new data, but they can also be used generatively to create new instances of data that correspond to a particular label. For example, a Deep Belief Network (DBN) is a generative model that can also be thought of as a deep neural network in which several layers of nodes (or latent variables) are connected by weights from neighboring layers, while nodes of the same layer are not connected (Hinton, Osindero, and Teh 2006). Hinton et al. used DBNs to classify images of handwritten digits (0-9) by training on several examples and then used them generatively to "imagine" what a 2 looks like by creating several images that each uniquely looked like a handwritten two, thus demonstrating a form of sensory imagination. Another generative model uses a hierarchical approach to recognize and then generate unique images of handwritten symbols, again demonstrating sensory imagination (Lake, Salakhutdinov, and Tenenbaum 2013). Sum Product Networks (SPNs) have also been used to learn bidirectional associations between patterns (Poon and Domingos 2011). Given a picture of half a face, SPNs were able to infer (or imagine) the other half. These generative models can often be applied directly to the raw inputs (i.e., directly to pixels in an image) and thus seem to exhibit advanced perceptual abilities and in turn can generate artifacts directly. The associative memory model implementation is not limited to a single model, but could be split into separate discriminative and generative parts. A machine learning algorithm could be the discriminative part and be trained to predict a given artifact's concept vector (e.g., given a ‘sad' melody, the learning algorithm predicts the ‘sad' vector). The generative part could be implemented by a genetic algorithm that uses the discriminative model as the fitness function. For example, a genetic algorithm could be given the ‘sad' vector to imagine a ‘sad' melody, and the discriminative model knows what characteristics a ‘sad' melody should have and could then guide the evolutionary process. Other specific associative memory models could be incorporated depending on the domain, its representation, and available training data. Additionally, multiple AMMs for different domains could be incorporated into the framework simultaneously (i.e., one model learns images while another learns sounds for each concept), with the AMMs then indirectly related through conceptual space. Performing Imagination Once an implementation of the ACI framework has its components in place and properly trained, it is ready to imagine, and even create, artifacts. To perform sensory imagination, an ACI model can generate artifacts for a particular concept that it has previously learned. For instance, after having seen example images of ‘cats', the system has learned an internal representation for what a ‘cat' looks like. The associative memory model can then start with the ‘cat' concept vector and generate a unique image that would likely be associated with the ‘cat' vector, presumably an image of a ‘cat' (see Figure 3(a)). In the case of using probabilistic generative models, the probabilistic nature of the model and the distribution of various poses, angles, and colors learned from the many example ‘cat' images allow the system to generate a unique ‘cat' image each time. To perform creative imagination, the framework takes inspiration from the DeViSE model, which uses VSMs to aid in correctly recognizing images of objects (Frome et al. 2013). The DeViSE model first learns word vectors from a large corpus using a VSM. The model is then trained with raw image pixels using a deep convolutional neural network that learns to predict the correct labels' vector (instead of the label directly). Cosine similarity is performed between the predicted vector and the other word vectors to determine what the correct label should be. Since the vectors encode semantic relationships between concepts, the model can successfully label an image with a word for which it has never seen example images (called zero-shot prediction). For example, the system may have been trained on images of ‘rats' and ‘mice' but not on images labeled ‘gerbil'. Given a picture of a ‘gerbil' the model can still successfully label it as such because a ‘gerbil' is similar (according to the VSM) to a ‘rat' and a ‘mouse'. Replacing the convolutional neural network with, say, a probabilistic generative model could allow the system to act June 2015 247 (a) (b) (c) (d) (e) Figure 3: Different ways the Associative Conceptual Imagination framework can be used to imagine artifacts. The green rectangle with black dots represents concept vectors in conceptual space, which are learned from a vector space model. The Associative Memory Model (AMM) associates concept vectors to artifacts. The framework allows the imagining of artifacts for concepts it has previously observed (a). It can facilitate the imagining of artifacts for concepts it has not previously observed but that are similar to other concepts that is has observed (b). The framework allows the imagining of artifacts that are combinations of two (or more) previously observed concepts (c). Models based on ACI can imagine changes to a previously observed concept (d). Finally, the framework can facilitate imagination across different domains by observing an artifact in one domain and then imagining a related artifact in another domain (e). in reverse. We could input the vector for ‘gerbil' and the system could imagine what a ‘gerbil' looks like without having ever seen a picture of a ‘gerbil', because of the semantic knowledge encoded in the vectors (see Figure 3(b). Similarly, the system could take advantage of the semantic structure of the VSM and imagine what a concept sounds like without having heard any example sounds for that concept. For example, the system could have been trained on sounds for ‘horses', ‘tractors', ‘dogs', and ‘trumpets', but not have been exposed to any sounds for ‘donkeys'. Yet, the system could still generate a unique sound for a ‘donkey'. The result may not sound exactly like a ‘donkey', but it will sound closer to a ‘horse' than to the other concepts because the system knows that ‘donkeys' are more similar to ‘horses' than to the other concepts. An ACI model can imagine its own ‘donkey' sound in a way that is novel, yet still reasonable by leveraging semantic information gained through the VSM and transferring it to the task of generating sound. In another situation, a system based on ACI can imagine what a combination of concepts could look like by starting with a vector that is in between concepts in conceptual space. As shown in Figure 3(c), the system could imagine what a ‘cold' and ‘fiery' image looks like by starting with a vector part-way between the ‘cold' and ‘fiery' vectors. The system should generate a novel image that is some blending of the two concepts (and perhaps other surrounding concepts). The system is essentially imagining what new combinations of concepts look like, while being anchored in past experience. ACI could facilitate the imagining of distortions to existing concepts by gradually venturing away from a concept's vector along different dimensions (see Figure 3(d)). The system could generate images of ‘roses' starting with the ‘rose' vector, but then gradually move away from the ‘rose' vector. The resulting images should become distorted depending on the direction and distance from the original vector. Finally, an ACI model could generate artifacts across different domains. The system could learn, using separate associative memory models, what concepts look and sound like. Given a picture of a ‘dog', the system could then imagine what the ‘dog' sounds like. The ACI model simply uses the AMM for images to predict the vector associated with the ‘dog' picture and then feeds that predicted vector into the AMM for audio and has it generate a unique sound. The system could also be given a melody and then imagine an image to go with it, the two domains being tied together through the conceptual space as shown in Figure 3(e). The ACI framework provides potential for these types of imaginative (and creative) abilities. It has been designed to model imagination by learning conceptual knowledge, perceiving concepts (artifacts), and generating novel artifacts never before experienced in several ways. Of course, this is only a framework, and the actual power of it depends on the abilities of the specific VSM and AMM implementations chosen for each domain (and their training data). Current state-of-the-art models are probably not yet capable of generating (or even classifying) large, detailed images of arbitrary concepts at the pixel level. Nor are they likely yet able to perceive sophisticated music in the general case. However, these capabilities do seem to be on the horizon with the advent of generative deep learning systems (such as DBNs). June 2015 248 Figure 4: Example training images for each of the four known 2D vectors shown in conceptual space. Imagining Images In order to show how the ACI framework could work in practice, we created a simple toy implementation that can imagine basic binary images. Instead of using a vector space model, we manually specified the conceptual space as a 2D plane in order to more easily visualize how images at various vector locations relate to one another. We then chose four vectors in the 2D conceptual space that are spatially located at four corners. The four vectors are ~tl = (0.0, 0.1), tr~ = (1.0, 1.0), ~bl = (0.0, 0.0), and ~br = (1.0, 0.0), to which we will refer as the known vectors. We then generated four sets of training images for each of the four known vectors that are 32 ⇥ 32 pixels in dimension and are binary (i.e., black and white). The training images were pictures of actual corners, and example images for each of the four known vectors can be seen in Figure 4. We implemented the associative memory model using a sum product network (SPN) and trained the SPN using corner images paired with their associated known-vectors (perturbed slightly using Gaussian noise). To learn the structure and parameters of the SPN, we used a modified version of the LEARNSPN algorithm that is able to accommodate both categorical and continuous random variables (Gens and Domingos 2013). The result was a model that represents a joint probability distribution over image-vector pairs. We used the efficient, exact-inference capabilities of the SPN to generate novel images by sampling from the conditional probability distribution of images, conditioned on the concept vector. This was done by clamping the concept vector to a specific value and sampling the image pixel variables. The model can perform sensory imagination by generating images for each of the four known vectors that it has learned. The bottom set of images in Figure 5 are example images imagined for the ~br = (1.0, 0.0) vector. Notice how each imagined image is unique yet still looks like the training images in Figure 4. The system can also perform creative imagination by generating images for vectors for which it has never seen example images. These imagined images should look more similar to nearby known vectors than to known vectors farther away. The top set of images in Figure 5 were produced for the vector (0.8, 0.2). These images are indeed similar to the images at vector ~br = (1.0, 0.0) (bottom set), which is the Figure 5: The bottom set of images were imagined for the vector ~br = (1.0, 0.0), which is one of the four vectors on which the system had been trained. The top set of images were imagined for the vector (0.8, 0.2), which is a vector on which the system was not trained. The top images are similar to the bottom images because the vector (0.8, 0.2) is close, in conceptual space, to the known vector ~br = (1.0, 0.0). closest known vector. Although the system was never shown images for vector (0.8, 0.2), it could still imagine what the images could look like by leveraging the information represented by the vectors in conceptual space (in this simple case just spatial information). To further illustrate the imagining capabilities in this simple example, we had the system generate images at vector locations all over the 2D plane in 0.1 increments. In order to help visualize how the various generated images transition along conceptual space, we generated 100 images at each vector location and averaged them into a single image. We then arranged each averaged image on the plane according to their respective 2D vector (see Figure 6). Moving from corner to corner on the 2D plane essentially shows the known images morphing into each other. The center image becomes a blend of all four corner shapes, while the images in the middle of the edges are a blend of the two corners on that edge. The model has only seen images for the corner vectors, which provide a basis for the other vectors in the 2D plane. The model cannot imagine images that do not relate to the four known corner images, which the results seem to confirm. Admittedly, this toy example with a small 2D conceptual space and simplistic binary images is not visually impressive. It may be hard to ascribe imagination to a model that just seems to be doing a form of interpolation. Keep in mind that this example is only intended to be a proof-ofconcept that demonstrates how the framework could work to generate actual artifacts. This example also allows us to understand why the model is generating the images that it does—because of the training images (perceived artifacts) and the spacial arrangements of the vectors (conceptual relationships). A full implementation of this framework would be dealing with thousands of concepts in a conceptual space hundreds of dimensions in size, which is a much richer representation of conceptual knowledge. Also working with real artifacts, such as actual visual art or music, has the po June 2015 249 Figure 6: The average of 100 rendered images for each 2D vector in conceptual space at 0.1 increments. The system was trained on example images only for the vectors located at the four corners and then the system had to imagine what images at vectors in the middle would look like based on the images observed for each of the four corner vectors. Note how the images start to blend together as their corresponding vector approaches the middle of the space. tential to yield much more impressive results. Conclusions and Future Work We have outlined the Associative Conceptual Imagination framework, which models how imagination could occur in a computational system that generates novel artifacts. The ACI framework accounts for the cognitive processes of learning conceptual knowledge and concept perception (via artifacts). The framework proposes using vector space models to learn associations between different concepts, and using associative memory models to learn associations between concepts and artifacts. This network of associations can be leveraged by the system to produce novel artifacts. We have demonstrated a basic implementation of ACI and applied it to simple binary images. We showed that the system could perform both sensory and creative imagination through the images it was able to produce. The ACI framework poses some interesting questions. How will this framework perform when applied to real artifacts? What implementation and corpus should be used for the VSM? What models are appropriate to use for the AMMs? Does the choice of the model depend on the domain? Does the choice of the model depend on the artifact's representation (e.g., an image could be represented by raw pixels, extracted image features, or parameters to a procedural algorithm)? Research needs to be done to implement and refine this framework for various domains in order to explore these questions, and we are confident that the ACI framework will be useful for computationally creative systems. In future work, we plan to apply the ACI framework to DARCI, a system designed to generate original images that convey meaning (Heath, Norton, and Ventura 2014). We plan to use the skip-gram VSM (Mikolov et al. 2013) trained on Wikipedia, which will learn vectors for 40,000 concepts in 300 dimensional space. Initially, we intend to implement the AMM using a discriminative model and a genetic algorithm. We will use 145 descriptive concepts (e.g., ‘violent', ‘strange', ‘colorful', etc) to train the discriminative model to recognize those concepts in images. For example, the model will learn to predict the ‘scary' vector when given a ‘scary' image. Once trained, the discriminative model will act as the fitness function to the genetic algorithm, which can then render images in ways that convey descriptive concepts (i.e., it can render a ‘sad' image). The system will also be able to render images that convey concepts on which it has not been trained (beyond the 145) because of the semantic relationships encoded in the vectors. In other words, it will be able to imagine what other concepts would look like based on past experience and conceptual knowledge. This framework could also be extended to include ideas involving conceptual blending. As it stands, the conceptual space does not change once the VSM learns the concept vectors and blending occurs through the associations between concepts and artifacts. It could be interesting to find ways to blend the concepts themselves together to produce new concepts that can then be expressed through artifacts. 2015_33 !2015 Preconceptual Creativity Tapio Takala Department of Computer Science Aalto University, Finland tapio.takala@aalto.fi Abstract Creativity, whether seen in personal or historical scope, is always relative, subject to the contextual expectations of an observer. From the point of view of a creative agent, such expectations can be seen as soft constraints that must be violated in order to be deemed as creative. In the present work, learned conventions are modeled as emergent activity clusters (pre-concepts) in a selforganizing memory. That is used as a framework to model such phenomena as stereotypical categorization and mental inertia which restrain the mind when searching for new solutions. Using the kinematics of a robotic hand as an example, the models' dynamic behavior demonstrates primitive creativity without symbolic reasoning. The model suggests cognitive mechanisms that potentially explain how expectations are formed and under which conditions an agent is able to break out of them and surprise itself. Creativity is in the Eye of Beholder Creativity is a concept that defies exact definition. The commonly accepted view that creativity is a process resulting in novel and useful products (Mumford 2003) appears to be loose, because in the strict sense even a slightest modification would make the product novel. Another often cited definition is by Newell et al. (1959) who generously view it as a problem-solving process presenting one or more of the following: novelty and value, unconventional thinking, high motivation, and ill-defined problems. They continue by admitting that no more specific criteria can be set for separating creative from non-creative thought processes. Surprise, more or less as a synonym of unconventional or unexpected, is often considered a necessary condition for creativity (e.g. Boden 1990). However, it may be difficult to distinguish unconventional from mere novelty, as it depends on the observers' subjective experience and conventions. Moreover, novelty is a moving target: once an invention is made it becomes legacy - unless it is forgotten and may be reinvented. Like Grace and Maher (2014), we conclude that creativity is in the eye of beholder, and cannot be defined objectively. To get a grasp of the relative nature of creativity we adapt the generate-and-verify model by Newell et al. (1959) into variable scopes (Fig.1). The products of a generator (G) passing the evaluation (E) on one level are used as input to evaluation on the next level. A person using computer as a generator (Gp) may find designs passing her evaluation criteria (Ep), but while showing these to others she (together with her computer) acts as a generator (Gh) for the society where others collectively act as evaluators (Eh). On the societal level creativity appears to be a statistical concept formed by opinions of the population under study. Czickszenmihaly (1997) studied individuals (Gh) with a reputation of being creative. Maher et al. (2013) studied the evaluation (Eh) with a temporal regression model of car designs, where outliers have higher potential for surprise and creativity. In this paper we concentrate on the personal level (Pcreativity), trying to computationally model some of the phenomena happening in a person's mind when a creative moment is encountered. In this respect the generative process is not in our focus. Although various control strategies (analogy, negation, metaphors, etc.) can make it more efficient and interesting, it may as well be a black box. Essential for creativity is the evaluation process, which recognizes value and novelty in products of the generator. It becomes surprised if something unexpected is produced, i.e. if its expectations are violated. Figure 1. Context defines the expectations (E) against which the creativity of a generative process (G) is evaluated (from Takala, 2005). What are the expectations then? They can be understood as constraints on the product (or process): what it should or is assumed to be (or how it is assumed to be done). They may be hard (defining the domain), such as laws of nature and logic or explicit rules of a game, but they can also be June 2015 252 2 soft (acquired) constraints: habits, conventions, manners, fashion, social norms, political correctness, etc. These soft constraints are contextual and subject to consideration, applying in one situation but irrelevant in another. But they can be very hard in practice if based on psychological repression. This may serve as an interpretation of Boden's expression that creativity produces "previously impossible ideas". An idée fixe, or design fixation (Jansson and Smith 1991) may be the most common obstacle hindering creativity. Such soft constraints form the "box", out of which we are supposed to take a leap. What makes creativity valuable is that it is a constructive, sense-making act, not just anarchy that randomly defies any rules without a purpose. The new act must in some (novel) way be regular and repeatable. Creativity is search for a constructive and consistent solution assuming some constraints but neglecting or modifying others. By and large, creativity is management of constraints for finding a resolution of conflicts among them. Different degrees of creativity can be identified according to the level of abstraction, or cognitive complexity: (1) Most trivial, though subjectively surprising, is the case when a solution is already known but happens not to be in the current scope of attention: "It just didn't come to my mind". (2) Some effort is required if the solution is not familiar as such but is potentially reachable by known methods or rules. Then essential is the selection of right starting points and methods to proceed with, while neglecting the obvious ones that may distract the process. An example of this is the need to backtrack in order to avoid an obstacle instead of stubbornly pushing straight towards a goal. (3) Yet a higher level comes if the solution is potentially reachable within the hard constraints, but requires constructive actions on the metalevel, i.e. new rules or methods. (4) Finally, even if the product is actually not realizable, we may still act creatively by imagination, neglecting the physical constraints. The first two degrees, interpreting unexpectedness as changes in the scope of attention (relaxing soft constraints and that way releasing latent possibilities), are demonstrated below using a self-organizing memory as a model. The higher levels, requiring symbolic rules to be changed, are out of the scope of this paper. So is the sometimes required property that creativity should reflect itself, consciously recognizing that something novel and valuable has been formed. On Representations What can be done (consciously acted on) in problem solving, depends on its conceptual representation. This is an important research issue for cognitive science. The main bulk of AI research concentrates on the symbolic level, dealing with logic, language and inference rules. Another end is the subsymbolic sensory area, dealing with neural networks, associative memory and statistical inference. The well-known frame problem, or symbol grounding, calls for connections between the two. In the present work, we are not trying to fill the gap fully, but approach it from bottom up, demonstrating how primitive conceptual representations possibly form from the embodied information. As the enaction theory (Stewart et al 2011, Rosch et al. 1992) assumes, regularities of the world are learned by receiving repeated stimuli and doing explorative actions. Conditioning and mimicking are two basic psychological principles facilitating this. Later, abstractions of experiences form as subsymbolic concepts. They facilitate more efficient behavior as perceptions are immediately categorized into known classes that may trigger preprogrammed reactions. Such predefined reactions are of advantage in the world where things are quite predictable. A repeatedly adequate behavior gradually becomes the expected, a rule to be followed. Novel reactions are necessary only if the conditions change - as the proverb says: "necessity is the mother of invention". From evolutionary perspective, however, it may also be of advantage to try out novelties even without a reason, to become prepared for changes. Such tendency is called curiosity, or creative personality. In neural networks, the sensory information is modeled statistically as conditional distributions and associations. Connecting this to the higher cognitive processes has long been a challenge. Gärdenfors (2000) suggests conceptual spaces as a potential bridge between sensory and symbolic levels, a theory of concept formation on supersensory but subsymbolic level. The idea is to describe objects with their properties that act as dimensions of a geometric (metric or topological) space. Individual objects are represented as points in this space, and their generalized conceptual representations as (convex) areas. Inspired by prototype theory (Rosch 1973) Gärdenfors suggests that natural categories may be represented as a Voronoi tessellation around central points representing stereotypical prototypes. This way the extensional (set of experienced samples) is converted into a more efficient intensional (set of constraints) representation. In this paper, a somewhat similar framework is built, though not relying on a geometric feature space like Gärdenfors (2000) and Chella et al. (2014), but letting the neural cells of a self-organizing network to serve as representative samples of the sensory input. Concepts are not formed explicitly but just as (dynamic) clusters of similar cells. Thus we call it preconceptual, resembling the development stage of mind before actual conceptual thinking, in which sensorimotor activity predominates. Pylyshin (2001) uses the term in a compatible manner to describe situated vision, referring to objects that are identified but not defined by their properties. The idea also closely relates to 'proto-symbols' by Brooks and Stein (1994), who use the term for patterns of behavior that represent generalizations but appear rather as signals than formal symbols. Creativity is then demonstrated in primitive form, i.e. problem solving and conflict management using implicit concepts without symbols (Brooks 1991). June 2015 253 3 Implementation with Self-Organizing Map The computational framework we use is based on the SelfOrganizing Map (SOM) by Kohonen (2001). It is a widely used clustering device in pattern recognition and data analysis. As a biologically motivated neural network it is an interesting model for cognitive science. It has been suggested by Gärdenfors (2000) as a means of implementing conceptual spaces, though his approach is rather programmatic than an actual implementation. The SOM is a neural network consisting of an array of cells connected to a vector of input values (Fig.2). The connection weights wij of a cell are initially random but are changed as follows: Given an input vector x, the cell with best matching weight vector wj is selected, and its weights are tuned towards the input values. A similar tuning is also done in its neighbor cells. … … wij xi Fig. 2: The principle of SOM. Input vector X is compared with weight vectors Wj of the cells. The best matching unit is selected and its weight vector tuned towards the input in the training phase. As associative memory, SOM returns Wj as output in response to partial input (an example: active elements emphasized). With a large number of input samples, the network organizes itself by unsupervised machine learning instead of using explicitly given concepts. Effectively it builds a model of the training input's statistical distribution, such that each cell represents a collection (a vector) of associated input values, and the number of cells with similar values reflects the density of those value combinations in the input. Usually SOM is implemented with low-dimensional topology (typically a regular 2-D array), and becomes folded if applied to higher dimensional input. An example is given in Figure 3. Fig. 3: One-dimensional SOM (chain of cells) trained with data from a 2-D distribution concentrated in the grey areas (cells are visualized in the input space in locations of their learned values). In pattern recognition SOM is widely used as a classification device. It tells efficiently if a given input vector belongs to one category or another. This helps in data compression as complex input vectors can be quantized and represented with a smaller number of dimensions. In SOM, similar cells emerge close to each other resulting in associations between a cell and its neighborhood. If there are concentrations in the input distribution, similar cells form clusters separated by dissimilar boundaries (Figs. 4a). (a) (b) Fig. 4: A two-dimensional SOM trained with RGB values of (a) discrete colors (b) flat color spectrum. Cell color shows its learned values, cell size indicates similarity with its neighbors. As each cell represents a vector of correlated input values, the SOM can act as an associative memory. A partial input (i.e. the values given for some inputs, and the rest undefined) as a stimulus activates the cells according to their similarity with the defined inputs. As result we get for each cell the probability of its value vector to become the output. Then we select the cell best matching the partial input, and take its weight vector as output (see Fig.2). Effectively the associative memory would fill in the undefined values by those from a cell selected by highest probability. Practical applications are found in image completion (Kohonen 2001), or information retrieval (Kohonen et al. 2000), for example. The separable clusters (as in Fig. 4a) can be interpreted as primitive concept formation ("preconcepts"). When an input activates some cells, their similar neighbors are activated as well in the cluster. Then if the cluster were labeled with semantic information (such as color name), the input would be identified with that. The behavior resembles categorical perception in psychology (Goldstone and Hendrickson 2010) in the sense that the classification of any input within a cluster would get strong support by a group of cells, whereas an input falling to an area boundary would be in "unknown" territory where classification is unreliable. This coincides with the phenomenon in categorical perception that stimuli near category boundaries are more difficult to identify than within categories. It is not clear if the human perceptual categories are independent of symbolic concepts, nor if they are presented by stereotypical prototypes or area boundaries. We hypothesize that it is possible to form concepts without higher level semantics, if such identifiable areas emerge. Such June 2015 254 4 does not happen if the input distribution is flat without statistical foci (Fig. 4b). A Case Study In this section, we show how an associative SOM can be used to solve the control problem of a kinematic hand, and demonstrate preconceptual creative behavior in that context. Setting the Scene An articulated kinematic hand mechanism consists of a set of links connected at rotational joints to make a chain. In our case there are two such links (Fig. 5). Using the two joint angles (α and β) as motoric controls, the hand can reach points in the (x, y) plane within an area delimited by its physical constraints (i.e. the allowed ranges of control angles, and possible other geometric obstacles). The hand position can easily be calculated by trigonometry from the angles and lengths of the joined links, whereas the inverse is non-trivial. This inverse kinematics (IK) problem, finding control values for angles, given a target position, is generally a hard problem without analytical solution. A simple solution exists for our case with only two degrees of freedom, but it is still interesting due to its non-linearity (including singularities), physical constraints, and nonuniqueness of the solution: the same point can be reached by left or right handed configuration (negative or positive values of β, respectively). x y α β (x,y) Fig. 5: Kinematics of a robotic hand Among many other techniques, feed-forward neural networks have been proposed to solve the IK problem by training the system with random samples from the configuration space (e.g. Duka 2013). In case the problem is under-constrained (i.e. the robot has redundant degrees of freedom), sampling can be utilized to satisfy additional goals, such as moving in a certain style. Wiley and Hahn (1997) propose building from the given positions a resampled grid that serves as a geometric index, out of which the final angle-target combinations are calculated by interpolation. Our approach is similar to both of these in the sense that a neural network is trained to form a gridlike index, from which candidate starting points are selected for final approach to the target. Let us assume our humanoid robot has two hands with their physical limits (hard constraints) similar to those of the human left and right hand. Each hand is trained to work in its most natural area (left/right in front of the base) as in Fig. 6a. The system is implemented in one SOM with two inputs for hand position (x, y), two for joint angles (α, β), and one (binary) input for handedness. Then clusters automatically form in SOM corresponding to left and right handed operation (Fig 6b). Their actual shape is random, sometimes bifurcated or consisting of multiple foci, but the areas are clearly identifiable. The clusters are separated by a boundary where the cells are less similar with their neighbors (shown in yellow). (a) (b) x y α β L/R Fig. 6: Training areas of hands (L=green, R=red) in the experiment. a) in robot space, b) as clusters formed in SOM. Two sample positions shown: white cells in SOM and the corresponding left (solid) and right (shadowed) hand positions. June 2015 255 5 The IK problem is solved by association, taking the target's coordinates as partial input, finding the cell(s) that best matches with it, and returning its weight values for the missing inputs (the control angles α and β): ƒ: (x, y, ?, ?) ⟶ (x',y', α,β) Although the result as such is not exact, it provides a good starting point for an iterative final approach. The movement direction needed in the iteration phase can be estimated from the cell's neighborhood by differentiation (approximating the Jacobian of parameters). This is a common strategy with actual robots and well grounded by biological action where proprioceptive memory and motor programs (Keele 1968) quickly lead to approximately right position and the final approach is done with the help of sensory feedback. In our implementation this phase is computed explicitly, but the Jacobian differentials could as well be learned by the SOM, if continuous movements instead of random positions were used in the training phase. Targets within the trained areas are easily reached with the method above, and if the target is not too far out from the trained area, it usually can be reached from the closest starting point by the final iteration (Fig. 7a). Acting Creatively Now let us take a challenge where the simple approach does not work, by setting the target in a place not reachable from the closest point by direct iteration. This may be caused by a limitation of the mechanism itself or happen due to a physical obstacle (such as the box wall in Fig. 7c). Then the final approach gets stuck and we need to find a new starting point. (a) (c) (b) (d) Fig. 7: Creativity in search for IK solutions. (a) target point reachable with left hand (final iterative approach shown as a sequence of red dots), (b) SOM cell (white circle), found in a recently active cluster (pink), defines the starting point for approach, (c) target appears impossible for the left hand due to the wall obstacle at its "elbow" (shadowed), but a new starting point feasible for the right hand is found (solid), (d) corresponding activity in SOM, where the previous starting point is surrounded by negative feedback effect (blue) due to unsuccessful trials, and the new point gets positive feedback (pink) which propagates to neighboring cells. June 2015 256 6 Though in principle any starting point could be considered as a new candidate, a random search is not very effective. Even if a cell's probability to be selected is weighted by its correlation with the input, a random method would mostly suggest candidates near the one which already lead to a dead end. The obvious engineering solution, trying out all candidate points in successive order, is not suitable here because sorting would call for higher level conceptual thinking and a different memory organization. We do not want to give the system any ready-made domain specific heuristics either, but want it to rely on very generic principles. As such an approach we utilize supervised reinforcement learning with a short-term memory (STM). We implemented a distributed STM as an additional variable in each cell. It modulates the cell's probability to be selected as candidate for a trial. Its value would be increased by positive feedback from a successful case and decreased if the trial fails. Following the self-organization principles, these changes are also propagated to the cell's neighborhood but only among similar cells. To keep the operation dynamic, both positive and negative effects are gradually faded, possibly with different time constants. The system's behavior now depends on its short-term history, its sensitivity to feedback, and the relative time constants. Let us assume the robot has operated for a while with targets in the left-hand area. Then the cells in the corresponding cluster(s) have been activated a lot, and due to positive feedback their probability to be selected again is high (pink color in Fig. 7a-b). When the target moves to a near but unreachable position (Fig 7c), the same cells continue to be activated as candidates, but a failure to reach the goal from one starting point will make the probability of that cell (and its close neighborhood) low. However, because of recent positive activity, the search will still continue with other cells in the same cluster. Then the further course of action is determined by the system's history and parameters as follows. If a cluster's temporal activity is high (due to operating long in that area) and fading slower than the effects of negative feedback, the system will continue search within the same cluster despite of being unsuccessful. This corresponds to mental inertia, the tendency to keep on temporal preferences, i.e. the agent's expectation that a recently useful concept will continue to be so, an idée fixe. However, if the negative feedback is more persistent and eventually dominates the whole cluster (indicated by blue color in Fig 7d), then a cell in some other cluster (probably one with next best correlation with the target) gets highest probability and will be taken as starting point for a trial. If it does not succeed, negative feedback will make its neighborhood less probable and the search continues somewhere else. Effectively this would implicitly perform an ordered search, though without explicit sorting. Once a successful case is found (possibly requiring iterative final approach as in Fig. 7c), it will get positive feedback which is diffused to its neighbors in the same cluster, too (pink color in Fig. 7d). If the agent's operation continues with further targets nearby, this neighborhood will provide successful candidates again, and eventually the cluster becomes predominant: a primitive paradigm shift has happened, heureka! Analysis of system behavior We can evaluate the system theoretically and get the following qualitative observations, also confirmed by experiments with different parameters and test conditions. In the above case, the creative leap was required because the left hand was unable to continue operation due to a constraint. Had the system a different history, with the right hand recently used before going to the new target, the new solution would have been obvious because of the predominant right hand: no creative moment, nothing unexpected, although new compared to what had been learned and stored in the long term memory (SOM). This is in alignment with the general observation that mental fluidity is induced by pressures (Hofstadter and Mitchell 1995) and may not happen otherwise. Sticking with recently used behavior and building expectations is necessary for the system to act creatively, but it is not sufficient alone. Without negative feedback from an unsuccessful trial the system will keep trying the same over and over without getting anywhere. Without any (positive or negative) feedback the system looses its temporal properties and reacts always the same way in a given situation, governed by the associative memory alone. An interesting situation is encountered if we neglect the positive feedback but keep the negative. This leads to an "anti-sticking" behavior: once a cell has been used, neither it nor its close neighbors will be used for the next trial, but something loosely associated with the input. As the effect of negative feedback gradually fades away, the system may return to this cell if its association to the input is high, but only temporarily, and then jump to another cell. Overall, this resembles divergent thinking: variable alternatives are tried out, not randomly but guided by associations. In our case study the robot's handedness was given as an explicit input feature to the SOM. This makes a clear distinction between clusters corresponding to left and right handed operation, respectively. However, this feature appears to be unnecessary, as similar behavior may emerge anyway if there only are two or more separate clusters formed from the distribution of input value combinations. The ability to act creatively depends on the problem domain and its representation: if there are local optima where one may get stuck, there is a possibility for radical moves - otherwise a too simple route may lead to the solution. In this respect our system can be compared with optimization: Gradient search is a sticky strategy corresponding to the case with positive feedback only. Parallel search methods, such as genetic algorithms and simulated annealing, may lead to unexpected solutions, though in their basic form they have no such concept as surprise. However, the 'temperature' that makes simulated annealing process to look for more random options may well be compared to the negative feedback in our system. June 2015 257 7 Discussion Different degrees of creativity, as mentioned in the introduction, can be demonstrated with our system. The case when a solution is already familiar (or reachable by iteration) but "didn't come to my mind" is modeled if the recent history has built strong temporal preference for a subset of solutions. This manifests itself as the agent's "sticky" tendency to sometimes utilize iterative approach from recently used starting points even if there were a better starting point stored in SOM, but this alternative is in a different cluster. The more interesting case, target reachable within hard constraints but outside the most obvious trained area, is demonstrated when starting to use the other hand after trying and failing with one (as in Fig. 7). This can be interpreted as transformational creativity on preconceptual level, a change in the predominant cluster (rule) used in the agent's operation. It involves relaxation of soft constraints (giving up accustomed solutions), an essential property of creativity. Whether this should be called creativity, may be an arguable question. Hristovski et al. (2011) have studied a similar situation of limb movements in the context of boxing. On the one hand, they state that any novel movement that has not been performed previously by an individual can be considered a P-creative act. On the other hand, they note that movement system bistability yields too much predictable behavior to account for creativity. Our case may be interpreted as the latter due to the binary choice of left or right hand in any situation, or the former because the exact hand movement is not predictable. A deeper analysis of the system's dynamics may be needed to take a stance. Although our model shows qualitative changes in the robot's dynamical behavior, it is missing temporal anticipation, which could be utilized for creative planning of actions. The implementation as such does not support reasoning about an action's consequences that would be needed for goal-oriented behavior and higher-level expectations (Lorini and Falcone 2005). However, similar techniques might be used for learning temporal associations as well, thus making it a platform for further development. Lorini and Falcone (2005) used formal logic to describe expectations and surprise in symbolic domain. At the other end of the scale, specific neural assemblies have been found that correspond to these phenomena in visual cognition (Egner et al. 2009). This suggests that a neural network model may be feasible. Gabora (2010) presents a schematized associative memory where neural cliques are alternatingly recruited for analytic and associative modes of thought, which is supposed to be essential for creativity. The model does not consider expectations and surprise, nor computational implementation, but the activation function of neurons may be comparable to our feedback mechanism. The Copycat system (Hofstadter and Mitchell 1994) has a somewhat similar feedback mechanism as our STM. Its global 'temperature' and the 'unhappiness' of objects serve as measures controlling the random choices that facilitate unexpected behavior. The main differences are that it works on textual objects instead of continuous signals, and its architecture is based on a crowd of heterogeneous codelets instead of neural networks. The latter feature makes it more reminiscent to Brooks' robots. Relaxation of hard constraints, e.g. leaving the physical space and thinking in another context by analogy or metaphor, would call for higher level conceptual models than neural networks, and is out of the scope of this paper. The same applies to reflective thinking. Our poor system itself does not recognize creativity, though it may be possible to detect it from the abrupt changes happening in the STM values during a creative leap. Had the system a measure of cumulative effort used before a successful trial, or about the time spent without a goal at all, it could model the emotional frustration and boredom that are supposed to control creative behavior on a higher level. In previous work (Takala 2005) these were used to control the recruitment of alternative methods to solve given problems. Combining the mechanisms with the present work may result in interesting behaviors. Our general approach follows much that suggested in robotics (Brooks 1991, Brooks and Stein 1994). Although the current implementation is based on a single neural network, and a multilevel hierarchical organization of several SOMs may be possible, a more heterogeneous architecture may also be due. Conclusion This work emphasizes the contextual nature of creativity, culminating to expectations and their role as soft constraints that must be violated in order to find novel and surprising solutions to problems. Concentrating on the preconceptual level of cognition, it contributes to an area rarely touched in previous works. A computational model is presented that implements a primitive form of creativity, which may serve as a basis for further development. Autonomous formation of conceptual spaces is demonstrated with the self-organizing memory, and a learning mechanism proposed that simulates the temporary preferences typical in idea fixations. Though our example case is about kinematics, the model is domain independent and may be applied in many different areas. The creativity model proposed in this paper is based on various ideas that are not novel as such but presented in multiple previous works. The main contribution appears to be the implementation where a self-organizing neural network is combined with control mechanisms usually applied on the symbolic level. Our system is not using predefined heuristics or encoded algorithms but applies generic learning principles to form (pre)concepts, on which the feedback mechanism operates. A theoretical conclusion is that creativity cannot happen just anywhere, but requires certain conditions: In order to be surprising, the situation should involve expectations, or temporary preferences, that are violated in a creative act. If June 2015 258 8 the system acts in a continuous parametric domain, such as movement, the setting (or its representation) should be non-monotonic, such that the system may get stuck in a local optimum. Yet another condition, though mostly overlooked in the present work, is motivation. If the problems to be solved are given from outside, the system acts in a slave mode, whereas a truly creative mind would be curious and willing to set problems, not just to solve them. An immediate future work is to study the proposed mechanism in more complicated cases, such as a real robot, taking into account physical continuity of movement and not only static positions. Another extension is to facilitate explorative creativity by letting the robot move randomly around and learn continuously. Long term goals include developing the proposed approach towards higherlevel cognition and conceptual thinking, including analogical reasoning and emotional self-control. Acknowledgements Thanks to the anonymous reviewers for their comments on the manuscript and pointers to additional sources. This work has been made possible in part by the sabbatical system of Aalto University. 2015_34 !2015 Specific curiosity as a cause and consequence of transformational creativity Kazjon Grace & Mary Lou Maher Department of Software and Information Systems University of North Carolina at Charlotte Charlotte, NC 28203 USA {k.grace,m.maher}@uncc.edu Abstract This paper describes a framework by which creative systems can intentionally exhibit transformational creativity. Intentions are derived from surprising events in a process based on specific curiosity. We argue that autonomy of intent is achieved when a creative system directs its generative processes based on knowledge learnt from within its creative domain, and develop a framework to elaborate this behaviour. The framework describes ways that transformation of the creative domain can arise: from learning, from a serendipitous situation, and as a result of intentional exploration. Examples of each of these kinds of transformation are then illustrated through examples in the domain of recipes. Introduction Significant effort has been devoted to developing computational models that can recognise creative artefacts, on the assumption such a capability could be used to generate creative artefacts if paired with an appropriate search algorithm. However, generate-and-test creative systems lack any kind of autonomous intent: they never decide to make a green artefact, or a loud one, or a happy one, unless such qualities are built into their externally-provided objective function. As classically formulated, a search function does not distinguish two points within its space in any way but by the objective, and thus has no intent that can be defined with the representations that define that space, only in how the resulting artefacts perform. Search functions that can modify their goals while searching (Gebser, Kaufmann, and Schaub 2009), or that search based on specific past experiences (Cully, Clune, and Mouret 2014) do exist, but, from a computational creativity perspective, there remains an unanswered question: under what conditions should a system decide to modify its search? At first this lack of autonomous intentions in our systems' search processes may fail to seem problematic: we are not constrained by cognitive plausibility. There is no inherent reason why intentionality, while clearly a quality of human creators, should be required in their digital analogue. Our goal is systems which produce output that would be considered creative, regardless of the processes involved. On closer inspection, however, autonomy of intention may not be so easily discarded from creativity. Intent is intrinsically tied to definitions of art and creativity (Dewey 2005), where the debated questions concern not whether an artefact's creator had intent, but whether that intent should be privileged over observers' interpretations (Best 1981). Intention is seen among human creators as critical both to the production and consumption of creative artefacts - evidence that argues for its role in appreciative as well as generative computational processes. Autonomy of intent also provides critical information for use in framing. A creative system's ability to construct framing narratives for its work - considered critical to any computationally creative construct (Charnley, Pease, and Colton 2012) - stems from its ability to provide justification for creative decisions. Without autonomy of intent these justifications can only be driven by external objectives (e.g. "I wanted to make the artefact seem brighter"), not intrinsic motivations (e.g. "I was exploring how colour influenced brightness"). Human creators make the decision to explore a particular set of concepts, and follow that exploration to its resolution by way of creative expression. Framing, as the channel by which a creative system can convince its audience of its creative autonomy, should explain such explorations. Previous models of intent in framing have been based on information extrinsic to the creative domain, such as the day's top news stories (Krzeczkowska et al. 2010), but we argue that without learning how to connect such external knowledge to the creative domain (e.g. through analogy), then such intent cannot be autonomous. How, then, can a creative system derive intent from its knowledge about the creative domain? On what basis should it transform its inspiring set and own past creations into contextual constraints on its search process? For one possible answer we turn to cognitive studies of how human designers think during the process of designing, and how their search for a creative solution affects itself. Human designers do not sequentially analyse a problem, synthesise solutions to it and then evaluate those solutions, but instead switch between those processes iteratively (Schon 1983), finding new ¨ problems as frequently as they find new solutions (Weisberg 1993). This co-evolution of problem-framing alongside problem-solving becomes more evident in expert designers (Cross 2004), and - more critically for our purposes - has been shown to produce more valuable output (Getzels and Csikszentmihalyi 1976). A cognitive protocol analysis of June 2015 260 sketching architects found that not only did they regularly unexpected discover features in their own drawings, but that those discoveries often led to reformulation of the design task (Suwa, Gero, and Purcell 1999). These reformulations led in turn to more unexpected discoveries, evidence that this cycle of intentionality and exploration is beneficial, if not central, to human creativity. We seek to capture this cycle in the computational model presented in this paper. We propose that the inspiration for a computational model of intentional creativity can come from the iterative process of defining the creative task and solving it in parallel. We propose that intentions are not created de novo, but that they arise from a drive to explore what the system has observed but not understood, both from its own output and that of other creators. The catalyst for this exploratory behaviour is unexpectedness: a creator being surprised by an artefact, and forming the intention to explore some part of the design space in return. We refer to this as a kind of specific curiosity, after the distinction between specific and diversive curiosity first articulated by Berlyne (1966). We frame our model for specific curiosity as an extension of Wiggins (2006) framework for describing exploratory and transformational creativity. With that symbolic representation we can then describe how transformational creativity leads to surprise, and how surprise can in turn lead to further creativity. Transformational creativity, surprise and their effects on behaviour This paper describes a model of autonomous intent in creative systems, drawing on theories of evaluating creativity, psychological studies of curiosity and cognitive studies of how designers respond to unexpected discoveries. We introduce each of those literatures here. Three long-lost cousins: novelty, transformational creativity and surprise Novelty (Newell, Shaw, and Simon 1959; Saunders and Gero 2001), surprise (Macedo and Cardoso 2001; Grace et al. 2014) and domain transformation (Boden 2003; Wiggins 2006) are three core ideas around which the debate on how to computationally recognise creative artefacts has revolved. In Grace and Maher (2014) we outlined how each of those three could be connected to the notion of unexpectedness, establishing one possible way to compare them in a common language. Novelty was, to the authors' knowledge, first floated alongside value by Newell, Shaw and Simon (1959), forming the closest thing to a broadly-accepted definition for creativity that we have today. Novelty and value are proposed as necessary and complementary aspects of creativity: a solely valuable artefact is merely good, while a solely novel artefact is merely weird. Novelty is typically conceptualised as difference from that which is known (Sternberg and Lubart 1999), and usually operationalised by a distance measure between a new observation and past experiences. An alternate view of novelty is based on the degree to which observing an artefact helps an agent to understand the world (Schmidhuber 2010), proponents of which criticise the distance-based approach as attributing overly high novelty to noise. Boden (2003) proposed another solution to the problem of distinguishing meaningful novelty from noise by focusing on impact. Transformational creativity is based on the degree to which an artefact changes the creative domain to which it belongs. This is suggested by Boden to be a more significant form of creativity than the combination of "mere" novelty and value, which she considers the result of exploratory creativity. Wiggins (2006) formalises Boden's definition of transformational creativity and provides a general description of a creative system that is capable of it, although he questions Boden's strict hierarchical superiority of transformation over exploration. The authors have previously proposed unexpectedness and surprise as an alternative formulation of novelty (Grace et al. 2014), although we are far from the first to do so (Macedo and Cardoso 2001). Unexpectedness is the degree to which observing an artefact violates (i.e. opposes) an agent's confident predictions about the world. The flexibility of this approach is in the source of predictions, which may be relationships within the artefacts, trends derived from the domain's history, or other sources of knowledge. Novelty can be described from this perspective as a form of unexpectedness based on the predicting that the domain will continue as it has in the past. Surprise is an affective response to unexpectedness: unexpected artefacts induce surprise in their observers. Transformational creativity can be described as a quantification of surprise based on how much a new artefact changed domain knowledge. This connection was described in Baldi and Itti (2010), who used an information theoretic perspective to connect measuring surprise by (un-)likelihood to measuring it by impact on knowledge. Throughout this paper we adopt the viewpoint that these three notions are intimately connected, constituting complementary perspectives on how a creative artefact can be meaningfully different from those that preceded it. We argue that the evaluative processes of creative systems should possess the ability to detect all of the above aspects of meaningful difference, and that any one of them - in conjunction with value - can indicate creativity. Curiosity and the pursuit of novelty Curiosity is an overloaded term in psychology, referring both to a trait possessed by different people to different degrees, as well as to motivating state that drives its experiencers to seek novel stimuli (Berlyne 1966). The latter definition, curiosity as a state, has been proposed as a motivator for computational creative systems (Saunders and Gero 2001; Merrick and Maher 2009), based on the principle that novelty-seeking (alone or alongside value) will drive exploration towards creative solutions. Berlyne distinguishes state-curiosity along two axes: perceptual vs epistemic and specific vs diversive. Perceptual curiosity is the drive towards novel sensory stimuli, and has been observed in a variety of animals of different cognitive capabilities. Epistemic curiosity is the drive to acquire novel knowledge This conceptual curiosity can be modelled by June 2015 261 systems that learn a conceptual space and measure novelty within it, rather than measuring between artefacts at the level of sensory input (Saunders and Gero 2001). The distinction between creativity at the sensory and knowledge-levels has been drawn within computational creativity by Smith and Mateas (2011), who refer to the latter as "rational curiosity". The specific/diversive division has received less attention in computational creativity. Specific curiosity is the search for observations that explain or elaborate a particular goal concept. Diversive curiosity, on which most computational models of curiosity have focussed, is the search for new information without any specific targets. While the search for a specific concept can be modelled by search, the challenge is how to trigger specific curiosity: when and why should a creative system become specifically curious? This is related to the broader issue of creative autonomy (Jennings 2010; Saunders 2011). In this paper we develop a model of specific curiosity that uses surprise as a way to address this challenge. How surprises affect designing Cognitive studies of human creators - particularly in the field of design - have shown that surprise significantly impacts the creative process. Designing has been described as a "reflective conversation with the medium" (Schon 1983), ¨ meaning that designers iteratively synthesise new additions to their emerging design and then reflect on their effects. Expressing creative artefacts through rough yet external representations - usually referred to as sketches in the case of human designers - is a critical component of the creative process as it allows designers to observe changes they did not consciously make (Schon and Wiggins 1992; Goldschmidt 1991). Through this externalisation a designer may perceive an emergent shape, discover a new relationship between components, or construct an analogy to past designs. Several computational creativity systems have adopted this cyclical reflective approach in whole or in part, including the search-bias transformation in DeLeNoX (Liapis et al. 2013), the interpretation-driven mapping of Idiom (Grace, Gero, and Saunders 2015) and the expectation-based reinterpretation of Kelly and Gero (Kelly and Gero 2014). This iterative process of "seeing" (perceiving an emerging design) and "moving" (making a change to it) allows designers to read more off a sketch than they originally put there (Schon and Wiggins 1992). Though the term has since been corrupted beyond recognition, this was the original meaning of design thinking: an iterative, reflective, solutions-focussed strategy as opposed to a step-bystep, analytical problem-focused one (Lawson 2006). In a "think aloud" cognitive protocol study where architects were observed designing, unexpected discoveries were bidirectionally causally connected to reformulation of the design goals, i.e.: surprises led to transformation of the problem, and transformation of the problem led to surprises (Suwa, Gero, and Purcell 1999). These results with human creators suggest that surprise-triggered specific curiosity might be useful for encouraging transformative creativity in artificial creative systems. In the remainder of this paper we develop a framework for how that behaviour could be operationalised. Unexpectedness-triggered specific curiosity: A model of transformation-seeking behaviour We adopt the creative systems framework from (Wiggins 2006) to describe our model of unexpectedness and specific curiosity. Wiggins' framework describes a creative system in terms of a search process that traverses a conceptual space to generate artefacts, coupled with a metacognitive search process that traverses the space of all possible conceptual spaces. The resulting system is capable of both exploratory and transformational creativity, with the latter represented as exploration at the meta-level. The following symbols define the core of the framework, although readers are encouraged to familiarise themselves with the original, which affords each definition far greater depth: U is the universe, the space of all possible distinct concepts that make up all possible representations of artefacts in the current creative domain. L is the ruleset language, the set of all possible rules that act on concepts the creative system can construct. J.K is the definition interpreter that takes a subset of L and acts on a set of concepts, yielding real numbers in [0,1]. This is used to apply a rule set to a set of concepts, assigning a value to each. R ✓ L is a constraint ruleset, by which the system defines the scope of the conceptual space (within U). C is a conceptual space is the current subset of U permitted by R. i.e., C = JRK(U). T ✓ L is a traversal ruleset, by which the system explores C. E ✓ L is an evaluation ruleset, by which the system evaluates proposed concepts. cin is the input set, a totally ordered subset of U that reflects the list of artefacts known to the system, in the order of the system's observation of them. cout is the output set, a totally ordered subset of U that reflects the output of the creative system after a particular generative iteration. hh., ., .ii is the generation interpreter that takes three subsets of L, the rules that define the conceptual space R, the rules that define how to traverse that space T, and the rules that assign value to members of that space, E and acts on the set of all previously observed artefacts to generate a new set of artefacts. i.e., cout = hhR,T,Eii(cin). LL is the meta-level ruleset language, the set of all possible rules that act on rulesets (i.e., on L) the creative system can construct. RL ✓ LL is a meta-level constraint ruleset, by which the system defines the scope of the meta-conceptual space of possible rules that can be part of L. June 2015 262 TL ✓ LL is a meta-level traversal ruleset, by which the system explores the space of possible rules for L. EL ✓ LL is a meta-level evaluation ruleset, by which the system evaluates proposed rulesets for their ability to generate valuable concepts. The differentiation of R, the rules defining the conceptual space, from T, the rules defining the search process which acts on that space, is a significant addition to Boden's notion of transformational creativity. With this distinction Wiggins can describe two kinds of transformational creativity: R-transformation of the space of possible concepts, and T-transformation of the search process for generating new concepts. R-transformation, closest to Boden's original conceptualisation of transformative creativity, concerns the redefinition of what a creative system considers possible. T-transformation concerns the redefinition of how a creative system creates. The definition of a creative domain - as captured by Wiggins' R - is a socially grounded construct. While it is useful from the perspective of defining transformation across a creative domain to think of that construct as stable across all members of a society, in practice this knowledge must be learnt by each member. In Boden's original model, the definition of the creative domain is agreed amongst all participants, and this knowledge is not expected to be constructed through exposure to the domain. Wiggins hints at the social nature of R, but does not distinguish individual and societal transformation of the conceptual space. To model the influence of artefacts created by others on a system's behaviour, we must capture this distinction: we will use R to refer to an individual creative system's definition of the space, but one could imagine a broader, socially grounded historical-R of the sort Boden describes emerging from the cross-pollination of ideas and norms. Our intent is to capture specific curiosity - intentional pursuit of further transformation along a search trajectory incited by a particular transformative example - within an expansion of this framework. To achieve this, we need to expand Wiggins' formalisation in four ways: • To enable a creator to be surprised by its own output, as in Schon (1983), a creative system must externalise and ¨ re-perceive its creations as part of the generative process. • To incorporate the influence of other creators, the input to a creative system's generative process must include all artefacts it has observed, not just its own creations. • To model the probabilistic nature of expectations, the conceptual space should be a fuzzy set of probable concepts, not a crisp set of possible concepts. • To separate unexpectedness from inexplicability, the system should be aware of its confidence in the predicted likelihood of any concept being in the conceptual space. These changes capture the situated, social, and expectation-based nature of creative systems, allowing us to use Wiggins' formalisation to explore the question of when, where and why transformative creativity occurs. Surprise as R-transformation We now formally describe the above expansion of the framework. The literature on design cognition describes how creators can be surprised by their own creations. For this to be possible in an artificial creative system those creations must be represented in a way that contains additional information not used to create them. To reflect this we add a step to the post-generation process of the creative system. First, hh., ., .ii is used to generate a new set of outputs, cout, from the current inputs, and then instead of those outputs being directly appended to cin for the next iteration, they are first reified via a function r, which maps from a concept to an externalised representation of that concept which we call an "artefact", and then re-perceived by a function p, which maps from an artefact back to a concept in U. The nature of perception, reification and the space of possible artefacts is beyond the scope of this paper. To capture a society of creative systems that influence each others' work, we must amend the generative step of Wiggins' formalisation: instead of applying the interpreter hh., ., .ii to just cin, the ordered set of that system's own past creations, we must apply it to all an ordered set of all concepts the system has previously observed, regardless of their source. We assume our creative system is part of a society of creative systems that are all producing artefacts within the same domain (by which we mean they share at least U). Each creative system possesses an additional ordered set of concepts, cobs that it has observed but did not create. Different societies may have different structures in which creative systems are exposed to each others' work in more or less selective ways, but cobs is generated by applying the perception function p defined above to some subset of the artefacts externalised by other creative systems. If cobs is non-empty before a creative system has generated any concepts of its own, then those pre-existing known artefacts are the system's inspiring set (Ritchie 2001). We can now describe the generation step in our amended formalisation, applying the interpreter to the union of creations and observations, and afterwards reifying and re-perceiving the output, i.e. cout = p(r(hhR,T,Eii(cin [ cobs))). Wiggins suggests that the output of the interpretation function for R (a real number in [0, 1]) be converted to a boolean value indicating membership in C. We propose instead that C be considered a fuzzy set, with the output of the interpreter defining a membership function l :U! [0, 1] that indicates the likelihood of observing each concept as part of the domain. This transforms Wiggins' space of possible artefacts into a space of probable artefacts, and lets us capture all the rich relationships between concepts that influence their mutual likelihoods. We derive this interpretation from our previous work on expectation, novelty and transformation, see Grace and Maher (2014) for details. We introduce into our framework a notion of confidence. This serves to differentiate unexpectedness (a violation of confident expectations) from ignorance (Ortony and Partridge 1987). To achieve this we replace the J.K interpreter from Wiggins with a modified version, L.M, which differs only in that it returns a 2-tuple of real numbers in [0,1] for each artefact to which it is applied. The first, as in J.K, is June 2015 263 the truth value, which becomes the value of the likelihood function l that defines the artefact's membership in the resulting set. The second value is the system's confidence, with 0 indicating a complete lack of confidence and 1 indicating complete certainty. This confidence becomes another function c :U! [0, 1]. We use L.M when generating the conceptual space with R, as in: C = LRM(U ) As a result our C is a fuzzy set of concepts with a membership function l defining the likelihood of observing each concept in U, as well as a similar confidence function c defining the system's confidence in each artefact's likelihood. These functions are compiled from the first and last elements, respectively, of the tuples output by L.M. That is, for each concept a 2 U, given LRM({a})=(al, ac) and assuming al > 0: a 2 C , l(a) = al, c(a) = ac From this perspective, Wiggins' R becomes the creative system's expectations about the creative domain. This connection between conceptual space membership and expectation allows us to describe the influence of surprise on creative search. In our amended framework, R-transformation is commonplace and necessary, a natural effect of creative systems acquiring the knowledge they need to competently model the society's rules about the domain through their own experience. A creative system experiences expectation failure when the conceptual representation of a newly observed artefact has a low a-priori likelihood in the conceptual space. We can then distinguish two kinds of artefact that cause expectation failure: inexplicable ones, where the system is not confident of its predicted low likelihood, and unexpected ones, where it is. An unexpected artefact au is one for which: au 2 (cin [ cobs), l(au) ⇡ 0, c(au) ⇡ 1 Complementarily, for an inexplicable artefact ai: ai 2 (cin [ cobs), l(ai) ⇡ 0, c(ai) 6⇡ 1 Only in the first case can we say that the agent's expectations were violated - in the absence of a confident prediction the system was merely ignorant. Both inexplicable and unexpected artefacts should by rights induce a transformation of the domain knowledge in R, as well as potentially transformations of T. Those transformations can be considered a result of creativity if the artefact(s) that caused them are valuable under E. Given our definition of unexpectedness in terms of R we can restate how our expanded formalisation captures the dyad of novelty and value. The rules in E will be concerned with the evaluation of artefacts' performance, quality, style, and other components of value, and some portion of T will use those evaluations to direct search. Contrastingly, some other subset of T will be concerned with novelty seeking: evaluating the dissimilarity of new artefacts to existing ones using measures of novelty, surprise and transformativity. We refer to this novelty-seeking subset as Tn ⇢ T. These latter traversal rules will be based on the likelihoods, confidences, and transformations of L associated with artefacts. We do not seek to resolve the disputes surrounding the definitions of novelty, surprise or transformation, only suggesting that Tn could contain metrics for any or all of those, but we do require that for any creative system Tn 6= ;. Any artefact valued by both Tn and E can be considered p-creative. This generative act is serendipitous if the search process possessed no specific intent to create that artefact or anything like it. An artefact discovered to be transformative by Tn after its creation was not the result of a directed search, for the system cannot know how its knowledge will be transformed by new observations. This places limits on a creative system's ability to generate framing about its creative output: serendipity defies satisfying explanation. In the next section we use our definitions of inexplicable and unexpected artefacts to describe different possible kinds of transformational creativity. We also propose how a system might adopt constraints on its future generation in response to unexpectedness, and thereby intentionally seek out further unexpected discoveries. Specific curiosity as a consequence of surprise A system that has observed inexplicable artefacts will attempt to learn: to improve its (clearly insufficient) knowledge of U. We consider learning to be a creative system's response to the inexplicable, and it is our first possible kind of R-transformation. Learning can be expressed as the application of TL to produce new R and/or T in response to inexplicable artefact(s) in cin or cobs. While the mechanisms of learning will be specific to the rules in LL, we can describe its effects: it attempts to transform R such that the likelihood of previously observed artefacts increases. A system that has observed unexpected artefacts will be surprised. We consider artefact-induced surprise to be a creative system's response to unexpected artefacts, and it is our second kind of R-transformation. Artefact-induced surprise can be expressed as the application of TL to produce new R and/or T in response to unexpected artefact(s) in cin or cobs. Learning occurs from unexpected objects as it does from inexplicable ones, producing R-transformations that increase the expected likelihood of previous observations. Inspired by cognitive studies of reflection in human designers by Suwa et al (1999) and others we can now consider how surprise might affect a system's future generative behaviour (i.e. cause transformation of T). Specific curiosity, as introduced earlier, is the deliberate pursuit of specific new knowledge or stimuli through the adoption of goals or constraints on behaviour. In the context of a creative system this is T-transformation with the goal of exploring an unexpected stimulus, based on the hypothesis that (as observed in human designers), surprise begets further surprise. This can also be considered a form of active learning (Cohn, Ghahramani, and Jordan 1996), where the system actively tries to fill the gaps in its knowledge through generation. To become specifically curious about an artefact is to seek to create more artefacts that embody the interesting things about it. We formalise this as follows: given an unexpected June 2015 264 artefact au we can determine the subset of rules that contributed to its confident low-likelihood prediction: Rau ✓ R. These rules embody the domain knowledge that was violated by the perception of the new artefact, in that they produced a confident prediction that was proven wrong. This subset forms the basis of the system's specific curiosity, in that the system can use them to pursue artefacts that are unexpected according to just those rules. To define this we induce r, a relevance function over concepts that measures the complement of the likelihood of a concept occurring in a conceptual space defined exclusively by Rau . Accordingly r(a) ⇡ 1 for any artefact a that would be considered unexpected according to the same rules as was au, including au itself. Conversely, any artefact that is not unexpected, or is unexpected due to other rules not in Rau , would produce a lower value of r. We can then define specific curiosity about au as replacing Tn with a single rule that seeks artefacts for which r(a) ⇡ 1. This (temporarily) redirects the system's general (i.e. diversive) search for novel artefacts towards those that are unexpected according to the same rules as the one that caused the surprise. By constructing a relevance function from the rules violated by the unexpected artefact we focus the system upon the parts of its own knowledge that produced the unexpected result. The results of this specific curiosity will vary based on the structure of the knowledge that was violated. If the rules define boundaries of the domain, the relevance function will value artefacts that break the same boundaries as the focus of curiosity. If the violated rules placed the focus in a new or rare category, the relevance function will value artefacts in that category. If the violated rules define an expected relationship between components of the artefacts' representation, the relevance function will value artefacts that break the same relationship in the same way as the focus. In each case the relevance function will value artefacts that are in some way similar to the one that caused surprise, but with that similarity determined by the system's knowledge. The hypothesis driving this specific curiosity is that regions of the conceptual space that generate one unexpected artefact likely have the potential to generate more, and searching nearby has a greater chance to yield further unexpected (and therefore potentially creative) artefacts than searching elsewhere in the space. This behaviour aligns with the concept of creative autonomy and situational adaptation of goals described in (Jennings 2010). In the following section we illustrate the above kinds of R-transformation with examples from the domain of recipe generation. A worked example of unexpectedness-triggered specific curiosity As a hypothetical example of our unexpectedness-triggered reformulation approach, consider the creative domain of recipes. Culinary creativity has recently attracted attention in the computational creativity community (Morris et al. 2012; Varshney et al. 2013), and we draw upon it as a way of illustrating our model of specific curiosity. Assume a hypothetical recipe generation system integrated with a large online recipe repository. The system has access to all the recipes posted by humans, and is tasked with supplementing that database with its own creations. Each recipe is an artefact represented by its ingredients and their quantities, the preparation steps, and metadata such as cooking time and user-applied tags. This is supplemented by behavioural information for each recipe: the full text and ratings of its set of user reviews. The system's task is to generate novel and valuable recipes, and submit them for human consumption and review. E is based on aggregated user ratings. R is based on domain knowledge represented by a set of predictive models that describe the likelihood of various combinations of ingredients, quantities, tags, categories, reviews and ratings occuring. We can now describe three ways that this implementation of our framework could encounter transformative creativity. The first cause of R-transformation is encountering an inexplicable recipe. This would be commonplace while the system developed its knowledge about the domain (as the pre-existing human-created recipes that form its inspiring set were added to its database). For example, assume that the system, early into its learning, encountered its first slowcooked dish. The existing rules in R would assign a very low a-priori likelihood to a recipe with an eight hour cooking time, but having seen so few previous recipes of any kind it would also assign a low confidence to that prediction. The result would be learning - transforming R to incorporate the new range of observed cooking times. No surprise or specific curiosity would result - the system's understanding of the conceptual space improved as a result of observing new kinds of artefact that had been produced by others, a necessary and commonplace step of acquiring competency in a creative domain. The second cause of R-transformation is an unexpected recipe. This occurs when the system makes confident predictions of the likelihood of observed recipes, but is still wrong, possibly as the result of a change in the behaviour of the other creative systems in the society (which, in this case, are the human submitters of recipes). Consider what would happen to the system's knowledge about the ingredient "ginger" if its inspiring set (i.e. the recipes in cobs it used to populate R before generating any artefacts of its own) contained mostly Western recipes, and it developed confident predictions about that ingredient before being exposed to Eastern-inspired recipes. It would confidently expect that ginger was found mostly in sweet baked goods, alongside ingredients like butter, sugar and flour. Encountering a recipe for ginger-and-soy chicken would be highly surprising, causing it to adapt its domain knowledge to fit the new recipe. In this case the creative system had a robust, but incomplete model of the creative domain, and observed an artefact that it would consider p-creative, even though that artefact's creator may have considered it novel. The third cause of R-transformation is as a result of specific curiosity caused by an earlier surprising recipe. As another example, consider "chicken paprikash", a Hungarianinspired dish that combines a roux-based sauce with currylike spices (cumin, paprika and chili). This is an incongruous combination of ingredients and instructions, as the ma June 2015 265 jority of roux-based sauces are flavoured with herbs, stocks and/or cheeses. Our creative system encounters this recipe, becomes surprised as in the ginger-and-soy chicken example, and uses that surprise to trigger specific curiosity. The rules in R that confidently assign a low likelihood to a recipe containing both the steps for a roux and the ingredients for a curry are extracted as Rau. A relevance function is then constructed from those rules that evaluates the degree to which a recipe violates them, and this function replaces the novelty-seeking rules in Tn. The system begins generating recipes that violate these specific rules, such as a roux-based sauce with other unexpected ingredients (such as chocolate), or curries with unusual preparation steps (such as being baked into a pie). The authors feel compelled to mention that they are not chefs, but encourage readers to assume for the sake of argument that those new recipes are both novel and valuable. The observation of these new recipes would lead to additional R-transformation, and this time that transformation can be said to have a deliberate cause. These artefacts were not created serendipitously, they were intentionally generated as the result of a targeted exploration of a specific region of the creative domain, and their discovery further transformed the conceptual space. Specific curiosity can be triggered both from a creative system's own creations, or from those of the other creative systems within its society (here the human user-base of the recipe website). In the case of the chicken paprikash above, the specific curiosity episode was triggered by the observation of a surprising creative artefact generated by a human - other likely external curiosity-triggers in this domain could include the addition of bacon to sweet foods, the inclusion of leafy greens in smoothies, the rise of a new and novel "superfood", or a seasonally resurgent ingredient. The creative system could trigger its own specific curiosity episodes by generating recipes that, once reified and reperceived, were considered surprising. Consider, for example, rules in our creative system's T that use computational analogy-making to map between two recipes and then transfer a new ingredient from the source to the target. An analogy could be constructed between a calzone and an omelette, as both consist of a base layer to which toppings are added before the base is folded over to create a filled final product. The rules for analogical transfer in T identify that the tomato paste spread on the calzone is missing from the omelette, and create a new recipe in which a tomato sauce is spread over the omelette before folding. This would be considered unexpected by the rules in R that pertain to omelettes, which would make confident predictions that a tomato-based sauce would be unlikely to be involved in an omelette recipe. The authors again remind the reader that we are definitely not chefs, but let us assume that the resulting sauced omelette was also considered valuable. Specific curiosity about that unexpected combination of ingredients and cooking methods would result in a transformation of Tn to specifically seek out further recipes involving unusual ingredients being added to omelettes during cooking. Generating new artefacts under this transformed search trajectory could lead to the recipes with further unexpected mid-omelette additions such as spices or fruits. These new creative artefacts would further transform the rules in R that pertain to omelette creation, and if they were also considered valuable according to E then they would constitute intentional transformative creativity. Conclusions We have described an extension to Wiggins' (2006) framework that captures the notions of unexpectedness, surprise and specific curiosity. This approach is motivated by the need for creative systems that can make autonomous evaluative decisions and exhibit intentional behaviour (Jennings 2010; Saunders 2012). The solution proposed in our framework draws on literature from design cognition which suggests that human creators are not only capable of selfsurprise but that it is a significant driver of creative output. Based on this inspiration from cognition we model surprise based on violation of a creative system's learnt model of the conceptual space, and describe specific curiosity behaviours that explore surprising stimuli. Within our framework we can distinguish three causes of transformational creativity: inexplicable artefacts, unexpected artefacts, and specific curiosity. If found in an artefact that was also valuable the first would not be creative (as the transformation resulted from a lack of sufficient knowledge to make predictions), the second would be serendipitous creativity (as the system stumbled upon it without any deliberate goal), and the last would constitute intentional creativity. Specific curiosity describes the iterative cycle between the R-transformation that occurs when observing or creating an unexpected artefact, the T-transformation that facilitates the resulting search for more, similarly surprising artefacts, and the resulting R-transformation that heralds the success of that deliberate search. Our future work involves the development of systems like the one presented here as an example: creative machines capable of surprise, specific curiosity and autonomous intent. 2015_35 !2015 Computational Poetry Workshop: Making Sense of Work in Progress J. Corneli⇤ Goldsmiths College A. Jordanous University of Kent R. Shepperd Goldsmiths College M. T. Llano Goldsmiths College J. Misztal Jagiellonian University S. Colton Goldsmiths College C. Guckelsberger Goldsmiths College Abstract Creativity cannot exist in a vacuum; it develops through feedback, learning, reflection and social interaction with others. However, this perspective has been relatively under-investigated in computational creativity research, which typically examines systems that operate individually. We develop a thought experiment showing how structured dialogues can help develop the creative aspects of computer poetry. Centrally in this approach, we ask questions of a poem, inviting it to tell us in what way it may be considered a "creative making." Keywords: computer poetry, social creativity, flowcharts, Writer's Workshops ‘We can talk,' said the Tiger-lily: ‘when there's anybody worth talking to.' Through the Looking Glass, Lewis Carroll Introduction We are writing in a large part to champion Alan Turing's proposal that intelligent machines should "be able to converse with each other to sharpen their wits" (Turing, 1951). The formalism that we propose builds on the notion of social cybernetics that flows from the following propositions of Heinz von Foerster's, which he uses to theorise systems in which participants can responsibly specify their own roles in relationship to other system participants: "Anything said is said by an observer." "Anything said is said to an observer." (Von Foerster, 2003 [1979]) According to Jaako Seikkula and Tom Arnkil, who draw on the philosophical and literary analysis of Mikhail Bakhtin (Bakhtin, 2010 [1986], 1984 [1963]) in their approach to psychosocial work, "Dialogues could be called ‘the art of crossing boundaries'. Instead of trying to control others, the parties reach out towards each other to hear their views better, to generate shared languages and to join resources." (Seikkula and Arnkil, 2014, p. 23) ⇤Corresponding author. Email: j.corneli@gold.ac.uk This paper outlines a study of social creativity with a dialogical emphasis, taking computer poetry as our working domain. It uses the Writer's Workshop model (Gabriel, 2002) as the virtual laboratory in which to conduct a thought experiment. The findings of our study are applied to the FloWr system (Charnley, Colton, and Llano, 2014). We focus on the following questions in turn: - How has the social dimension of creativity been explored in CC to date? - How can a created artefact tell us about its making, and what can this contribute to CC? - How can computer poetry contribute to developing a process-based theory of poetics? - What would have to change about the FloWr system to implement the computational poetry workshop approach? - What are the pros and cons of the workshop approach? - What might be the future role of dialogue in CC? Background Social creativity in CC Minsky noted that computers need to be social if they are to deal with problems of any great complexity (Minsky, 1967, 1988). We believe that this is particularly true for challenges in computational creativity, since the essence of creativity lives in its appreciation by the creative entity itself and its audience. With creativity in ‘the eye of the beholder' (Cardoso, Veale, and Wiggins, 2009), the ability to respond to evaluation during the creative process (Poincaré, 1929 [1908]; Csikszentmihalyi, 1988) becomes pivotal. Social creativity expands this paradigm by introducing co-creators to the process, and creating works that rely on dialogue, reflection, and multiple perspectives (e.g. the stages suggested by (Gervás and Leon, 2014)). ‘Results' may be steeped in process and are not always based on consensus. The Four Ps of creativity - the creative Person, Product, Process and Press (i.e. environment) (Rhodes, 1961) - have been emphasised in general creativity research. Pluralising these terms (Persons, Products, Processes) calls further attention to a social dimension of creativity, and would emphasise the way the "Press" accommodates multiple multidirectional perspectives akin to a social network in both the modern and original senses. The Pluralised Ps remind us 1 June 2015 268 that in order to understand creativity it is not sufficient to model a lone creator or to generate an attractive artwork. To date, computational creativity research has achieved many successes in computational generation of creative products, but the question of how these systems could adapt and learn from feedback to improve their creativity is less-explored in computational creativity (Jordanous, 2015). Evaluation has been advanced as a pivotal contributory part of the creative process, but researchers often give priority to generating artefacts that could be seen as creative over the task of incorporating feedback and evaluation within the processing of a creative system (Jordanous, 2011). At the previous year's (ICCC 2014) the opening session had the theme "co-creation." However in the main proceedings of the conference, 36 out of 49 papers (approximately 3 in 4 papers) do not appear to mention social interaction or the ability to respond to feedback. Some notable exceptions highlight the usefulness of interaction and feedback for creative systems (McGraw and Hofstadter, 1993; Colton, Bundy, and Walsh, 2000; Sosa, Gero, and Jennings, 2009; Pérez y Pérez, Aguilar, and Negrete, 2010; Pease, Guhe, and Smaill, 2010; Saunders, 2012). Some of this work is influenced by the DIFI (Domain-Individual-Field-Interaction) framework (Csikszentmihalyi, 1988). However, social interaction between creative agents and their audience is often overlooked or relatively simplified: some examples in the domain of computer poetry presented below give the flavour. Increased development of the interactivity of creative systems, especially where this affects the way these systems works, has been highlighted as deserving more attention (Colton and Wiggins, 2012). FloWr is a framework for implementing creative systems as scripts over processes that can be manipulated visually as flowcharts (Charnley et al., 2014). Its general approach consists of linking the inputs and outputs of code modules, called ProcessNodes, together to create a linear flow of data. The resulting Flowcharts can be constructed and executed visually through a GUI; however, they are ultimately represented as scripts, which are the main medium of FloWr. Experiments with automatic process generation in FloWr, reported in (Charnley et al., 2014), highlight the ability of the tool to do meta-programming and modify its own flowcharts. This suggests that FloWr has potential as an environment for modelling social creativity, where the observers are nodes and flowcharts, and their languages are, respectively, programming and meta-programming instructions. . . . and in computer poetry In the domain of poetry-generation, there have already been several attempts to simulate social creativity by incorporating multi-agent systems. In WASP (Gervás, 2010), social behavior is simulated by incorporating a cooperative society of readers/critics/editors/writers consisting of specialized families of experts that cooperate during the poetrygeneration process. The McGONAGALL system (Manurung, Ritchie, and Thompson, 2012) incorporates diverse modules as operators in evolutionary algorithms that produce poems fulfilling the constraints on grammar, meaning A. how to become a writer1 r W i et for 8 hours a day Readfor8hoursad ya B. "the Other" Statement Speak Response Listensandinet pr er st C. proto-workshop Poet Writes Poem Responds Contex S t peaks Reader Responds Figure 1: (A) gives a simple recipe for the growth and development of a writer; (B) response always has dimensions that goes beyond the utterance that is overheard; (C) adds a reader who shares the context with the writer and responds. and poeticity. This approach facilitates the pursuit of several alternative solution paths in parallel, focusing on more promising results or coming back to former ideas. However McGONNAGALL does not provide any communication between modules. In the MASTER system for computeraided poetry generation (Kirke and Miranda, 2013) a society of agents in various emotional states influences each other's moods with their pieces of poetry. The poetry-generation process is based on social learning as the agents interact by reciting their own pieces of poetry to each other. The generated poems are based on repeated words and sounds, and are closer in some ways to music than to typical language. Montfort, Pérez y Pérez, Harrell, and Campana (2013) and Misztal and Indurkhya (2014) use blackboard approaches to poetry-generation, in which independent specialized modules cooperate via a shared global workspace, à la (Baars, 1997). "Experts" exchange information using the blackboard, but without direct communication and without feedback about the reception of their created artifacts. In connection with our work in the current paper, we did a limited proof-of-concept reimplementation of some of the core methods of blackboard poetry system inside of FloWr; we include one of the generated poems and the corresponding flowchart. Methods "What are the proposed ‘lab rats'?" The generative side of the cycles in Figure 1 has been studied more than the reflective side. Our "lab rats" are, accordingly, not poems per se, but rather, instances of reading and responding to poetry. Naturally, such responses could be more or less "canned" (as with Michael Cook's humorously nonspecific AppreciationBot2), so the question becomes: what constitutes an interesting and useful response, and how will these be developed? The idea of responses is useful at various levels. We focus here on staging an encounter between writer and reader. Writer's Workshops Quoting (Gabriel, 2002, pp. 2- 3): The original idea behind the writers' workshop was to do a close reading of a work... looking at the words on 1 According to (King, 2000). 2 https://twitter.com/appreciationbot 2 June 2015 269 the page rather than the intentions of the author or the historical and aesthetic context of the work. Under this philosophy, the workshop doesn't care much what the author feels about what he or she wrote, only what's on the page. Framing and any other contextualisation of the work as it is intended to be presented is permitted, and receives critical attention. We define a Workshop closely following Gabriel's outline, to be an activity consisting of these steps: presentation, listening, feedback, questions, and reflections. The first and most important feature of feedback is for the listener to say what they heard; in other words, what, for them, is in the work. In some settings this is augmented with suggestions. After any questions from the author, the commentators may make replies to offer clarification. In related recent work, we have shown how the Workshop framework can help foster serendipitous discovery and invention (Corneli, Pease, Colton, Jordanous, and Guckelsberger, 2015; Corneli and Jordanous, 2015). Content as creative process Giving agency to the poem rather than the poet's intentions, the poem illuminates its own creative process. This informs our approach to Workshop interactions, which are focusing on the poem observing its own construction. We're interested in context not in the literary or historical sense but in the micro-history of the poem's creative evolution. The originary and therefore unpredetermined nature of the creative process means that the outcome represents a more accurate and objective evidence of the process than the poet's attempt to explain the process. Moreover, to the extent that a creator knows what is expressed through the creative process, even he or she learns this only in the course of doing the work. Observers are only able to consider after the fact how a creator may have selected and rejected various possibilities. The content of the poem is no more and no less than how the poem was made. "In a poem, objective material becomes the content and the matter of the emotion and not just its evocative occasion." (Dewey, 1958 [1934], p. 69) P. G. Whitehouse writing on Dewey's Art as Experience suggests that Dewey joins Collingwood in separating aesthetic emotion from any notion of inspiration that could be considered to be something like raw materials. An emotion is aesthetic when it "adheres to an object formed by an expressive act" (Whitehouse, 1978, pp. 149- 156). However, "the art object does not have emotion for its significant content"; rather, the emotion "belongs to the self that is concerned in the movement of events toward an issue that is desired or disliked" (Dewey, 1958 [1934], p. 14). Aspects of the creative process Doug Anderson and Carl Hausman take Collingwood's study further and map the creative process roughly as follows (Anderson and Hausman, 1992, pp. 299-305): Disturbance ! aesthetic emotion ! response ! artist's decision on components of expression ! feeling of easement plus a simultaneous emerging of a unique imaginative expression ! alleviation ! realising and converting prior psychical emotion ! unique aesthetic experience including new conscious emotion The poem is a work of progress before it is a work in progress. The purpose of a poetry workshop that attends to the content of the poem as process is to illuminate what the poet is exploring through his/her creative process and through the poem. The process of reading a poem is also a process of poiesis - and in the Workshop, the reader joins the writer in the process of creation. Asking questions like those listed in in Table 1 tells us what the constituent parts of the poem are doing. Relevance for CC research From a CC standpoint, asking what the work tells us about the creative process gives an objective and critical focus on "creative evolution" (Bergson, 1911 [1907]) and provides an antidote to the seductions of mere generation. A poetry workshop gives participants the opportunity to read the drafts and final versions of poems by other Workshop participants, a shared culture of critique that can be applied to previously existing poems, and a structured way to gather feedback on one's own work in progress. These analyses, unbiased by the explanations of the (software) creator, will allow participants to explore and extend the conceptual space around poetry, or in practical terms, the toolbox the agents can access. "Extending" expresses both a refinement of the tools used and the introduction of entirely new tools. Moreover, reverse-engineering of the creative process from artefacts will help to teach agents participating in the workshop at which stage of their creative process these new tools or extensions could potentially be used. Dialogue in the workshop involves "respecting the voices of each of the participants" (Seikkula and Arnkil, 2014), be they agents, poems, or individual words - and suggests that we look at the "art of boundary crossing" that is to be found inside poems. Bridges between ‘theory' and ‘practice' Our ansatz is that the Workshop could serve as a way to develop a process-based theory of poetics. There are certain prerequisites: in particular, an underlying context is required, shared (with respect to differing points of view) by the poet and the reader/listener (see Figure 1). Participants are assumed to have relatively stable, enduring but evolving, identities - either might be able to ask "Who am I?" and "Who are you?" (Bakhtin, 1984 [1963], p. 251). Answers would acknowledge a prepared mind with certain prior questions, abilities, involvements, and so on. However, within the Workshop dialogues, the discussion focuses solely on the work itself. Persistent identities allow participants to learn from these exchanges. Table 1 contains a list of questions that a reasonably sophisticated poetry reader might ask about poems. This is complemented by a list of questions that could be addressed, in a 3 June 2015 270 Question Examples What are the register(s) of the poem? cliched, instructive, imperative Who is addressed? reader, poet, friend, rival, confidante What position(s) are present in the poem? pleading, remonstrating, ephemeral What is the poem doing with the reader? accuse, bewilder, alienate Who are the characters in the poem? "the falconer", "you", narrator, "two men" What is the role of image(s) in the poem? "the sea", "a bicycle"; multiple meanings What functions, mechanics, and paradigms are present for the reader to engage with? communication, subverted cliche What problems, discomforts, or diseasements are invoked in the poem? horror, self-loathing, rejection, desire How do these evolve? E.g. an image may start to take over from a register What is the world of the poem, and how does the poem distinguish between this and its perception of this? "Surely", "must"; sacred vs mundane; perspectival vs surreal; tale vs telling What are the overlaps, transitions, implicit dialogues? "twinned" lines/ideas, juxtaposed parts of the poem What role does the chronology of reading play, versus references to chronology and chronological positions within the poem? flagged development, evolution, movement, stasis How are lexical categories used? solid nouns, tortuous adjectives, indistinct adverbs Are there discernible allusive effects? illustrating the literary apprenticeship of the author (or reader) Where is the poet presented with respect to the poem? Confidence, determination, exploration Table 1: Questions that we ask when reading a poem straightforward programmatic manner (Table 2). Each of the examples listed in the right-hand column of Table 1 (and a plethora that are not listed) present a way of thinking about the poem. We can see these as roughly analogous to the agents in Table 2 (Minsky, 2006). To illustrate, in response to a computer-generated poem: Oh dog the mysterious demon Why do you feel startle of attention? Oh demon the lonely encounter ghostly elusive ruler Oh encounter the horrible glimpse helpless introspective consciousness A human critic might offer the following feedback: 1. The use of the word mysterious in the first line has no resolution, real or attempted, or quest to find one. Question Agent concerned Word level What is are dictionary definitions of this word? WordNet expert What are its etymological roots? Etymology expert Where did this word come from? Provenance expert What pronouns are used in the poem? Pronoun expert Phrase level What are the components? Keywords expert Do the components have a negative or positive connotation? Association expert What are the modifiers attached to the components? Modifer expert Sentence level What is the parse tree? Grammar expert Line level How long is the line? Counting expert Where does it break? Breathing expert Where is there white space? Position expert Poem level How are terms that exhibit emotion distributed within the poem? Distribution expert Where is there alliteration (rhyme, consonance) in the poem? Phonics expert Does the poem have a metrical structure? Rhythm expert How repetitive is the poem? Repetition expert Does the poem cohere? Thematic expert Does the poem have a progression? Narrative expert Where are the various elements of the poem concentrated? Entropy expert Table 2: Questions we imagine a computer would currently be capable of answering when reading a poem 2. The use of the word attention is not being interrogated or acknowledged for its importance. Its qualifying word is startle, used here as an adjective; acknowledging the fact that the attention is noted, but is not yet part of the transformative of the poem. 3. This is repeated in the next references to the aesthetic experience as a lonely encounter, exclusive ruler, horrible glimpse and introspective consciousness. 4. The contact made between the poem and its own construction is qualified in negative terms attached to the words demon, encounter and consciousness. 5. This poem does not welcome the intimacy of bringing anything to aesthetic consciousness so that it might be expressed. Why do I say that? Because the words are generalised and horribly imprecise. 6. The poem does not move toward a better understanding of the ideas it alludes to. The vocabulary seems to associate exploration with fear and isolation and this is (paradoxically) quite an interesting acknowledgment of the poem's refusal to go anywhere i.e. to become a thing transformed by a creative process. 4 June 2015 271 Each of these six points is dual-voiced in the sense that the critic is relaying the words of the poem with a new emphasis. Each such statement is one side of a micro-dialogue (Bakhtin, 1984 [1963], p. 73). The challenge is, of course, to bring the observations into the awareness of the computer poet, across the "analogue divide." Care should be taken not just to blythely program the computer with more rules, but rather to give attention to facilitating the process of learning new rules contextually. We continue with the example from this point of view in the following section. First we will consider a reversal of roles, with the computer in the position of critic, looking at a passage from an historical piece of poetry. We have selected a passage from Robert Burns that might have - but in fact did not - serve as a model for the poem generated above. I'm truly sorry man's dominion Has broken Nature's social union, An' justifies that ill opinion Which makes thee startle At me, thy poor, earth born companion An' fellow mortal! Naturally, the first problem is for the computer to read the poem. One of the approaches that is most appealing from our point of view is the automatic generation of a semantic network from the input text (Harrington and Clark, 2007). We could straightforwardly extend the methods of Harrington and Clark with notions drawn from Table 2. 1. The passage begins with I/me, locating the poor, earth born poet 2. thee/thy is another person, possibly the reader, who becomes startled 3. Singular I contrasts with the class man 4. sorry is a sad emotion 5. truly exaggerates sorry 6. dominion is large 7. broken and union are opposites 8. sorry and justifies are opposites 9. union, companion, and fellow are positive words about relationship and joining 10. broken, ill, poor, startle and mortal are related to frailty 11. born and mortal are related 12. There are a lot of rhymes in the poem, at the end of the lines, enjambed. These comments are very different from the other reading above, and are differently interesting. We've demonstrated that the computer is capable of asking objective questions of a poem. A similar semantic network approach would allow it to listen to feedback and take it on board, even when it doesn't understand the ways of thinking that generate this feedback. Again, this links the process of reading and writing poetry to a process of dialogue. Moderator Flowchart A (F_A) Flowchart B (F_B) 1. Read log 2. Send message 3. Writes message 4. Modify flowchart Workshop Log 4. Modify flowchart 3. Writes message 2. Send message Figure 2: Schematic diagram for a workshop built in the FloWr system Seeds for a FloWr Garden Keeping in mind the current limitations of FloWr - no looping or conditionals, only running one flowchart at a time and in one direction - a conversation between ProcessNodes or flowcharts is not immediately feasible. Figure 2 represents a hypothetical design in which a Workshop could take place with a minimally-altered version of FloWr. As shown in Figure 2, each participant in the Workshop would be represented by a single node. One of these nodes is a moderator in charge of dictating the interaction between the participants of the Workshop, while the rest represent flowcharts that have the ability to modify their own connections according to the discussions from the Workshop - this can be achieved by exploiting the scripting mechanism of FloWr and dynamically loading the new structure of the flowchart. Moreover, a shared log would contain the history of the messages exchanged during the Workshop and a queue of messages waiting to be delivered. We define four different types of messages that can be exchanged: • comments about specific elements of a poem, or more general statements about how the poem affects this reader. • questions to facilitate comprehension of this commentary; for instance, the questions can vary from simple requests of sources of information (e.g. files, input from another node, which resources a flowchart uses, etc.) to process-specific details (e.g. current conditions, purpose, other outstanding questions, etc.) • answers would be associated to previous questions and may contain simple text such as an url for the source of information, or a piece of script representing a node used by a flowchart. • suggestions are changes proposed by one participant to another. Similar to the answer, this can be as simple as suggesting the change of an information source, or more complex, such as suggesting the replacement of a node for an alternative node. A Workshop session follows this communication protocol: 5 June 2015 272 1. The moderator initialises an empty log and sends a message to the flowcharts to indicate that the session has started. 2. The flowcharts start writing messages in the log. 3. The moderator checks the current state of the workshop by reading the log. 4. The moderator selects the next message in the queue and passes it to the target flowchart. 5. The flowchart reads the message and acts accordingly, by either (i) modifying its connections or; (ii) sending a message back, i.e., writing to the log. 6. Step 3 is repeated until no further message are left in the queue. Example. Figure 3 shows the poetry generator flowchart that generates the poem about the "demon dog" presented above. The flowchart uses two linguistic resources: ConceptNet (Liu and Singh, 2004), a semantic network of common knowledge, and Disco (Kolb, 2008), a semantic similarity words retrieval system. Let us assume the human critic A has access to the system through a "UI flowchart" like a Read-Eval-Print Loop (REPL), and the poetry generator B is mainly concerned with maintaining a generative flowchart like the one shown in the figure. The following exchange of messages can occur: 1. Comment from participant A to participant B: The words "lonely encounter" and "elusive ruler" in lines 2 and 3 are generalised and imprecise. 2. Question from participant B to participant A: I identify the processes Disco3 and Disco4 as the source of the problem. Can you suggest an alternative to Disco? 3. Suggestions from participant A to participant B: Use WordNet or the Historical Thesaurus to find more expressive and specific terms for the core concepts in the poem; try to link the core concepts together by chaining together related concepts in ConceptNet or WordNet. 4. Action executed by participant B: Receives suggestions, creates several alternative versions of the script, executes them and decides which one is most coherent and which conveys a sense of narrative. From this exchange, the computer might learn (without ever being explicitly told) that expressive terms and narratives are related, and it might begin to discover a way to produce coherent poems with a narrative structure. Since the computer has source code instead of a brain, we can use it to do control studies with process. However, in general source code does not uniquely determine process: contextual effects are what make an experiment an experiment. As described in (Cook and Colton, 2013), code may include hints about its expected operating context. This is related to our theme of embedding process within an artefact. In this connection, one extension to FloWr that would help to facilitate dialogue between flowcharts would be to add machine-readable commentary to ProcessNodes. Commentaries would label a node's inputs and outputs, describe Figure 3: The flowchart that created the "demon dog" poem its basic purpose, and provide information about procedure, conditional behaviour, mappings between processes and elements of a generated poem (like the mapping between Disco3 and "lonely encounter"). Altered versions of a flowchart (Charnley et al., 2014) can be seen as parallel solutions that could be executed and compared on a population basis with respect to some prespecified metrics in order to make an informed decision on which suggestion(s) to follow, as hinted in the last step of the example. In (Colton, Pease, Corneli, Cook, and Llano, 2014) we explored the related idea of modelling system progress over time. Learning new rules contextually would offer one clear measure of progress. Caveat lector: considerable work would be required to realise the ideas we've described in FloWr or any other platform we're aware of. Discussion Potential applications. The paradigm advanced in this paper would not remove the "generation" aspects of CC, but would pair them more closely with reflection. The same skills that support learning in a writers workshop may support a form of dialogue with the work itself, leading to richer creative artefacts that show us more about how creativity works. Focusing on social creativity does not suggest that we should devalue works from lone creatives, but it does suggest that we think about how we knit individuals together in the social fabric of the CC community. The current model at the International Conference for Computational Creativity (ICCC) is similar to many other academic conferences: we present our work to one another and build our sense of community in that way. But what about a track for computers to present their work? The idea of computers interacting in a workshop-like setting is not unprecendented. As Turing (1951) foresaw, computational software has become highly competent at Chess and reasonably competent at Go, partly through continuous practice pitting programs against each 6 June 2015 273 other. Poetry could be approached in a similar way, reviving the floral games of the troubadours. Other creative arts may also be amenable to the same sort of approach. Gabriel mentions "brainstorms, critiques, charrettes, pair programming, open-source software projects, and even master classes" (Gabriel, 2002, p. 11). The sort of thinking we have developed here might be adapted to contexts like these. Potential criticisms. It can be challenging and timeconsuming to invite and process feedback, and the Workshop would often be seen as unnecessary for standardised production cycles that can already produce artefacts that are "good enough." Furthermore, since we often seem to get the computer to do just what we have in mind when we're programming, it might not make sense to treat it as a distinct other and invite it to participate in a dialogue. (Some REPL users may disagree, and may already think of programming as a dialogue.) From our read/write perspective on computational creativity, the most immediate problem is that appreciation of works of art is rather hard. Consider the difference between creating a video game (for example) and playing a video game. In the first case, the designer has full control over the rule-set, game mechanics, interaction devices and so forth. At least one computational video game designer can play its own games (Cook, Colton, Raad, and Gow, 2013), and an experiment shows that it is possible for an artificial game player to learn how to play classic video games using reinforcement learning, starting from raw pixels (Mnih et al., 2013) - but both are quite far from general-purpose game playing. This is itself a topic of contemporary research, and it serves to illustrate that coping with feedback is a major challenge for AI research. Finally, we are not in a position to make strong claims about the quality of workshopped artefacts, although our experience with poetry has shown us that high-quality poems are often exactly the ones which teach us about the creative process. We hope future research will explore this connection further. Conclusions The ideas of social interaction, feedback, and evaluation have frequently been discussed in CC, but implementation and theorisation around these topics have been more limited. In the current paper, we suggest giving artefacts more agency, designing computer programs with more autonomy, and focusing research effort on creative evolution. We have shown that in principle computers can engage in dialogue about poems, which points to a theory of poetics rooted in the making of boundary-crossing objects and processes. In order to move from thought experiment to computational simulation, FloWr could be helpfully extended with further programmer facilities including loops, subroutines, and commentaries, along with the ability to generate-and-test in a population-based manner, and the ability to learn over time. Workshops and related approaches are suitable for autonomous learning and development of the creative process, but they face technical and also some theoretical limitations. Dialogue may offer a way to creatively push these limits, empowering both programs and programmers. Acknowledgements This research has been supported by EPSRC grants EP/L00206X and EP/J004049, and the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET-Open Grant numbers: 611553 (COINVENT) and 611560 (WHIM). 2015_36 !2015 Interaction Evaluation for Human-Computer Co-creativity: A Case Study Anna Kantosalo, Jukka M. Toivanen, Hannu Toivonen Department of Computer Science and Helsinki Institute for Information Technology HIIT University of Helsinki, Finland anna.kantosalo@helsinki.fi, jukka.toivanen@cs.helsinki.fi, hannu.toivonen@cs.helsinki.fi Abstract Interaction design has been suggested as a framework for evaluating computational creativity by Bown (2014). Yet few practical accounts on using an Interaction Design based evaluation strategy in Computational Creativity Contexts have been reported in the literature. This study paper describes the evaluation process and results of a human-computer co-creative poetry writing tool intended for children in a school context. We specifically focus on one formative evaluation case utilizing Interaction Design evaluation methods, offering a suggestion on how to conduct Interaction Design based evaluation in a computational creativity context, as well as, report the results of the evaluation itself. The evaluation process is considered from the perspective of a computational creativity researcher and we focus on challenges and benefits of the interaction design evaluation approach within a computational creativity project context. Introduction Evaluation is vital for guiding the development and measuring progress in computational creativity methods (Jordanous 2012). Especially formative feedback is needed to guide practical development work (Jordanous 2012). This is also true for interactive systems based on computational creativity methods, including human-computer co-creative systems - systems in which both the human and the computer take creative responsibility of the output of the program. As new human-computer co-creative systems are created we will need to address issues in their evaluation. Bown (2014) argues for a more contextually based evaluation of creative systems within their cultural environments. We consider this to be true especially for human-computer co-creative systems as an evaluation focusing merely on the computational system's creativity is not sufficient to evaluate the success and progress of the system with regard to the user's creative process or the co-creative experience itself. Methods incorporating the user's perspective are needed for incorporating these aspects. Bown (2014) suggests learning from contextually and culturally aware evaluation methods intended for end-user evaluation established in the field of Interaction Design. In this study paper, we first briefly discuss the similarities and differences between human-computer co-creativity evaluation and computational creativity evaluation. We then proceed to view Interaction Design in the context of computational creativity: We see how Interaction Design currently connects to computational creativity and view previous human-computer co-creation and creativity support system evaluation projects in the light of the DECIDE framework (Rogers, Sharp, and Preece 2011). Then, we move on to discuss our own case study of the Poetry Machine evaluation and illustrate how the DECIDE framework works in practice in the context of a human-computer co-creativity system evaluation. Next, we present the results of our evaluation case study and finally discuss our findings and the usefulness of this evaluation with regard to computational creativity development. Evaluating Computational Creativity and Human-Computer Co-Creativity Evaluation of computationally creative systems may focus on different levels of the system: According to Colton and Wiggins (2012), a distinction is often made between evaluating the "cultural value of the artefacts produced by systems, and tests which evaluate the sophistication of the behaviours exhibited by such systems". Jordanous (2012) supports a similar idea in her analysis of existing evaluation frameworks. According to Yannakakis et al. (2014), this characterization of evaluation also applies for the evaluation of co-creativity. Yannakakis et al. continue that the evaluation of the final outcomes of a co-creative process may utilize same approaches as the evaluation of the outcomes of an independent computationally creative process but the process itself is more difficult to evaluate because of the unknown nature of the human creativity process itself. In this paper, we have focused on the evaluation of the process aspects and left out the evaluation of the artefacts. However, the evaluation of artefacts can also factor into evaluating the effects and benefits of the co-creative system to its users. Jordanous (2012) notes that computational creativity evaluation has traditionally favored expert evaluation, although the evaluation of computational creativity systems with target users has been discussed. There are still few practical examples describing the end-user-evaluation of either autonomously creative or co-creative systems. In this paper, we hope to provide the field with a practical example of how June 2015 276 end-user evaluation of computational creativity software involving users can be conducted in practice at early development stages. One important difference between evaluating autonomous computational creativity systems and human-computer cocreative systems seems to be that the subjective experience of the human user of a co-creative system becomes an interesting evaluation target. Therefore, the focus of evaluating co-creative systems can not be only on evaluating the creativity of the system, but also in part on the effects the system has on the user. Yannakakis et al. (2014) conclude that the interaction between the human and the computer fosters the creativity of the tool, but the claim cannot be thoroughly evaluated with current frameworks. Finally, Jordanous (2012) divides the evaluation of computational creativity systems to summative and formative evaluation. The purpose of the former is to provide a summary of a system's creativity, while the latter aims to provide constructive feedback on the system. A similar distinction is made by Hartson et al. (2003) for Interaction Design evaluation methods, with the distinction that formative evaluation is usually done iteratively during product design and summative evaluation is usually reserved for finished designs or comparisons between designs. Jordanous (2014) seems to consider formative evaluation a more important goal for current computational creativity evaluation procedures, as she regards the usefulness of evaluation results as an evaluation criteria for evaluation methods themselves. This paper focuses on the formative evaluation of an on-going project, aiming to produce results that are useful for guiding the future development of the poetry writing tool. Interaction Design and Evaluation in Computational Creativity Contexts The field of Interaction Design studies how to best design interactive products to facilitate human interaction and communication. As such, it seems ideal for designing humancomputer co-creative tools. Interaction Design covers a multitude of design fields and approaches, such as user-centered design (Rogers, Sharp, and Preece 2011). As a methodological framework it offers iterative processes and methods for designing and evaluating interaction in specific contexts. Some Interaction Design methods have already been used in designing software based on Computational Creativity methods (Kantosalo et al. 2014). Bown (2014) argues that the wide range of robust Interaction Design methods for observing and measuring user experience could help build a thorough empirical grounding for Computational Creativity evaluation. He continues that Interaction Design would also help to establish commonly used evaluation concepts - ‘value' and ‘novelty' - as constructs immediately related to the goals of the individual user. This new human-centered approach would shift the nature of the enquiry very slightly "by asking not how creative a system is, or whether it is creative by some measure, but how its creative potential is practically manifest in interactions with people." In this section, we provide a brief review of Interaction Design evaluation in creative contexts. We cover humancomputer co-creativity projects STANDUP (Waller et al. 2009), Scuddle (Carlson, Schiphorst, and Pasquier 2011), Evolver (DiPaola et al. 2013), and the Sentient Sketchbook (Yannakakis, Liapis, and Alexopoulos 2014). They all have used evaluation methods that can be seen to fall within the scope of Interaction Design. To learn more about how the creative context should be considered in Interaction Design evaluation, we include six creativity support systems that have been evaluated in the literature: the IdeaManager (Shibata and Hori 2002), a Virtual Musical Environment (VME) (Johnston, Amitani, and Edmonds 2005), the Envisionment and Discovery Collaboratory (EDC) (Warr and O'Neill 2007), the Choreographer's Notebook (Singh et al. 2011), Ugobes Pleo (Ryokai, Lee, and Breitbart 2009), and Parallel Pies (Terry et al. 2004). We structure the review, and our subsequent description of how we evaluated the Poetry Machine, according to the DECIDE framework by Rogers et al. (2011). The DECIDE framework is a checklist with the following six items: 1. Determine the goals 2. Explore the questions 3. Choose the evaluation methods 4. Identify the practical issues 5. Decide how to deal with the ethical issues 6. Evaluate, analyze, interpret, and present the data Each step of the framework guides the next step: Determining goals helps designers to ask relevant study questions, and questions guide the selection of methodologies. Then again, the selected methods predict some of the practical issues, which may be related to ethical questions. Finally, all previous factors are relevant to deciding how the data is best evaluated, analyzed, interpreted, and presented. Determining Evaluation Goals Choosing what to evaluate is often a challenge in the creative domains. Some projects attempt to measure the increase in creativity of the user, some discuss the creativity of the system, while some focus on user experiences and feedback. Carroll (2011) has noted that because creativity is difficult to define, it is often difficult to say if tests designed to measure creativity of an interactive system actually measure creativity or some other construct. Additionally, aspects of creativity may be domain specific (Carroll 2011). It is surprising that only two of the reviewed humancomputer co-creativity evaluation projects state their goals explicitly: Waller et al. (2009) investigated if their target group is capable of using the STANDUP system, and how they use it. Yannakakis et al. (2014) studied if the Sentient Sketchbook fostered the designer's creativity, specified as aspects of lateral thinking and diagrammatic reasoning. In evaluations of creativity support tools, goals have included gathering initial feedback (Johnston, Amitani, and Edmonds 2005), evaluating if the tool supports specific aspects of a creative process (Warr and O'Neill 2007; Singh et al. 2011), or what is the role of the tool in a creative process (Ryokai, Lee, and Breitbart 2009). June 2015 277 Exploring the Questions Exploring the questions means the redefinition and focus of the goals to more operational questions (Rogers, Sharp, and Preece 2011). Among the Human-Computer Co-Creativity evaluation examples, only Yannakakis et al. (2014) further explain their evaluation targets as the degree and quality of use of the suggestions of a computational partner. As a type of elaboration for their implicit goals DiPaola et al. (2013) provide the set of actual questions used in their study. Among the creativity support systems, Singh et al. (2011) provide a similar list of questions asked from their users and Johnston et al. (2005) list the specific behaviors of the system they want to investigate. Choosing Methods There is a wide range of Interaction Design evaluation methodologies, including formal vs. informal testing methods, thinking aloud vs. observation, and summative vs. formative testing (Lewis 2006). It is common for designers to combine different methods to gather rich data (Rogers, Sharp, and Preece 2011). Mixed-methods approach combining quantitative and qualitative data is also the evaluation recommendation of the NSF Workshop on Creativity Support Tools (Carroll 2011). The selection of Interaction Design methods is affected by multiple factors: Firstly, the purpose of the evaluation, context of use, and type of data to be gathered matter (Rogers, Sharp, and Preece 2011). Secondly, practitioners must consider the reliability, thoroughness and validity of methods (Hartson, Andre, and Williges 2003). Finally, a number of case based issues contribute to the selection, such as cost efficiency and the target group. All of the studied projects described the methods used in the study but not necessarily the rationale behind their selection. Notably three of the projects, including the Sentient Sketchbook, the Idea manager, and Choreographer's Notebook, used remote methods, including the collection of usage logs to determine the quantity of use or usage patterns. Shibata and Hori (2002) explained they needed longitudinal remote data collection because creativity is dependent on the context and environment of the users and thus impossible to study in a laboratory setting. Nearly all laboratory studies seem to have strived to simulate creative situations for the users, with the exception of STANDUP and Evolver. Methods have also been applied to creative contexts in different ways. For example, the tasks used in the evaluation are very differentiated, some evaluations having more explorative tasks with only a general goal (e.g. Pleo and Parallel Pies), while others used more specific tasks with scripted roles for the participants (e.g. EDC). Identifying Issues Regardless of the chosen methods, all methods require representative participants, representative tasks and representative environments in which participants are observed (Lewis 2006). These dimensions define most of the practical issues related to any Interaction Design evaluation, and were not absent from the example projects either. For example, finding suitable users was difficult for the STANDUP project (Waller et al. 2009). The creative context also proposes some additional issues to evaluation: Experiences from creativity support tool evaluation show that errors in the interfaces may sometimes provide additional opportunities for the users, and that spending significant times at a task may indicate immersion, not poor quality of interaction (Carroll 2011). Therefore, some metrics loaned from Interaction Design may not suit the evaluation in the creative setting (Carroll 2011). The novelty and value of artefacts produced by creative systems become highly dependent on user and context in creativity contexts, as suggested by Bown (2014): For instance, Shibata and Hori (2002) studied a creativity support tool intended to catalyze idea generation. They had their users to evaluate the novelty and practicality of the ideas for themselves, instead of trying to assign objective values to the produced ideas. Ethics As with all human studies, ethical issues require specific care with Interaction Design evaluations involving users. Very few specific ethical issues were reported in the example studies, and in general they were unrelated to creativity: Waller et al. (2009) report issues related to child participants and Warr and O'Neill (2007) note the use of consent forms and stress to users that they are evaluating the software, not the users. Analysis and Presentation The chosen methods define the type of data collected to a great extent but researchers still have to choose how to analyze and present the data, as well as account for its validity, generalizability and scope (Rogers, Sharp, and Preece 2011). Many of the sample cases focus on the creative process and key interactions related to it in their analysis: Yannakakis et al. (2014) analyzed use patterns from log files and identified important process milestones from them with the help of the user provided qualitative data. Singh et al. (2011) also analyzed logs noting key changes in the creative processes by presenting rationale for the use. Warr and O'Neill (2007) recognized different sub-activities and key interactions in the idea generation process of their users based on video logs. Ryokai et al. (2009) illustrated the process through a detailed example and Carlson et al. (2011) focused on process related user quotes. As a semi-process oriented reporting approach Waller et al. (2009) focused on analyzing interaction paths and Terry et al. (2004) analyzed how well the interaction model enhanced the workflow of their users. Feedback plays a great part in most of the evaluation projects; Waller et al. (2009), Johnston et al. (2005), Shibata and Hori (2002), Warr and O'Neill (2007), and Terry et al. (2004) report new ideas for improvement. Most projects also used user quotes to illustrate key findings or feedback; only Yannakakis et al. (2014), Warr and O'Neill (2007), and Terry et al. (2004) do not use user quotes at all. June 2015 278 Evaluation of the Poetry Machine The Poetry Machine (Kantosalo et al. 2014) aims to solve the problem of the empty paper for its users, school children studying poetry or just exercising writing. The user selects a theme (in the tested version one out of 8 options), and the Poetry Machine provides a draft poem consisting of poetry fragments. The editing interface simulates a set of fridge magnets. The user can edit the draft by dragging words and rows around, removing them, or entering new ones. The user can also ask for further assistance from the computer, by using a feature called the "robot". By dragging words or rows on the robot, the robot provides the user with similar fragments or rhyming words. The Poetry Machine has been developed at the University of Helsinki, based on the poetry generation methods developed earlier in the group (Toivanen et al. 2012). However, the version evaluated in this paper does not utilize the full functionality of these methods. Instead we decided to use simple fragment based approaches to provide pieces of poetry and rhyme candidates that can be expanded to full poems by users of the system. The Finnish poetry fragments and rhyme dictionaries are automatically extracted from a corpus containing children's literature from Project Gutenberg. This simplistic setting makes it easier to assess the effectiveness of the current interface of the system and also provides a basic setting for further iterative testing. Planning the Evaluation In the next paragraphs we describe the evaluation process of the Poetry Machine through the DECIDE framework. Determining Evaluation Goals We selected three goals for the evaluation of the Poetry Machine: (1) discovery of usability problems, (2) evaluation of its usefulness, and (3) evaluation of its enjoyability. The first goal is a typical Interaction Design evaluation goal, yielding concrete remarks on how to improve the interface. In this case, eliminating usability problems is a vital step before conducting additional, comparative testing on the contents of the cocreation. The second goal, usefulness, is defined here as the system's ability to make creative writing easier for children. Finally the last goal, enjoyability, is related to the ISO9241-11 (ISO/IEC 2010) satisfaction parameter, but combined with fun, which with child users correlates with usability (Sim, MacFarlane, and Read 2006). Exploring the Questions In the question exploration phase, each goal was elaborated with a set of sub-questions, which could be more easily approached with specific Interaction Design evaluation methods. Our primary study questions were: 1. Usability (a) Are children able to use the program? (b) Is the interface graphically pleasing to children? 2. Usefulness (a) What features of the program are the most useful for children? (b) Does the program make creative writing easier for children? 3. Enjoyability (a) Do children exhibit negative signs, such as signs of boredom or frustration, when using the program? (b) Do children exhibit positive signs, such as smiling, or willingness to continue the activity for a longer period of time? (c) What activities do children name when asked about the most fun/boring features in the program? Most of the questions can be further divided into sub-subquestions, such as "Do children use all of the features or only few?". We intentionally excluded questions, such as "Does the tool promote learning or creativity?". These questions were considered outside the scope of the first evaluation, but more experiments are planned for evaluating the pedagogical potential of the tool, and alternatives for promoting creativity. Choosing Methods In order to gather a wide range of feedback, we decided to use a mixed-methods approach with two methods: Peer Tutoring and a small group session we call Group Testing. We chose the paired Peer Tutoring composition proposed by Edwards and Benedyk (2007) in which two users work as a pair - the first participant first learns the use of the tool and then teaches it to his or her partner. In the Group Testing we simulated a small group teaching scenario with one teacher teaching a group of five pupils on how to write a poem with the Poetry Machine. By using the methods in a school environment, we attempted to imitate some culturally and contextually aware conditions. Peer Tutoring was selected as it has been previously used with young children in usability tests organized at school. It offers a natural context for using the tool with a friend, diminishing biases resulting from an unbalanced adult-child relationship between the users and the researchers administering the test (Hoysniemi, H ¨ am¨ al¨ ainen, and Turkki 2003). ¨ It is also good for eliciting comments from children (Edwards and Benedyk 2007), as well as for fostering creativity, experimentation and problem solving-skills within the test situation (Hoysniemi, H ¨ am¨ al¨ ainen, and Turkki 2003). ¨ Group Testing allowed us to observe the use of the system in a more authentic, teacher driven learning situation. Observation of behavioral signs is considered more trustworthy than self reports in the case of children (Hanna, Risden, and Alexander 1997), and it is used in both methods to provide both quantitative and qualitative data. To collect more qualitative data, both methods were coupled with an appropriate background questionnaire and a post task debriefing. With Peer Tutoring we used a paired interview. For the Group Testing we developed a group-based, game-like, feedback gathering method called Feedback Game (Kantosalo and Riihiaho 2014). Each of our six Peer Tutoring sessions started with tutor introduction: The researchers presented themselves to the tutor pupil, and the facilitating researcher helped him or her to fill a background questionnaire. During the next step, tutor training, the tutor was encouraged to explore the tool and June 2015 279 write a poem with it. Next, during tutee introduction, the tutee was introduced to the test setting and filled the background questionnaire, while the tutor read a book. This was followed by the actual peer tutoring phase, during which the tutor guided the tutee in writing a poem with the tool. Finally the tutor and the tutee were interviewed in what we call the pair interview phase. Both Group Testing sessions started with an introduction phase, during which the participating children filled in the same background questionnaire as the Peer Tutoring participants. This was followed by instruction by the teacher, during which the teacher shortly composed a poem at the front of the classroom explaining the use of the tool. We then moved on to the poem writing phase, during which each child composed a poem, the teacher instructing them when necessary. Feedback from the children was then gathered in the the Feedback Game phase. In the game children answered questions like "Was it fun to use the poetry tool?" on a five step Likert scale turned into a gameboard. Each question was followed by a round of arguments. Finally a separate teacher interview was conducted to learn how the teacher perceived the effects of the tool on his class. Identifying Issues As a sensitive user group children require specific care in selecting and applying test methods. Both, the Peer Tutoring test and the Group Testing were conducted on site, in a small classroom at a local Finnish school. To gather enough material to make for possible test session failures, we decided to work with a fairly large group of children. We recruited a class of 9-10-year-old pupils. Their teacher selected 22 participants (12 for Peer Tutoring, 10 for Group Testing) according to criteria provided by us. The sample is large, but narrow, which is somewhat typical for Interaction Design evaluation with children (see e.g. the sample sizes in (Sim, MacFarlane, and Read 2006) or (Hoysniemi, H ¨ am¨ al¨ ainen, and Turkki 2003)). Further test¨ ing with more varied users is planned. To ensure unintrusive data collection we videotaped each session, and the researcher acting as the main facilitator in charge of interviewing and helping was accompanied by one or two additional observers, who were present at all times. Additionally we performed automatic data collection of the artefacts produced by the children, including recording which words in each poem were computer generated. To promote creative thinking, we decided to use a very generic test task — the general goal of "writing a poem". In Peer Tutoring, this proved very difficult for some of the tutors, who were unfamiliar with poetry and required thus more guidance, such as suggesting a topic in one case. The tutees seemed to respond to the task more positively, possibly due to peer presence. We were also worried the tutors might try to push tutees to a specific creative direction during testing and discouraged this by allowing only the tutee direct access to the mouse and keyboard during the peer tutoring. We were happy to see the tutors seldom did anything to affect the creative content of their tutee's poem. The same open task worked well with the Group Testing participants. Ethics As the participants of the study were all underaged, we requested a permission from the guardians of each pupil with a letter sent to them through the school. Additionally, we emphasized the volunteer nature of the study at the beginning of each session, explained the secrecy of all raw material, and noted we were there to recruit the pupils' help, not to evaluate them. During two of the Peer Tutoring sessions we held longer pauses to allow the tutor pupils to take a recess or have lunch before continuing with their tutee. Analysis and Presentation All sessions were analyzed from videotaped material. All peer tutoring session videos were analyzed by two researchers; the facilitator and one observer. Each Group Testing session video was analyzed by the facilitator. Additionally field notes were used to note important factors during testing. The facilitator counted instances of use for each feature from the Peer Tutoring videos, as well as positive and negative gestures. Both facilitator and observer additionally observed the tape for interesting comments, actions and usability problems. The problem listings obtained were combined and duplicates were merged into single problems. Each problem was rated by frequency and assigned a severity rating. It was not possible to conduct an equally robust analysis of the Group Testing sessions, because of limitations in taping each participant individually. More general observations were made instead. The pair interviews and Feedback Game sessions were transcribed and the transcripts analyzed for common elements and improvement ideas. Evaluation Results The analysis revealed a number of interesting issues related to the evaluation goals and user ideas for improving the tool. Additionally we were able to find some interesting elements related to the use patterns and creative processes of the users. Usability We found 82 unique usability problems through the Peer Tutoring tests. The problems ranged from practical interface problems, such as how to move words, and aesthetic problems, such as the appearance of buttons on screen, to more conceptual problems including for example misunderstanding of what publishing a poem means. A solution for each problem was suggested based on the problem's manifestation during testing and improvements are being carried out to allow further testing. Enjoyability The enjoyability of the tool was evaluated based on gestures recorded from the Peer Tutoring videos and user comments. All of the six girls, who participated in the Peer Tutoring tests, seemed to show more negative gestures than positive when composing a poem. Four of the six boys however showed more positive signs. This could be taken as an indication of a generally negative reception for the prototype, however there is some ambiguity in interpreting gestures of children: Hanna et al. (1997) consider frowning a negative sign, but during testing this seemed rather to be a sign of concentration, which according to Read et al. (2002) should be considered as a positive sign. Also, as Carroll points out (2011), these signs may have to be interpreted differently due to the creative context. If we interpret these possible signs of concentration as positive, only one pupil displayed more negative gestures during testing. Most of the June 2015 280 negative comments heard during testing had to do with the ambiguity of the task: some children were unsure of what poems are and how to write one. Other negative comments heard during the Peer Tutoring indicated usability problems, and in one case disapproval of the concept itself. Less negative comments were heard during the Group Testing, where children received more clear instructions from their teacher. The interview and Feedback Game results support a more positive response to the tool: All Peer Tutoring participants gave great scores for the prototype (4 or 5 stars out of 5), 5 out of 12 stating reasons related to the perceived fun of the tool. Additionally two pupils would recommend the tool to their peers based on fun. All Feedback Game participants agreed the tool was fun, and four of them specifically indicated they were willing to participate in a similar test because writing poems during the test was so fun. Enjoyability is also supported by anecdotal evidence provided by the teacher after the testing, during a later visit to the school, and the general reception children gave to the tool. This includes one child mentioning after a test that she had actually stayed after school as she was so enthusiastic to try the tool out. Usefulness The tool was found useful by both the pupils and their teacher: The pupils clearly responded positively to writing poems with the tool. 12 out of 22 pupils indicated that poem writing with it was fun. Six pupils out of 22 also considered that writing poems with the tool was easier than writing otherwise. One pupil specifically mentioned that existing words given by the computer helped his writing process. The teacher highlighted motivation issues: He considered that the pupils were faster to get to work and more engaged with the program than in a typical lesson. He specifically mentioned that one of the pupils, who usually had difficulties with coming up with ideas for creative writing worked very autonomously throughout the session. The teacher also reported later that one of his pupils had been inspired by the tool to start poem writing as a hobby. All pupils were able to write a poem during testing, however two of them seemed to reproduce one written before the test session. Also, some of the tutors required some ideation help for writing their poem and the facilitator suggested a theme for them, helping the process along with some open questions. No formal evaluation of the educational value of the tool was made and children were not asked to specifically evaluate the learning potential of the tool, but many of the children considered the tool useful for learning: Seven pupils wanted to recommend the tool to others as they saw it as useful for learning. Three pupils considered autonomously that they had themselves learned to write poems with the tool. The teacher was also able to see the tool as a useful part for future lessons. Use Patterns and Creative Process To gain a better understanding of the use of the tool, we recorded how many times each feature was used by the children during testing. While some of the users were writing with no apparent pattern, the data showed two clear strategies utilized by some of the pupils. The first strategy was to use one of the rowboxes, originally intended to note the row structure in the final poem, as a storage-unit. A pupil using this storagestrategy would shift words within the interface from the operational area to the storage-unit and back according to his or her poem idea. The final poem would consist in a large part of words suggested by the computer. Four participants in the Peer Tutoring test were seen using this strategy. The second strategy, robot-induced-ideation, was seen specifically in one of the pupils. He would primarily engage with the robot, looking always first through its suggestions and only then added a word either invented by the robot or himself. By looking at the usage data recorded during the use, Peer Tutoring participants wrote shorter poems than the Group Testing participants. The average length of Peer Tutoring participants' poems was 11.6 words (median 11, minimum 6 and maximum 23), while the Group Testing participants wrote 25.4 word poems on average (median 19, minimum 12 and maximum 55). On average, 28% of the final words in the poems written by Peer Tutoring participants were provided by the computer (either in the initial draft, or suggested by the robot tool), while 34% of the words used by Group Testing participants originated from the computer. In both test setups two pupils decided not to use any of the suggestions provided by the computer, while in the Group Testing one participant relied entirely on words suggested by the computer, acting as a sort of an editor. However, the logs do not record all of the effects of the tool to the writing of the children - for example one child said during a Peer Tutoring session that "something came to my mind from this" and pointed to one of the robot's suggestions. We did not attempt to evaluate the quality of the poems and the possible effect of Poetry Machine on them. A larger sample would be needed, as well as a comparative set of poems, either from the same age group or from earlier poems written by these pupils. User Ideas The user ideas collected during testing are summarized in table 1. Peer Tutoring and Group Testing produced different kinds of ideas. On average one Peer Tutoring session produced one idea, whereas each Group Testing session managed to produce two. The ideas gathered during Group Testing are also more immediately related to the conceptual level of the system, while the Peer Tutoring ideas also address more specific interaction ideas. We discuss the main ideas below. 1 Inputting multiple words together should be easy 2 Users should be able to remove all words easily 3 Proposed words should be more familiar 4 Proposed words should be more tightly linked to words pointed out by the user 5 Proposed words could be displayed under the word to be replaced 6 A quick way to add punctuation is needed 7 Drafts should have more familiar words 8 Proposed words should be more related to the topic 9 Proposed words should have better rhymes 10 Drafts should have more rhymes Table 1: Ideas collected from users during testing June 2015 281 Using the Results in Developing the Poetry Machine The usability evaluation results are already used to enhance the interface in order to support test situations in which we focus more on the content of the interactions instead of their fluidity. The initial results will guide our research into the pedagogical potential of the tool, and we will further focus in the development of the tool as a motivating agent. The use patterns collected show potential principles on the base of which further interaction in the tool can be build to support human-computer co-creativity. For example the storage-strategy should be investigated further as an interaction paradigm in the system. The relationship between the robot-induced-ideation and the quantity of computer provided words in the system should be investigated further in the tests, and means for promoting it could include a more active computational participant. The feedback provides many possibilities for further development of the computational creativity methods used in the system. Especially the ideas give concrete suggestions as to how the system should be developed further. (1) Instead of just providing simple fragments without any cohesion between them, methods for adding more coherence between the proposed fragments should be investigated. Here the computer could propose fragments that are well suited to the fragments already proposed and also written by the user. Methods of textual coherence based on vector space models of words and linguistic fragments (Mikolov et al. 2013) or corpus word statistics could be used here to enhance the results. (2) The quality of the rhymes have room for improvement. Methods for improving the quality of rhymes are many, including metrics based on word length etc. Also many different kinds of rhymes like syllabic rhymes, half rhymes, assonances, consonances, and alliteration could be used to add more variation. (3) Words suggested by the system could be more familiar to the users. However, the users were not unanimously supporting the use of only familiar words. During group testing, one pupil noted that "there were these words you use more seldomly, so there were a couple I could select for my poem". Therefore, tying the words better to the context, proposing synonyms and antonyms for the words pointed out by the user, and using a mix of more and less typical words a good balance between vocabulary enhancing and supporting words could be attained. In the future, the system could also be used for teaching metrical systems prevalent in traditional poetry. The computer could, for instance, propose that the user could write a sonnet and then track the number of syllables on each line of the poem. If the number of syllables on some line did not fit the metrical structure of a sonnet the computer could propose changing, for instance, one word on the line to satisfy the metrical constraints. Conclusions We have shown how to conduct an interaction design method based evaluation on a human-computer co-creativity tool called Poetry Machine. The evaluation conducted in this case study has similarities to other evaluation cases of human-computer co-creative tools and creativity support tools. Especially interesting is the varied set of evaluation goals that can be supported through Interaction Design methodologies. In creative contexts however, the selection of methodology seems to be especially important: Mixed methods should be used to gain a varied set of data. Also specific care has to be taken to create a test situation that allows the flow of creativity by either using remote study methods, methods that have been found to suit creative contexts, or setting up the evaluation in a creative environment. Tuning methods for creative contexts also requires selecting suitable tasks for the users to do within the test situation. A very interesting aspect to Interaction Design evaluation planning and practice within the creative context are the issues faced during testing. It seems that some traditionally used interaction design evaluation measures, such as time, or facial gestures are not useful within a creative context, as some negative signs, such as frowning, may actually indicate positive aspects, such as concentration or immersion instead. Most of the issues related to human-computer co-creativity testing with interaction design evaluation methods still seem to be concerned with typical interaction design evaluation problems, such as selecting suitable users. The analyzed sample cases revealed that typically the analysis of human-computer co-creativity evaluation results is similar to that of Interaction Design evaluation. For example, quotes are frequently used to illustrate key issues. Interestingly many projects have also focused on how the creative process of the user is supported by the interface. A large part of the cases also provided feedback and improvement ideas. We have illustrated here how such formative evaluation results can be applied to practical computational creativity development work by providing a list of gathered user ideas and presenting concrete ideas on how to use them for further development. However, a simple listing of the ideas is not enough - to defend design decisions and to tune solutions to actual user needs, we need to look at the qualitative data as a whole. Based on the projects studied for this paper, it seems interaction design evaluation methods have already taken a place within human-computer co-creativity evaluation and the philosophical foundations of this work are also being laid in the computational creativity community. Through our case study, we have demonstrated in a formalized manner, how to plan and conduct Interaction Design method based evaluation for a human-computer co-creativity tool and how the results can be applied in practice. With this we have shown how interaction design evaluation practices offer an interesting, complementary evaluation approach to humancomputer co-creation tools, providing results that can be put to practical development use. Acknowledgments This work has been supported by the Academy of Finland (decision 276897, CLiC) and by the European Commission (FET grant 611733, ConCreTe). We wish to thank the pupils June 2015 282 and teachers who participated in this research, and K. Tiuraniemi and M. Hynninen for participating in the data collection. 2015_37 !2015 Impact of a Creativity Support Tool on Student Learning about Scientific Discovery Processes Ashok K. Goel & David A. Joyner Design & Intelligence Laboratory, School of Interactive Computing Georgia Institute of Technology 85 Fifth Street NW, Atlanta, GA 30338 ashok.goel@cc.gatech.edu; david.joyner@gatech.edu; Abstract Science education nowadays emphasizes authentic science practices mimicking the creative discovery processes of real scientists. How, then, can we build creativity support tools for student learning about scientific discovery processes? We summarize several epistemic views of ideation in scientific discovery and find that the ideation techniques provide few guarantees of correctness of scientific hypotheses, indicating the need for supporting hypothesis evaluation. We describe an interactive tool called MILA−S that enables students to elaborate hypotheses about ecological phenomena into conceptual models and evaluate conceptual models through agent-based simulations. We report on a pilot experiment with 50 middle school students who used MILA−S to discover causal explanations for an ecological phenomenon. Preliminary results from the study indicate that use of MILA- S had a significant impact both on the creative process of model construction and the nature of the constructed models. We posit that the computational support for model construction, evaluation and revision embodied in MILA- S fosters student creativity in learning about scientific discovery processes. Introduction Scientific discovery in general is a creative task (Carruthers, Stitch & Siegal 2002; Clement 2008; Darden 1998; Magini, Nersessian & Thagard 1999; Nersessian 2008). Thus, computational modeling of scientific discovery processes has received significant attention in AI research on creativity (Chen et al. 2009; Davies, Nersessian & Goel 2005; Griffith, Nersessian & Goel 2000; Langley 2000; Langley et al. 1987; Lindsay et al. 1980). Science education nowadays emphasizes authentic science practices mimicking the creative discovery processes of real scientists (Clement 2008; Edelson et al. 1999). Thus, interactive tools for supporting authentic science practices in science education have received significant attention in AI research on education (Bridewell et al. 2006; De Jong & van Joolingen 1998; Jackson, Krajcik, & Soloway 2000; Novak 2010; vanLehn 2013). The goal of supporting creative discovery processes in science education raises several issues for research on computational creativity. We briefly three questions: (1) What specific tasks in creative discovery processes should we automate in supporting science education? We focus on ideation in scientific discovery, and summarize five epistemic views of ideation in the literature. We find that most epistemic views provide few guarantees of the correctness of ideas. This indicates a need for supporting hypothesis evaluation in student learning about creative discovery processes. (2) What computational tools may support evaluation of hypotheses in science education? We focus on conceptual modeling in scientific discovery. We summarize an interactive technology called MILA−S for first elaborating explanatory hypotheses into conceptual models and then evaluating a hypothesis through simulation. (3) What is the impact of creativity support tools such as MILA−S on student learning about scientific discovery processes? We summarize an educational intervention in a middle school engaging MILA−S for modeling ecological phenomena. We find that the use of MILA−S had substantial impact on the discovery processes of middle school students in modeling the ecological phenomenon. Epistemic Views of Scientific Discovery Idea generation is a core element of the creative process in scientific discovery (Clement 2008; Nersessian 2008). However, the task of ideation is complex. The question for us is what specific subtasks of ideation should we automate in supporting student learning about scientific discovery processes? To answer this question, we examine several epistemic views of ideation in scientific discovery. Conceptual Classification One common view of ideation in scientific discovery is classification of data into known categories.. We know about Linneas' classic work on classification in biology. Classification continues to be important in modern biology (e.g., Golub et al. 1999). Classification has been extensively studied in AI (e.g., Duda, Hart & Stork 2001) and ML (e.g., Bishop 2007). The classic DENDRAL system (Lindsay et al. 1980) classified mass spectroscopy data into chemical molecules. Chandrasekaran & Goel (1988) trace the evolution of early AI theories of classification. We have studied both top-down hierarchical classification in which a concept is incrementally refined based on data (Goel, Soundarajan & Chandrasekaran 1987), and bottom-up hierarchical classification in which features of data are incrementally abstracted into a concept (Bylander, Goel & Johnson 1991). June 2015 284 Abductive Explanation Abductive inference, i.e., inference to the best explanation for a set of data, is another common view of ideation in scientific discovery. AI has studied abduction from multiple perspectives (e.g., Charniak & McDermott 1985; Josephson & Josephson 1996). The classic BACON system (Langley et al. 1987) abduced physical laws from data. Bylander et al. (1991) have analyzed the computational complexity of the abduction task. Goel et al. (1995) describe a computational technique for abductive explanation based on the RED system for identifying red-cell antibodies in a patient's serum (Fischer et al 1991): the technique assembles composite explanations that explain a set of data from elementary explanations that explain subsets of the data. Conceptual Modeling Conceptual modeling is ubiquitous in science (e.g., Clement 2008; Darden 1998; Nersessian 2008). Conceptual models are abstract representations of the elements, relationships, and processes of a complex phenomenon or system. AI has extensively studied conceptual models (e.g., Davis 1990; Lenat 1995; Stefik 1995). We have developed conceptual models of complex systems that specify how a system works, i.e., the way the system's structure produces its behaviors that achieve its functions (Goel, Rugaber & Vattam 2009). We have used structure-behavior-function modeling for both engineering systems (Goel & Bhatta 2004) and natural systems (Goel et al. 2012) for supporting a variety of reasoning processes in design and invention. Analogical Reasoning Scientific discovery often engages analogical reasoning (Clement 2008; Dunbar 1997; Nersessian 2008). We know about Neil Bohr's famous analogy between the atomic structure and the solar system. Analogical reasoning engages retrieval of an analogue useful for addressing the scientific problem of interest and transfer of the relevant relational knowledge from the retrieved analogue to the scientific problem. AI research has developed several theories of analogical reasoning (e.g., Bhatta & Goel 2004; Falkanehainer, Forbus & Gentner 1989; Hofstader 1996; Thagard et al. 1990). We have studied analogical reasoning in scientific problem solving (Griffith, Nersessian & Goel 2000). Starting from verbal protocols of physicists addressing problems with spring systems (Clement 1988), we developed an AI system called Torque that emulates the problem solving behavior of the physicists. Visual Reasoning Scientific discovery often engages visual representations and reasoning (Clement 2008; Magnini, Nersessian & Nersessian 1999; Nersessian 2008). Although some AI research has explored visual representations and reasoning (e.g., Glasgow, Narayanan & Chandrasekaran 1995), AI research on visual representations and reasoning is not as robust or mature as on, say, classification. We have developed a language for representing visual knowledge and a computational technique for reasoning about visual analogies (Davies, Goel & Yaner 2008), and to understand the use of visual analogy Maxwell's construction of the unified theory of electromagnetism (Davies, Nersessian & Goel 2005). The Evaluation Task It is noteworthy that in general the above methods of idea generation in scientific discovery provide few guarantees of correctness of their results. Further, while these methods help generate hypotheses for a given situation, in general they do not by themselves evaluate their results. This indicates a need for supporting hypothesis evaluation in student learning about creative discovery processes. That is, there is a need for developing interactive tools that automate the evaluation task in the context of supporting creativity in student learning about scientific discovery processes. Thus, we decided to focus on automating the evaluation task in supporting student learning as described below. Model Construction and Evaluation In this work, we elected to automate the evaluation task in the context of supporting creativity in student learning about conceptual modeling. Cognitive science theories of scientific discovery describe scientific modeling as an iterative process entailing four related but distinct phases: model construction, use, evaluation, and revision (Clement 2008; Nersessian 2008; Schwarz et al. 2009). Thus, a model is first constructed to explain some observations of a phenomenon. The model is then used to make predictions about other aspects of the phenomenon. The model's predictions next are evaluated against actual observations of the system. Finally, the model is revised based on the evaluations to correct the errors and improve the model's explanatory and predictive efficacy. Scientific models can be of several different types, with each model type having its own unique affordances and constraints, and fulfilling specific functional roles in scientific inquiry (Carruthers, Stitch & Siegal 2002; Magnini, Nersessian & Thagard 1999). In this work, we are specifically interested in two kinds of models: conceptual models and simulation models. Conceptual models allow scientists to specify and share explanations of how a system works, aided by the semantics and structures of the specific conceptual modeling framework. Conceptual models tend to rely heavily on directly modifiable representations, languages and visualizations, enabling rapid iterations of the model construction cycle. Simulation models capture relationships between the variables of a system such that as the values of input variables are specified, the simulation model predicts the temporal evolution of the values of other system variables. Thus, the simulation model of a system can be run repeatedly with different values for the input variables, the predicted values of the system variables can be compared with the actual observations of the system, and the June 2015 285 simulation model can be revised to account for discrepancies between the predictions and the observations. A main limitation of simulation models is the complexity of the setting up a simulation, which makes it difficult to rapidly iterate on the model construction cycle. AI research on science education has used both conceptual models (e.g., Novak 2010; vanLehn 2013) and simulation models (e.g., Bridewell et al. 2006; de Jong & van Joolingen 1998; Jackson, Krajcik, & Soloway 2000) very extensively and quite productively. However, AI research on science education typically uses the two kinds of models independently from each other: students use one set of tools for constructing, using, and revising conceptual models, and another tool set for constructing and using simulation models. However, cognitive science theories of scientific inquiry suggest a symbiotic relationship between conceptual modeling and simulation modeling (e.g., Clement 2008; Magnini, Nersessian & Thagard 1999; Nersessian 2008): scientists use conceptual models to set up the simulation models, and they run simulation models to test and revise the conceptual models. Thus, we developed an interactive system called MILA- S that enables science students to construct conceptual models of ecosystems, to directly and automatically generate simulation models from the conceptual models, and then execute the simulations. MILA−S: A Tool for Model Construction and Evaluation MILA (Modeling & Inquiry Learning Application)_ is a family of interactive tools for supporting student learning about scientific discovery. The core MILA tool enables middle school students to investigate and construct models of complex ecological phenomena. MILA- S also allows students to simulate their conceptual models (Joyner, Goel & Papin 2014). In this paper, we focus on the impact of using MILA-S on students' creativity in conceptual modeling. MILA builds on a line of exploratory learning environments including the Aquarium Construction Toolkit (ACT; Vattam et al. 2011) and the Ecological Modeling Toolkit (EMT; Joyner et al. 2011). ACT and EMT were shown to facilitate significant improvement in students' deep, expert-like understanding of complex ecological systems. For conceptual modeling, ACT used StructureBehavior-Functions models that were initially developed in AI research on system design (Goel, Rugaber & Vattam 2009). In contrast, EMT used Component-MechanismPhenomenon (or CMP) conceptual models that are variants of Structure-Behavior-Function models adapted for modeling ecological systems. Both ACT and EMT used NetLogo simulations as the simulation models (Wilsensky & Reisman 2006; Wilensky & Resnick 1999). Like most interactive tools for supporting modeling in science education (vanLehn 2013), both ACT and EMT provided one set of tools for constructing and revising conceptual models and another tool set for using simulations. Like EMT, MILA- S uses Component-MechanismPhenomenon (or CMP) conceptual models that are variants of the Structure-Behavior-Function models used in ACT.. In CMP models, mechanisms explain phenomena such as fish dying in a lake. Mechanisms arise out of interactions among components and relations among them. Components are parts of the physical structure of system, and are classified as either biotic or abiotic; oxygen, for example, is an abiotic component while fish are biotic components. The representation of each component in CMP includes a set of variables such as population, age, birth rate, and energy for biotic components, and amount for abiotic components. The representation of each component is annotated by a set of parameters specifically for setting up a simulation, such as the appearance of the component and ranges for each variable associated with the component. In the CMP model of a system, representations of components (and their variables) are related together through different kinds of relations. MILA- S provides the modeler with a set of prototype relations. For example, interactions between a biotic component like 'Fish' and an abiotic component like 'Oxygen' could be 'consumes', 'produces', or 'destroys'. Connections have directionality; a connection from 'Oxygen' to 'Fish' would have a different set of prototypes, including 'poisons'. Representations of relations are also annotated with parameters to facilitate the simulation, such as energy provided for 'consumes' and rate of production for 'produces'. Like ACT and EMT, MILA- S too uses the NetLogo simulation infrastructure. After constructing a CMP conceptual model, a student clicks a 'Run Sim' button to initialize MILA- S and pass their model for simulation generation. MILA- S iterates through some initial boilerplate settings, then gathers together all the components for initialization along with their individual parameters. After this, MILA- S writes the functions based on the relations specified in the CMP model. A key part of this is a set of assumptions that MILA- S makes about the nature of ecological systems. For example, MILA- S assumes that if a biotic component consumes a certain other component, then it must need that other component to survive. A model with 'Fish' that contains 'consumes' connections to both 'Plankton' and 'Oxygen' would infer that fish need both Plankton and Oxygen to survive. MILA- S also assumes that species will continue to reproduce to fulfill their carrying capacity rather June 2015 286 than hitting other arbitrary limitations. These assumptions do limit the range of simulations that MILA- S can generate, but they also facilitate the higher-level rapid model revision process that is the learning objective of this project. Figure 1 illustrates a simple conceptual model constructed by a middle school student team (on the top of the figure) and the results of simulating it (at the bottom). Educational Intervention The present intervention had two main parts. In the first part, 10 classes with 237 students in a metro Atlanta middle school used MILA for two weeks. During this time, students worked in small teams of two or three to investigate two phenomena: a recent massive and sudden fish death in a nearby lake and the record high temperatures in the local area over the previous decade. In the second part, two classes with 50 of the original 237 students used MILA-S to more deeply investigating the phenomenon of massive, sudden death of fish in the lake. Prior to engagement with MILA- S, the 50 students in our study received a two-week curriculum on modeling and inquiry, featuring five days of interaction with CMP conceptual modeling in MILA. In the first part of the study using MILA, students also used pre-programmed NetLogo simulations that did not respond to students' models, but nonetheless provided students experience with the NetLogo interface and toolkit. Thus, when given MILA- S, students already had significant experience with CMP conceptual modeling, NetLogo simulations, and the interface of MILA−S. The question now becomes what was the impact of using MILA−S on students' creativity? Impact on Students' Creativity An initial examination of the processes and results of model construction by the student teams in our study provides two insights. Firstly, there exists a fundamental difference in the conceptual models that students constructed with MILA- S compared to the earlier models they constructed with MILA: while earlier models were Figure 1: A model in MILA- S (top) showing a set of simple relationships between fish, algae, and oxygen, and the NetLogo simulation (bottom) generated by MILA- S to simulate the model. This model was constructed by the team described in the third case study below; the simulation was generated and run from their model by research staff to obtain this screenshot. June 2015 287 retrospective and explanatory, models constructed with MILA- S models were prospective and dynamic. Secondly, the model construction process when students were equipped with MILA- S was profoundly different from their earlier process using MILA: whereas previously, conceptual models were used to guide investigation into sources of information such as existing theories or data observations, once equipped with MILA- S the students used the conceptual models to spawn simulations that directly tested the implications of their hypotheses and models thereof. The Constructed Models During engagement with MILA, students produced models that can be described as retrospective and explanatory. Students started from an observable phenomenon, the aforementioned fish kill, and recounted a series of events that led to that result. Causal relationships were captured throughout the model, but only those that lay directly in the causal path leading to the observed phenomenon, and only in the specific way in which the chain occurred in the phenomenon. For example, one team modeled multiple feedback cycles to explain the phenomenon. In their model, a heat spike caused algae populations to grow out of control, then die off due to a lack of carbon dioxide to breathe and a lack of sunlight to produce energy (due to the thick algae clouding the lake). This led to a spike in algae-decomposing bacteria who suddenly had an ample food supply, as well as a drop in the population of oxygen-producing algae. These bacteria, then, consumed an enormous quantity of oxygen, causing the fish population to suffocate. This led to more dead matter in the lake, thus encouraging more bacteria reproduction, exacerbating the fish kill further. This model presented a complete explanation for why and how the fish kill occurred in the lake; however, the model only captured a retrospective view of the series of events applicable in this situation. Although students could use mental simulation to imagine the results, these models do not explicitly capture dynamic relationships in the system, and thus are of limited use describing what would have happened under different circumstances. For example, had the temperature changed more slowly and allowed the algae to grow steadily rather than exploding and plummeting in quick succession, could the lake have sustained the increased algae population? Would the increased algae population have produced sufficient oxygen to allow the fish population to grow and thrive as well? Thus, models constructed with MILA were explanatory and retrospective. With MILA- S, students constructed fundamentally different kinds of models that aimed not to capture the series of events that occurred, but rather to capture the dynamic relationships that gave rise to that series of events. Thus, the models constructed in MILA- S invoked a layer of abstraction and generalization away from the models constructed in MILA. For example, one team constructed an initial model that captured the three relationships they considered most pertinent in the system. These students already believed that the fish kill was caused by a sudden drop in oxygen, thus suffocating the fish. Thus, their first relationship was that fish consume oxygen. They similarly knew that oxygen is produced from sunlight, and thus included the relationship between sunlight and oxygen. These connections differed fundamentally from those modelled in MILA, such as accounting for trends in multiple directions (i.e. oxygen production varies directly, up or down, with sunlight presence). The model was not constructed to directly explain the phenomenon, but rather to provide the relationships necessary so that under the right conditions, the phenomenon may arise on its own. Model Construction Process During prior engagement with MILA, model construction occurred as students constructed their initial hypotheses, typically connecting only a cause to a phenomenon with no mechanism in between. This was then used to guide investigation into other sources of information such as observed data or other theories to look for corroborating observations or similar phenomena. The conceptual model was then evaluated according to how well it matched the findings; in some cases, the findings directly contradicted the model, while in other cases, the findings lent evidence or mechanism to the model. Finally, the conceptual models were revised in light of this new information (or dismissed in favor of stronger hypotheses, reflecting revision at a higher level) and the process began again. During engagement with MILA- S, however, we observed a profound variation on the model construction process. The four phases of model construction were still present, but the nature of model use and evaluation changed. Students started by constructing a small number of relationships they believe to be relevant in the system, the model construction phase. After some initial debugging and testing to become familiar with the way in which conceptual models and simulations fit together, students generated simulations and used them to test the implications of their conceptual models. After running the simulation a few times, students then evaluated how well the results of the simulation matched the observations from the phenomenon. This evaluation had two levels: first, did the simulation accurately predict the ultimate phenomenon (in this case, the fish kill)? Once this basic evaluation was met, an advanced evaluation followed: did other variables, trends, and relationships in the simulation match other observations from the phenomenon? For example, one team successfully caused a fish kill by causing the quantity of food available to the fish to drop, but evaluated this as a poor model nonetheless because nothing in the system indicated a disturbance to the fish's food supply. Finally, equipped with the results of this evaluation, students revised their models to more closely approximate the actual system. Thus, students still constructed and revised conceptual models, but through the simulation generation framework of MILA- S, the model use and evaluation stages took on the practical rigor, repeatable testing, and numeric analysis June 2015 288 facilitated by simulations. Rather than speculating on the extent to which their model could explain a phenomenon, students were able to directly test its predictive power. Then, when models were shown to lack the ability to explain the full spectrum of the phenomenon, students were able to quickly return and revise their conceptual models and iterate through the process again. Three Illustrative Case Studies We present three case studies from our experiment to illustrate the above observations about the model construction process. These case studies were chosen to demonstrate variations in the process and connections to the underlying model of construction and revision. Case 1 The first team posited that pollution from dangerous chemicals played a significant role in the system. Specifically, this team speculated that chemicals were responsible for killing the algae in the lake, which then caused the fish population to drop. They began this hypothesis by constructing a model suggesting that algae produces oxygen, fish consume oxygen, and harmful chemicals destroy algae populations. They then used MILA- S to generate and use a simulation of this model to mimic the initial conditions present in the system (i.e. a fish population, an algae population, and an influx of chemicals). This simulation showed the growth of fish population continuing despite the dampened growth of algae population from the harmful chemicals. The team evaluated this to mean that the death of algae alone could not cause the massive fish kill to occur. The team then revised their model to suggest chemicals directly contributed to the fish kill by poisoning the fish directly, as well as killing the algae. The team then used MILA- S to generate another simulation. This time, when the team used the simulation under similar initial conditions, the fish population initially grew wildly, but the chemicals ate away at both the fish and algae. Eventually, the harmful chemicals finished eating away at the algae, the oxygen quantity plummeted, and the fish suffocated. Students evaluated that this simulation matched the observed phenomenon, but also evaluated that their model missed a relevant relation: based on a source present in the classroom, students posited that fish ought to consume algae. They revised their model to account for this error uncovered during evaluation, used their simulation again, found the same result, and evaluated that they had provided a model that could explain the fish kill. Case 2 A second team started off by creating a simple set of relations that they believed was present due to their biology background and prior experience with MILA. First, they speculated that sunlight "produces" oxygen, and then that fish, in turn, consume the oxygen. Following these two initial relationships, they generated their first simulation through MILA- S and used it to mimic the believed initial conditions of the lake (i.e. a population of fish, available oxygen, available sunlight). Sunlight was inferred to be continuously available, and thus, at first, the population of fish expanded continuously without any limiting factor. However, when the population of fish hit a certain threshold, it began to consume oxygen faster than it was being produced. This led to the quantity of oxygen dropping, and subsequently, the population of fish dropping. However, rather than depleting completely, the fish and oxygen populations instead began to fluctuate inversely, with oxygen concentration rebounding sufficiently when fish population dropped, allowing the fish to rebound. The team ran this simulation multiple times to ensure that this trend repeated itself. In one instance, the fish population crashed on its own simply due to the suddenness of the fish population growth and subsequent crash. However, the team evaluated that this was not an adequate explanation of what had actually happened in the lake. The team posited that if this kind of expansion and crash could happen without outside forces, it would be more common. Second, the team observed that their model contained faulty or questionable claims, such as the notion that sunlight "produces" algae. This evaluation based on both the simulation results and reflection on the model led to a phase of revision. An ‘Algae' component was added between sunlight and oxygen, representing photosynthesis. Students then used MILA- S to generate a new simulation, and used this new simulation to test the model. This time, students found that their model posited that an oxygen crash would always occur in the system, and evaluated that while this successfully mimicked the phenomenon of interest, it failed to match the lake on other days. Case 3 The third team began with an interesting hypothesis: algae serves as both the food for fish and the oxygen producer for fish. The team, thus, started with a simple three-component model with fish, algae, and oxygen: fish consume algae, fish consume oxygen, and algae produces oxygen. The team further posited that in order for algae populations to grow, they must have sunlight to feed their photosynthesis process. Sunlight, therefore, was drawn to produce algae. The team reasoned that if the fish population destroys the source of one type of ‘food' (oxygen) in search for another type (actual food), it could inadvertently destroy its only source for a necessary nutrient. The team used MILA- S to generate a simulation based on this model and ran it several times under different initial conditions. Each time, algae population initially grew due to the influx of sunlight. As a result, fish populations grew, due to the abundance of both algae (as produced via sunlight) and oxygen (as produced by the algae). As the fish population spiked, the algae hit a critical point where it began to be eaten faster than it reproduced, and the rate of sunlight entering the system was insufficient to maintain steady, strong growth. This caused the algae population to June 2015 289 plummet, and in turn, the fish population to plummet as the fish suddenly lacked both food and oxygen. Sometimes, the algae population subsequently bounced back even after the fish fully died off, while in others both species died entirely. Unlike the second team, this third team evaluated this to mean their model was accurate: under the initial conditions observed in the lake, their model predicted an algal bloom every single time. Thus, the third team provided two interesting variations on the model construction process observed in other teams: first, they overloaded one particular component, demonstrating an advanced notion of how components can play multiple functional roles. Second, they posited that a successful model would predict that the same events would transpire under the same initial conditions every time, as opposed to the second team's notion that this phenomenon ought to only occur sometimes. Summary, Conclusions, and Future Work Scientific discovery in general is a creative task. Our goal in this work was to enable science students to mimic the scientific modeling practices of real scientists and thus help make learning about scientific discovery as authentic as possible. Our analysis of several epistemic views of idea generation in scientific discovery indicated a need for automating the task of hypothesis evaluation. Therefore, we developed an interactive system called MILA- S that enables science students to construct conceptual models of ecosystems, to directly evaluating the conceptual models by automatically generating simulation models from the conceptual models and then execute the simulations. Our hypothesis was that the computational support for model construction and evaluation embodied in MILA- S would foster student creativity in scientific modeling. Initial results from a pilot study with 50 students in a middle school provide preliminary evidence in favor of the hypothesis (although a controlled study is needed to conclusively verify these claims). Firstly, students approached the modeling process from a different perspective from the outset, striving to capture dynamic relationships among the components of the ecological system. These dynamic relationships then promoted a more abstract and general perspective on the system. Secondly, the process of model construction, use, evaluation, and revision presented itself naturally during this intervention, with the simulations playing a key role in supporting the cyclical process of constructing conceptual models. By using the simulations to test their predictions and claims, and by subsequently evaluating the success of their conceptual models by matching observations from the actual phenomenon, students engaged in a rapid feedback cycle that saw rapid model revision and repeated use for continued evaluation. MILA- S empowers science students to evaluate the conceptual models through simulation, allowing them to focus on idea generation, and model construction and revision. Note that in addition to conceptual modeling, this project entails some of the other processes of scientific discovery we briefly mentioned in the introduction. Thus, it engages abductive explanation as students explore multiple hypotheses for explaining an ecological phenomenon, and construct the best explanation for the given data about the phenomenon. It also engages visual representations and reasoning: students construct a visual representation of their conceptual model of the ecological phenomenon (top of Figure 1) and generate visualizations of simulations directly from the conceptual models (bottom of Figure 1). We are presently engaged in a full-scale investigation to test these theories, techniques and tools with college-level biology students. The objective of this investigation is to examine the use of creativity support tools for scientific modeling of ecological phenomena in college-level introductory biology courses. Acknowledgements We thank the anonymous reviewers of this paper: their comments helped improve the articulation of this work. 2015_38 !2015 Intentionally Generating Choices in Interactive Narratives Michael Mateas and Peter Mawhorter and Noah Wardrip-Fruin Computer Science Department University of California Santa Cruz Santa Cruz, CA 95064 USA {michaelm, pmawhorter, nwf}@soe.ucsc.edu Abstract Interactive stories face a famous "authorial bottleneck." Two existing approaches to this problem are story management systems, such as drama managers, and interactive narrative generators. Existing work leverages well-understood qualities of linear narrative such as suspense to generate content, but interactivity brings new capacities, like the ability to make a player experience regret. These interactive poetics arise from the player's ability to make choices, and depend heavily on the structure of the choices that are presented to the player. This system description paper presents a system that creates choices by reasoning about their structure, and describes the architecture that enables it to do so. Introduction Since the 1970's, researchers in artificial intelligence have been making systems that can creatively generate stories (Klein et al. 1971). With the rise of digital games, and in particular, interactive narratives1, this research has found a new application: generating and managing the complexities of interactive narratives. One approach to this problem is to manage players' experiences. A managed experience lets authors create a diverse array of content while letting players experience a coherent narrative that includes different parts of the content depending on their choices. Another approach is to create systems that generate content, letting authors work at a more abstract level (perhaps writing re-combinable actions or events) which the system can then use to generate a wide variety of possible stories. Both approaches are proposed solutions to the fact that the work necessary to create a truly open world is overwhelming for human authors (Orland 2011). Existing systems have demonstrated the viability of reasoning about traditional narrative qualities for both experience management and story generation. Qualities unique to interactive narratives have not yet been widely used for reasoning in such systems, however. For example, the ability to make a player regret their own actions is unique to interactive contexts, and it depends on aspects of the narrative (such as which actions the player intended, and which outcomes were consequences of player actions) that go beyond traditional narrative qualities. 1 The authors are aware that interactive narratives predate digital games in several forms, but digital games have popularized interactive narrative as a medium. Interactive narrative systems thus stand to gain by reasoning about interactive as well as traditional poetics. Presented here is a system called Dunyazad that attempts just that: It dynamically builds choices with the goal of achieving specific poetic effects. Dunyazad focuses on choice poetics as a subset of interactive poetics, attempting to structure the choices that it gives the player so that they evoke feelings like safety or confusion (Mawhorter et al. 2014). As an operationalization of choice poetics, Dunyazad's successes and failures can also inform the theory that drives it. Liapis, Yannakakis, and Togelius recently stated that games were an ideal domain for computational creativity, and listed interactive narrative as an important part of that domain (Liapis, Yannakakis, and Togelius 2014). Human authors are now exploring the full potential of interactive narrative: Many independent games have earned praise for their stories, and communities that produce innovative interactive narratives have formed around tools like Inform 7 (http://inform7.com/) and Twine (http: //twinery.org/).2 If generative narrative systems want to leverage the potential of interactive narrative, they will need to reason about interactive poetics, and in particular, how the choices they present to players are perceived. Prior Work In computational narrative systems, there has been a recent trend towards explicit poetics. Szilas' 2003 IDTension first proposed the idea of creating an interactive narrative by "simulating the laws of narrative" (Szilas 2003), much as one can produce a wide range of gameplay by simulating the laws of physics. This direction of work naturally proceeds by identifying the mechanism of specific poetic effects and building computational systems to produce those effects. El-Nasr's 2007 Mirage (El-Nasr 2007) is another example of this approach; it attempts to apply a range of dramatic techniques to increase engagement in an interactive narrative. In contrast, systems such as Suspenser (Cheong and Young 2006) and Prevoyant (Bae and Young 2008) have focused on specific poetic effects (suspense and surprise respectively). Dunyazad as described here can be viewed as continuing this line of research because it reasons explicitly about poet2 These are both examples of tools not explicitly designed to encourage creativity which nonetheless support it by making authoring faster and easier. June 2015 292 ics; it emphasizes interactive poetics, and in particular, the poetics of discrete choices. Whereas IDTension and Mirage incorporate traditional poetics into interactive experiences, Dunyazad focuses on interactive poetics, leveraging choice structures to create affect. In some respects Dunyazad is designed as much to illuminate interactive poetics as to exploit them: because it uses declarative code to construct poetic choices, its successes and failures can be traced to concrete parts of its theory, and that theory can thus be informed by the system's performance. More recently, several studies have attempted to formally investigate and model poetic effects in interactive narrative contexts, with a focus on choices. In 2011, Thue, Bulitko, Spetch, and Romanuik measured players' perceptions of agency and found that they often differed from what one might expect based on the choices available to the player (Thue et al. 2011). Their system did manipulate an interactive narrative to achieve a poetic effect (give the player a sense of agency), but it focused on manipulating events in a way that was invisible to the player, rather than on changing a player's perceived options at any particular choice. In a study of agency which did not involve a generative system, Fendt, Harrison, Ware, Cardona-Rivera, and Roberts were able to create an illusion of agency, albeit in the context of an extremely simple interactive narrative (Fendt et al. 2012). A follow-up to the Fendt et al. study by Cardona-Rivera, Robertson, Ware, Harrison, Roberts, and Young linked players' perceptions of differences between outcomes to their perceptions of agency (Cardona-Rivera et al. 2014). In another paper focusing on choices in interactive narratives, Yu and Riedl were able to predict player choices using collaborative filtering (Yu and Riedl 2013). This active research surrounding choices in interactive narratives shows that authors are interested in the poetic effects of choices. However, systems that actually reason about the poetics of the choices they generate are scarce- most existing systems reason about different options and outcomes independently. Barber and Kudenko's 2007 work on dilemma-based interactive narrative is a notable exception (Barber and Kudenko 2007). Their work focuses on a single type of choice, generating interactive experiences where each choice is a dilemma. Ideally, a system that took choice poetics into account would dynamically construct each choice that it offers the player for maximum poetic impact. Of course, just as IDTension and Mirage reason about a range of classical poetics, such a system could take into account a range of interactive poetics (including aspects beyond choice poetics). But even a system that only considers choice poetics is a step in the right direction. Choice Poetics The theory of choice poetics described by Mawhorter, Mateas, Wardrip-Fruin, and Jhala in (Mawhorter et al. 2014) provides a framework for reasoning about choices, which is crucial for an interactive narrative system which must generate them. When analyzing the poetics of a choice, the first consideration is the player's mode of engagement: how is the player approaching the game, and what do they hope to achieve through their play? Common modes of engagement include power play (playing to achieve ludic goals like scoring points), avatar play (playing by projecting yourself into the game and making the choices you would make in a character's situation) and role-play (playing to express a particular role through the actions of one or more characters you control). There are also other less common modes of engagement like critical play, and players can (and usually do) engage with multiple modes at once. Taking modes of engagement into account does not require reading the player's mind, however: just as with any other element of a game, designers can make decisions based on their intuitions about how players will play, and they can refine their designs through playtesting. Dunyazad directly encourages avatar play, and assumes that this will be players' primary mode of engagement when it constructs choices. Role play is also supported to some extent, but because there are only minimal game mechanics, Dunyazad's stories do not lend themselves to power play. The game mechanics that do exist (skills which affect outcomes) are deployed in such a way that favorable outcomes from an avatar play perspective (which are favorable for the diegetic protagonist) are aligned with favorable gameplay outcomes (those in which the action attempted is successful, generally leading to successful endings). Once a choice is considered in terms of a particular mode of engagement, it may fall into one of several classes of recognizable choice idioms, such as the dilemma or the false choice. Recognizing these idioms is based on an analysis of the framing, options, and outcomes of a choice (for example, a classic dilemma must have exactly two options, and the options' outcomes should each thwart a different player goal). Besides classifying choices as examples of choice idioms, (Mawhorter et al. 2014) does not say much about how to construct choices, although it does list some aspects of player experience that can be manipulated through the use of different choice structures. More analysis of existing interactive fictions within the framework of choice poetics would likely yield more specific methods for choice construction, however, and there is some existing advice on choice construction in the form of authoring advice for human authors of interactive narrators (e.g., (Choice of Games LLC 2010)). The choice construction methods in Dunyazad are currently based on this latter body of work, as described in the Choice Generation section on page 5. Dunyazad Although not yet complete, Dunyazad is a novel story generation system that is intended to generate interactive narratives in the style of Choose Your Own Adventure books using second-person narration and explicit choices. Dunyazad treats choices as first-class objects, and reasons about their structures. In particular, it has rules for constructing a variety of choice types based on the player's estimated expectations and evaluations in its choice structure module. Dunyazad ultimately produces natural language narratives. Each story consists of sections of text followed by choices, where each choice leads to another section of text or to an ending. Dunyazad is not interactive, but instead generates June 2015 293 entire interactive narratives that players can interact with separately (importantly, this allows players to re-play parts of the narrative). As you journey onwards, a leviathan rises majestically up from the ocean, tentacles curling. It is threatening you. ! You try to flee from it. - You attempt to pacify it with music. You flee from it and escape. You travel onwards. Figure 1: A minimal example vignette Because Dunyazad is focused on operationalizing choice poetics, its default domain is simple travel/adventure stories in a fantasy setting, made up of sequences of relatively independent "vignettes" or scenes. Each vignette is made up of a setup plus a few basic actions, some of which may be player-initiated choices ("choice" nodes) and some of which may be events dictated by the system ("event" nodes). Each story node thus represents a single event or choice, including a context, one or more actions that might happen, and any outcomes of those actions. Although most world state is reset at the beginning of each vignette, the state of the player's party is not, allowing for some overall continuity. Figure 1 shows the full text of a minimal example vignette composed of a single choice (at which the player chose to flee), and a single event (the default vignette-ending event "travel onwards"). This vignette also introduces some new context at its choice node (an attacking monster) which is described in the text. Of course, when presented to the player, the text stops at the choice until the player has selected an option. Although more complex vignettes are possible, the system is designed to create a stream of simple, direct choices, imitating the game Spent (http://playspent.org/html/) (McKinney 2011). By limiting the complexity of vignettes and refreshing most of the world state between vignettes, the user experience is directed towards shallow and playful interaction, and at the same time, the system has fewer opportunities to accidentally create plot holes. This also creates an environment in which the poetics of individual choices (e.g., was the last choice relaxing?) are important to the feel of the story overall, as opposed to merely supporting a dramatic arc defined by traditional narrative elements.3 The system accordingly assumes that players will mainly engage in avatar play, and perhaps also light role play, and attempts to provide choices that enable these modes of engagement.4 From a technology standpoint, Dunyazad combines imperative Python code with declarative answer set programs to iteratively grow a branching story. The imperative code manages the iteration, at each step filling in single story node and adding new blank child nodes for each new option created. Filling in a node is accomplished by using the Potassco Labs tools gringo and clingo to ground and solve an answer set program (Gebser et al. 2011). The answer set program 3 In this case, the overarching plot of a journey to an exotic destination constrains little in terms of tension, narrative developments, etc. 4 For more details about of modes of engagement, refer to (Mawhorter et al. 2014) for each node includes predicates that represent the entire current story state, but facts from the solution to the program are only used to modify the currently-focused node. After a complete story structure is created, Dunyazad's imperative code uses a set of text templates to render the story into natural language. This module takes care of verb conjugation and pronominalization where necessary, and the text templates form a generative grammar which adds extra variation to the story. This variation doesn't change the underlying sequence of events, and mostly consists of word choice and sentence-structure variation that reduces literal repetition when similar events are described multiple times. While Dunyazad's hybrid iterative/declarative approach does limit the kinds of constraints that the system can easily place on multi-node story structures, it is necessary to keep the answer set problems tractable: Asking the solver to produce a complete story with hundreds of nodes in a single step is not a task that many modern computers could handle (if any), whereas just solving a single node can be accomplished in seconds. At the same time, being able to use answer set programming for the creation of individual nodes provides two benefits. First, answer set programming reasons simultaneously about all of its constraints, which means that building some logic which detects a certain condition also allows direct control over that condition (by e.g., prohibiting it or requiring that it hold). This means that there is little distinction between writing code which recognizes a phenomenon and writing code which produces it: the answer set solver does the hard work of figuring out what has to happen in order for the phenomenon to occur. The second main benefit of using answer set solving is that it directly encodes constraints. Dunyazad as a project aims to apply choice poetics to the generation of interactive narrative, but it should also be able to push back on choice poetics when constructing choices based on theory fails to produce the expected results. Setting aside the difficult issue of blame assignment between the system and the theory, using answer set programming enables the system to better inform the theory because the constraints responsible for producing behavior can be directly translated into theoretical statements. For example, a rule like regret(Choice) :consequence(Choice, Outcome), bad for player(Outcome) translates directly to a theoretical statement "When the player chooses an outcome that leads to something which is bad for them, they will feel regret." If testing reveals that players do not feel regret when the system thinks they should, the rule can be refined, and because it is a direct encoding of the theory, such refinement can directly inform the theory. Representation Although the output of Dunyazad is natural language, it has an underlying predicate representation of the stories it generates. Each story node describes either a choice or an event, and the structure of the two is the same, the only difference being events have only one option. Story nodes have a rich predicate representation of their initial state, which can encode arbitrary properties of and relations between story elements, including characters and items. Story nodes also June 2015 294 1. st(root, inst(actor, monster 76)). 2. st(root, property(name, inst(actor, monster 76), "leviathan")). 3. st(root, relation(threatening, inst(actor, monster 76), inst(actor, you))). 4. at(root, action(option(1), flee)). 5. at(root, outcome(option(1), o(success, escape))). 6. at(root, outcome(option(1), o(get injured, safe))). 7. at(root, arg(option(1), fearful, inst(actor, you))). 8. at(root, arg(option(1), from, inst(actor, monster 76))). 9. at(root 1, action(option(1), travel onwards)). Figure 2: Some example predicates describing parts of fig. 1 have some number of options, each of which has an action associated with it, along with argument bindings for that action. Figure 2 shows some of the predicates that describe the example vignette in fig. 1. Story states are sets of state predicates each of which takes one of four forms: 1. st(root, inst(Type, ID)). Declares the existence of a particular instance, which has a Type of either actor or item. 2. st(root, state(State, Inst). Assigns a unary state such as injured to an instance. 3. st(root, property(Prop, Inst, Value). Associates a property with an instance and specifies its value. For example, an actor can have the has skill property with a value of music indicating that they possess the music skill. Properties can be multi-valued. 4. st(root, relation(Rel, From, To). Asserts a relation between two instances. For example, an actor can have the has item relation with an item. Some constraints (like exclusivity of the has item relation) are enforced. Frame axioms dictate that state changes only occur when specified by actions. Actions are defined by arguments, outcome variables, skill links, preconditions, and post-conditions as follows: 1. argument(Action, Arg, Type). Specifies an argument Arg which must bind an instance of type Type in the current state. 2. outcome val(Action, Var, Val). Specifies that outcome variable Var can take on value Val. Each variable has multiple possible values. 3. skill link(Skill, Type, NeedsTool, Action, Arg, o(OutVar, OutVal)). Skill links specify how character skills influence action outcomes. The four link types are required, promotes, avoids, and contest. These indicate player expectations. For example, the healing skill is linked to the healed value of the success outcome variable for the treat injury action via a required link that also specifies that a tool is needed. Thus if the player lacks the healing skill and an option for them to take the treat injury action is presented, the system assumes that the player will expect the action to fail. 4. Preand post-conditions. These have no fixed form, but instead are arbitrary logical constraints. For example, it is an error for the treat injury action to be performed on a patient who is not injured. Most depend on outcome variables having specific values. Another example: if the success variable of a treat injury action has a value of healed, then the injured state is removed from the patient, but if the success variable is either still injured or killed this doesn't happen. As an example, the text "You try to flee from it," in fig. 1 is a rendering of an action "flee" with the player character and the monster as arguments. The flee action has two outcome variables: success which has values escape and failure, and get injured, which has values injured and safe. In fig. 2, facts 4 to 8 describe the action, outcome, and arguments of this option (the node that it is part of is root, and it is the first option at that node). Of course, each option at one node leads to another story node, and any consequences of the outcome associated with that option are reflected in the starting world state of the linked node. In fig. 1, the initial choice story node links to two successor nodes, only one of which is displayed.5 In this case, the consequence of the successful flee action is that you have escaped the threat of the monster (which was part of the initial world state). The "travel onward" action's initial world state thus does not include the threat of the monster attack, which is actually a precondition for the "travel onwards" action- it requires an absence of "problems." "Problems" and more generally "potentials" (which are either "problems" or "opportunities") are an important part of how the system builds stories. Dunyazad represents a "setup" as a partial world state which is added to the current world state when a new vignette begins. In fig. 1, the "monster attack" setup is used, which introduces a monster (in this case a leviathan) which is threatening the player-character. 5 From a developer's perspective, all of the linked nodes are part of the same vignette, but of course without re-play, a player will only see one of them. June 2015 295 The fact that the player is being threatened (which is encoded as a relation) is explicitly recognized by the system as a "problem"- in fact all instances of the "threatening" relation are considered "problems," even when the player is not the target. This explicit representation of both problems and opportunities drives the basic consideration of what actions are appropriate in a given situation, and also plays into the rules about choice structures. Reasoning Dunyazad uses answer set programming to create the individual events and choices that make up a story. It is thus governed by a set of logical constraints which dictate what event configurations are acceptable. As already mentioned, solving for dozens of story nodes simultaneously is infeasible, because solving time is exponential in the number of nodes considered. As a compromise, Dunyazad iteratively solves individual story nodes. Thus Dunyazad's reasoning revolves around the construction of a single event or choice node. The rules governing node construction can be divided into three categories: • Constructive rules- rules that help create the basic structure of facts, such as the rule that stipulates that each option has an action associated with it. • Sense rules- rules that disallow nonsensical story structures, such as the rule that says that no choice should have two identical options or the rule that disallows trading items with oneself. • Content rules- rules that discard some valid stories as uninteresting or otherwise undesirable, such as the rule that requires successive vignettes to use different setups. Without constructive rules, core facts like those that assign values to arguments would be missing from result answer sets, and the system would crash. Without sense rules, all of the basic components of a story would be there, and the story could be rendered to natural language by the text generation system, but the result would be at best surreal and at worst gibberish. Without content rules, the result would be an understandable sequence of events, but it would probably not be an interesting story. Because of the nature of answer set programming, Dunyazad effectively chooses an arbitrary permutation of an event among all possibilities that satisfy its rules (consult (Gebser et al. 2011) for more background on how answer set programming works). Each rule represents a constraint on the generative space of story nodes, which makes it both easy to prune the generative space, and easy to see how an individual constraint effects the generative space. The following variables determine the space of possible choice structures: • The number of options (minimum 2 for a "choice" node; maximum 4 for performance reasons). • The action for each option (there are 13 actions in the current domain model). • The possible argument bindings for each action (most actions have 2-4 arguments, and each argument generally has 5-7 type-appropriate bindings at a given story state). • The values for each outcome variable of each action (most have 1-2 outcome variables with 2 values each). Unsurprisingly, there are a staggering number of possibilities under the constructive rules, but this space is reduced drastically by the sense and content rules. Choice Generation Dunyazad's design is in part based on concrete human advice for writing choice-based narratives offered by Choice of Games (an interactive narrative publisher) in several online articles (Choice of Games LLC 2010). In particular, Dunyazad focuses primarily on expectations and outcomes, which factor prominently in an article about the fundamentals of choice design titled "5 Rules for Writing Interesting Choices in Multiple-Choice Games," (Fabulich 2010). Dunyazad's choice structure subsystem is devoted to estimating and managing the poetics of the choices it generates. Abstractly, this subsystem reasons about choices in terms of expectations and outcomes, using estimates of player perception. This same structure for representing and reasoning about choices could be used by other systems that wanted to generate choices intentionally. The most basic structure of Dunyazad's choice representation has already been described: a choice consists of context, options, and outcomes. Context in this case is a world state, options correspond to discrete, fully-specified actions that the player-character can take, and outcomes are the changes in world state that result from a particular action. To actually reason about the poetics of a choice, however, the system needs to make some assumptions about the player's experience, which gives rise to three more entities: player goals, player expectations and perceived outcomes. Player goals are the basis for reasoning about how players might perceive choices. Some basic player goals can be predicted by the author, and to the extent that players actually pursue these goals, an author can design choice poetics. For example, an author might presume that players will want to keep their character alive and healthy, and that players will also want to maintain the health of their allies. A choice where the player is forced to sacrifice either their character's health or the health of an allied character could then be constructed with the goal of adding to the player's sense of tension. If players do in fact value their character's health and that of their allies, the choice should be a tense one (other details of its construction notwithstanding). For players who don't value one of these goals, the choice will lack the tension that the author intended, but that doesn't mean that the author's strategy for creating a tense moment was invalid. The author also has methods for encouraging the pursuit of various goals, such as using standard narrative techniques to try to promote empathy with the characters. Dunyazad relies on the same strategy as this hypothetical author to create choice poetics: through its fixed introduction segment and according to genre conventions, it encourages players to pursue certain goals. It then estimates the poetic effects of the choices it creates assuming that the player will be invested in those goals. Dunyazad assumes are that the player will pursue the following goals: June 2015 296 • Avoid injury to themselves and their allies (high priority). • Avoid threats to themselves and other non-aggressive characters (high priority). • Have all actions they take be successful (low priority). • Acquire and retain tools for their skills (low priority). Given assumed player goals, a choice can be considered in terms of its player expectations and perceived outcomes. Both of these will vary from player to player, but just like with player goals, human authors can often estimate them. Like the player goal estimation, player expectation and perceived outcome estimation depends on the system author. This works via the skill link system as mentioned in the Representation section above: the author of an action specifies which skills are linked to which outcomes and how, and this information is used by the system to estimate player expectations. For example, if the player has a goal to maintain their health, but they're missing the fighting skill, an option allowing the player to attack an enemy will be marked as dangerous, because the fighting skill is linked to the injured value of the aggressor state outcome variable, and that outcome would cause the player to be injured, threatening their goal. As an example, consider the choice in fig. 1. This choice has two options, which correspond to the "flee" and "pacify" actions. Unbeknownst to the player (before they've made a decision at least), the outcome of the "flee" action in this case will be a successful escape, while the outcome of the "pacify" action (not shown) will be a failure that does not change the world state (i.e., the monster continues to threaten the player). While generating this choice, the system creates a player expectation for each option for each player goal, indicating how the player would expect that option to impact that goal. Taking "escape from threats" as a player goal, both options at this choice are expected to threaten that goal, because the system knows that both options could fail to achieve it. At the same time, both options are expected to enable that goal, because depending on their outcomes, either option could achieve that goal. But is either option likely to succeed or fail? Assuming that the player has the "wilderness lore" skill (linked by a "contest" link to the "flee" action) but the monster does as well, the first option is indeterminate. However, based on a "required" skill link, if the player does not have the "music" skill, the second option is likely to fail. There are thus five possible non-exclusive player expectations per player goal: • Irrelevant- this option is irrelevant to this goal. • Threatens- this option risks failing this goal. • Enables- this option might achieve this goal. • Fails- this option is expected to fail this goal. • Achieves- this option is expected to achieve this goal. Threatens and enables expectations are assigned based on all possible outcomes of an action, while fails and achieves are based on outcomes that the player has reason to believe are likely. Combinations of these expectations can describe a variety of situations. For example, a choice which threatens, enables, and fails a goal could be seen as a desperate gamble: it has a possibility of success, but it is expected to fail. Similarly, there are five perceived consequences for each player goal: • Irrelevant- this outcome does not affect this goal. • Hinders- this outcome hinders progress towards achieving this goal, but does not actually cause it to fail. • Advances- this outcome contributes to achieving this goal, but does not actually achieve it. • Fails- this outcome directly fails this goal. • Achieves- this outcome directly achieves this goal. Unlike player expectations, perceived consequences are mutually exclusive, and while expectations only reason about the potential outcomes of an action, perceived consequences are assigned based on actual outcomes. Returning to our example, the player expectations for the first option with regards to the goal "escape from threats" will include "threatens" and "enables," but since both the player and the monster have the associated contest skill, no stronger expectation is formed. For the second option, because the player lacks the relevant "music" skill6, the player expectations will be "threatens," "enables," and "fails." The perceived outcome of the first option will be that it achieves the "escape from threats" goal, while the perceived outcome of the second option will be that it fails this goal. This representation of player goals, expectations, and perceived outcomes enables rich reasoning about the poetics of a choice. To start with, it's easy to encode simple choice idioms (see (Mawhorter et al. 2014)). An example would be a dilemma- traditionally a choice with exactly two options which lead to two different negative consequences. A choice with exactly two options, each of which is expected to fail one of two goals and enable the other fits these criterion. The perceived outcomes here determine what kind of dilemma it is, for example a false dilemma might have identical outcomes for both options. You come to a tavern and decide to rest for a while. A merchant is selling a music book and she is selling an oboe and a noble is bored and a peasant is bored and an innkeeper seems knowledgeable. - You play a song for the peasant. (+music) - You gossip with him. (+elocution) - You offer to trade the merchant some perfume for the music book. (no skill) - You tell the noble a story. (+storytelling) Figure 3: A "relaxed" choice. To give a more concrete demonstration of choice generation, consider figs. 3 and 4. These examples were generated 6 Relevant to an actual player's expectations is whether they are aware of this lack. For now, Dunyazad ensures this by mentioning any relevant skills (or lack thereof) in parentheses after each option. These are omitted in fig. 1 to avoid confusion. June 2015 297 As you travel onwards, a dragon slowly approaches you. It is threatening you. - You attack it. (-combat) - You attempt to pacify it with music. (-music) Figure 4: A "grim" choice. using a single player goal based on the idea of power-gaming: "Succeed at every action." Evaluating expectations relative to that goal, fig. 3 is the result of asking for a "relaxed" choice: a choice where there are no options which the player expects to fail. Figure 47 is the opposite: a "grim" choice where every option is required to be expected to fail. When generating individual choices, the system uses everything available to it (the choice of setups, the background including the player's starting skills, and the configuration of options and outcomes) to create choices that satisfy the given constrains, which can be expressed directly at the level of player expectations. This ability to reason about player expectations is critical in a system that wants to use choice structures to achieve poetic effects. Of course, the system isn't reasoning directly about the player's actual expectations, but merely about the system author's guess as to what those expectations will be. For human authors, this is often enough to achieve their goals, and for systems without dynamic player modelling, it will have to be enough as well. Abstract Architecture The underlying principles behind Dunyazad's choice structure rules suggest requirements for systems that want to build choices intentionally. At a basic level, the ability to reason about all parts of a choice: the context, options, and outcomes, is required. It should be noted here that reasoning about the options individually is not sufficient: a system that wants to dynamically construct choices that create poetic effects needs to be able to reason about the range of options available at a choice and assert things like "There are no options available which the player expects will lead to positive outcomes." That brings up the next requirement: such a system needs to be able to reason about player goals, expectations, and perceptions. These things do not have to be modelled exactly as Dunyazad models them, but they should be represented in such a way that the system can reason about them. In order to creatively construct a choice that gives the player a feeling of "agency" or "regret" or "power" a system needs to be able to define those things in terms of the player's view of the game. It is of course possible to give a player these feelings in an interactive narrative without representing them in any sort of system, but that just means that a human author has done the reasoning required, not that it never happened. And if a human author did that reasoning, then the system will not be able to freely generate such choices: it will be limited to generating them in situations that the human author was able to foresee. 7 These examples were slightly edited from their original form for brevity. So what is the next step once you have a system that reasons about all parts of a choice and the player's perspective besides? The next step is to identify the choice poetics that you want your system to create, and define them in terms of the player's perspective. For example, if you want the player to experience "agency," you might define your objective as "The system should create choices with multiple options that the player expects will lead to significantly different world states," as in (Cardona-Rivera et al. 2014). Given this concrete definition of your goal in terms of player expectations, the system should be able to construct such choices. If your goal is hard to pin down in terms of player expectations, playing some interactive narratives that create the feeling you want to create and analysing their choice structures would be the place to start. While a computer-generated interactive narrative that successfully evoked a particular feeling using choice structures would be an achievement in its own right, even better would be a system that used multiple choice structures in service of a more complex goal. For example, if there are narrative generation mechanisms trying to achieve a desired tension level, making sure that choice structures are also contributing to this goal would be a benefit. Or if the player-character is supposed to be stumped by a mystery at some point in the plot, perhaps generating choice structures where there are no clear good or bad options could reinforce that point. The ability to craft choices towards poetic ends unlocks many new options for an interactive narrative system. Future Work Dunyazad is still under development, and getting it to generate full interactive narratives as opposed to individual choices is our current focus. Once it does so, it will be critical to evaluate the narratives it generates, both from a creative standpoint and to determine whether players actually perceive the effects that it is trying to create. If Dunyazad is able to create full interactive narratives with choices that support their stories, it will represent an important step towards narrative generators that take full advantage of interactive as well as traditional poetics. However, even if it only generates individual choices, Dunyazad still enables experiments that can contribute to knowledge of choice poetics. Generating full stories will require significant authoring effort. Dunyazad's current domain model has 13 actions, 6 potentials, 6 setups, and 4 player goals. In order to generate experiences even few minutes long, it would need to generate stories with dozens of nodes across perhaps 8-12 vignettes. The primary authoring effort to get to that point with a satisfying level of variety lies in creating more distinct setups, as well as adding a few more actions. Although actions, potentials, setups, and player goals are all modular from a technical standpoint and can be authored individually, practically they need to take each other into account in order to create interesting output. This is mostly a consequence of the content rules. For example, adding an action which results in a new state probably won't change the system's output by itself, because without any player goals or potentials that involve that state, the system will never consider the new action to be relevant. Actions, potentials, setups, and June 2015 298 goals thus form interconnected subsystems that are linked by certain key states. Although this makes authoring a bit tricky, these subsystems are at least somewhat independent of each other: the actions that involve trading items can be authored without worrying about injury and death. Besides further work on Dunyazad, there are several promising research directions suggested by this work. First, the fact that an interactive narrative system is making assumptions about its players begs for a mechanism by which the system could actually measure its players. (Thue et al. 2011) is an example of exactly that, but by increasing both the complexity of choice manipulation and the detail of the player model things would become even more interesting. There is a link here to the world of intelligent tutoring system: systems like Graesser, Chipman, Haynes, and Olney's AutoTutor already have components that attempt to measure a student's knowledge and even emotions (Graesser et al. 2005). Adapting these to work in an interactive narrative context as opposed to a tutoring context would give the system a much better means of estimating a reader's expectations and goals than authorial guesswork. Expanding a choice-poetics based system to deal with broader interactive poetics would be interesting too. Many games offer interactions much more complicated than discrete choices, and reasoning about these would be more difficult. Many of the same principles apply, however, and creating a system that analyzed complex interactive situations in terms of player expectations, available actions, and their consequences would allow the deliberate construction of more complicated and open-ended narratives. Finally, a system capable of deliberate choice creation might enable new and more dynamic forms of narrative. This is an eventual aim of Dunyazad: to create a branching story with myriad paths where the freedom to explore a huge possibility space is enabled by collaboration between a human author and a computer system. If generative tools could enable human authors to design entire narrative possibility spaces, the resulting fictions would be the products of both human and machine creativity. Acknowledgements The authors would like to acknowledge the support of NSF grant IIS-1409992. 2015_39 !2015 "In reality there are as many religions as there are papers" - First Steps Towards the Generation of Internet Memes Diogo Costa, Hugo Gonc¸alo Oliveira, Alexandre Miguel Pinto CISUC, Department of Informatics Engineering University of Coimbra, Portugal dcosta@student.dei.uc.pt, hroliv@dei.uc.pt, ampinto@dei.uc.pt Abstract We report on the first steps towards the automatic generation of Internet memes starring public figures. Their images are retrieved from the Web and combined with famous quotes, altered according to recent information on the figures. Current implementation, in Portuguese, exploits several computational resources and aims to produce artifacts with coherent text, image, and some humor value. A preliminary evaluation survey confirmed a strong relation between generated memes and present events. Results on humor were also positive. Introduction The term meme originally denotes an idea, behavior, or style that spreads from person to person within a culture (Dawkins, 1976; Blackmore, 2000). On the Internet domain, memes became a popular and effective way of transmitting an idea. They are a product of human creativity that typically take the form of an image, often combined with a short phrase. They tend to be funny, make people laugh, and aim to be spread throughout the World Wide Web by sharing and re-sharing in social media. We present the first steps towards the development of MemeGera, a system for the automatic generation of Internet memes - or better, protomemes1 - starring public figures (hereafter, characters). MemeGera uses famous quotes, altered as follows: one word is replaced by another that is semantically related to the character and its current information2. These sentences, presented together with a character's image, should convey a simple and effective idea, make sense for the character, even if only for a short period of time after generation, and exhibit some novelty. To deal with the latter, the system exploits fresh information on the character, such as that in recent news or tweets. The produced text+image combinations have thus a transient flavor which, together with their humor potential, may qualify them as "jokes du jour". Long-term knowledge on the character, from its Wikipedia page, is also explored, but so far only used to favor fresh information. 1 The definition of meme implies social sharing, which will only occur if people actually spread the protomeme. 2 As the title of the paper, a twist of Mahatma Gandhi's quote: "In reality there are as many religions as there are individuals". We see the generation of meme sentences as a kind of linguistic creativity, a topic that covers tasks such as the generation of: poetry (Toivanen, Gross, and Toivonen, 2014; Gonc¸alo Oliveira and Cardoso, 2015); metaphors (Veale and Hao, 2008); neologisms (Smith, Hintze, and Ventura, 2014); or verbally-expressed humor (Binsted and Ritchie, 1994; Valitutti et al., 2013). Given the funny aspect inherent to memes, our work is probably closer to the latter. Yet, although not essential, the character and its image also play an important role in the success of our memes. In the remaining of this paper, we provide some background knowledge on the study of humor, together with computational approaches to this topic. We then present the automatic method for generating memes and list each of the steps involved. The current implementation targeted Portuguese, our native language, and is described right after. Although the method may seem quite straightforward, our effort involves the combination of several knowledge sources. Before concluding, we describe an illustrative example and report on the results of an online survey, which suggests that we are heading in the right direction. All the memes used in the survey are shown in the end of the paper, together with information about their generation and evaluation. Background and Related Work This section addresses the topic of humor from a theoretical point of view, followed by an enumeration of computational approaches for humor generation and recognition. Theoretical Study of Humor Humor has been studied from a variety of perspectives ranging from psychology and philosophy (Morreall (2013)), to its sociological aspects in literature (Kuipers (2010)) and, more recently, via the computational approach (e.g. Suslov (1992); Ritchie (2014)). Theoretical accounts of humor encompass the superiority theory, endorsed by Descartes, where "our laughter expresses feelings of superiority over other people or over a former state of ourselves"; the relief theory, a hydraulic model proposed by Shaftesbury, and later refined by Sigmund Freud, according to which laughter acts as a mechanism for releasing accumulated nervous energy built up from many possible emotionally-charged situations; and the incongruity theory, proposed by Beattie and sponsored by Kant, Schopenhauer, and Kierkegaard, among oth June 2015 300 ers, which claims "laughter is the perception of something incongruous - something that violates our mental patterns and expectations", which is now the dominant theory. Socioliterary studies (e.g. Kuipers (2010)) explore the mechanisms through which humor is related to social boundaries, and how it differs between groups; whereas computational approaches address the building of formal theories of humor (Ritchie (2014)), the synthesis of a sense of humour via specific algorithms (Suslov (1992)), and the generation of humorous text and jokes (Ritchie (2009)). Humor expressed in Portuguese has also been studied from a theoretical point of view. While presenting linguistic mechanisms for achieving humor in this language, Tagnin (2005) states that, since humor breaks conventionality in language, understanding it is a sign of fluency. Humor generation The automatic generation of humor has been a research topic for more than two decades. In early work by Binsted and Ritchie (1994), a model, implemented under the name of JAPE, was proposed for generating punning riddles. The generated puns (e.g. What do you call a murderer that has fiber? A cereal killer) took advantage of spelling or word sense ambiguities. STANDUP (Manurung et al., 2008) follows the lines of JAPE, but is more robust, user friendly, and was developed with the purpose of allowing young children, especially those with linguistic disabilities, to explore language and improve their skills. Given a concept and an attribute, HAHAcronym (Stock and Strapparava, 2005) rewrites existing acronyms and generates new ones with a humor intent. It relies on an incongruity detector and generator that selects opposing domains and opposing adjectives, while considering also rhythm and rhymes. For instance, the acronym FBI may become Fantastic Bureau of Intimidation. Or given the concept of ‘processor' and the attribute ‘fast', it generates the acronym OPEN - Online Processor for Effervescent Net. Valitutti et al. (2013) explored the generation of adult humor based on the replacement of a word in a short message. The word should introduce incongruity and lead to a humorous interpretation, achieved by applying three constraints. It must: (i) be of the same form as the original word, i.e. match the part-of-speech and either rhyme or be orthographically similar to the original word; (ii) convey a taboo meaning, e.g. an insult or something related to sex; (iii) take place at the end of the message and keep the coherence of the original sentence. An example of an output is: I've sent you my fart.. I mean ‘part' not ‘fart'.... Besides English, there were attempts for generating puns in Japanese (e.g. Sjobergh and Araki (2007a)). We are not ¨ aware of any work of this kind for Portuguese. Humor recognition In the scope of natural understanding, there has been work on the automatic recognition of verbally-expressed humor. Researchers typically focus on a specific kind of jokes, such as knock-knock (Taylor and Mazlack, 2004) and That's what she said (Kiddon and Brun, 2011), or on a less specific kind of humor but transmitted in bounded kinds of text, such as single sentences (Mihalcea and Strapparava, 2006; Sjobergh ¨ and Araki, 2007b), or tweets (Barbieri and Saggion, 2014). Humor recognition is generally seen as a text classification problem and relies on a set of humor relevant features to train a classifier, given their presence in humorous and nonhumorous text. For instance, Barbieri and Saggion (2014) exploit hashtags, such as #humuor or #irony, to collect positive examples. Selected features generally include the occurrence of antonymous or ambiguous words, alliteration, and other words or expressions typically used in jokes, such as slang or idiomatic expressions. For Portuguese, the closest works to humor recognition we are aware of include the automatic detection of irony (Carvalho et al., 2009) or proverbs (Rassi, Baptista, and Vale, 2014) in text. Internet Memes Internet memes are a current trend in social media. They are typically a reusable combination of text and graphics. Popular memes include Boromir from the Lord of the Rings with the template "One does not simply X", Morpheus from the Matrix with "What if I told you Y", or Batman slapping Robin, with a personalized text in their speech balloons. There is however a subtype of Internet memes related to current events, where new images, text, or both, can be used - if successful enough, they might be reused. Events that triggered several memes include the football player Luis Suarez biting his opponent in a World Cup 2014 match (e.g. ´ "If you can't beat them, eat them"), or when the pop singer Madonna fell on stage, while wearing a cape, during a performance in the BritAwards 2015 ceremony (e.g. "56 years old, still does her own stunts", "Has a cape, can't fly"). While most memes show a break of conventionality (e.g. unexpected situation, confusing interpretation, taboo meaning), we address the previous subtype, which, as suggested by the superiority theory, makes fun of the portrayed character. In fact, the image is sometimes enough to make people laugh (e.g. when it displays a funny person or situation). We are not aware of any published work on the automatic generation of Internet memes. Existing web services for meme generation rely on the user input of both images and text. There is work however on the automatic combination of images and text, such as Grafik Dynamo (2005) and Why Some Dolls Are Bad (2008), by Kate Armstrong3. In those projects, a narrative is dynamically generated by combining sequences of images, retrieved from social networks, with speech balloons. The result is often non-sense. Method This section provides a high-level description of our proposed method for meme generation. Specific details of its current implementation are given in the next section. Among other parameters, our algorithm for the generation of memes (see figure 1) uses the name of a public figure, our character, currently provided by the user. Informally, it starts by retrieving n recent messages (e.g. tweets) mentioning the character, from where the top-k frequent nouns 3 http://katearmstrong.com/ June 2015 301 are collected. Then, it selects a random quote from a pool of famous quotes, pairs it with one of the top-k nouns, and generates a sentence, more precisely, an altered quote where the last noun of the original quote is replaced by one of the top-k nouns - similarly to Valitutti et al. (2013), replacing the last noun will increase surprise and humor potential. After repeating this process for a predefined number of times, generated sentences are ranked by a dedicated scoring function. The highest-ranked sentence is pasted on an image of the character, automatically retrieved from the Web, and the combination is finally returned as the generated meme. The scoring function considers the humor value of the sentence, the frequency of the replacement noun, and its presence in a more stable long-term information source on the character. Words without previous associations to the character are considered novel and are thus favored in the ranking. We may find some parallelism between this and the work of Toivanen, Gross, and Toivonen (2014), where novel associations in documents are identified by their overlap with known associations from a background corpus. Require: charName:name of character n : # of messages to retrieve k : # of top frequent common nouns to consider m : # of < quote,frequent noun> pairs to generate 1: procedure MEMEGERA 2: messages {msg : msg mentions charName} : #messages = n 3: freqNouns top-k most frequent nouns in messages 4: quotes {quote : quote is a famous quote} 5: pairs {< quote, freqNoun > randomly generated} : #pairs = m 6: maxEval 0 7: bestQuote ; 8: for each < quote, freqNoun >2 pairs do 9: nq replace last noun in quote with freqNoun 10: ne score(nq, freqNoun, charN ame) 11: if ne > maxEval then 12: maxEval ne 13: bestQuote nq 14: image get image of charName from the Web 15: resultingMeme paste bestQuote in image 16: return resultingMeme Figure 1: Meme generation algorithm Implementation Although our method is language-independent, its current implementation targets Portuguese. MemeGera was implemented in Java and exploits several available resources, for different purposes, including a classifier for Portuguese humor, currently in development. We also describe the function that currently ranks the generated sentences. Tools and Resources Famous quotes used in this work were acquired from the Portuguese edition of Wikiquote4, a collaborative repository of quotes, run by the Wikimedia Foundation. For the current 4 http://pt.wikiquote.org/ version of the system, we selected quotes from three wellknown thinkers - Mahatma Gandhi, Aristotle and Confucius - who were the authors of many quotes, most of them timeless and generic enough for our purpose. We soon realized that long quotes would not produce the desired effect, so we only used quotes with up to 15 words, totaling 90. We use the social network Twitter5 and Twitter4J6, a Java API, to retrieve tweets mentioning the names of the selected characters. While we could have used a news site or aggregator, the choice of Twitter relied on the fact that its messages are shorter, up-to-date, and mix different and less controlled opinions. In recent years, Twitter has been widely exploited by computer programs, not only for text mining, but also in computational creativity research (e.g. Veale (2014); Cook, Colton, and Gow (2014) or the recent PROSECCO Code Camp7, focused on the development of creative Twitterbots). Natural language processing is made by the OpenNLP toolkit8 and its models trained for Portuguese tokenization and part-of-speech tagging. Since the models were not trained with tweets, a few annotation errors are expected. But this is not severe because we end up using only words in a morphological lexicon, LABEL-Lex9, in which we rely to perform inflection, so that words agree with the sentence they are put in. Also, we count the lemmas frequency in the tweets, and not the words frequency. Lemmatization is performed by LemPort (Rodrigues, Gonc¸alo Oliveira, and Gomes, 2014), a Portuguese lemmatizer. Nouns long-associated to famous people were collected from the abstracts of their articles in the Portuguese Wikipedia, retrieved directly from the DBPedia10 entries under the category of Person. Images of the meme characters are retrieved automatically from Google Images11, at runtime. The first hit for each character is always used. The Mallet12 toolkit was used in the development of a humor classifier for Portuguese, presented in the next section. Given a positive and a negative dataset, Mallet automatically converts input text to features, and learns a classifier, using one of the algorithms available out-of-the-box. Humor Classifier We have recently started to work on a classifier for recognizing humorous pieces of text, in Portuguese, currently trained with the Mallet toolkit. The first step for its development was the collection of examples of humorous and non-humorous Portuguese documents, labeled respectively as positive or negative. The selected datasets were then imported to Mallet, which was used to train a classifier with the 5 https://twitter.com/ 6 http://twitter4j.org/ 7 http://codecampcc.dei.uc.pt/ 8 https://opennlp.apache.org/ 9 http://label.ist.utl.pt/pt/labellex_pt.php 10http://dbpedia.org/ 11https://images.google.com 12http://mallet.cs.umass.edu/ June 2015 302 best available learning algorithm. Instead of labeling the examples manually, we collected them from selected sources which we now present. Positive Dataset: While it is rather easy to collect negative examples, the same does not apply for humorous examples in Portuguese. After searching in the Web, we were able to find the following compilations of Portuguese jokes: • B´ıblia de Anedotas13 (in English Bible of Jokes); • O Sagrado Caderno das Piadas Secas14 (in English, The Sacred Book of Dry Jokes). To focus on shorter jokes, we discarded all with more than 25 words, and were left with 790 positive examples. Negative Dataset: The non-humorous dataset should contain text with a similar structure to the positive examples but without a potential humor effect. We thus collected sentences of similar length (25 words) from non-humorous sources. Since many of the collected jokes have a questionanswer structure, we included this kind of text as well. The following resources were used: • 304,211 sentences from the Portuguese Wikipedia, each collected randomly from a different article; • Text from Portuguese corpora available through the AC/DC project (Santos and Bick, 2000)15: - 81,478 sentences from CETEMPublico, a corpus with editions of the Portuguese newspaper Publico (1991´ 1998). - 25,000 sentences from CONDIVport, a corpus of sports newspapers, fashion and health magazines; - 6,767 question-answer pairs from Museu Da Pessoa, a corpus of interviews. In the end, we had a total of 417,456 negative examples. Validation: After importing the positive and negative datasets, a classifier was trained with the Maximum Entropy algorithm, selected after a 10-fold cross-validation, where it yielded 99.8% accuracy. These numbers look promising, but they were computed in a dataset with mostly negative examples. Although the F1 for the negative class was 99.9%, it was just 63.7% for the positive, with a recall of 49.4%. We should stress that the classifier is still in an early stage of development. In the future, instead of relying only in the black-box text classification of Mallet, additional features should be integrated, including a subset of those used by others (Sjobergh and Araki, 2007b; Mihalcea and Strapparava, ¨ 2006). Moreover, we are aware that we cannot expect much of the current classifier, at least for the kind of sentences we are generating. While it was trained with classic and timeless jokes, understanding the generated sentences requires not only general world knowledge, but further information that may be valid only on a specific moment in time. 13http://rbep.cm-porto.pt/rbep/upload/ dnloads/BibliadeAnedotas.doc 14https://www.facebook.com/CadernoDasPiadas 15http://linguateca.pt/ACDC/ Figure 2: Meme generated for the pop singer Madonna. The text translates to Keep your thoughts positive, because your thoughts become your falls. Ranking function As referred earlier, MemeGera generates a set of m sentences that combine a known quote with a noun f retrieved from Twitter. Towards the selection of the most promising generated sentences, these are currently ranked by the following linear combination: Score = humorP rob ⇤ ↵ + wordF requency ⇤ " + notInW ikipedia ⇤ # There, humorProb is the probability returned by the humor classifier; wordFrequency is the number of tweets where f occurs, divided by the total number of retrieved tweets, n; and notInWikipedia is a binary function that is 1 if the word is not in the Wikipedia abstract of the character, or 0 otherwise. Results We generated several memes with different configurations. Although not enough experiments were performed to select the best configuration, at a certain point, we started to use fixed parameters, to have a base for comparison. In all reported experiments, generation was based on 200 tweets (n = 200), written in Portuguese (according to Twitter), and using the top-5 frequent nouns (k = 5). The best sentence was selected from a set of 20 (m = 20). The ranking function used the weights: ↵ = 0.7, " = 0.25, # = 0.05. Example Figure 2 illustrates the output of MemeGera with a meme generated on the 26th February 2015, the day after Madonna fell on stage. The original quote, attributed to Mahatma Gandhi, was Keep your thoughts positive, because your thoughts become your words. Since people were talking about the fall, the most frequent nouns in tweets were: tombo (tumble), queda (downfall), palco (stage), v´ıdeo (video) and madonno. The last one results from an incorrect part-of-speech tag given to the proper noun Madonna. There is no risk of using it though, because June 2015 303 Figure 3: Overall results of the performed evaluation. it is not in the morphological lexicon. None of the top-5 nouns were in Madonna's Wikipedia abstract. In the end of this paper, we present more memes, together with information that will help understanding why they were generated, and their scores according to the online survey where they were used. Evaluation survey In order to have a first appreciation of the results produced by MemeGera, we made an online survey, answered by 41 human subjects, all Portuguese native speakers. The survey had the title "Imagens com texto" (Images with text) and it never mentioned the word meme, nor automatic generation. The survey had 5 memes, for which the name of the character was presented together with three questions, to be answered according to a Likert scale: strongly agree (5), partially agree (4), neutral (3), partially disagree (2) and strongly disagree (1). The questions were: 1. The text is syntactically and semantically coherent (Does it follow the grammar rules and makes sense?). 2. There is coherence in the combination of text, image and the present time (We suggest to search for breaking news about the character). 3. The combination of the text and the image produce a humorous effect (Did it make you smile?). The used memes were generated between the 25th February 2015 night and 26th morning. Their characters were manually selected for being mentioned in fresh news in online media. All of these memes used the highest-ranked sentence from the 20 generated. They are presented in the end of the paper, together with information on their generation, as well as their individual scores in the survey questions. The survey opened just a few hours after generating the last meme, and was opened for about 24 hours. This means that some memes would only be interpreted appropriately by someone following the daily news. Figure 3 presents overall results, which combine the answers to the five memes. The survey confirmed that it is often safe to replace one word in a sentence by another of the same part-of-speech. If inflection is handled properly, syntax remains coherent, which makes it easier for semantics, especially when using generic quotes. Answers on the coherence between text, image and the present time are also positive. The meme in figure 6 was the one with more negative answers in the first two questions. First, possibly because it is not very easy to find semantic connections between tumble and awake. Second, because this meme was related to a very recent event and, although we suggested the subjects to search for the character in the news, most of them probably did not do it, and were not aware of Madonna's fall. As for the humor aspect, while we cannot say that the generated memes are very funny and have the ability to make everybody laugh, the overall results are encouraging, as the majority of the answers are positive. It is always subjective to assess the presence of humor, especially in this case, where world knowledge and following recent news was a requirement. A curious fact is that the memes with clearly positive answers in this aspect are those with Portuguese politicians. Given that all our subjects were Portuguese, they are probably better informed about Portuguese characters, who probably play a more relevant role on the subjects lives, and make them more responsive to laugh at. This is related to another issue: the image itself or, sometimes, just the character, might play an important role in the humor value, since there are people for which we are more prone to laugh at than others. Concluding remarks We have presented the first steps towards the development of MemeGera, a system that generates combinations of text and image that may be seen as Internet memes. Famous quotes are altered according to a public figure and complemented with their image, automatically retrieved from the Web. Fresh information on the public figure, in the form of frequent words, is currently obtained from Twitter. Several altered quotes are generated and the best is selected after a ranking that considers the humor value and the novelty of retrieved words, in an attempt to positively discriminate the most promising sentences. The humor value is given by an automatic classifier, trained with positive and negative examples of humor expressed in Portuguese. However, this tool is still far from what we expect from it and gave very low scores to the generated sentences (rarely more than 1%). On the other hand, we should stress that MemeGera has the ability of generating a different and novel sentence each time, based on fresh news. In fact, the results of an online survey showed that it is not only capable of generating coherent sentences, with some relation to the character, but that the generated combinations have some humor potential. The work described in this paper lead to the development of the @memegera Twitterbot that, from time to time: (i) reads the list of current trends in the Portuguese Twitter; (ii) checks if any of them is the name of a known person; if so, (iii) generates a meme on that person and posts it. The bot is still in a test phase, but we may soon start relying on users feedback (e.g. retweets, favored) for evaluation and adaptation of the weights in the ranking function. Additional plans include both improvements to the system and to its evaluation. In the scope of this and other projects, the humor classifier shall be improved by: (i) enriching the datasets with humorous text from the Twitter accounts of famous Portuguese humorists; (ii) considering additional features (e.g. ambiguity of words and adult slang, for which there are available Portuguese resources we could use). To increase variation, we will devise adding more quotes to our pool, as long as they are not too specific. Regarding evaluation, in a further survey, we aim at recording the reaction of June 2015 304 the subjects in the moment when the meme is first presented to them, and draw conclusions from their facial expressions. Acknowledgements The project ConCreTe acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the 7th Framework Programme for Research of the European Commission, under FET grant number 611733. 2015_4 !2015 Using Human Computation to Acquire Novel Methods for Addressing Visual Analogy Problems on Intelligence Tests David A. Joyner1,2, Darren Bedwell1, Chris Graham1, Warren Lemmon1, Oscar Martinez1, Ashok K. Goel1 Design & Intelligence Laboratory1 Udacity2 Georgia Institute of Technology 2465 Latham Street Atlanta, GA 30332 USA Mountain View, CA 94040 USA {djoyner3, dbedwell3, cgraham36, wlemmon3, omartinez8}@gatech.edu; goel@cc.gatech.edu Abstract The Raven's Progressive Matrices (RPM) test is a commonly used test of intelligence. The literature suggests a variety of problem-solving methods for addressing RPM problems. For a graduate-level artificial intelligence class in Fall 2014, we asked students to develop intelligent agents that could address 123 RPM-inspired problems, essentially crowdsourcing RPM problem solving. The students in the class submitted 224 agents that used a wide variety of problem-solving methods. In this paper, we first report on the aggregate results of those 224 agents on the 123 problems, then focus specifically on four of the most creative, novel, and effective agents in the class. We find that the four agents, using four very different problem-solving methods, were all able to achieve significant success. This suggests the RPM test may be amenable to a wider range of problemsolving methods than previously reported. It also suggests that human computation might be an effective strategy for collecting a wide variety of methods for creative tasks. Introduction The Raven's Progressive Matrices (RPM) tests are a group of intelligence tests based on visual analogy problems (Raven, Raven, & Court 1998). In these problems, a matrix of visual frames is presented with a blank space; six or eight options are presented for filling in this space. Performance on RPM has been shown to correlate well with other intelligence tests (Snow, Kyllonen, & Marshalek 1984). Thus, although wholly visual, the RPM tests measure general human intelligence, and are often used as the psychometric measure of choice in educational and clinical settings. Hunt (1974) suggested that humans use multiple problemsolving methods to address RPM problems, including "analytical" and "Gestalt" methods. Bringsjord & Schimanski (2003) have proposed intelligence tests such as RPM as a method of measuring the effectiveness of AI techniques. AI research has developed a variety of methods for addressing RPM and similar visual analogy problems, including both "analytical" methods that typically use propositional representations (Evans 1968; Lovett, Forbus, & Usher 2009; O'Donoghue, Bohan & Keane 2006; Prade & Richard 2011; Ragni & Neubert 2014), and "Gestalt" methods that often use imagistic representations (Dastani, Induskhya & Scha 2003; Kunda, McGreggor, & Goel 2013; McGreggor & Goel 2014; Schewring et al. 2009). Another way of classifying the various methods is by control of processing. For example, some methods for addressing RPM problems, such as the affine method (Kunda, McGreggor & Goel 2013), first generate an answer based on the (partial) matrix, and test this answer by comparing it with each available choice; other methods, such as the fractal method (McGreggor, Kunda & Goel 2014), test each available answer by computing the degree of fit in the matrix. While it may appear that generation of answers is a necessary part of creativity, we posit that generating explanations for available answers is also creative. The Raven's Test and Creativity One major component in the value of the RPM test is its connection not only to intelligence, but also to creativity. Hunt (1974) laid the foundation for the creative nature of problem-solving methods on this test in identifying the two broad categories of methods mentioned previously, "Gestalt" and "analytical". Kirby & Lawson (1983) argued further that it is the diversity of problem-solving methods that makes the RPM test a valuable tool for assessing intelligence in humans. If creativity is in part the ability to develop novel, useful, and effective methods to a problem, then the RPM test's admission of multiple methods adds to its value as a tool for studying creative problem solving. Second, Keating & Bobbitt (1998) argue that addressing many RPM problems requires metacognitive abilities to select among the available problem-solving methods, to monitor the progress of the selected method, to suspend or abandon the current method and move to a different method, and to combine insights from the use of multiple meth June 2015 fi Figure 1: A 2x1 visual analogy problem. Although RPM tests do not have 2x1 problems, 20 2x1 problems are used as a soft introduction to solving visual analogies. ods into one final answer choice. Third, the normatively correct choices for some RPM problems are often nonobvious, sometimes even unexpected, such as in the problem shown in Figure 1. Thus, from the perspective of both process (metacognitive processing) and product (unexpectedness of the answer), the RPM test measures not only intelligence, but also creativity. One potential critique of the RPM test for studying creativity is that a set of answer choices are presented to the test-taker. However, this implies that the creative task necessarily entails generating a novel answer. The structure of the RPM problems turns this notion of creativity around: rather than generating an answer, the test-taker instead creatively generates an explanation for a particular answer choice. In Figure 1, for example, the most obvious answer would be a large square; however, none of the answer choices match this obvious answer. The presence of answer choices constrains the activity and forces the testtaker to creatively generate not an answer, but an explanation for why one of the presented choices is most compelling. This explanation is as much the output of the creativity process as the answer itself. From the perspective of computational creativity, the above analysis makes the RPM test an excellent choice for designing, evaluating, and comparing new AI methods not only for intelligence, but also for creativity: the task admits a wide variety of AI methods characterized by different knowledge representations and different controls of processing. The question then becomes: how can we identify the novel techniques that may effectively address RPM problems? We postulate that one strategy for acquiring new methods for addressing visual analogy problems on the RPM test is through crowdsourcing (Howe 2008), or, more accurately, human computation (Law & von Ahn 2011). Although crowdsourcing has typically been used for acquiring domain knowledge, human computation also admits acquisition of problem-solving methods. Yet, it is also important to acquire new methods for addressing visual analogy problems not from any crowd, but from intelligent, educated, high-achieving humans who themselves are likely to do well on the RPM test. The Experiment In Fall 2014, we offered a new online Georgia Tech graduate-level CS 7637 course titled "CS 7637 Knowledge-Based AI: Cognitive Systems" as part of the new Georgia Tech Online MS in CS Program (Goel & Joyner 2014; Goel & Joyner 2015). We also offered an inperson class in parallel, with the two classes sharing the same syllabus and structure. The course describes its learning goals as, "to develop an understanding of (1) the basic architectures, representations and techniques for building knowledge-based AI agents, and (2) issues and methods of knowledge-based AI." Toward this end, students cover several knowledge representations (semantic networks, frames, scripts, formal logic), reasoning strategies (casebased reasoning, rule-based reasoning, model-base d reasoning), and target domains (computational creativity, design, metacognition). More comprehensive information on the structure and content of the class is available at the link above. In previous offerings of the in-person class, we had used variants of problems on the RPM test to motivate the class projects (Goel, Kunda, Joyner, & Vattam 2013). Thus, we knew class projects based on the RPM test stimulated student engagement while providing an authentic opportunity to explore cutting-edge research. Therefore, in Fall of 2014, we again designed the class projects based on variants of problems on the RPM test. Students in both the online and in-person sections were asked to complete four projects that addressed 123 RPM-inspired problems in all, culminating in Project 4, wherein students designed agents that could answer all 123 problems using visual input. 224 students completed Project 4, addressing all the problems using the raw imagistic input. We collected all the data on these 224 Project 4 submissions, including the designs of the agents and their performance on the 123 problems. In this paper, we will describe the results of this experiment. First, we will present at a high level the results of the 224 agents that were developed to address these RPMinspired visual analogy problems. Second, we will examine in greater detail the design of four of the most creative and effective agents developed for the project. These agents operate according to four significantly different methods for reasoning about these problems. In describing these agents, we will clarify their relationship to elements of human creativity operationalized and instantiated in AI agents. June 2015 fi Figure 2: A 2x2 visual analogy problem, inspired by Raven's Progressive Matrices. In this paper, individual squares in a prob lem are called ‘frames', while individual shapes within each frame are called ‘objects'. RPM-Inspired Visual Analogy Problems The standard set of Raven's Progressive Matrices test is made of 60 visual analogy problems: 24 of the problems are 2x2 matrices, and 36 of the problems are 3x3 matrices. For copyright reasons, we have not yet been able to use actual RPM in these class projects. Instead, we have developed a set of 123 RPM-inspired problems. These problems are broken into three categories: 27 2x1 matrices (as shown in Figure 1), 48 2x2 matrices (as shown in Figure 2), and 48 3x3 matrices (as shown in Figure 3). Although there are no 2x1 matrices in the actual RPM test, these are included in our set to provide a simpler initial set of problems for students to address before moving on to more difficult problems. To develop these RPM-inspired problems, we examined individual problems on the actual RPM tests (both the standard and the advanced test) and wrote problems to have a close correspondence with the problems on the actual tests. Although the individual shapes and their properties differ, these RPM-inspired problems mimic the same transformations and problem types as the actual standard and advanced RPM tests. These correspondences, however, only exist at the level of individual problems; not every RPM has a corresponding RPM-inspired problem in our problem sets, and some types of problems are present more often in our problem sets than in the actual RPM tests. Therefore, no claim is made that our RPM-inspired problem sets are equivalent to the RPM tests as a whole; we only claim that the individual problems capture the same reasoning as problems on the original RPM tests. We are presently running two previously-designed agents (Kunda. McGreggor & Goel 2011; McGreggor, Kunda & Goel 2014) for solving the actual RPM tests against these new RPM-inspired problems in order to establish a conversion factor between the two sets. The Projects In the Fall 2014 version of the KBAI class, students completed a series of four projects. In the first three projects, students designed agents that could address 2x1, 2x2, and 3x3 matrix problems. During these projects, the input into these agents was propositional representations of the 123 RPM-inspired visual analogy problems. The propositional representations were written by the instructors of the course to prevent students from building inferential advantages into the representations. During the design of their agents, students could see 83 of these problems: the remaining 40 were designated 'Test' problems and were hidden from students in order to test their agents for generality. Thus, students were encouraged to construct agents with general problem-solving ability rather than agents that would tightly fit a small set of previously-seen problems. By the end of project 3, students had completed an agent that could solve 2x1, 2x2, and 3x3 visual analogy problems based on propositional input. In project 4, students designed an agent that could solve these same problems using visual input. Here, students' agents read in the images directly from .PNG files, with one file representing each frame from the problem. Students' agents were run against the same 123 problems. Students' grades were dependent on performance on 100 of these problems (the remaining 23 were provided as challenge problems with no credit granted for correct answers), and 40 of these 100 problems were withheld as 'Test' problems. This paper focuses only on the agents designed in project 4, which took visual input. Table 1: Performance on the eight sets of RPM-inspired problems (123 problems in all). "n" gives the number of problems in that set. "Avg." gives the average number of correct answers in that set for the 224 agents. "1", "2", "3", and "4" give the performance of the four agents described in further detail under 'Four Agents', below. n Avg 1 2 3 4 2x1 Basic 20 8.8 18 14 17 12 2x1 Extra 7 1.5 4 1 7 2 2x2 Basic 20 8.8 18 16 20 14 2x2 Extra 8 2.5 7 4 7 7 2x2 Test 20 7.2 17 16 14 12 3x3 Basic 20 11.0 19 17 20 15 3x3 Extra 8 1.5 2 0 6 4 3x3 Test 20 7.9 16 15 11 13 June 2015 fi Figure 3: A 3x3 visual analogy problem, inspired by Raven's Progressive Matrices. Individual objects within frames in an RPM can be said to have ‘properties';; for example, some of the triangles in this problem have a 180° rotation as a property. Aggregate Results Students in the KBAI class submitted 224 agents, each of which ran against the 123 problems. The percentage of agents answering an individual problem correctly ranged from 87% (fo r the easiest 3x3 problem, which involved no transformations between frames) to 8% (for the hardest 3x3 problem, which demanded reasoning about the sum of the number of sides of multiple shapes). Among the problems completed for credit, one Test problem was correctly answered by only 10% of agents; this 2x2 problem involved two transformations - change-fill and removeshape - that conflicted with one another. Table 1 previously shows the performance of the agents as a whole, as well as the performance of the four agents highlighted below. The table is broken up by the eight distinct problem sets students addressed: 'Basic' sets were provided to students during the design of their agents and were evaluated for the project grade; 'Test' sets were not provided to students for the design of their agents and were evaluated for the project grade; 'Extra' sets were provided to students during the design of their agents but were not evaluated for the project grade. Agents' scores on the Basic and Test sets comprised 70% of students' project grades. Perhaps surprisingly, students' agents performed better on 3x3 problems than on 2x2 problems. While 3x3 problems allow more complex problem structures, such as transformations in which two frames together determine the contents of a third, students noted that 3x3 problems gave their agents more information with which to work. With more information, their agents performed better, even on more complex problems. Four Agents After evaluating the aggregate results, we examined the problem-solving methods of several of the best-performing agents and identified a number of particularly novel and successful methods for addressing these RPM-inspired problems. The majority of the 224 submitted agents operated by first writing a propositional representation based on shape recognition, and then solving the problem propositionally; we describe the most successful agent using this method below, which combines contour recognition with problem classification. However, we also identified several other methods to solving these problems. Here, we describe three additional creative methods to solving RPMinspired problems based on imagistic representations. Agent 1: Contour Recognition & Reasoning Agent 1 uses an intermediate propositional knowledge representation for working memory. In the agent's representation, each frame in an RPM consists of objects, and each object consists of the following attributes: shape, size, fill, rotation, and relative-position to other shapes. A library of shapes was available to the agent, storing 20 basic shapes and features such as symmetry and corner count. Agent 1's method has three phases: symbol extraction, top-down recognition, and bottom-up recognition. Phase 1 uses image processing to extract a propositional representation for each problem. First, objects are found by isolating connected components, after which they are classified into shapes based on attributes of the object like corner count, edge lengths, and convexity. Other object attributes, including fill, rotation, size, and relative position are also computed in this phase. Phase 2 uses top-down pattern finding. 19 pattern recognizers look for simple patterns that will be combined to form a pattern fingerprint. Recognizers include "constant rotation across objects in frame" (as seen between frames A and C in Figure 2) and "object count arithmetic sequence." For each problem matrix, patterns are found and combined for all in-row, -column, and -diagonal relationships. The agent then chooses the answer with the largest set of matchers. In the event of a tie, Phase 3 begins. Phase 3 performs bottom-up reasoning by splitting each problem into 2x1 sub-problems: 2 for 2x2 matrices and 29 for 3x3 matrices (including diagonal sub-problems). The agent solves each sub-problem, producing multiple answer choices, then uses majority-rule to make a final answer selection. June 2015 fi To solve a 2x1 sub-problem, (1) all object pairs from frame A to frame B are created; (2) all object pairs from frame C to the answer choices are created; (3) all mappings between object pairings from step one and step two are created; and (4) each mapping is given a score. The scoring function includes the intuitiveness of the transformation in step two and the strength of analogy in step three. For example, a mapping would be scored highly for intuition for mapping a triangle from frame A to a triangle in frame B. However, if a triangle in frame A instead mapped to a square in frame B, the best analogy would map triangles from frame C to squares in frame D. The highest scoring mapping is the most intuitive analogy. In the worst case, phase 3's runtime is O((n!)3), where n is object count per frame. To offset this, time limits were imposed. To take the problem shown in Figure 3 as an example: during Phase 1, 31 shapes and 14 frames would be represented in a fashion similar to the following: frames: [{id: 1, objects:[{id: 1; shape: triangle; fill: yes; angle: 0; left-of: [2, 3]; size: medium},{id: 2; shape: triangle; fill: yes; angle: 180; left-of: [3];; size:medium}…]}, …]. During Phase 2, each potential answer is inserted into the last cell of the matrix, and each pattern matcher runs. Here, the matcher labelled "remaining shapes after pairing" will match: each upright triangle in the first cell of a row or column is paired with a flipped version in the second cell, and the remaining triangles are checked to see if they match those of the third cell. Other matchers may also match the inserted choice, creating a more complex pattern. In the end, each potential answer will have a list of matchers associated with it, and the one with the longest list of matchers is selected. For this problem, the agent would choose the first answer choice. Because the problem would be solved in Phase 2, Phase 3 would not execute. Agent 1 performed exceptionally well, correctly answering 101 of the 123 problems (88 of the 100 problems for credit). Agent 1's general method of generating a representation based on prior shape knowledge also reflects the most common approach used in the class (as well as an approach used in prior literature, e.g. O'Donoghue, Bohan, & Keane 2006);;however, Agent 1's classification of multiple problem types goes beyond what the majority of agents attempt and plays a large role in its success. Connecting with computational creativity, Agent 1 possesses the ability to creatively generate its own answers. Presently, Agent 1 operates by substituting each answer choice in the empty frame and evaluating its degree of fit to the problem's transformations; however, implicit here is the idea of an 'optimal' fit for the remaining frame. Were the agent deprived of the answer choices, it could instead generate the optimal solution for the empty frame. Agent 1 is limited in this regard, however, in that it could only produce solutions that are comprised of the shapes in its shape library; Agent 1 cannot deal with novel shapes. Agent 2: Shape-Agnostic TransformationRecognition The second agent, Agent 2, operates in two stages. First, the agent detects and analyzes individual objects to produce a propositional representation, similar to Agent 1. The agent uses the individual properties to find relationships between objects in pairs of frames, and chooses the answer that best fits the relationships that are found. Agent 2's high-level process thus resembles Agent 1's in its initial phase of translating imagistic representations into propositional ones; however, it differs in that it does not rely on prior shape knowledge. Agent 2 derives the structure and content of the problem from within the problem, rather than based on prior knowledge of shapes and features. The agent begins by recording visual measurements for each object in the problem and using a simple clustering method to partition similar objects into shape groups. The agent records the width/height ratio of an object and the amount of whitespace "outside" of the object's boundaries in its cropped region. Without predefined knowledge of triangles and squares, the agent instead categorizes shapes based on these properties and gives them arbitrary names. For example, the agent may label alltriangles as "shape1" and all squares as "shape2", even if the individual objects vary in size and other properties across the problem, based on these measurements. To account for variations in the measurements, objects are rotated to optimize an arbitrary scoring function. This also helps determine relative rotation angles between objects which are necessary in certain problems. To take an example, in Figure 1, there are no overlapping objects in the frames. Individual objects are easily isolated, and the shapes of these objects are distinguished by the relative outside whitespace. Other properties, such as relative size and position, are also computed. In frames A and B, the agent records as the target relationship that the single object in frame B has the same shape (shape2) as both of the objects in frame A and the same size as the larger object in frame A. The agent then compares frame C with each answer frame to find the closest match to this relationship. An exact match is not possible because frame C contains two different shapes (shape1 and shape3) rather than a single shape. The correct answer, frame 2 with the large triangle (shape3), is chosen because it matches all aspects of the target relationship other than the object matching the shape of the smaller object. Thus, the concept of shape is used to mark objects as being different from or similar to other objects, and as long as the agent correctly observes those differences in the visual analysis portion it will have enough information to solve the problem. The process for the problems in Figures 2 and 3 is similar, although the addition of rotating objects demands the June 2015 fi agent's rotation logic. For example, in the first frame of Figure 3, the two outer triangles are already at the "ideal" rotation angle and are given an angle value of 0 degrees, whereas the middle triangle would reach the same "ideal" value after being rotated 180 degrees. As noted before, the primary difference between Agent 1 and Agent 2 is that while Agent 1 relies on prior knowledge of shapes and their potential properties, Agent 2 takes a grounded method to identifying shapes in a frame. Thus, while Agent 1 will fail to recognize previously unseen shapes, Agent 2 is equipped to address previously unidentified shapes. Agent 2 performed exceptionally well, correctly answering 83 of the 123 problems (78 of the 100 problems for credit). It is notable, though, that Agent 2's performance lagged behind on the ‘Extra' problem sets;; many of these sets included transformations, such as counting the sides of a shape, for which Agent 2's more visually-oriented method does not account. We also hypothesize Agent 2 would show greater success on problems featuring previously unseen shapes that humans could similarly address, but no such problems were included here. Like Agent 1, Agent 2 can also generate novel answers rather than select them from a set of possible answers. The paragraph above acknowledged that on the problem presented in Figure 1, the most-obvious answer to Agent 2 is not present among the answer candidates. To have a 'most obvious' answer prior to examining the choices, Agent 2 must generate its own solutions. This also reveals how the presence of candidate answers can encourage creativity by introducing new constraints. It is creative to generate novel solutions from scratch, but it is also creative to generate arguments for available non-obvious solutions. Agent 3: Visual Heuristics In contrast to Agents 1 and 2, Agent 3 does not derive any representation of the visual analogy problems. Agent 3 begins from the supposition that it is fundamental to reduce the input space to something both manageable and meaningful for the agent to be able to compute and correctly guess an answer from the given choices. Agents 1 and 2 do so by reducing the input space to a propositional representation; Agent 3 reduces the input space to sets of contiguous non-white pixels. Agent 3 takes each possible answer choice and computes the likelihood it is correct. To do so, the agent takes a series of measurements capturing the relationship between each training pair, which is described by any two adjacent cells in the matrix. It then compares those measurements against each of the test-answer pairs, the combinations of any cell adjacent to the empty slot and each answer choice. Each comparison, if significant enough, casts a vote for the current answer as the likely answer with a weight directly proportional to the believed similarity of the cells. The most-voted answer is selected as the agent's answer. Many relationship measurements were evaluated, such as grid-based similarity, histogram-based similarity, and affine transformations. After multiple iterations, few measures were needed to yield the best performance. In the final design, the agent only uses the following two measurements: x Dark pixel ratio: the difference in percentage of the number of dark-colored pixels with respect to the total number of pixels in the contiguous pixel sets of two matrix cells. x Intersection pixel ratio: the difference in percentage of the number of dark-colored pixels present at the same coordinates with respect to the total number of dark-colored pixels in both matrix cells for a given set of contiguous pixels. For example, in Figure 1, the intersection pixel ratio would lead the agent to vote for the answers containing an outer square; this is analogous to the most logical answer to the problem, an outer square with the inner object removed. Counterintuitively, the correct answer is just the expanded triangle, but the agent would also vote for that answer based on the dark pixel ratio's similarity to the most logical answer. Hence, thanks to the simple metrics used, the agent is "immune" to problems that may appear deceiving at first glance or may involve convoluted transformations. Although for this particular example, the agent picked answer 6, the correct answer was evaluated to be only 6.76% less likely to be correct. Agent 3 performed exceptionally well, correctly answering 102 of the 123 problems (82 of the 100 problems for credit). Agent 3 gave the most correct answers of any agent, although a greater proportion of its correct answers were previously-seen problems than Agent 1's similarly high performance. This may suggest that the iterations examining the effectiveness of multiple measures of similarity may have overfit the agent's reasoning to those problems, and that further development with more problems may expand the set of desirable measurements. Unlike Agents 1 and 2, Agent 3 does not have the capability of generating an answer choice rather than selecting from a set of presented answer choices. This is because while Agents 1 and 2 operate under an implicit ranking of possible choices culminating in an ideal choice, Agent 3 might find numerous options equally ideal, and thus could generate thousands of candidate selections. Agent 4: Hybrid Reasoning Agents 1 and 2 use propositional representations of the target problem while Agent 3 uses purely imagistic representations; Agent 4, by contrast, leverages both and takes a hybrid method. This method asks the question: can an agent quickly find patterns and relationships in a problem through a high-level visual comparison? If the agent can find high-level visual relationships quickly, it can efficient June 2015 fi ly formulate a solution without any further propositional understanding of the problem. If no such visual relationships are found, the agent may look for lower level propositional relationships present in the problem. Thus, Agent 4 starts by examining frames for visual relationships and transformations that can be quickly detected by visual inspection. The agent uses image similarity to detect rotation, vertical and horizontal reflection, the identity transformation, image addition, XOR, and NOR. If this process detects the presence of one of these relationships within a matrix problem, the agent generates a prospective solution and looks for a matching answer. For example, in Figure 1, the transformation between frame A and frame B would be identified through the XOR transformation, which searches for pixels present in only one of two frames. Similarly, in Figure 2, the transformation between frame A and frame B would be identified through the rotation transformation; the agent would (successfully) identify frame 3 as a frame that would complete the same rotation transformation when paired with frame C. This imagistic method was successful in finding solutions to over 20% of the problems, and it was much more computationally efficient compared to extracting propositional representations from the images; this is notable in that it acknowledges the different levels of effort applied by humans in solving these problems. Results could be further improved by searching for more types of high-level relationships and transformations, by applying transformations at a lower granularity than at the image level, and by improving the image comparison. For example, at present, Agent 4 is unable detect the visual transformations between parts of frames in Figure 2. This visual method has difficulty finding relationships that cannot be represented through affine transformations, such as problems involving prior knowledge of shapes and properties represented in the frames. When the agent is confronted with problems like these, it will try to find lowlevel relationships using contour recognition to identify shapes and object properties, ultimately leading to a method similar to Agent 1. Agent 4 performed exceptionally well, correctly answering 79 of the 123 problems (66 of the 100 problems for credit). Although these scores are the lowest among these four agents, they are in the top 10% of agents submitted. Moreover, Agent 4 may represent the best approximation of human reasoning; humans can discuss problems in both visual and propositional terms (Kunda, McGreggor & Goel 2011), and Agent 4 similarly can do both. As noted in the description above, during the first phase of its reasoning, Agent 4 generates prospective solutions and compares those prospective solutions to the answer choices. Thus, it already engages in creative answer generation and compares the generated answers to the candidate solutions. Discussion Agents 1 and 2 above exemplify Hunt's (1974) analytical, propositional reasoning strategies for addressing RPM problems. Agent 1 extracts propositonal representations that describe the shapes, spatial relations, and transformations from the input images, and then operates on those representations. Agent 2 also extracts propositional representations, but these representations are grounded in the transformations between objects: it has no prior knowledge of shapes, but rather the ability to generate representations of the transformations themselves. Agents 3, on the other hand, exemplifies Hunt's "Gestalt" visual reasoning strategy for RPM. It uses visual abstractions over problems to approximate the answer even without precise knowledge of the transformations between frames. Agent 4 combines the two methods: it first leverages the immediately-identifiable "intuitive" answer that can be established from accessible visual transformations before resorting to more complex propositional reasoning strategies. Thus, Agent 4 demonstrates the possibility of creatively combining methods. As far as we know, the precise strategies used by these agents have not appeared in the literature on the RPM test. These four agents, along with the 220 other agents developed over the course of this project, reflect the ability of AI agents to succeed on a test of human intelligence that relies on creative and flexible problem-solving. This experiment suggests that there may be no one single "right" problem-solving strategy for the RPM test, that creativity on the RPM test may entail a large number of problemsolving strategies, and that we have so far discovered only a subset of creative problem-solving strategies. Future research along these same lines will test future agents against the authentic RPM test; examine patterns of errors in agents' performance for comparison to human performance (Kunda et al. 2013) including atypical cognition (Kunda & Goel 2011); and better articulate the strengths and weaknesses of different methods (Lynn, Allik, & Irwing 2004; Kunda et al. 2013). We will also examine merging multiple agents into a single agent equipped with metacognitive ability to select among the different strategies, thus more closely approximating factors that determine human success on such tests (Keating & Bobbitt 1978). Conclusions The RPM test admits many problem-solving methods, which in part is what makes it a good test of intelligence and creativity. The various problem-solving methods differ in both the knowledge representations and control of processing they use. In this paper we described a human computation strategy for acquiring novel problem-solving methods for addressing RPM-inspired visual analogy problems. This strategy resulted in the design of 224 AI agents for addressing 123 visual analogy problems. Some of the June 2015 fi agent designs were both novel and effective: we described four of these agent designs. An important issue in computational creativity is how to acquire knowledge of creative methods. Our research suggests that human computation may be a useful strategy for this acquisition, especially when the computation comes from intelligent, educated, high-achieving humans who themselves are likely to do well on a creative task. Acknowledgements We thank all 224 students in both the in-person and online sections of CS 7637 KBAI course at Georgia Tech in Fall 2014. Goel was the primary instructor of both sections; Joyner was the course developer and head TA of the online section; Lemmon, Graham, Martinez, and Bedwell were four students in the online course and developed agents 1, 2, 3, and 4, respectively. We are grateful to Maithilee Kunda and Keith McGreggor for their prior work on which this project builds. We also thank the course's teaching team: Lianghao Chen, Amish Goyal, Xuan Jiang, Sridevi Koushik, Rishikesh Kulkarni, Rochelle Lobo, Shailesh Lohia, Nilesh More, and Sriya Sarathy. We also thank the anonymous reviewers of this paper: their comments truly helped improve the discussion. 2015_40 !2015 A chart generation system for topical metrical poetry Berty Chrismartin Lumban Tobing and Ruli Manurung Faculty of Computer Science Universitas Indonesia Depok 16424, West Java, Indonesia berty.chrismartin@ui.ac.id,maruli@cs.ui.ac.id Abstract Several poetry generation systems that are in some way inspired or motivated by existing articles such as newspaper stories have recently appeared. However, most if not all of them employ template-based generation, which limits both the expressiveness of the system and the ability to faithfully convey the message of the source article. In this paper we present our work on a poetry generation system that uses a dependency parser to extract the predicate argument structure of the input article, and tries to maintain this structure through deep syntactic text generation whilst complying with a given target form. The combinatorial nature of this task presents huge challenges, and we describe several improvements that have been applied in an attempt to produce poetry in a tractable fashion. Introduction Poetry generators are systems that are capable of automatically generating poetry given certain restrictions and contexts. Gervas (2002) presents an overall evaluation of var´ ious poetry generators. Various generation approaches are employed, e.g. evolutionary algorithms (Manurung, Ritchie, and Thompson 2012), case-based reasoning (Diaz-Agudo, Gervas, and Gonz ´ alez-Calero 2002), template-based gen´ eration (Colton, Goodwin, and Veale 2012), (Rashel and Manurung 2014), and constraint programming (Toivanen, Jarvisalo, and Toivonen 2013). ¨ In this paper, the task we are aiming to solve can be referred to as meaningful poetry generation, where the goal is to generate a text that exhibits poetic aspects such as rhyme, metre, alliteration, and other phonetic or orthographic patterns, but also broadly tries to convey a given meaning representation. This last requirement is what distinguishes this task from other forms of poetry generation, which primarily focus on generating texts that take the form of a poem. The way in which an input meaning representation is provided, and the manner in which a poetry generation system attempts to preserve the fidelity of the input meaning representation, varies. Most systems can be said to be "loosely inspired" by their input meaning representations, as they use words and phrases from the input as fillers for textual templates. Although the resulting poems may include words derived from the input, they do not necessarily take into account aspects such as predicate-argument structure and head-modifier relationships that are crucial towards semantic interpretation. For example, it is possible that a poetry generator that extracts keywords from an article about the Gulf War writes a poem about Iraq invading the USA, and not the other way around. Some notable exceptions are PoeTryMe (Gonc¸alo Oliveira 2012) and MCGONAGALL (Manurung, Ritchie, and Thompson 2012). PoeTryMe selects content in the form of a set of words and relations between them that are obtained from a semantic graph, and conveys this content using templates that are known to express such specific relations. MCGONAGALL employs a fitness function that measures the semantic similarity between a candidate poem and a given target semantics in such a way that structural similarity is significantly preferred. One other interesting related work is Gervas (2015), which explores var´ ious modifications and extensions to an existing poetry generation system, WASP, to consider much tighter constraints on the content of generated poems. This paper describes a system that uses a meaning representation that explicitly captures predicate-argument structure and tries to maintain this structure through deep syntactic text generation whilst complying with a given target form. We first discuss chart generation, the basic mechanism the system employs to produce text, before discussing how we extract meaning representations from input news articles. Some results from experiments using this initial system are shown. We then present a complexity analysis of the algorithm and suggest four different improvements to make the system generate meaningful metrical poems in a much more tractable manner. Finally, some results are shown from experiments using the revised system. Chart Generation Moreso than an author of prose, an author of poetry may have to perform a lot of rewording, paraphrasing, and various other alterations to the text, in order that the end result can satisfy the various poetic constraints such as rhyme and metre. Moreover, in literary texts, creative language use often results in more exibility of lexical choice, word-order, and grammaticality, hence an even larger search space for the paraphrasing. One efficient method for constructing all valid para June 2015 308 phrases of a natural language utterance is chart generation (Kay 1996). Given an input meaning representation, a set of grammar rules, and a lexicon, it systematically generates all syntactically well-formed texts that convey the input meaning. It employs a dynamic programming technique to overcome the inefficiency caused by backtracking due to the pervasive non-determinism in natural language grammar rules. A data structure known as the chart stores all complete constituents once they are generated, so regardless of the number of paraphrases they may appear in, they will only be constructed once. The chart also stores incomplete constituents, which are predictions of larger constituents yet to be generated. A chart contains entries that are labelled with ‘dotted rules' which describe both complete constituents, called inactive edges, and incomplete constituents, called active edges. An active and an inactive edge can combine to yield a new edge that represents a larger constituent. An example of an inactive edge is np ! det noun •, which represents a noun phrase (np) constituent that consists of a determiner followed by a noun, whereas an example of an active edge is np ! det • noun which represents a partially constructed noun phrase which is still lacking a noun. Note the position of the dot (•) that delineates the portion of the constituent that has been constructed from that which is still lacking. For chart generation, it is not enough for the dotted rules to simply state syntactic constituency. They must also state the semantics of each constituent, and how their arguments must unify when being combined. When two edges combine, their semantics must also be unioned to obtain the semantics of the new edge. Moreover, some semantic subsumption checking must be performed to prevent false sentences from being generated. For example, given the input semantics loves(john, mary), an edge with the semantics loves(john, X), where the variable X indicates an unbound argument, can be added to the chart, because its semantics still subsume the input. However, according to the input grammar, a chart generator may also construct an edge with the semantics loves(X, john), whose semantics does not subsume the input. Therefore, it must be rejected. The algorithm can be informally described as follows: 1. Add entries for all words whose semantics subsume the target semantics to the chart. 2. Bottom-up prediction: for each inactive edge in the chart, add new active edges to the chart for each grammar rule that have it as the first constituent on the right hand side. 3. Scanning: for each active edge in the chart, look for inactive edges whose category matches that of the first constituent needed, and add a new edge that combines the two. 4. Completion: for each inactive edge in the chart, look for an active edge that is looking for a constituent with a matching category. 5. The above processes are repeatedly applied to all new entries to the chart until no more new entries can be added. Let us consider a simple example. Suppose the following target semantics are to be generated: {dog(d), definite(d), see(s), cat(c), definite(c), arg1(s,d), arg2(s,c)}. Assume the grammar consists of the following three rules: s(x) ! np(y)vp(x, y) np(x) ! det(x)noun(x) vp(x, y) ! verb(x, y, z)np(z) and the lexicon consists of the following four entries: Word Category Semantics cat noun(x) x:{cat(x)} saw verb(x,y,z) x:{see(x),arg1(x,y),arg2(x,z)} dog noun(x) x:{dog(x)} the det(x) x:{definite(x)} Following the algorithm described above, edges are entered to form the chart seen in Table 1. The process is as follows: • Initially, edges 1,2,4,6, and 7 enter the chart. They represent the lexical items that convey a portion of the target semantics. • Edges 3,5, and 8 enter the chart as a result of the prediction operation. Based on the grammar, the algorithm hypothesises the existence of larger constituents. • Edges 9 and 11 enter the chart as a result of combining the inactive and active edges 1+3 and 6+8 respectively. • Edges 10 and 12 enter the chart as a result of the prediction operation on edges 9 and 11. Note that edge 12, although cannot form any part of a sentence that conveys the input semantics, still enters the chart, but will not combine with any other edge due to the semantic subsumption checking. • Edge 13 and subsequently edge 14 enter the chart as a result of combining edges 5+11 and 10+13 respectively. Metre compatibility Manurung (1999) first introduced an extension to chart generation to take into account rhythmic constraints of poetry. In most forms of poetry, metre is the arrangement of words such that rhythmic patterns emerge from their lexical stress, which is the relative prominence of stress received by syllables in a word. To simplify matters, we will assume that syllables may receive one of either two types of lexical stress: weak stress or strong stress. Thus, the rhythm of natural language strings can be represented as lists, which we call stress patterns, denoting the type of stress received by each syllable in an utterance, which can be either weak (denoted as 'w') or strong (denoted as 's'). For example, the list [w,s,w,s,w,s,w,s,w,s] would be a stress pattern that represents a line of iambic pentameter. These stress patterns can be used as the representation for specifying the metrical constraints that are provided as input for the chart generator. The starting point for constructing stress patterns is lexical stress, which can be obtained from pronunciation dictionaries such as the CMU Pronouncing Dictionary1. 1 http://www.speech.cs.cmu.edu/cgi-bin/cmudict June 2015 309 No. Phrase Category Semantics Operator 1 dog noun(d) d:dog(d) Lexical 2 the det(d) d:definite(d) Lexical 3 the np(d) ! det(d) • noun(d) d:definite(d) Prediction (2) 4 saw verb(s, d, c) s:see(s), arg1(s,d), arg2(s,c) Lexical 5 saw vp(s, d) ! verb(s, d, c) • np(c) s:see(s), arg1(s,d), arg2(s,c) Prediction (4) 6 cat noun(c) c:cat(c) Lexical 7 the det(c) c:definite(c) Lexical 8 the np(c) ! det(c) • noun(c)) c:definite(c) Prediction (7) 9 the dog np(d) ! det(d) noun(d)• d:definite(d),dog(d) (1)+(3) 10 the dog s( ) ! np(d) • vp( , d) d:definite(d),dog(d) Prediction (9) 11 the cat np(c) ! det(c) noun(c)• c:definite(c),cat(c) (6)+(8) 12 the cat s( ) ! np(c) • vp( , c) c:definite(c),cat(c) Prediction (11) 13 saw the cat vp(s, d) ! verb(s, d, c) np(c)• s:see(s), arg1(s,d), arg2(s,c), definite(c), cat(c) (5)+(11) 14 the dog saw the cat s(s) ! np(d) vp(s, d)• s:see(s), arg1(s,d), arg2(s,c), definite(c), cat(c), definite(d), dog(d) (10)+(13) Table 1: Sample entries during chart generation for "the dog saw the cat" Stress patterns are not only used to represent input target forms, but also the metre of texts that are incrementally constructed through chart generation. When two edges are combined, their stress patterns are appended to obtain the stress pattern of the new edge that arises. Therefore, when attempting to add a new edge to the chart, the system can first check whether or not its stress pattern can appear as a contiguous subsequence of the target stress pattern. For example, the verb phrase "saw the cat" has a stress pattern [s,w,s], and can thus be said to be compatible with an iambic pentameter metre because it can appear as a subsequence of [w,s,w,s,w,s,w,s,w,s]. However, the prepositional phrase "with the cat" has a stress pattern of [w,w,s], and is thus not compatible, and hence should not be added to the chart. By applying this metre check everytime an entry is attempted to be added to the chart, the search space can be significantly reduced, as it ensures that only texts that satisfy the metre constraints will be added to the chart. Implementing topicality As mentioned above, we aim to generate poems that explicitly convey a given meaning representation, preserving the fidelity of the message by taking into account predicateargument structure, head-modifier relationships, and lexical semantics. To that end, we implemented a preprocessing module that obtains meaning representations from a given text, which in our case is a newspaper article from an online website. An input URL is provided, and the main article content is extracted using a popular context extraction tool2. The article is split up into sentences, and each sentence is parsed using the Stanford Dependency parser3 (Klein and Manning 2003). The set of dependency relations produced is taken to be the input meaning representation for the poem to be generated. Strictly speaking, a dependency parse can2 https://code.google.com/p/boilerpipe/ 3 http://nlp.stanford.edu/software/lex-parser.shtml not be said to be a genuine semantic representation, as it is still closely related to the constituent structure of the original sentence. Although the dependency relations do include semantic relations of an entity being the agent, subject, or object of another entity, a genuine semantic representation should abstract away from any syntactic decisions, whereas the dependency parse still contains relations such as advmod (adverb modifier) and xcomp (open clausal complement). Nevertheless, such a representation is still a useful abstraction from the original text, and arguably does convey the semantics of the original text. In fact, the Stanford CoreNLP tool4 actually refers to the dependency parse as a "semantic graph". In particular, it represents predicateargument and head-modifier relations very well. It is precisely such relations that the keyword and phrase extractionbased approaches of previous topical poetry generation systems fail to capture. For example, given an input sentence "The fox jumps over the dog.", the dependency parse output is as follows: {det(fox-2, The-1), nsubj(jumps-3, fox-2), root(ROOT-0, jumps-3), det(dog-6, the-5), prep_over(jumps-3, dog-6)} Since the chart generator will populate the initial chart with lexical entries based on the input meaning representation, whereas the relations above actually define relations between words, we must first explicitly add clauses for each word by introducing a lex relation and introduce a new variable for each word. Subsequently, we replace the arguments in the dependency relations to unify with these new variables, yielding the following representation: {lex(a, [the, det]), lex(b, [fox, noun]), lex(c, [jumps, verb]), lex(d, [over, prep]), lex(e, [the, det]), 4 http://nlp.stanford.edu/software/corenlp.shtml June 2015 310 Synset ID Gloss #102118333 alert carnivorous mammal with pointed muzzle and ears and a bushy tail; most are predators that do not hunt in packs #110022759 a shifty deceptive person #114764910 the grey or reddish-brown fur of a fox Table 2: WordNet entries for the noun "fox" lex(f, [dog, n]), det(b,a), nsubj(c,b), det(f,e), prep_over(c,f)} Mapping words to concepts Although the meaning representation from the dependency parse explicitly states predicate-argument and head-modifier relations, it does so over strings of text such as "fox" and "jumps". To properly treat this as a semantic input, and to maximize the paraphrasing power of the generation component, these strings must first be transformed into semantic concepts. To achieve this, these strings are mapped onto appropriate WordNet synsets (Fellbaum 1998). This increases the paraphrasing power of the generator, as it enables the generator to select synonyms to convey the concept, which may be necessary to satisfy rhythmic constraints. Unfortunately, words are ambiguous symbols that may have many meanings, and such a mapping process raises the issue of word sense disambiguation (Agirre and Edmonds 2007). For example, given the word "fox" as a noun, WordNet has three different senses, which can be seen in Table 2. In the initial version of the system that we develop, we simply take all possible senses for the word given the appropriate part of speech tag as returned by the dependency parser. Thus, in the example of "fox" above, all three senses of the noun are considered, but the three senses of "fox" as a verb are not. Lexical resources To accommodate the input meaning representations obtained from the dependency parser, the grammar, lexicon, and semantic representations must first be suitably modified. The lexicon is constructed by consulting WordNet and the CMU pronouncing dictionary. Before mapping to WordNet synsets, lemmatization is first applied to the words found in the dependency parses. Since WordNet only contains open class words, entries for closed class words such as determiners, prepositions, etc. are added manually. Initial Experiments To summarize the previous two sections, given an input URL and a target form, the poetry generation system proceeds as follows: Ask in french surface Call her years, check her Think were toy tennis Skid in chase, land her (a) Game were this tuesday Is her but basket James were drill friday He were this target (b) This baby tell me That I can miss you? That you should hold me Will know I miss you (c) Table 3: Sample output of initial experiments 1. Given an input URL, download the page and extract the main news content. 2. Parse each sentence of the text using the dependency parser. 3. For each sentence, apply chart generation to produce a text that conveys target semantics in the form of the target stress pattern. 4. Assemble all possible poems from the successfully generated sentences. To test this system, three input articles were provided: two news articles from the sports section of the New York Times: "Maria Sharapova Is Finding Her Stride On Clay at Roland Garros"5 and "James and the Heat Coolly Even the N.B.A. Finals"6, and the lyrics to a contemporary R&B song, "Officially Missing You"7. From each of these input articles poems were generated using target stress patterns of 4 lines long, each consisting of 5 syllables, with a rhyming pattern of AB-AB. For the two news articles a stress pattern of [s,w,s,s,w] was specified, but for the song lyrics the generator was only constrained by the number of syllables. Table 3 shows some randomly selected sample output for each input article. Note that they are all perfect in terms of rhyme and metre, and they all roughly convey some aspects of semantics of their respective input articles. Improving runtime complexity Despite the fact that chart generation utilizes dynamic programming to make the process efficient, and that metre compatibility checking can substantially reduce the search space, the system as described is still very inefficient, and takes sev5 http://www.nytimes.com/2014/06/07/sports/tennis/mariasharapova-is-finding-her-stride-on-clay-at-roland-garros.html 6 http://www.nytimes.com/2014/06/09/sports/basketball/lebronjames-and-miami-heat-coolly-even-the-series.html 7 http://en.wikipedia.org/wiki/Officially Missing You June 2015 311 eral hours on a modern desktop PC to compute the sample output presented in the previous section. A brief algorithm analysis will now be presented, together with some insights on how to speed up the process. Assume that we are trying to generate a poem consisting of A lines based on an input article containing Z sentences. Assume also an input target semantics of N clauses, where for each word appearing in the semantics there are L possible WordNet synsets, with each synset having K synonyms. The lexicon that needs to be considered contains a total of N ⇥L⇥K entries. Finally, assume a grammar that contains M rules. The chart generation process starts by considering all words from the lexicon that can possibly convey a section of the input semantics, and the bottom-up operator checks the M rules whether they predict the appearance of a word with the appropriate syntactic category. By taking into consideration repeated application of the scanning and completion operators discussed previously, until no more entries can be added to the chart, the theoretical worst case complexity is estimated to be: O((((L ⇥ K) A ⇥ P(N,A) ⇥ M ⇥ A) ⇥ Z) P ) (1) where P(N,A) is the permutation function, N! (N # A)!. Idea 1: Summarizing input text In our initial experiment described above, the entire input news article is parsed and processed. As an example, the New York Times article about Maria Sharapova consisted of 1198 words. One idea to reduce complexity would be to try to summarize the article beforehand, and extract the semantic representation of the summary as input for the poetry generator instead. Aside from issues of complexity, attempting to convey the meaning of an entire news article in a short poem without really considering issues of discourse processing and coherence is slightly naive. Document summarization systems are precisely designed to analyse a text at the discourse level and to determine the most salient portions. Thus, aside from reducing complexity, this approach may also leverage the ability of such summarization systems to select a subset of content from the input news article that is more relevant to be conveyed. Assuming that the Z sentences of the news article is summarized into P sentences, where P is the number of lines in the target form to be generated and is < Z, the complexity becomes: O((((L ⇥ K) A ⇥ P(N,A) ⇥ M ⇥ A) ⇥ P) P ) (2) In our experiments, we use the popular document summarization tool MEAD8 (Radev et al. 2003). Idea 2: Sense disambiguation In our initial version, the system simply considers all possible senses of a word when mapping to WordNet concepts. Given that this is done for all words in the input text, this 8 http://www.summarization.com/mead/ creates a combinatorial explosion, many of which are likely to be incoherent combinations of senses. To select the most appropriate word sense, the context of the target word, in this case the sentence in the news article to which it belongs, is compared against the context of the various available senses, i.e. the gloss and/or example sentences from WordNet. The modified Lesk algorithm is a well-known instance of this approach (Banerjee and Pedersen 2002). We employ a vector space model approach, where the two contexts are represented as vectors in a highdimensional space and the sense that yields the highest cosine similarity is selected as the appropriate sense. In recent years, so-called word embeddings that have been trained using neural networks on very large corpora have yielded very good results. We use pre-trained vectors that have been made available as part of the GloVe9 (Global Vectors for Word Representation) toolkit. By applying word sense disambiguation, L = 1, thus the complexity becomes: O(((KA ⇥ P(N,A) ⇥ M ⇥ A) ⇥ P) P ) (3) Idea 3: Positional indexing Chart generation is actually a dynamic programming approach to text generation that is motivated by chart parsing, which analyses a sentence and produces all parse trees based on a given grammar. In chart parsing, bottom-up processing starts with adding entries for each word appearing in the text to be parsed. However, since the order of the words is already known, entries in the chart are indexed based on the position they appear in the sentence. This index speeds up the process, since only edges that are incident to each other can possibly combine to yield new edges that represent larger constituent structures. However, in chart generation such positional indexing is typically not used, as one does not know beforehand where words will appear in the sentence, and the overriding aspect that governs which edges can combine is that of semantic subsumption. When considering metre compatibility during an attempt to add an edge to the chart, the system currently checks whether it can appear as a contiguous substring in the target form, but does not specify where precisely this substring is located. As a result, this substring matching process must be repeated every time, for every edge. When considering the interaction of this aspect with that of rhyme, it is possible that the chart generator spends a lot of time building partial structures that appear to be valid constructions early on, but eventually cannot fit the metre. To overcome this, we augment the chart data structure by also recording the start and end position of each edge in terms of the syllable count within the poem. When adding lexical items to the chart at the beginning of the generation process, multiple instances are recorded for each word at each possible position within the poem. However, the metre compatibility check need only be computed once during the beginning, and when the generation subsequently proceeds, 9 http://nlp.stanford.edu/projects/glove/ June 2015 312 the system need only ensure that pairs of incident edges are being combined, without having to perform any additional metre substring matching. The complexity is thus further reduced to become: O(((KA ⇥ P(N,A) ⇥ M) ⇥ P) P ) (4) Note, however, that due to the additional bookkeeping overhead and redundancy of having multiple entries for words based on the position they appear in, the memory complexity increases. Idea 4: Greedy collation In our initial version above, for each input sentence, chart generation is applied to produce a text that conveys the target semantics in the form of the target stress pattern for one line. Following this process, all possible combinations of these lines are assemble to yield all possible poems. This is a major source of inefficiency. The final modification that is carried out in an attempt to improve the efficiency of the generation algorithm is to replace this exhaustive combinatorial process with a greedy algorithm that selects subsequent lines so as to maximize an objective function that considers the aspects of rhyme, metre, and semantics. Firstly, all possible candidates for the first line are tried in turn. For each subsequent line l, a candidate is selected that maximizes the following objective function: f(l) = ↵1 ⇥ rhyme(l) + ↵2 ⇥ syll(l) + ↵3 ⇥ sem(l) where: • ↵1, ↵2, and↵3 are weight factors in the interval [0,1] and ↵1 + ↵2 + ↵3 = 1. • rhyme(l) is a function that returns a value of 1 if l ends with a correct rhyme, 0 otherwise. • syll(l) is a function that returns a normalized syllable count, e.g. the ratio of the number of syllables found in l to the number of syllables in the target form for that line. • sem(l) is a function that returns a normalized semantic content count, e.g. the ratio of the number of semantic clauses conveyed by l to the maximum number semantic clauses obtained for that sentence during generation. The complexity is thus further reduced to become: O((KA ⇥ P(N,A) ⇥ M) ⇥ P) (5) Subsequent experiment To test the various modifications that were designed and implemented, the system was run with the exact same input as during the initial experiment, and results can be seen in Table 4. As can be seen, the overall quality of the results suffers as a result of some of the modifications, and possibly most notably the use of a greedy algorithm to assemble the resulting poem. For instance, from the point of view of the rhyme and metre the solutions are sub-optimal. On the other hand, whereas previously the generator would run for many hours to complete, the empirical running time measurements from the modified system show that the modified system typically takes approximately 2030 seconds to generate poems given the same size of input. Is she Court were full even She were take couple She were a woman (a) James were a system Are an air system Was a james Aver get way (b) Tell me are you Wish you now call me Fix over with you Guess was it was I (c) Table 4: Sample output of subsequent experiments Discussion & summary In this paper we have presented work in progress on the development of a poetry generation system that uses a dependency parser to extract the predicate argument structure of the input article, and tries to maintain this structure through deep syntactic text generation whilst complying with a given target form. The combinatorial nature of this task presents huge challenges, and several improvements have been suggested and applied in an attempt to produce poetry in a tractable fashion. Whilst this does drastically improve the complexity of the algorithm, changing the running time from several hours to a matter of seconds, the quality of the output seems to visibly suffer. Deep natural language generation that is constrained by a target semantics at one end and a target form on the other end is a very difficult task. Whereas other poetry generation systems try to achieve this through the means of evolutionary computation and template-based generation, our work can be seen to be related to the work reported in (Toivanen, Jarvisalo, and Toivonen 2013), as the task can be cast as ¨ a constraint satisfaction problem. Unfortunately, imposing syntactic constraints on a constraint satisfaction problem, where the syntactic constraints are defined as context-free grammar rules is a very computationally expensive problem. Our approach is to utilize chart generation, a well-known dynamic programming technique where the grammar rules are a fundamental component of the algorithm. Another strategy worth considering for future work is context-free grammar filtering (Kadioglu and Sellmann 2008), a time and space efficient arc-consistency algorithm that allows the formal specification of constraints as a context-free grammar within a constraint satisfaction problem framework. 2015_41 !2015 TheRiddlerBot A next step on the ladder towards creative Twitter bots Iván Guerrero1, Ben Verhoeven2, Francesco Barbieri3, Pedro Martins4, Rafael Pérez y Pérez5 1Universidad Nacional Autónoma de México, D.F., México 2CLiPS Research Center, University of Antwerp, Belgium 3Universitat Pompeu Fabra, Barcelona, Spain 4CISUC, University of Coimbra, Portugal 5Universidad Autónoma Metropolitana, Cuajimalpa, D.F., México Abstract We present a computational model for the generation of a Twitter bot that aspires to be considered creative by generating riddles about celebrities and well-known characters. The riddles are created by combining information from both wellstructured and poorly-structured information sources. This model has been implemented as an interactive Twitter bot (@TheRiddlerBot) that presents its outputs as contests to its followers, checks the posted answers and replies accordingly. Lastly, we present a discussion about the main attributes of a creative Twitter bot, and the remaining work for our bot to qualify as such. Introduction On several social networks, but especially Twitter, a new variety of users, the bots, are increasingly interacting not only with human users, but even among themselves. The first Twitter bots that appeared on the web were considered in the best case graceful, and sometimes even useful, or helpful, but they were far from being considered creative. To be creative usually relates to the generation of something novel and interesting, not only to oneself, but also to partners sharing a common background (Mayer 1999). According to this, a creative activity can be considered a social activity as well, since the environment evaluates any generational process to determine if it can be considered truly creative or not. In this sense, the environment establishes diverse constraints to any creational process, and the main challenge for an inventor resides in freeing himself from all these conventions to create something novel, interesting and yet valuable. Novel ways of interacting inside social networks have added new and barely studied constraints to the creative process. Contests for the generation of micro-stories (Hamid 2014) 100 words long stories -, similar to Tweet messages, or the generation of writing maps writing prompts to inspire writers (Maps 2015) have emerged from these new ways of interaction. The problem that we tackle in this paper is the design and implementation of a Twitter bot that can be considered creative, focusing on missing features in the prevailing bots. The use of more realistic and diverse knowledge sources (Twitter, Facebook, Wikipedia, online news sites), evaluative mechanisms for its own outputs, and the definition of a purpose which surpasses the generation of pseudo-random messages, are examples of such omissions. The goal of our bot is to generate riddles about celebrities, formed as questions to encourage readers to assert the name of a famous character. The rest of the paper is organized as follows. We first give an overview of the state of the art of Twitter bots, after which we give a general description of a model to automatically generate riddles and its implementation in a Twitter bot (@TheRiddlerBot). We present the results of a questionnaire where we asked people to evaluate a set of riddles. We then close with a general discussion of our proposal and our conclusions. Related research We now present relevant research from two different fields: riddle generation, and automatic Tweet generation. The existing theories related to the generation of riddles are not yet complete mainly because their descriptions only contemplate a subset of riddles (typically those in the question-answer format). Nevertheless, we present several approaches that provide relevant features that should be present in any riddle to be considered as such. Besides, we describe several first generation Twitter bots, Tweetgenerating systems that autonomously perform useful and well-defined services (Veale 2014), that are using Twitter in diverse ways. We distinguish feeder bots, which create tons of Tweets for their followers; watcher bots, which are constantly looking for specific texts to extract information; and interactive bots, which ask followers for specific ways of communication and information sharing (Cook 2015). We describe different Twitter bots as examples of the state of the art. We will focus on both the creative aspects and unique features that are already present, as well as missing features. Pepicello (Pepicello and Green 1984), among others, has researched riddles extensively, and described them as text fragments that employ ordinary language restricted by semiotic, aesthetic and grammatical artistic constraints. They argue that ambiguity in these descriptions is a key June 2015 315 aspect of a riddle, and they define three types: phonological (use of words with the same phonetic code), morphological (use of words with the same writing) and syntactic (phrases with different possible interpretations). According to this work, the goal of a riddle is to confuse the guesser by utilizing one or more of these ambiguities. Additionally, Weiner (Weiner and De Palma 1993) defines a riddle as a language game, initiated by a question, with the goal to mislead the guesser. They describe two pragmatic mechanisms for the generation and comprehension of riddles: accessibility hierarchy and parallelism. The former relates to categorization, the capability to relate different concepts to accomplish specific goals. They describe parallelism as the tendency to remain in the same cognitive space unless a force makes us change to an alternative representation. They state that we, as humans, employ these two mechanisms to generate and comprehend riddles. According to this work, there exist two types of concepts present in every category: context-invariant (what first comes to our minds) and context-variant (present when a relevant context appears). They state that a riddle must bring to our minds the context-invariant information to mislead the answer of the riddlee. Parallelism, in turn, helps on to generate false expectations on the part of the guesser. JAPE Joke Analysis and Production Engine (Binsted 1996) is a question-answer riddle generation system. Herein, several strategies to generate riddles are described: syllable substitution, word substitution and metathesis. The first mechanism consists in confusing the syllable in a word with a similar sounding word; the second, confuses an entire word with a similar sounding word; the third, reverses the sounds of two words to suggest a similarity in meaning between two phrases. To generate riddles, JAPE uses templates consisting of ‘canned text' with slots where words or phrases are inserted. To determine which words are to be incorporated to the final riddle, the system makes use of predefined schemas, which establish relationships between words which must hold to build a joke. These schemas are manually built from previously known jokes. In an effort to delineate novel uses of Twitter, Angelina5 (Cook and Colton 2014) is a software for the generation of 3D games that uses a module to evaluate its textures, i.e. images utilized for decorating walls and ceilings inside the scenario, in a Twitter account (@angelinasgames). Each game has a theme, initiated by a word or phrase. Angelina-5 obtains a set of words associated to it from an English corpus, and uses them to retrieve sound effects, textures, 3D models and fonts to create a game. The bot periodically Tweets images and asks its followers to associate terms to it. These terms are collected into a repository to be further used as tags for the image. This bot can be classified as a watcher with the goal of obtaining tags for a tweeted image from the user. The bot does not have any capabilities for analyzing the information received, given its very limited function within Angelina-5. Nevertheless, it is a functional example of how bots can receive information from humans to enhance the capabilities of a system, a desirable function to contemplate in our bot. Flux Capacitor (Veale 2014) is a generator of wellformed and interesting character arcs (conceptual starting and ending points for a character inside a narrative). These character descriptions are defined in terms of properties, and a well-formed arc contemplates representative changes by looking for templates (such as XbecomesY ) in Google n-grams (Brants and Franz 2006). Apart from that, relationships among properties to describe such states are retrieved from WordNet (Fellbaum 1998). The output of the bot serves the MetaphorIsMyBusiness (@MetaphorMagnet) Twitter bot to generate metaphors related to character twists in a story. This bot has several aspects that differentiate it from first generation bots, such as its capability to deal with massive, poorly-structured knowledge databases (those lacking a well-defined format), and its purpose to create outputs surpassing the generation of pseudo-random messages. Another aspect of the bot is its high curation coefficient, the ratio of good outputs to all outputs, since the system contemplates mechanisms to evaluate its own outputs and filter those considered with low quality. General description We present a model for a Twitter agent with creative behaviors such as its abilities to utilize real-world, poorlystructured data sources, to evaluate its own outputs, and to interact with Twitter users. We describe as well the implementation of our model in a Twitter bot (@TheRiddlerBot) that generates creative riddles about fictional or real characters (e.g. celebrities) using cross-references from different knowledge bases. Model description The model consists of five main modules each subdivided in three layers (see Figure 1). Each module has a specific task ranging from the selection of a relevant celebrity, to the publication of the riddle in Twitter and tracing the answers of the followers. Besides, a layered structure of the system provides every module of tools for retrieving additional information from diverse sources, for processing the information available, and for evaluating its outputs. Now we describe the main characteristics of each module and how its tasks are distributed among the diverse layers of the model. Character selection module This module initiates by retrieving a list of celebrity names from diverse knowledge bases. Some sources may have well-structured information, such as the Non-Official Characterization (NOC) list (Veale 2015), whereas others may lack this structure, such as June 2015 316 Figure 1: Model architecture Google News, trending topics from Twitter, or public information from Facebook. This task resides inside the first layer of the module, the information retrieval layer. The data obtained is then passed to the processing layer, where one of the celebrities is selected according to diverse criteria such as his public relevance. These criteria give clues about the current importance of the celebrity due to the events he or she has recently been involved in. Finally, the evaluation layer determines if the selected character has been lately used to generate riddles, in which case it is not suitable for a new riddle. Once the character selection process finishes, the name of the celebrity is passed to the next module to look for as many facts as possible about him. Feature Extraction Module This second module gathers attributes about the previously selected character from both well-structured sources, such as the NOC list, and poorly-structured sources, such as Wikipedia. Furthermore, common sense knowledge bases (see the Perception dataset of the Nodebox project1) serve as repositories for hypernyms (super categories) of the character's attributes. These tasks are performed inside the first layer of the module. All the information obtained is then passed to the processing layer, where a subset of features is extracted according to their uniqueness and interestingness. A subset of features is considered unique if they describe only one celebrity. This evaluation is important because a riddle with unique traits is not always desirable, since it becomes easy to solve. A riddle is considered interesting when it describes a character with attributes that altogether represent relevant traits, but do not provide excessive information so that the riddle cannot be easily guessed. A set of attributes is considered relevant when the sum of the n-gram percentage of its elements, according to the Google N-gram viewer. An isolated attribute is considered to provide excessive information when its n-gram percentage is too low, and it can be considered unique. These values still need to be determined and further studies must be done to evaluate its accuracy. Lastly, the evaluation layer determines if the subset selected has not been previously used for the same character, and that the evaluation of the attributes in previous riddles is acceptable to keep using them. These features are finally 1 http://www.nodebox.net/perception sent to the next module to extract additional information from them. Analogy Generation Module The third module initiates by gathering information about similar characters according to the features of the character selected for the riddle. For this purpose, it uses information available at the NOC list as well. Then, it retrieves descriptions of analogies for the generation of relations between characters. We consider two different types of relations between a character and his attributes. Direct relations exist between a character and his features ('Diego Rivera' lived in 'Mexico', 'Tequila' is produced in 'Mexico'); higher-order relations exist between a character and a concept related to one of his features ('Diego Rivera' lived in the country where 'Tequila' is produced). For this last example, we substituted an attribute by its hypernym to create the relation ('country' is a hypernym of 'Mexico'). The information for the generation of analogies is passed to the processing layer where such analogies are created according to the attributes selected for the character. Finally, the evaluation layer determines if the mixture of attributes and analogies has been previously used to create riddles of the same character, in which case the analogies are discarded and new attributes are analyzed. With the set of features and analogies complete, the information is now passed to the next module to convert them into utterances. Natural Language Generation Module This module initiates with the retrieval of different types of phrasal templates for each part of the riddle (initial phrase, clues, final question). These templates are stored and retrieved from a repository specially developed for this project. A phrasal template is a previously-known sentence with slots to be further filled by specific words (Becker 1975). Each slot is commonly associated with a part-of-speech tag which allows to preserve the syntax of the sentence. Inside the processing layer, the module performs a process to select one template for each type of sentence. The selection process begins with the random selection of an initial-phrase template. Then, several clue templates are selected preventing than recently utilized templates are now repeatedly chosen. Lastly, a final-question template is picked in accordance to the first selected template. These templates have the purpose of providing the system with a wider variety of possible generations. Once a template is selected, the slots are filled with either character's attributes, or analogy information. User Interaction Module The last module extracts a list of aliases for the character to be guessed. The processing layer prepares and tweets the riddle, and starts looking for responses in Twitter, which are compared against the previously obtained aliases. If there is a match, the riddle is considered to be finished. Users get points for each correct answer, which makes this system into a kind of game. When a wrong answer is detected, the user is notified and encouraged to try again. The number of incorrect answers of a June 2015 317 riddle can be further employed in the evaluation layer of the module. They cast light on the difficulty level of the riddle, and on the interestingness and uniqueness of the attributes employed to generate it. System description We have been incrementally implementing the previous model in a Twitter bot called ‘TheRiddlerBot'2. We started out with random character selection, direct relations with tratis, and Twitter publishing capabilities. By now, we have incorporated several new features into the system, which are explained below. The character selection process retrieves a list of celebrities from the NOC list, and randomly selects one of them. If the character has not been used to generate one of the last riddles, he is passed to the following module. The NOC list is a matrix where every row contains information about a famous character and every column is a trait, so every cell contains the value of a trait for a specific character. Several additional matrices exist where further information about specific traits, such as clothing, fictional worlds, vehicles, weapons..., can be found as part of the project. Due to its simplicity, we utilize the Pattern package in Python (De Smedt and Daelemans 2012) for a wide variety of tasks, from the feature extraction from comma-separatedvalues files, to the reception and sending of Tweets. Direct relations between traits and the given character are obtained from the NOC list as well, whereas common sense knowledge is obtained from the Perception demo3 to incorporate additional traits to the available character's knowledge. From the list of available traits, three of them are randomly selected and evaluated to determine its interestingness and uniqueness (in the current version this evaluation is not fully implemented yet). The selected traits are considered unique when we cannot find additional celebrities inside the NOC list with the same values. To determine the interestingness of the attributes, we look for them on the character's Wikipedia page, and if they are present, we consider them relevant. If the number of characters sharing the selected values surpasses a threshold, they are not considered relevant, in such case a new subset of features is randomly selected and the generation process re-initiates. Finally, the evaluation layer looks for similar subsets for the same character in previous riddles. If this subset-character pair has been previously used, the feature extraction process re-initiates until a suitable subset is found. These features are finally sent to the next module to determine if additional relations can be obtained. We retrieve from the additional matrices of the NOC list project information to generate higher-order relations about 2 Source code available on github: https://github.com/ivangro/theriddlerbot 3 http://www.nodebox.net/perception fictional worlds and group affiliations. For this purpose, we look for characters who share their profession with the previously depicted character. Then, we filter out those characters who don't have any information about their fictional worlds or group affiliations. From the remaining characters, one is randomly selected to create an analogy by means of a template. We randomly depict a template from the repository available inside the system (see table 1). Analogy type Template Fictional : like in world : like someone in Group : like in affiliation : like someone in Table 1: Sample analogy templates An analogy template consist of two parts: concept, and reinterpretation of the concept, in the form : . In general, an analogy template contains redescriptions of the selected character for the riddle (), in terms of a second character (), and the fictional world or the group affiliation of one of the characters ( or ). For instance, we can reinterpret the character ‘The Joker', whose fictional world is ‘The Dark Knight Rises', in terms of another character with the same profession, ‘criminal'. In this case, we employ ‘Morpheus', who can be considered a criminal in ‘The Matrix', and the first template for fictional worlds, to state that ‘The Joker' is like ‘Morpheus' but in ‘The Dark Knight Rises'. Besides, using the second template we obtain that ‘The Joker' is similar to 'someone' in ‘The Matrix'. Finally, the evaluation layer determines if the mixture of attributes and analogies has been previously used to create riddles of the same character, in which case the analogies would be discarded and new attributes are selected. The main task of the language generation module is to represent the attributes and analogies obtained from the previous step as utterances. Inside the feature extraction layer, the date of death of the character is retrieved from his Wikipedia page to determine if he is still alive or not. This information is further employed to conjugate the verbs of the generated phrases (a riddle about a deceased person is written in past tense). Some phrases require additional information of the character to present a more elaborated text, for this reason we obtain positive and negative adjectives describing the character from the NOC list. Inside the generation layer, we convert features and analogies to text. For this task we employ three different types of phrasal templates: introductory templates (see table 2), clue templates (see table 3), and final question templates (see table 4). For the clue templates, several attributes are available for the system to select one of them. The list includes clothing, opponent, opponent activity, married partner, typical activity, vehicle and country. Every template consists of three different types of June 2015 318 Type Template First person I Third person Tell me the name of a person that Table 2: Sample introductory templates Feature type Template Group affiliation -be/VB the of analogy -could have belonged to , but do/VB not -be/VB like but be/VB not part of Fictional world -be/VB similar to someone in analogy -be/VB the of Profession -be/VB attribute -be/VB , yet Opponent -do/VB not like attribute -be/VB definitely not a close friend of Hyperonym -be/VB known as attribute -be/VB , yet Clothes -have/VB been seen wearing Table 3: Sample clue templates elements: words, verbs (marked with the tag VB), and slots (). The verbs are conjugated in accordance with the type of template depicted for the introductory phrase. Afterwards, the attributes and analogies selected in the previous module are converted to text by replacing the slot fillers of the selected template with the values associated to the attributes and analogies. To conclude, the final question template is selected in accordance to the introductory-phrase template. Once the three phrases are generated in natural language, they are chunked as a riddle. The last module tweets the generated riddle to open a new contest. To determine who wins a contest, we obtain a list of aliases for the character from his Wikipedia page. Every time a follower replies a riddle, his answer is obtained to be compared against each of the available aliases for the character, and if one of them matches, the contest is declared finished and a Tweet is published to point out the winner; if none of the aliases match, a reply to the owner is sent stating that the answer was not accurate, and the contest continues. If, after several hours, the riddle had no correct answer, a Tweet exposing the celebrity is sent, and a new contest begins. Example of a riddle Now, we show how to generate a riddle about ‘The Joker'. Once the character selection module finishes, several attributes are obtained for the character from a variety of sources (see Table 5). Type Template First person Who might I be? Third person Who is this? Table 4: Sample final question templates Type Value Hypernym ‘maniac', ‘madman', ‘criminal' Group affiliation ‘The Dark Knight Rises' Clothes ‘a purple topcoat', ‘a green wig' Pos. adjectives ‘playful', ‘witty', ‘flamboyant', ‘cunning', ‘brilliant', ‘creative' Neg. adjectives ‘maniacal', ‘cruel', ‘sadistic', ‘inhuman' Table 5: Sample attribute and analogy values for ‘The Joker' From the available attributes, a subset of three traits is selected (for this example, group, profession, and clothes), and their corresponding values are sent to the analogy module. If one of the attributes is suitable to generate analogies, the process initiates. In this case, the group attribute is used to create an analogy. We look inside the knowledge base for characters who share a hypernym (see Table 6). Character Group affiliation ‘Fagin' ‘Oliver Twist' ‘John Dillinger' ‘Public Enemies' ‘Fredo Corleone' ‘The Godfather' ‘Snake Plissken' ‘Escape From New York' ‘Morpheus' ‘The Matrix' Table 6: Characters sharing a hypernym with ‘The Joker' With the information obtained from the previous steps, we randomly select an analogy template (: like in ), and it is encapsulated, with the rest of the values for the natural language generation. Here, an introductory template is selected (‘Tell me the name of a person that'), three clue templates are selected, one for each of the attributes or analogies employed, and a final question template is retrieved as well (‘Who is this?'). The clue template for group is (be/VB the of ), for hypernym is (be/VB , yet ), and for clothes is (have/VB been seen wearing ). If several values are available for an attribute, one of the is randomly picked to replace the empty slot. Finally, we create the riddle by chunking the three templates where its slots are replaced with the corresponding values: Tell me the name of a person that is the Morpheus of The Dark Knight Rises, is criminal, playful yet cruel, has been seen wearing a purple topcoat. Who is this? June 2015 319 Model evaluation As described above, we save all the posted riddles and their metadata (number of retweets, favorites, answers, etc.) in a database. The metadata could be used for the evaluation of the model if we assume that a riddle with more wrong answers is harder or that a riddle with a lot of favorites is better. Unfortunately, our bot is not popular enough yet, so there is very little interaction. Here are some numbers to give you an idea. At the time of writing (April 29th 2015), our bot has 57 followers. Since February 2nd 2015 (date of implementation of the database) 285 riddles were posted. Ten different users gave correct answers to 34 riddles in total. So we decided to perform a different evaluation. We asked 86 people to each evaluate five riddles. We first asked the participants to guess the answer to the riddle. Then, we presented the correct answer and asked if they knew the person in question. The participant indicated whether he considered the quality of the riddle satisfactory and, if negative, gave us the reason why it wasn't good. Figure 2 shows the percentage of correct answers (15.58%), and the number of known celebrities (54.19%) once the correct answer was presented. Figure 2: Results for correct answers and good descriptions Figure 3 shows the number of riddles considered to have accurate descriptions of the characters (41.86%). When that was not the case, the main reason chosen was that the description was too vague (36.51%). Among the additional reasons given, the most recurrent was that the character was already dead and the riddle was written in present (< 1%). Finally, we present here the top 5 answered riddles, according to the number of times they appeared and the number of correct answers given to them. Tell me the name of a person that can be found in UK, enjoys robbing from the rich, likes wearing a feathered cap. (Answer: Robin Hood). Who is a creator, can be found in Italy, wears a paintstained smock? (Answer: Michelangelo). Figure 3: Results about the accuracy of the description Who is a creative professional, pretty yet superficial, can be found in USA, enjoys monetizing celebrity status? (Answer: Paris Hilton). Who is a religious leader, loves spreading Christianity, likes wearing sandals? (Answer: Jesus Christ). Who is the Hermione Granger of The Simpsons, wears an orange dress, is the Timothy McGee of The Simpsons Family? (Answer: Lisa Simpson). Discussion and future development Relevant results were obtained from applying the questionnaire. The percentage of known celebrities once the answer was presented (54.19%) indicates that the process for the selection of celebrities should be improved. From this result we realised that almost half of the riddles could not be correctly answered because people did not have enough information about the character. One reason for this result is that owners of the NOC list (the main source for celebrities) and the riddlees were from different countries, and they did not have enough information in common. The percentage of good descriptions of the celebrities (41.86%) represents our curation coefficient (the ratio of good outputs to all outputs), and the major cause for our descriptions to be considered wrong was its vagueness. This indicates that further work must be done to improve the interestingness of our riddles (the description of a character with relevant attributes, but without excessive information to be easily guessed). Thereby, additional mechanisms to determine the number of traits to incorporate to a riddle based on its relevance, might prevent descriptions from being too vague. The low number of correct answers (15.58%) suggests that the complexity of the generated riddles is high. Nevertheless, by improving the character and trait selection processes will mitigate this problem. June 2015 320 In general terms, the current version of the system still lacks selection mechanisms relying on informed decisions. For instance, the character selection process randomly picks a celebrity from a list; the feature selection randomly chooses three character traits, despite of the final evaluation which determines whether they are good enough or not to continue with the process; most of the templates are randomly picked as well, and the values replacing the empty slots in such templates follow the same track. We consider that transforming as many random selections as possible into informed decisions will contribute in an overall increment of the final quality of the outputs generated by our bot, and will provide our model of additional traits for it to be considered creative. A key aspect to distinguish simple generation from creative generation is the curation coefficient in the outputs. To increase the number of high-quality riddles generated by our system several improvements will take place in the next release of the system. The work presented here is a first step in building up a robust, Twitter bot that can be considered creative. For the next release we still need to improve several aspects related to intermediate output validation, and mechanisms for the automatic expansion of the current knowledge bases utilized by the system. Despite the fact that several knowledge bases such as Conceptnet (Liu and Singh 2004) or Facebook, are not part of the project yet, the current version of the system already contains fully working mechanisms for information retrieval, and is still pending to exploit this information to generate more interesting, high quality riddles. Conclusions We have described a computational model to generate riddles about celebrities. It consists of modules to select a celebrity, to retrieve relevant traits to describe him, to generate analogies between his attributes and convert such descriptions into utterances, and to tweet the generated riddle and interact with Twitter users by evaluating their answers. The model presents a subdivision of each module in layers. The first layer is responsible for all the data extraction processes; the second, for processing of the information retrieved; the last, for the evaluation and validation of the generated outputs. We consider this layered approach relevant because it provides tools to enrich the intermediate outputs of every module. It contemplates the retrieval of additional information, when required, and the validation of intermediate results to achieve a higher quality in the outputs. We present an implementation of our model in a Twitter bot named ‘TheRiddlerBot'. Herein, we introduce several difficulties emerging from reifying our model, such as gathering character traits, generating analogies, and generating natural language utterances. We consider ‘TheRiddlerBot' as a creative agent according to the following considerations. If we describe a creative bot in terms of its capability to deal with poorly-structured knowledge to generate something interesting and novel, we have provided our system with such capabilities. Some authors on the field consider as essential properties for an artifact to be considered creative, novelty, quality and typicality of its outputs (Ritchie 2007). Although similar riddles can be found widespread over the literature, we consider that our system generates novel outputs since the traits employed by our implementation, considering the incorporation of analogies, make them rare to replicate. We still need to implement direct and indirect evaluations for the overall quality of the riddles, but we have sketched in this document several validation mechanisms to ensure the overall quality of our outputs. According to our definition of a riddle, questions to encourage readers to assert the name of a famous character, we argue that our outputs are typical examples of this type of queries. According to Pérez y Pérez (2013, 2014), any output must be presented in a correct manner (coherence), generate new knowledge to the reader (interesting), and be considered new (novelty) to be creative. We verify the coherence of our riddles particularly in two stages: the analogy and natural language generation models. The analogy and phrasal templates provide the system with well-formed structures to generate complex attributes of a riddle (analogies), and to generate readable phrases written in natural language. During the evaluation layer at every module, we validate the novelty of our riddles, since at every stage of the process we ensure that the intermediate outputs have not been previously utilized. Our system considers a riddle to be interesting looking for the traits to describe a character at his Wikipedia page, and also detecting that we have not utilized the same subset on previous riddles. The first validation gives us clues about the relevance of the traits. If the reader does not know all the presented information, he will be capable of learning new qualitites of a celebrity. The second validation lets the system be certain of the uniqueness of the employed traits. Acknowledgements This research was sponsored by PROSECCO4. We would like to thank them as well for organizing the code camp on computational creativity5 in Coimbra, Portugal, where this research project began and started to grow. This research was also sponsored by the National Council of Science and Technology in Mexico (CONACyT), project number: 181561. The second author is supported by the FWO Research Foundation Flanders. 2015_5 !2015 Accounting for Bias in the Evaluation of Creative Computational Systems: An Assessment of DARCI David Norton, Derrall Heath and Dan Ventura Computer Science Department Brigham Young University Provo, UT 84602 USA dnorton@byu.edu, dheath@byu.edu, ventura@cs.byu.edu Abstract Recent investigations into the assessment and evaluation of "creative" systems in the field of computational creativity have disclosed several problems common to research within the field. We perform a practical evaluation of the latest iteration of the creative system, DARCI, attempting to address some of these problems using a specially designed, but generalizable, online human survey. Of note, we address the complications of evaluator bias that are present in all assessments of creativity. Using our evaluation, we show that within its narrow domain, DARCI is able to produce artifacts that are rated at least as favorably as human counter parts across five aspects of creativity. Further, these artifacts tend to be more surprising and perceived as more difficult to produce than those created by human artists. Introduction Recent investigations into the assessment of "creative" systems in the field of computational creativity have disclosed several problems common to research within the field. The first problem is properly focusing assessments to the intended scope of a given creative system: how much should an evaluation focus on the artifacts themselves—weak computational creativity—and how much should it focus on the processes involved in creating the artifacts—strong computational creativity (al Rifaie and Bishop 2012)? The second problem is determining measurable assessment criteria that can be used to determine if one version of a creative system is an improvement over another, or to compare two different creative systems (Colton et al. 2014). The third problem is empirically grounding the ambiguous terminology that is commonly used to describe and assess creative systems (Brown 2014). The fourth problem is picking, or designing, the best methodology to actually carry out the assessment of a system (Jordanous 2014). The fifth problem, and one that is not addressed in detail by researchers in the field, is compensating for the effects of bias inevitably introduced by human evaluators when assessing creative systems. While the researchers exploring these issues have presented tantalizing theoretical solutions, few have implemented practical solutions (a noted exception is Jordanous' meta-evaluation of existing evaluation methodologies (2014)). In practice, as each of the researchers have noted, there is no straightforward solution to any of these problems. Here we perform a practical evaluation of the latest iteration of the DARCI system, attempting to address some of these problems using a specially designed, but generalizable, online human survey. Of note, we address the complication of bias introduced by human evaluators that is unaccounted for in current assessments of creativity. There has been some reticence in the community towards conducting human surveys as a means of evaluation. Brown notes that human surveys often have wide variance making them difficult to incorporate into established models of creativity (2014). In a study comparing several methods of evaluation, Jordanous concludes that human surveys were the least correct of the methods she explored (2014). She suggests that this was because participants, unsure of the definition of creativity, evaluated systems based on other factors. However, anonymous online surveys can quickly gather many responses from individuals outside of the computational creativity community. Having this outside opinion is valuable as it reduces biases that those within the community inevitably bring to assessments. We evaluate DARCI through such a survey, but, in order to reduce participant confusion and response variance, ask participants to evaluate a variety of explicitly defined artifact qualities (that correspond to requirements for creativity) rather than asking them to directly evaluate the system's creativity. Brown stresses the inadequacy of human surveys as empirically grounding assessments since we don't have an understanding of what the human responses mean (2014). In order to gain that understanding on some level, we develop a standard for judging the artifact qualities that we measure. The standard is created by having survey participants assess human artifacts (the standard) in addition to DARCI's. In order to evaluate a creative system from a strong computational creativity standpoint, Colton et al. argue that the process by which a system produces artifacts, in addition to the artifacts themselves, must be evaluated (2014). While our survey questions do focus on the artifacts, some are designed to glean opinions about DARCI's creative process. Unfortunately, in order for survey takers to evaluate this process, the survey cannot be blind. Participants in the survey will know that they are evaluating an artificial system, and bring with that knowledge unwanted biases. These biases may be negative if the viewer feels that art is an inherently human affair that automatically renders a computer's efforts June 2015 31 invalid. Or, they may be positive if the viewer feels that the computer has an unfair disadvantage and should thus be graded on a curve. Another possible source of positive bias is potential viewer familiarity with computational creativity, or even DARCI itself, and a concomitant desire for the study to succeed. In order to evaluate DARCI's creative process while taking into consideration the effects of evaluator bias, we design the survey to detect the level of human/computer bias in each survey taker. We then use this information to determine the effects of survey taker bias and adjust our conclusions from the survey accordingly. DARCI and Artifact Creation DARCI is composed of several subsystems, each with its own creative potential, and each designed to perform an integral step of image creation from conception of an idea, to design, to various phases of implementation, to curation. The most complete subsystem, and the one that is the focus of this paper, is called the image renderer. The image renderer uses a genetic algorithm to discover a sequence of image filters for rendering an image composition (produced by another subsystem) so that it will reflect a given description (selected from yet another subsystem). DARCI is designed to produce a rendering for a given source image that reflects a given adjective(s) in an interesting way. As detailed in previous research, by interesting we mean that the rendering is different enough from the source image so as to satisfy the creativity requirement of originality while not being too different from the source image so as to satisfy the creativity requirement of functionality (Norton, Heath, and Ventura 2014). To produce its artifact, DARCI first uses a system of genetic algorithms to build a pool of candidate artifacts from which to select the final rendering. Once these candidates have been created, DARCI uses a heuristic to rank them and then selects the top ranked candidate as the final artifact. Candidate Artifact Creation DARCI begins by training a binary artificial neural network (ANN) for the given adjective. This neural network, called here the adjective ANN, is trained to associate 51 image features with the adjective using standard backpropagation and a training set of hand-labeled images. The 51 image features describe a variety of image qualities including color, lighting, texture, and local interest points, and were chosen from a larger set of 198 features using forward feature selection as described by Norton et al. (2015). Many of these image features are the result of psychological studies analyzing the connection between color and various affective words (Ou et al. 2004; Wang, Yu, and Jiang 2006; Machajdik and Hanbury 2010). Others summarize local interest point data that is typically reserved for object detection in images (Norton, Heath, and Ventura 2015). Still other features come from a publicly available1 set of widely accepted global image features (King, Ng, and Sia 2004). 1 http://appsrv.cse.cuhk.edu.hk/~miplab/discovir/ Once the adjective ANN is trained, DARCI uses a genetic algorithm to discover the configuration and parameter settings of Photoshop-like filters for rendering the source image to reflect the given adjective. Candidate filter sequences are evaluated by applying them to the source image and using the resulting image as input to the adjective ANN. The output of the adjective ANN is the fitness score. To increase the variety of renderings discovered by the genetic algorithm, speciation is introduced by including sub-populations. After a number of generations of evolution (in our case 100) the renderings corresponding to the ten highest scoring filter sequences discovered per sub-population are returned. In these experiments, we use six sub-populations, yielding 60 images. These select images are ordered by fitness, then added to the pool of candidate artifacts one at a time beginning with the most fit image. Images are only added to the candidate artifacts if they are determined to be sufficiently unique. To identify those artifacts that are not sufficiently unique, the system calculates the normalized cosine similarity between the 51-element feature vector of each potential candidate and the feature vector for each existing candidate. If the similarity is greater than some threshold, the potential candidate is considered redundant and not added to the candidate pool. For our experiments, based on preliminary observations, we set this threshold to 0.95. Once the candidate artifacts have been selected, another epoch of evolution is performed. This time a neural network we call the novelty ANN is trained to distinguish images novel to DARCI (the hand-labeled images mentioned previously) from those produced by the system (the pool of candidate artifacts). This process is similar to the process employed by Machado et al. in training NEvAr to create novel images (2007). A new genetic algorithm is initialized using the combined output of the novelty ANN and the adjective ANN as the fitness function. To combine the output of the two neural nets, the system selects the minimum output of the two classifiers as described by Norton et al. (2014). The genetic algorithm performs 100 generations of evolution using the new fitness function. This forces DARCI to produce images that reflect the given adjective and are distinct from the images produced earlier. As before, the most fit artifacts are added to the pool of candidate artifacts, provided they are not redundant. This process is repeated for several epochs, each adding increasingly varied images to the pool of candidate artifacts as the system attempts to optimize the changing fitness function. For our experiments, we perform a total of 8 epochs including the initial novelty-ANN-free 0th epoch. Figure 1 illustrates how candidate artifacts vary from epoch to epoch during one experiment with the adjective "cold" using the image of Figure 2 as the source image. Candidate Artifact Curation Once the candidate pool has been created, DARCI selects a single rendering to present as the finished product. Curating the candidates consists of two phases. In the first phase, DARCI ranks the candidates by their similarity to the source image and selects the top 10% (see Figure 3a 3c), increas June 2015 32 (a) Epoch 0 (b) Epoch 1 (c) Epoch 2 (d) Epoch 3 (e) Epoch 4 (f) Epoch 5 (g) Epoch 6 (h) Epoch 7 Figure 1: Sample artifacts from each epoch of the candidate building process for the adjective "cold" and source image in Figure 2. Note that since the candidate pool is empty during epoch 0, the novelty ANN is not used in the genetic algorithm's fitness function for this epoch. Figure 2: The source image for all experiments in this paper. Figure 3: The curation process for selecting a final artifact from the pool of candidates (represented by a colored bar). (a) Each artifact in the pool of candidates is assigned a score of similarity to the source image (in this case Figure 2). (b) The candidates are ranked by this score (depicted by the bar's gradient). (c) The top 10% of ranked artifacts are chosen for the next phase of curation. (d) The remaining artifacts are given a score of how well they match the given adjective. (e) The artifacts are ranked by the new score. (f) The top image is selected as the final artifact to be returned. ing the chance that the final rendering will make noticeable use of the source image. During curation, similarity to the source image is calculated to preserve the content, rather than the color, of the source image. Color usage has been shown to correlate with the affect of images (Wang, Yu, and Jiang 2006; Li and Chen 2009; Norton, Heath, and Ventura 2013), and we would actually like the color of the image to change in order to match the adjective description while keeping major objects within the source image recognizable. Therefore, similarity is calculated by first extracting a 1000-element histogram of visual words from the source image and each candidate artifact (visual words are quantized local image features commonly used in content-based image retrieval approaches (Sivic and Zisserman 2003)). The similarity between two images is calculated by taking the cosine similarity of the images' visual word histograms, as this similarity function has previously been used to successfully preserve the source image (Norton, Heath, and Ventura 2014). In the second phase of curation, DARCI ranks the remaining candidates by their association with the given adjective using the adjective ANN (see Figure 3d 3f). The highest ranked image is then selected as the final artifact. This second phase occurs after over-filtered images have been removed in order to increase the chance that the final artifact reflects the given adjective and to reduce the possibility of returning an under-filtered image. June 2015 33 (a) Human 1 (b) Human 2 (c) Human 3 (d) Human 4 Figure 4: Renderings of Figure 2 created by four human artists. The renderings were created to depict, from left to right, the adjectives "cold", "eerie", and "violent". Commissions For our experiments, we commissioned DARCI and four human volunteers to produce renderings of the photograph in Figure 2 that depict it as "cold", "eerie", and "violent", respectively. These adjective were chosen because DARCI is able to associate them with images effectively (Norton, Heath, and Ventura 2015), they are affective, and they haven't been used extensively in previous studies involving DARCI. In order to keep the rendering tools available to DARCI and the human artists as similar as possible, human artists were restricted to a subset of tools found in software packages used for photo manipulation. All four human volunteer artists have experience working with photo manipulation software, and, for grounding, they were shown examples of human-produced renderings from a previous study. The 12 images they produced for our study are shown in Figure 4. We commissioned DARCI seven times for each of the three adjectives. Each commission produced one artifact as outlined in the previous section. In order to increase output diversity across these commissions, the error threshold used in training the neural networks was varied for several commissions. To match the number of human commissions, we selected four of the artifacts DARCI produced for each adjective. We made the final decision to ensure varied artifacts and to eliminate potential outliers. Figure 5 shows all of the (a) *DARCI 1 (b) *DARCI 2 (c) *DARCI 3 (d) *DARCI 4 (e) DARCI 5 (f) DARCI 6 (g) DARCI 7 Figure 5: Renderings of Figure 2 created by DARCI. The first four sets (*) were selected for the study. The renderings were created to depict, from left to right, the adjectives "cold", "eerie", and "violent". artifacts produced by DARCI and notes our chosen images. Online Survey In order to easily gather many responses, the survey was anonymous and online. To our knowledge, prior to taking the survey all participants were informed that the survey June 2015 34 was to help with research regarding DARCI, "a computer program we created". Furthermore, the survey began by informing volunteers that "the results will be used in research exploring creativity in computational systems". The survey was separated into two parts. The first part was designed to detect any pre-existing human/computer bias in the survey taker as well as any bias the survey taker may have towards our research in particular (given the survey preface and disclosure of our system). The second part was designed to gather survey takers' opinions about the renderings created by DARCI and the human artists. Part 1 Volunteers were presented with 15 pairs of images from which they would indicate their preferences. All images were created by applying random filters from DARCI's toolset to random source images and selecting intriguing and abstract creations from the thousands of random images. In order to limit the number of factors survey takers would be required to consider when making their selection, we paired images together that seemed similar in some respect. These image pairs were presented to volunteers in a random order with random labels. The labels indicated that one of the images was created by a human, and the other was created by a computer program. For each pair, the volunteers were asked which they thought was the better image and given only 10 seconds to respond. Since the images were randomly labeled as "human" or "computer", unbiased volunteers should pick the "human" and "computer" options approximately equally. Part 2 All volunteers were randomly assigned to one of three experiments: blind, basic, or detailed. The experiments were identical except for the amount of information that was presented to each volunteer. In all three experiments volunteers were given the following instructions: In this part, you will be presented with a total of seven images. You will be asked to indicate your impressions of each image. Each image was created by either a human artist or a computer program called DARCI. The images were created using digital tools to modify a specific source photograph so that it reflected a given word. As an example, observe how an artist modified the following source photograph so that it reflected the word "happy". In the blind experiment, volunteers were never given the name of DARCI (it was obfuscated from the above instruction) and were not told which images were produced by DARCI and which were produced by a human artist. In the basic experiment, volunteers were told the name of DARCI and which images were produced by DARCI. In the detailed experiment, volunteers were not only told which images were created by DARCI, they were also given a detailed (for the layman) description of how DARCI produced its images. This description was followed by a simple one question quiz to assess comprehension. Aside from the noted differences, the three experiments proceeded identically. Survey takers were presented with the source photograph (Figure 2), noted as such, and then six random images presented in random order: one image from DARCI and one from a human artist for each of the three adjectives ("cold", "eerie", and "violent"). Only six of the twenty-four possible images were presented to reduce fatigue. Volunteers were required to evaluate each image by indicating how strongly they agreed or disagreed with a series of 7-point Likert items. To assist with these items, volunteers were always allowed to view the source photograph. For all images, except the source, the Likert items were (adjective taking the place of the appropriate adjective): "I like this image." (like) "This image is adjective." (adjective) "This image is a surprising modification of the source photograph." (surprising) "This image would be difficult to create from the source photograph." (difficult) "This image makes good use of the source photograph." (use) For the source image we asked about all three adjectives, and omitted the three items that referred to the source. Participants were not asked to explicitly assess the creativity of artifacts since personal notions of creativity vary widely. Instead, these five items were chosen to succinctly capture certain qualities required to attribute creativity to a system via the artifacts it produces, and to a small extent, its creative process. Norton et al. have shown that a similar set of Likert items are reliable (using Cronbach's alpha) and correlate with participants' opinions of creativity as measured by an additional Likert item explicitly for "creativity" (2013). Researchers in computational creativity have identified several attributes necessary to attribute creativity or, as Colton has stated, not attribute un-creativity to a system. These attributes include Colton's creative tripod (appreciation, imagination, and skill) (2008), Ritchie's 18 criteria defined by functions of quality, novelty, and typicality (2007), Jordanous' 14 components of creativity (2012), and the American Psychological Association's functionality and originality attributes. Many of these attributes relate to the Likert items in the survey. The like item relates to the attributes of skill, quality, functionality, and Jordanous' ‘domain competence' and ‘value' components. Adjective relates to the attributes of functionality and Jordanous' ‘intention and emotional involvement' and ‘social interaction and communication' (particularly in the detailed experiments) components. Surprising relates to the attributes of novelty, originality, and Jordanous' ‘originality' and ‘value' components. Difficult relates to the attributes of skill and Jordanous' ‘domain competence' component, and emphasizes the creation process. Finally, use relates to the attributes of functionality, skill, and quality. Since DARCI produces artifacts, all of the Likert items relate to Jordanous' ‘generation of results' component, and for the detailed experiment where the creative process is disclosed, all of the items relate to Jordanous' ‘progression and development', ‘thinking and evaluation', and ‘variety, divergence, and experimentation' components. Results After removing results from volunteers who indicated that they had either taken the survey before or viewed someone else taking the survey, 284 completed surveys remained. An June 2015 35 additional 46 surveys in various stages of completion were collected and included in calculating applicable results. 100 volunteers were assigned to the blind experiments, 111 to the basic experiment, and 106 to the detailed experiment. For evaluation, results from volunteers who failed the comprehension question were removed from the detailed results and added to the basic results. This was 68 of the 106 volunteers assigned to the detailed experiment. Bias A volunteer's bias was calculated by subtracting the number of images they preferred labeled with "computer" from those labeled with "human" in the first part of the survey. Thus, a positive score indicates a bias in favor of humans. Since the images were randomly labeled, the average bias of all test takers should have been close to 0 if there was no bias. However, the average bias was 0.901 with a standard error of 0.185, indicating a small but substantial bias either towards humans or against DARCI. When analyzing results from the second part of the survey, we averaged the scores (between 1 and 7) for each Likert item across all artifacts produced by either humans or DARCI for each group of experiments. These results, with standard error, can be seen in Figure 6. In order to discover the effect of bias on the results in the second part of the survey, we calculated the Pearson correlation coefficient, r, between bias and the average Likert item scores for blind, basic, and detailed experiments. A positive correlation between bias and a particular item for a given artist (human or DARCI) would indicate that a bias towards humans (or against DARCI) is correlated with an increase in the item score for the artist. Table 1 shows these correlation values and their p-values (calculated with a two tailed Student's t-test) for the three experiments. Only the detailed experiment contained a correlation that was statistically significant to p < 0.05. That was a positive correlation with the difficult item in human produced images. This means that volunteers with a bias towards humans tended to give humans a boosted score for difficulty when they understood how DARCI produced images. Even though none of the other correlations were statistically significant, it should be noted that in the two most informed experiments, the correlations were generally more positive towards humans and more negative towards DARCI (as one might expect). But, the lack of significance indicates that bias did not have a substantial impact on most results. While one might expect no correlation between bias and scores in the blind experiment, there was a clear trend towards negative correlation across all items, both for humans and DARCI (see Table 1). None of these correlation values were statistically significant, but the fact that almost all of the correlations were negative suggests that there may indeed be an overall negative correlation. This would imply that those with a bias in favor of humans tended to give all images a lower score when they didn't know who produced them. Perhaps these volunteers were concerned that an image might be produced by DARCI. It would be interesting to investigate this phenomenon in future studies. Human blind basic detailed Item r p-value r p-value r p-value like -0.087 0.399 0.044 0.591 0.160 0.357 adjective -0.089 0.389 -0.011 0.892 -0.059 0.737 surprising -0.043 0.677 0.123 0.130 0.179 0.303 difficult 0.019 0.851 0.102 0.208 0.428 0.010 use -0.154 0.133 0.050 0.539 0.295 0.085 DARCI blind basic detailed Item r p-value r p-value r p-value like -0.047 0.651 -0.060 0.465 0.070 0.690 adjective -0.076 0.462 -0.0117 0.150 -0.012 0.944 surprising -0.045 0.661 0.068 0.405 -0.212 0.222 difficult -0.098 0.342 -0.036 0.662 -0.117 0.502 use -0.073 0.480 0.002 0.983 0.029 0.869 Table 1: The Pearson correlation coefficient, r, and associated p-value, between volunteer bias and item scores for the three experiments (blind, basic, detailed). Positive correlation indicates that a bias towards humans is correlated with an increase in item score. Evaluation The average scores of the source image across its four Likert items were 5.873 (like), 2.260 (cold), 1.870 (eerie), and 1.377 (violent). Looking at Figure 6b we see that both humans and DARCI were able to reflect the adjectives more effectively in their artifacts than did the original source (though at the cost of a lower "like" score). While the Likert scale is one of the most common evaluation tools used in psychology and marketing research, it has come under criticism for the unintended effects that it can introduce, including cultural biases, memory effects, and the loss of individual subjectivity when the scale is averaged across participants. Recently, it has been demonstrated that ranking or preference questionnaires have fewer negative effects (Yannakakis and Hallam 2011) and that converting from a rating scale to preferences can reduce some of the undesired effects of Likert questionnaires (Mart´inez, Yannakakis, and Hallam 2014). To augment the rating-based results of Figure 6, individual survey takers' preferences were calculated from their Likert scores. For each Likert item and for each participant, we performed a pairwise comparison of all images reviewed by the survey taker. We tabulated which images scored higher (were preferred) and when ties occurred in these pairwise comparisons. To summarize the results, we have indicated the percentage of all pairwise tests for each item where human art was preferred, DARCI's art was preferred, and when ties occurred (Figure 7). Looking at Figures 6 and 7 we see that DARCI clearly scored higher than human artists in the surprising and difficult categories while humans did not score substantially higher than DARCI in any category. These trends persisted across all experiments despite the overall human bias of the volunteers. In Figure 6, statistically significant differences (p < 0.05 using a two tailed Student's t-test) between humans and DARCI are starred (*). While purely quantitative, these results suggest that June 2015 36 (a) (b) (c) (d) (e) Figure 6: The average scores of each Likert item across all artifacts produced by either humans (blue, left) or DARCI (red) for each group of experiments (with standard error). (*) indicates statistical significance between human and DARCI results. (a) (b) (c) Figure 7: The result of every pairwise test, after converting Likert ratings to pairwise preferences. For each survey taker, pairwise tests were conducted between every combination of images produced by a human and those produced by DARCI. within this constrained domain of digital visual art, DARCI is capable of producing renderings that are comparable to human renderings in terms of appeal, while being significantly more surprising and unusual. This is more than just a functional evaluation of DARCI's artifacts, it's also an evaluation of the creation process. The fact that DARCI scored higher than humans in the difficulty category suggests that volunteers felt that DARCI's artifacts required some skill to create. Additionally, volunteers given details about DARCI's creation process responded to its artifacts very similarly to how those volunteers not given the details responded—understanding how DARCI functioned did not diminish the way the artifacts were perceived. Somewhat surprisingly, the additional information provided to some of the survey participants had minimal effect on their responses. There was no statistically significant difference between the results of any of the experiments except between the basic and detailed experiments in the adjective category (note the increased scores for both DARCI and human for the detailed experiment of Figure 6b). In this one case, understanding how DARCI produced artifacts influenced how volunteers perceived the meaning of the images produced by both DARCI and humans. Since the detailed group was told that DARCI learned to associate images with words through training by human teachers, volunteers may have realized that all of the images they were evaluating were essentially examples of what their peers associated with the given adjective. In other words, we suggest that volunteers were incorporating Jordanous' ‘social interaction and communication' component into their evaluation. Table 2 shows the top six images in each category for the three experiments. Refer to Figures 4 and 5 to view the actual images. Of note, DARCI's artifacts have a slightly greater representation amongst the highly rated images. Conclusions We have described recent improvements to a computational system, DARCI, that generates renderings of images so that they reflect an adjective and have presented a human-surveybased instrument designed to evaluate DARCI's artifacts and creation process while taking participant bias into consideration. The instrument uses human artists' artifacts as a baseline for analyzing DARCI's results. Such a survey could be generalized to many computational systems, though it would need to be tailored to the specific domain in question. By analyzing the survey results, we have shown that across each of our criteria for creativity, DARCI's artifacts were rated comparably to artifacts produced by humans. Of note, DARCI's images were generally considered more surprising and more difficult to create than their human counterparts. DARCI's performance in the evaluation persisted even when volunteers (shown to be biased against DARCI) were aware of the process used to create the images. While these results look remarkable on paper, we must note that creativity is still ill-defined and our survey questions are clearly a simplification of what it means to be creative. We must also acknowledge that the artifacts were very specific in nature and the human artists were heavily restricted in there creative process in order to make the comparison to DARCI fair. In a more practical setting, humans would have far fewer restrictions and would undoubtedly produce more interesting images. Finally, we must acknowledge that the four sets of DARCI's artifacts used in the survey were selected from seven sets by a human—though more than half of DARCI's artifacts were included. Despite these limitations, the results clearly indicate a system capable of performing on par with humans within the restricted domain. These results will also act as a baseline for testing future improvements to the system. June 2015 37 blind basic detailed Like DARCI 4 "cold" DARCI 4 "cold" DARCI 3 "cold" DARCI 3 "cold" Human 1 "cold" Human 3 "eerie" Human 1 "cold" Human 4 "violent" Human 1 "cold" Human 2 "cold" DARCI 3 "cold" Human 4 "violent" Human 4 "cold" Human 4 "eerie" Human 2 "cold" DARCI 2 "eerie" Human 4 "cold" DARCI 2 "eerie" Adjective DARCI 2 "cold" DARCI 1 "cold" DARCI 3 "cold" DARCI 4 "cold" DARCI 3 "cold" Human 2 "cold" DARCI 3 "cold" DARCI 2 "cold" DARCI 2 "cold" DARCI 1 "cold" Human 2 "eerie" Human 3 "eerie" Human 2 "violent" Human 2 "cold" DARCI 1 "cold" Human 2 "cold" DARCI 4 "cold" DARCI 4 "eerie" Surprising Human 3 "eerie" DARCI 2 "violent" DARCI 1 "violent" DARCI 2 "violent" DARCI 1 "violent" Human 3 "eerie" DARCI 1 "violent" Human 3 "eerie" DARCI 2 "violent" DARCI 2 "eerie" DARCI 2 "eerie" DARCI 1 "cold" DARCI 4 "cold" DARCI 4 "cold" Human 2 "violent" Human 2 "violent" Human 2 "violent" DARCI 4 "eerie" Difficult DARCI 1 "violent" DARCI 2 "violent" DARCI 1 "violent" DARCI 2 "violent" DARCI 1 "violent" Human 3 "eerie" DARCI 2 "eerie" Human 3 "eerie" DARCI 2 "violent" Human 2 "violent" DARCI 2 "eerie" Human 2 "violent" Human 3 "eerie" Human 2 "violent" DARCI 2 "eerie " Human 2 "eerie" DARCI 4 "cold" DARCI 4 "violent" Use Human 3 "eerie" DARCI 4 "cold" Human 4 "eerie" Human 4 "eerie" Human 1 "cold" Human 1 "cold" DARCI 4 "cold" Human 4 "violent" Human 3 "eerie" DARCI 3 "cold" Human 4 "cold" DARCI 1 "cold" Human 1 "cold" DARCI 1 "cold" DARCI 3 "cold" Human 4 "violent" DARCI 2 "eerie" DARCI 1 "eerie" Table 2: The top six images (based on Likert rating) for each item across the three experiments. Refer to Figures 4 and 5 to view images. 2015_6 !2015 Quantifying Creativity in Art Networks Ahmed Elgammal and Babak Saleh Department of Computer Science Rutgers University {elgammal,babaks}@cs.rutgers.edu digihumanlab.rutgers.edu Abstract This paper proposes a computational framework for assessing the creativity of products, such as paintings, sculptures, poetry, etc. The proposed computational framework is based on constructing a network between creative products and using this network to infer about the originality and influence of its nodes. Through a series of transformations, we construct a Creativity Implication Network. We show that inference about creativity in this network reduces to a variant of network centrality problems which can be solved efficiently. We apply the proposed framework to the task of quantifying creativity of paintings (and sculptures). We experimented on two datasets with over 62K paintings to illustrate the behavior of the proposed framework. Introduction The field of computational creativity is focused on giving the machine the ability to generate human-level "creative" products such as computer generated poetry, stories, jokes, music, art, etc., as well as creative problem solving. An important characteristic of a creative agent is its ability to assess its creativity as well as judge other agents' creativity. In this paper we focus on developing a computational framework for assessing the creativity of products, such as painting, sculpture, etc. We use the most common definition of creativity, which emphasizes the originality of the product and its influential value (Paul and Kaufman 2014a). In the next section we justify the use of this definition in contrast to other definitions. The proposed computational framework is based on constructing a network between products and using it to infer about the originality and influence of its nodes. Through a series of transformations, we show that the problem can reduce to a variant of network centrality problems, which can be solved efficiently. We apply the proposed framework to the task of quantifying creativity of paintings (and sculptures). The reader might question the feasibility, limitation, and usefulness of performing such task by a machine. Artists, art historians and critics use different concepts to describe pantings. In particular, elements of arts such as space, texture, form, shape, color, tone and line. Artists also use principles of art including movement, unity, harmony, variety, balance, contrast, proportion, and pattern; besides brush strokes, subject matter, and other descriptive concepts (Fichner-Rathus 2008). We collectively call these concepts artistic concepts. These artistic concepts can, more or less, be quantified by today's computer vision technology. With the rapid progress in computer vision, more advanced techniques are introduced, which can be used to measure similarity between paintings with respect to a given artistic concept. Whether the state of the art is already sufficient to measure similarity in meaningful ways, or whether this will happen in the near or far future, the goal of this paper is to design a framework that can use such similarity measures to quantify our chosen definition of creativity in an objective way. Hence, the proposed framework would provide a ready-to-use approach that can utilize any future advances in computer vision that might provide better ways for visual quantification of digitized paintings. In fact, we applied the proposed framework using state-of-the-art computer vision techniques and achieved very reasonable automatic quantification of creativity on two large datasets of paintings. One of the fundamental issues with the problem of quantifying creativity of art is how to validate any results that the algorithm can obtain. Even if art historians would agree on a list of highly original and influential paintings that can be used for validation, any algorithm that aims at assigning creativity scores will encounter three major limitations: I) Closed-world limitation: The algorithm is only limited to the set of paintings it analyzed. It is a closed world for the algorithm where this set is every thing it has seen about art history. The number of images of paintings available in the public domain is just a small fraction of what are in museums and private collections. II) Artistic concept quantification limitation: the algorithm is limited by what it sees, in terms of the ability of the underlying computer vision methods to encode the important elements and principles of art that relates to judging creativity. III) Parameter setting: the results will depend on the setting of the parameters, where each setting would mean a different way to assign creativity scores with different interpretation and different criteria. However, these limitations should not stop us from developing and testing algorithms to quantify creativity. The first two limitations are bound to disappear in the future, with more and more paintings being digitized, as well as with the continuing advances in computer vision and machine learning. The third limitation should be thought of as an advan June 2015 39 tage, since the different settings mean a rich ability of the algorithm to assign creativity scores based on different criteria. For the purpose of validation, we propose a methodology for validating the results of the algorithm through what we denote as "time machine experiments", which provides evidence of the correctness of the algorithm. Having discussed the feasibility and limitations, let us discuss the value of using any computational framework to assess creativity in art. For a detailed discussion about the implications of using computational methods in the domain of aesthetic-judgment-related tasks, we refer the reader to (Spratt and Elgammal 2014). Our goal is not to replace art historians' or artists' role in judging creativity of art products. Providing a computational tool that can process millions of artworks to provide objective similarity measures and assessments of creativity, given certain visual criteria can be useful in the age of digital humanities. From a computational creativity point of view, evaluating the framework on digitized art data provides an excellent way to optimize and validate the framework, since art history provides us with suggestions about what is considered creative and what might be less creative. In this work we did not use any such hints in achieving the creativity scores, since the whole process is unsupervised, i.e., the approach does not use any creativity, genre, or style labels. However we can use evidence from art history to judge whether the results make sense or not. Validating the framework on digitized art data makes it possible to be used on other products where no such knowledge is available, for example to validate computergenerated creative products. On the Notion of Creativity There is a historically long and ongoing debate on how to define creativity. In this section we give a brief description of some of these definitions that directly relate to the notion we will use in the proposed computational framework. Therefore, this section is by no means intended to serve as a comprehensive overview of the subject. We refer readers to (Taylor 1988; Paul and Kaufman 2014b) for comprehensive overviews of the different definitions of creativity. We can describe a person (e.g. artist, poet), a product (painting, poem), or the mental process as being creative (Taylor 1988; Paul and Kaufman 2014a). Among the various definitions of creativity it seems that there is a convergence to two main conditions for a product to be called "creative". That product must be novel, compared to prior work, and also has to be of value or influential (Paul and Kaufman 2014a). These criteria resonate with Kant's definition of artistic genius, which emphasizes two conditions "originality" and being "exemplary" 1. Psychologists would 1 Among four criteria for artistic genius suggested by Kant, two describe the characteristic of a creative product "That genius 1) is a talent for producing that for which no determinate rule can be given, not a predisposition of skill for that which can be learned in accordance with some rule, consequently that originality must be it's primary characteristic. 2) that since there can also be original nonsense, its products must at the same time be models, i.e., exemplary, hence, while not themselves the result of imitation, they must not totally agree with this definition since they favor associating creativity with the mental process that generates the product (Taylor 1988; Nanay 2014). However associating creativity with products makes it possible to argue in favor of "Computational Creativity", since otherwise, any computer product would be an output of an algorithmic process and not a result of a creative process. Hence, in this paper we stick to quantifying the creativity of products instead of the mental process that create the product. Boden suggested a distinction between two notions of creativity: psychological creativity (P-creativity), which assesses novelty of ideas with respect to its creator, and historical creativity (H-creativity), which assesses novelty with respect to the whole human history (Boden 1990). It follows that P-creativity is a necessary but not sufficient condition for H-creativity, while H-creativity implies P-creativity (Boden 1990; Nanay 2014). This distinction is related to the subjective (related to person) vs. objective creativity (related to the product) suggested by Jarvie (Jarvie 1986). In this paper our definition of creativity is aligned with objective/Hcreativity, since we mainly quantify creativity within a historical context. Computational Framework According to the discussion in the previous section, a creative product must be original, compared to prior work, and valuable (influential) moving forward. Let us construct a network of creative products and use it to assign a creativity score to each product in the network according to the aforementioned criteria. In this section, for simplicity and without loss of generality, we describe the approach based on a network of paintings, however the framework is applicable to other art or literature forms. Constructing a Painting Graph Let us denote by P = {pi, i = 1 ··· N} a set of paintings. The goal is to assign a creativity score for each painting, denoted by C(pi) for painting pi . Every painting comes with a time label indicating the date it was created, denoted by t(pi). We create a directed graph where each vertex corresponds to a painting. A directed edge (arc) connects painting pi to pj if pi was created before pj . Each directed edge is assigned a positive weight (we will discuss later where the weights come from), we denote the weight of edge (pi, pj ) by wij . We denote by Wij the adjacency matrix of the painting graph, where Wij = wij if there is an edge from pi to pj and 0 otherwise. Note that according to this definition, a painting is not connected to itself, i.e., wii = 0, i = 1 ··· N. By construction, wij > 0 ! wji = 0, i.e., the graph is anti-symmetric. To assign the weights we assume that there is a similarity function that takes two paintings and produces a positive scalar measure of affinity between them (higher value indicates higher similarity). We denote such a function by S(·, ·) yet serve others in that way, i.e., as a standard or rule for judging." (Guyer and Wood 2000)-p186 June 2015 40 Figure 1: Illustration of the construction of the Creativity Implication Network: blue arrows indicate temporal relation and orange arrows indicate reverse creativity implication (converse). and, therefore, wij = ⇢ S(pi, pj ) if t(pi) < t(pj ). 0 otherwise. Since there are multiple possible visual aspects that can be used to measure similarity, we denote such a function by Sa(·, ·) where the superscript a indicates the visual aspect that is used to measure the similarity (color, subject matter, brush stroke, etc.) This implies that we can construct multiple graphs, one for each similarity function. We denote the corresponding adjacency matrix by Wa, and the induced creativity score by Ca, which measure the creativity along the dimension of visual aspect a. In the rest of this section, for the sake of simplicity, we will assume one similarity function and drop the superscript. Details about the similarity function will be explained in the next section. Creativity Propagation Giving the constructed painting graph, how can we propagate the creativity in such a network? To answer this question we need to understand the implication of the weight of the directed edge connecting two nodes on their creativity scores. Let us assume that initially we assign equal creativity indices to all nodes. Consider painting pi and consider an incoming edge from a prior painting pk. A high weight on that edge (wki) indicates a high similarity between pi and pk, which indicates that pi is not novel, implying that we should lower the creativity score of pi (since pi is subsequent to pk and similar to it) and increase the creativity score of pk. In contrast, a low weight implies that pi is novel and hence creative compared to pk, therefore we need to increase the creativity score of pi and decreases that of pk. Let us now consider the outgoing edges from pi. According to our notion of creativity, for pi to be creative it is not enough to be novel, it has to be influential as well (some others have to imitate it). This indicates that a high weight, wij , between pi and a subsequent painting pj implies that we should increase the creativity score of pi and decrease that of pj . In contrast, a lower weight implies that pi is not influential on pj , and hence we should decrease the score for pi and increase it for pj . These four cases are illustrated in Figure 1. A careful look reveals that the two cases for the incoming edges and those for the outgoing edges are in fact the same. A higher weight implies the prior node is more influential and the subsequent node is less creative, and a lower weight implies the prior node is less influential and the subsequent node is more creative. Creativity Implication Network Before converting this intuition to a computational approach, we need to define what is considered high and low for weights. We introduce a balancing function on the graph. Let m(i) denote a balancing value for node i, where for the edges connected to that node a weight above m(i) is considered high and below that value is considered low. We define a balancing function as a linear function on the weights connecting to each node in the form Bi(w) = ⇢ w " m(i) if w > 0. 0 otherwise. We can think of different forms of balancing functions that can be used. Also there are different ways to set the parameter m(i) with different implications, which we will discuss in the next section. This form of balancing function basically converts weights lower than m(i) to negative values. The more negative the weight of an edge the more creative the subsequent node and the less influential the prior node. The more positive the weight of an edge the less creative the subsequent node and the more influential the prior node. The introduction of the negative weights in the graph, despite providing a solution to represent low weights, is problematic when propagating the creativity scores. The intuition is, a negative edge between pi and pj is equivalent to a positive edge between pj and pi. This directly suggests that we should reverse all negative edges and negate their values. Notice that the original graph construction guarantees that an edge between pi and pj implies no edge between pj and pi, therefore there is no problem with edge reversal. This process results in what we call "Creativity Implication Network". We denote the weights of that graph by w˜ij and its adjacency matrix by W f. This process can be described June 2015 41 mathematically as B(wij ) > 0 ! w˜ij = B(wij ) B(wij )=0 ! w˜ij = 0 B(wij ) < 0 ! w˜ji = "B(wij ) The Creativity Implication Network has one simple rule that relates its weights to creativity propagation: the higher the weight of an edge between two nodes, the less creative the subsequent node and the more creative the prior node. Note that the direction of the edges in this graph is no longer related to the temporal relation between its nodes, instead it is directly inverse to the way creativity scores should propagate from one painting to another. Notice that the weights of this graph are non-negative. Computing Creativity Scores Given the construction of the Creativity Implication Network, we are now ready to define a recursive formula for assigning creativity scores. We will show that the construction of the Creativity Implication Network reduces the problem of computing the creativity scores to a traditional network centrality problem. The algorithm will maintain creativity scores that sum up to one, i.e., the creativity scores form a probability distribution over all the paintings in our set. Given an initial equal creativity scores, the creativity score of node pi should be updated as C(pi) = (1 " ↵) N + ↵ X j w˜ij C(pj ) N(pj ) , (1) where 0  ↵  1 and N(pj ) = P k w˜kj . In this formula, the creativity of node pi is computed from aggregating a fraction ↵ of the creativity scores from its outgoing edges weighted by the adjusted weights w˜ij . The constant term (1 " ↵)/N reflects the chance that similarity between two paintings might not necessarily indicate that the subsequent one is influenced by the prior one. For example, two paintings might be similar simply because they follow a certain style or art movement. The factor 1 " ↵ reflects the probability of this chance. The normalization term N(pj ) for node j is the sum of its incoming weights, which means that the contribution of node pj is split among all its incoming nodes based on the weights, and hence, pi will collect only a fraction w˜ij/ P k w˜kj of the creativity score of pj . The recursive formula in Eq 1 can be written in a matrix form as C = (1 " ↵) N 1 + ↵W C, ff (2) where W ff is a column stochastic matrix defined as W ffij = w˜ij/ P k w˜kj , and 1 is a vector of ones of the same size as C. It is easy to see that since W ff, C, and 1 N 1 are all column stochastic, the resulting scores will always sum up to one. The creativity scores can be obtained by iterating over Eq 2 until conversion. Also a closed-form solution for the case where ↵ 6= 1 can be obtained as C⇤ = (1 " ↵) N (I " ↵W ff) "11. (3) A reader who is familiar with social network analysis literature might directly see the relation between this formulation and some traditional network centrality algorithms. Eq 2 represents a random walk in a Markov chain. Setting ↵ = 1, the formula in Eq 2 becomes a weighted variant to eigenvector centrality (Borgatti and Everett 2006), where a solution can be obtained by the right eigenvector corresponding to the largest eigenvalue of W ff. The formulation in Eq 2 is also a weighted variant of Hubbell's centrality (Hubbell 1965). Finally the formulation can be seen as an inverted weighted variant of the Page Rank algorithm (Brin and Page 1998). Notice that this reduction to traditional network centrality formulations was only possible because of the way the Creativity Implication Network was constructed. Originality vs. Influence The formulation above sums up the two criteria of creativity, being original and being influential. We can modify the formulation to make it possible to give more emphasis to either of these two aspects when computing the creativity scores. For example it might be desirable to emphasize novel works even though they are not influential, or the other way around. Recall that the direction of the edges in Creativity Implication Network are no longer related to the temporal relation between the nodes. We can label (color) the edges in the network such that each outgoing edge e(pi, pj ) from a given node pi is either labeled as a subsequent edge or a prior edge depending on the temporal relation between pi and pj . This can be achieved by defining two disjoint subsets of the edges in the networks Eprior = {e(pi, pj ) : t(pj ) < t(pi)} Esubseq = {e(pi, pj ) : t(pj ) % t(pi)} This results in two adjacency matrices, denoted by W fp and W fs such that W f = W fp + W fs, where the superscripts p and s denote the prior and subsequent edges respectively. Now Eq 1 can be rewritten as C(pi) = (1 ! ↵) N + (4) ↵[" X j w˜p ij C(pj ) Np(pj ) + (1 ! ") X j w˜s ij C(pj ) Ns(pj ) ], where Np(pj ) = P k w˜p kj and Ns(pj ) = P k w˜s kj . The first summation collects the creativity scores stemming from prior nodes, i.e., encodes the originality part of the score, while the second summation collects creativity scores stemming from subsequent nodes, i.e, encodes influence. We introduced a parameter 0  "  1 to control the effect of the two criteria on the result. The modified formulation above can be written as C = (1 " ↵) N 1 + ↵["W ggpC + (1 " ")W ggsC], (5) where W ggp and W ggs are the column stochastic adjacency matrices resulting from normalizing the columns of W fp and W fs respectively. It is obvious that the closed-form solution June 2015 42 in Eq 3 is applicable to this modified formulation where W ff is defined as W ff = "W ggp + (1 " ")W ggs. Creativity Network for Art In this section we explain how the framework can be realized for the particular case of visual art. Visual Likelihood: For each painting we can use computer vision techniques to obtain different feature representations for its image, each encoding a specific visual aspect(s) related to the elements and principles of arts. We denote such features by fa i for painting pi, where a denotes the visual aspect that the feature quantifies. We define the similarity between painting pi and pj , as the likelihood that painting pj is coming from a probability model defined by painting pi. In particular, we assume a Gaussian probability density model for painting pi, i.e., Sa(pj , pi) = P r(pj |pi, a) = N (·; fa i , #aI). It is important to limit the connections coming to a given painting. By construction, any painting will be connected to all prior paintings in the graph. This makes the graph highly biased since modern paintings will have extensive incoming connections and early paintings will have extensive outgoing connections. Therefore we limit the incoming connections to any node to at most the top K edges (the K most similar prior paintings). Temporal Prior: It might be desirable to add a temporal prior on the connections. If a painting in the nineteenth century resembles a painting from the fourteenth century, we shouldn't necessarily penalize that as low creativity. This is because certain styles are always reinventions of older styles, for example neoclassicism and renaissance. Therefore, these similarities between styles across distant time periods should not be considered as low creativity. Therefore, we can add a temporal prior to the likelihood as Sa(pj , pi) = P rv(pj |pi, a) · P rt (pj |pi), where the second probability is a temporal likelihood (what is the likelihood that pj is influenced pi given their dates) and the first is the visual likelihood. There are different ways to define such a temporal likelihood. The simplest way is a temporal window function, i.e., P rt (pj |pi)=1 if pi is within K temporal neighbors prior to pj and 0 otherwise2. Balancing Function: There are different choices for the balancing function B(w), as well as the parameter for that function. We mainly used a linear function for that purpose. The parameter m can be set globally over the whole graph, or locally for each time period. A global m can be set as the p-percentile of the weights of the graph, which is p-percentile of all the pairwise likelihoods. This directly means that p% of the edges of the graph will be reversed when constructing the Creativity Implication Graph. 2 Alternatively, a Gaussian density can be use, P rt (pj |pi) = exp(![t(pi)!t(pj )]2/#2 t ). However, adding such temporal Gaussian would complicate the algorithm since it will not be easy to estimate a suitable #t, specially the graph can have non-uniform density over the time line. One disadvantage of a global balancing function is that different time periods have different distributions of weights. This suggests using a local-in-time balancing function. To achieve that we compute mi for each node as p% of the weight distribution based on its temporal neighborhood. Experiments and Results Datasets and Visual Features Artchive: This dataset was previously used for style classification and influence discovery (Saleh et al. 2014). It contains a total of 1710 images of art works (paintings and sculptures) by 66 artists, from 13 different styles from 14121996, chosen from Mark Harden's Artchive database of fineart (Harden ). The majority of the images are of the full work, while a few are details of the work. Wikiart.org: We used the publicly available dataset of "Wikiart paintings"3; which, to the best of our knowledge, is the largest online public collection of artworks. This collection has images of 81,449 fine-art paintings and sculptures from 1,119 artist spanning from 1400-2000+. These paintings are from 27 different styles (Abstract, Byzantine, Baroque, etc.) and 45 different genres (Interior, Landscape, Portrait, etc.). We pruned the dataset to 62,254 western paintings by removing genres and mediums that are not suitable for the analysis such as sculpture, graffiti, mosaic, installation, performance, photos, etc. For both datasets the time annotation is mainly the year. Therefore, it is not possible to tell which is prior between any pair of paintings with the same year of creation. Therefore no edge is added between their corresponding nodes. We experimented with different state-of-the-art feature representations. In particular, the results shown here are using Classeme features (Torresani, Szummer, and Fitzgibbon 2010). These features were shown to outperform other stateof-the-art features for the task of style classification (Saleh et al. 2014). These features (2659 dimensions) provide semantic-level representation of images, by encoding the presence of a set of basic-level object categories (e.g. horse, cross, etc.), which captures the subject matter of the painting. Some of the low-level features used to learn the Classeme features also capture the composition of the scene. Example Results We show qualitative and quantitative experimental results of the framework applied to the aforementioned datasets. As mentioned in the introduction, any result has to be evaluated given the set of paintings available to the algorithm and the capabilities of the visual features used. Given that the visual features used are mainly capturing subject matter and composition, sensible creativity scores are expected to reflect these concept. A low creativity score does not mean that the work is not creative in general, it just means that the algorithm does not see it creative with respect to its encoding of subject matter and composition. Figures 2-top and 3 show the creativity scores obtained on the Artchive and Wikiart datasets respectively. Figure 2bottom shows a zoom in to the period between 1850-1950 in 3 http://www.wikiart.org/ June 2015 43 the Artchive dataset, which is very dense in the graph4. In all figures we plot the scores vs. the year of the painting. The figures visualize some of the paintings that obtained high scores, as well as some with low scores (the scores in the plots are scaled). We randomly sampled points with low scores for visualization. A close look at the paintings that scored low (bottom) reveals the presence of typical subject matter that is common in the dataset, or in some cases the image presents an unclear view of a sculpture (e.g. Rodin 1889 sculpture in the bottom right). The general trend shows peaks in creativity around the time of High Renaissance (late 15th , early 16th century) and the late 19th and early 20th centuries, and a significant increase in the second half of the 20th century. One of the interesting findings is the algorithm's ability to point out wrong annotations in the dataset. For example, one of the highest scoring paintings around 1910 was a painting by Piet Mondrain called " Composition en blanc, rouge et jaune," (see Figure 2). By examining this painting, we found that the correct date for it is around 1936 and it was mistakenly annotated in the Artchive dataset as 1910 5. Modrain did not start to paint in this grid-based (Tableau) style untill around 1920. So it is no surprise that wrongly dating one of Mondrain's tableau paintings to 1910 caused it to obtain high creativity score, even above the cubism paintings from that time. On the Wikiart dataset, one of the highest-scored painting was "tornado" by contemporary artist Joe Goode, which was found to be mistakenly dated 1911 in Wikiart 6. A closer look at the artist biography revealed that he was born in 1937 and this painting was created in 19917. It is not surprising for a painting that was created in 1991 to score very high in creativity if it was wrongly dated to 1911. These two examples, besides indicating that the algorithm works, show the potential of proposed algorithm in in spotting wrong annotations in large datasets, which otherwise would require tremendous human effort. Time Machine Experiment Given the absence of ground truth for creativity, the aforementioned wrong annotations inspired us with a methodology to quantitatively evaluate the framework. We designed what we call "time machine" experiment, where we change the date of an artwork to some point in the past or some point in the future, relative to its correct time of creation. Then we compute the creativity scores using the wrong date, by running the algorithm on the whole data. We then compute the gain (or loss) in the creativity score of that artwork compared to its score using correct dating. What should we expect 4 For Figure 2 a temporal window historical prior is uses. For Figure 3 no historical prior was used. For both, we set K=500,↵=0.15 5 The wrong annotation is in the Artchive CD obtained in 2010. The current online version of Artchive has corrected annotation for this painting 6 http://www.wikiart.org/en/search/tornado/ 1#supersized-search-318512 accessed on Feb 28th, 20157 http://www.artnet.com/artists/joe-goode/ tornado-9-2Y7erPME95YlkhFp7DRWlA2 Table 1: Time Machine Experiment Art movement avg % gain/loss % increase Moving backward to AD 1600 Neoclassicism 5.78%±1.28 97%±4.8 Romanticism 7.52%± 2.04 98%± 4.2 Impressionism 14.66%± 2.78 99%±3.2 Post-Impressionism 16.82%±2.22 99%±3.1 Symbolism 15.2%±2.94 97%±4.8 Expressionism 16.83%±2.43 98%±4.2 Cubism 13.36%±2.43 89%±9.9 Surrealism 12.66%±1.82 95%±7.1 American Modernism 11.75%±2.99 84%±8.4 Wandering around to AD 1600 Renaissance 0.68 %± 2.05 39%±5.7 Baroque 2.85%± 1.09 71%±19.7 Moving forward to AD 1900 Renaissance -8.13%± 2.02 20%±10.5 Baroque -10.2%±2.03 0%±0 from an algorithm that assigns creativity in a sensible way? Moving a creative painting back in history would increase its creativity score, while moving a painting forward would decrease its creativity. Therefore, we tested three settings: I) Moving back to AD 1600: For styles that date after 1750, we set the test paintings back to a random date around 1600 using Normal distribution with mean 1600 and std 50 years, i.e. N(1600, 502) . II) Moving forward to AD 1900: For the Renaissance and Baroque styles, we set the test paintings to random dates around 1900 sampled from N(1900, 502). III) Wandering about AD 1600 (baseline): In this experiment, for the Renaissance and Baroque styles, we set the test paintings to random dates around 1600 sampled from N(1600, 502). Table 1 shows the results of these experiments. We ran this experiment on the Artchive dataset with no temporal prior. In each run we randomly selected 10 test paintings of a given style and applied the corresponding move. We used 10 as a small percentage of the data set (less than 1%), not to disturb the global distribution of creativity. We repeated each experiment 10 times and reported the mean and standard deviations of the runs. For each style we computed the average gain/loss of creativity scores by the time move. We also computed the percentage of the test paintings whose scores have increased. From the table we clearly see that paintings from Impressionist, Post-Impressionist, Expressionist, and Cubism movements have significant gain in their creativity scores when moved back to 1600. In contrast, Neoclassicism paintings have the least gain, which makes sense, because Neoclassicism can be considered as revival to Renaissance. Romanticism paintings also have a low gain when moved back to 1600, which is justified because of the connection between Romanticism and Gothicism and Medievalism. On the other hand, paintings from Renaissance and Baroque styles have loss in their scores when moved forward to 1900, while they did not change much in the wandering-around-1600 setting. June 2015 44 1400 1500 1600 1700 1800 1900 2000 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 Mantegna; 1454 Leonardo; 1469 Mantegna 1474 Velazquez 1632 Vermeer 1661 Michelangelo 1564 Durer; 1503 Durer; 1514 Goya 1780 Vermeer 1664 Ingres; 1806 Ingres; 1851 Ingres; 1852 Monet 1865 Rodin; 1889 Munch 1893 Rousseau; 1898 Mondrian 1910 Lichtenstein; 1972 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 Cezanne; 1887 Van Gogh 1889 Rousseau 1897 Klimt 1898 Picasso 1903 Picasso; 1907 Mondrian 1910 Picasso; 1912 Picasso; 1912 Braque 1913 Malevich 1916 Malevich; 1915 Okeeffe 1919 Mondrian 1918 Malevich; 1929 Degas; 1870 Pissarro; 1885 Rodin; 1889 Cezanne; 1906 Figure 2: Top: Creativity scores for 1710 paintings from Artchive dataset. Bottom: zoom in to the period 1850-1950. Each point represents a painting. The thumbnails illustrate some of the paintings that scored relatively high or low compared to their neighbors. Only artist names and dates of the paintings are shown on the graph because of limited space. The red-dotted-framed painting by Piet Mondrain scored very high because it was wrongly dated to 1910 instead of 1936 in the dataset. June 2015 45 1400 1500 1600 1700 1800 1900 2000 2100 1 2 3 4 5 6 7 8 9 10 Rubens 1635 Gozzoli 1461 Da Vinci; 1480 Signorelli 1482 Carracci 1605 Dorazio 1998 Calhau 1998 Dorazio; 1990 Lefranc 1929 Gilliam; 1965 Mantegna;1502 Veronese;1558 Winterhalter Kittelsen; 1911 Reni, 1625 Canaletto; 1725 Fragonard; 1778 ; 1860 Berkowitz 1978 Landfield 1968 Pissarro; 1897 Spilliaert 1908 Mattis; 1916 Bruegel 1568 Durer 1497 Cranach; 1536 Figure 3: Creativity scores for 62K paintings from the wikiart.org dataset Conclusion and Discusion The paper presented a computational framework to assess creativity among a set of products. We showed that, by constructing a creativity implication network, the problem reduces to a traditional network centrality problem. We realized the framework for the domain of visual art, where we used computer vision to quantify similarity between artworks. We validated the approach qualitatively and quantitively on two large datasets. In this paper we focused on "creative" as an attribute of a product, in particular artistic products such as painting, where creativity of a painting is defined as the level of its originality and influence. However, the computational framework can be applied to other forms such as sculpture, literature, science etc. Quantifying creativity as an attribute of a product facilitates quantifying the creativity of the person who made that product, as a function over the creator's set of products. Hence, our proposed framework also serves as a way to quantify creativity as an attribute for people. 2015_7 !2015 Is Biologically Inspired Invention Different? Ashok K. Goel Design & Intelligence Laboratory, School of Interactive Computing, and Center for Biologically Inspired Design, Georgia Institute of Technology goel@cc.gatech.edu Abstract: The paradigm of biologically inspired design views nature as a vast library of robust, efficient and multifunctional designs, and espouses the use nature as a source of analogues for inspiring novel designs in domains of interest such as architecture, computing, engineering, etc. Over the last generation, biologically inspired design has emerged as a major movement in engineering, architectural, and systems design, pulled in part by the need for environmentally sustainable design and pushed partly by the desire for creativity and innovation in design. An important question is whether biologically inspired design is fundamentally different from other kinds of analogybased creative processes. This question is critical because the computational theories, techniques and tools we need to develop to support biologically inspired design depend on the nature of the task itself. In this paper, we first summarize some of our empirical findings about biologically inspired design, then derive a task model for it, and finally posit that biologically inspired design indeed is a novel methodology for multiple reasons. Biologically Inspired Design The paradigm of biologically inspired design (also known as biomimicry, biomimetics and bionics) views nature as a vast library of robust, efficient and multifunctional designs, and espouses the use of nature as an analogue for designing technological systems as well as a standard for evaluating technological designs (Benyus 1997; French 1994; Gleich et. al. 2010; Turner 2007; Vincent & Mann 2002; Vogel 2000). This paradigm has inspired many famous designers in the history of design including Leonardo da Vinci, and in a wide variety of design domains ranging from architecture to computing to engineering to systems. However, over the last generation the paradigm has become a movement in modern design, pulled in part by the growing need for environmentally sustainable development and pushed partly by the desire for creativity and innovation in design. Thus, the study of biologically inspired design is attracting a rapidly growing literature, including patents (Bonser & Vincent 2007), publications (Lepora et al. 2013), and computational tools (Goel, McAdams & Stone 2014). The Biomimicry Institute (2011) provides numerous examples of biologically inspired design. The design of windmill turbine blades mimicking the design of tubercles on the pectoral flippers of humpback whales is one example of biologically inspired design. As Figure 1 illustrates, tubercles are large bumps on the leading edges of humpback whale flippers that create even, fast-moving channels of water flowing over them. The whales thus can move through the water at sharper angles and turn tighter corners than if their flippers were smooth (Fish et al. 2011). When applied to wind turbine blades, they improve lift and reduce drag, improving the energy efficiency of the turbine. Figure 1: Design of windmill turbine blades to increase efficiency inspired by the tubercles on humpback whale flippers. (The Biomimicry Institute 2011) From the perspective of computational creativity, two characteristics of biologically inspired design are especially noteworthy. Firstly, biologically inspired design often is creative: its products, such as the windmill turbine blades illustrated in Figure 1, are novel, valuable, feasible, and non-obvious (even surprising at first). Secondly, the conceptual phase of biologically inspired design engages analogical transfer of knowledge from biological analogues to design problems in the domain of interest. The latter point raises an important question: is biologically inspired design fundamentally different from other kinds of analogy-based creative processes other than the obvious fact the source domain here is biology? This question is important because the computational theories, techniques and tools we need to develop to support biologically inspired design depend on the nature of the task. For example, Nagle (2014) describes an engineering-to-biology thesaurus that maps function terms used in engineering into equivalent function terms used in biology. The (implicit) assumption in the work on the engineering-to-biology thesaurus is that biologically inspired design is not very different from other analogy-based creative processes (e.g., Veale 2003), that if we could only bridge the vocabulary June 2015 fi gap between design and biology, we could borrow the rest from extant theories of design, analogy and creativity. In this paper, we first summarize some of our empirical findings about biologically inspired design and then derive a Task Model for it. Finally, we will posit that biologically inspired design is a novel methodology for multiple reasons, and thus requires the development of new computational theories, techniques and tools. Research Methodology Theories of biologically inspired design process can be normative and prescriptive or descriptive and explanatory. Vincent's et al.'s (2006) BioTRIZ theory, for example, is a normative and prescriptive account of biologically inspired design. In contrast, we have developed a descriptive and explanatory account. Thus, our research methodology consists of three major elements: In situ observations of biologically inspired design practices, task analysis of biologically inspired design, and comparison with current theories of design, analogy and creativity. Observations of Biologically Inspired Design Practices: Given that the professional biologically inspired design community at present is nascent, sparse and diffused, we studied biologically inspired design practices in the Georgia Tech ME/ISyE/MSE/BME/BIOL 4740 course from 2006 through 2013 taken by ~350 students. This a yearly, interdisciplinary, project-based course on biologically inspired design taught jointly by biology and engineering faculty. The class is composed of mostly senior-level undergraduate students from biology, biomedical engineering, industrial design, industrial engineering, mechanical engineering, and a variety of other disciplines. Although it evolves a little every year, the course is consistently structured around lectures, found object exercises, journal entries, and one or more design projects. Some lectures discuss biological systems; some lectures focus on case studies of biologically inspired design; and some lectures formulate, analyze and critique problems for students to solve in small groups. Yen et al. (2011, 2014) provide a detailed account of the teaching and learning in the course. Task Analysis of Biologically Inspired Design: Given our observations in the ME/ISyE/MSE/BME/BIOL 4740 classes from 2006 through 2013, we conducted a task analysis of the macrostructure of biologically inspired design practices. Crandall, Klein & Hoffman (2006) describe the methodology of task analysis in detail. Task analysis helps identify the task decomposition of a complex task, the methods used to accomplish the various subtasks in the task decomposition, and the contents of knowledge used by the different methods. For example, Chandrasekaran (1990) presents a high-level task analysis of the general design task while Goel & Chandrasekaran (1992) present a task analysis of the specific method of case-based design. In general, task analysis may describe the behaviors of an individual designer, the interactions among a team of designers, or the behaviors of a design team viewed as a unit. Although we are interested in all three levels of aggregation, in this work we focus on interdisciplinary design teams of biologists and engineers viewed as the unit of analysis. Our task analysis of biologically inspired design by interdisciplinary design teams generates a task model of biologically inspired design: the task model describes the processes and the knowledge used in biologically inspired design. Comparative Analysis with Theories of Design, Analogy and Creativity: Given our task model of biologically inspired design, we compared it with theories of biologically inspired design such as BioTRIZ (Vincent et al. 2006) and Design Spiral (Baumeister et al. 2012). However, because of space limitations, here we will compare our task model only with BioTRIZ. We also compared our task model with established theories of analogical reasoning such as Gentner (1983), Hofstadter (1996), Holyoak & Thagard (1996), and Kolodner (1993). Again because of space limitations, here we will compare our task model only with Gentner's structure-mapping theory of analogy. ! Data The ME/ISyE/MSE/BME/BIOL 4740 classes from 2006 through 2013 resulted in 83 extended, open-ended design projects. The 83 case studies of the design projects in the classes were the focal points of our data collection. The projects involved identification of a design problem of interest to the team and conceptualization of a biologically inspired solution to the identified problem. Each design project grouped together an interdisciplinary team of typically 4-5 students. Each team had at least one student with a biology background and a few from different engineering disciplines. Each design team also had at least one faculty member. Each team identified a problem that could be addressed by a biologically inspired solution, explored a number of solution alternatives, and developed a final solution design based on one or more biologically inspired designs. Each design team presented its final design to an interdisciplinary design jury. Goel et al. (2015) describe a digital library, called the Design Study Library (DSL), of all 83 case studies. Empirical Findings Cross-Domain Analogies: By definition, biologically inspired design engages cross-domain analogies from biology to engineering Although we have observed that extended episodes of biologically inspired design involve both within domain and cross-domain analogies (Vattam, Helms & Goel 2010), it is the essentialness of crossdomain analogies that defines the paradigm of biologically inspired design. Problem-Driven and Solution-Based Analogies: We observed the existence of two high-level analogical June 2015 fi processes for biologically inspired design based on two different starting points - problem-driven analogy and solution-based analogy (Helms, Vattam & Goel 2009). In the problem-driven analogical process, designers identify a problem that forms the starting point for subsequent problem solving. They usually formulate their problem in functional terms (e.g., stopping a bullet). In order to find biological sources for inspiration, designers "biologize" the given problem, i.e., they abstract and reframe the function in more broadly applicable biological terms (e.g., what characteristics do organisms have that enable them to prevent, withstand and heal damage due to impact?). Designers use a number of strategies for finding biological sources relevant to the design problem at hand based on the "biologized" question, and then they research the biological sources in greater detail. Important principles and mechanisms that are applicable to the target problem are then extracted to a solution-neutral abstraction and applied to arrive at a trial design solution. On the other hand, in the solution-based analogical process, designers begin with a biological source of interest. The designers understand (or research) their biological source to a sufficient depth to support the extraction of deep principles from it. Then they find human problems to which the principle can be applied. Finally, they apply the principle to develop a design solution to the identified problem. The two analogical processes have different characteristics. Compared to problem-driven analogical processes, solution-based analogical processes tend to exhibit not only design fixation but also a fixation on the structure of the biological design (Helms, Vattam & Goel 2009). Again compared to problem-driven processes, solutionbased design processes also tend to more often result in the generation of multifunctional designs, i.e. where a single design principle meets multiple functional goals (Helms, Vattam & Goel 2009). In general, a single case study may contain both problem-driven and solution-based analogical processes. Problem Decomposition and Level of Abstraction of Biological Analogy:!Biologically inspired design engages decomposition of the target design problem as well as functional decomposition of the biological system that acts as a source analogy to the design problem (Vattam, Helms & Goel 2007). Problem decomposition and functional decomposition of course are familiar ideas in design (e.g., Brown & Chandrasekaran 1989; Chandrasekaran 1990; Dym & Brown 2012; French 1996; Simon 1996). However, these decompositions appear to play a special role in biologically inspired design. The decomposition of the target design problem and the functional decomposition of the source biological system help identify the appropriate level for the analogical transfer from the biological system to the design problem. Problem Decomposition and Compound Analogies: Problem decomposition appears to play a second special role in biologically inspired design. We found that biologically inspired design often entails compound analogies in which a new design concept is generated by composing the results of multiple cross-domain analogies (Vattam, Helms & Goel 2008). This process of compound analogical design relies on an opportunistic interaction between the processes of memory and problem solving. In this interaction, the target design problem is decomposed functionally, solutions to different subfunctions in the functional decomposition are found through analogies to different biological systems retrieved from memory, and the overall solution is obtained by composing the solutions for achieving the different subfunctions. Thus, the subfunctions in the functional decomposition of the design problem act as probes into a memory of biological systems. Interactive Analogical Retrieval: Most designers are novices at biology (just as many biologists are naive about design). Thus, designers typically do not have a large number of biological analogues stored in their long-term memory. Instead, we found that designers searched online for biological cases analogous to the target problems. Based on our observations, this was one of the predominant approaches for finding biological cases that typically were in the form of biology articles. Designers reported using a range of online information environments to seek information resources about biological systems. These included: (1) online information environments that provided access to scholarly biology articles like Web of Science, Google Scholar, ScienceDirect, etc., (2) online encyclopedic websites like Wikipedia, (3) popular life sciences blog sites like Biology Blog, (4) biomimicry portals like AskNature, and (5) general web search engines like Google. We call this phenomenon interactive analogical retrieval (Vattam & Goel 2013). Serendipity in Biologically Inspired Design: The coupling of design problems and biological analogues often is serendipitous. For example, a design team may formulate a design problem, then find itself unable to make progress on it, and thus suspend additional work on the problem. At a later time, while working on a different problem, the team may serendipitously come across a biological analogue that provides a solution to the earlier problem, and therefore switch to the earlier problem. Abstraction and Transfer of Design Patterns: We found biologically inspired design engages abstraction and transfer of several kinds of design patterns. Design patterns are abstractions of design cases, including generic domain principles (Bhatta & Goel 1994) and generic teleological mechanisms - causal mechanisms that achieve specific types of functions (Bhatta & Goel 1996). In particular, we have so far studied three kinds of design patterns in biologically inspired design: domain principles, causal mechanisms for accomplishing specific functions types, June 2015 fi Figure 2. A generic task model of biologically inspired design. Figure 2. A generic task model of biologically inspired design. and arrangements of structural components for accomplishing function types. We expect that there are many other types of design patterns yet to be discovered in biologically inspired design. Bridging Spatial and Temporal Scales: Note that although the example in Figure 1 of this article is about product design at a spatial and temporal scale visible to thenaked human eye, the scope of biologically inspired design is much larger. Thus, biologically inspired products may cover many spatial scales ranging from nanometers (e.g., biomolecules) to hundreds of kilometers (e.g., ecosystems), as well as many temporal scales ranging from nanosecondsto centuries. Often, a design pattern abstracted from a biological analogue may bridge across several spatial and temporal scales. For example, Weiler & Goel (2015) describe the crinkles on the surface of mitochondria cells as a source of analogy for designing human-scale devices for harvesting water from fog. Problem-Solution Co-Evolution:! Conceptual design in biologically inspired design entails problem-solution coevolution (Helms & Goel 2012). That is, the design process iterates between defining and refining the problem and the solution, with both the problem and the solution influencing each other (Maher & Tang 2003; Dorst & Cross 2001). As a solution (S) is developed and evaluated for a given problem (P), it reveals additional issues, spawning a new conceptualization of the problem (P+1). The process continues with the development of a new solution (S+1) and will iterate until a final solution is decided upon. Task Model Figure 2 illustrates our generic task model of biologically inspired design based on the above findings. The overall task is design. This is accomplished by using two methods: problem-driven analogy and solution-driven analogy. Each method sets up subtasks like abstraction, retrieval, and mapping and transfer. Each subtask (e.g., retrieval) might, in turn, be accomplished by one of several methods (e.g., feature-based similarity matching for retrieval). Knowledge here refers to the knowledge used by a task or a method, for example, knowledge of design patterns. Note that June 2015 fi knowledge may be multimodal, for example, descriptive and depictive. The problem-driven analogical process incorporates the design subtasks of problem formulation, problem reframing, biological solution search, defining biological solution, principle extraction and principle application. Similarly, the solution-based analogical process incorporates the design subtasks of defining biological solution, principle extraction, solution reframing, problem search, problem definition, and principle application. To avoid cluttering, Figure 2 illustrates only some of these subtasks of problem-driven and solution-based design. Our task model of biologically inspired design also accounts for problem decomposition and compound analogies. In Figure 2, S1 represents the initial solution obtained. We add a new subtask "evaluate" to both problem-driven and solution-based methods. This subtask evaluates the initial solution S1 generated by a method. If the evaluation of S1 indicates that S1 addresses only a part of the design problem, then a new design sub-problem is spawned to address the remaining part(s) of the problem. Addressing the new sub-problem may lead to another partial solution S2. The subtask "compose" composes S1 and S2 to obtain a more complete solution to the original problem. For expediency, it is assumed here that subtask execution for compound analogy is sequential, represented by one-way arrows between the circles denoting the evaluation, designing and composition. The actual process may in fact involve much more complex interactions. Comparative Analysis In this section, we compare the task model for biologically inspired design with both computational theories of analogical reasoning in creativity and creativity in biologically inspired design. Due to space limitations, here we will compare the task model only with Gentner's structure-mapping theory of analogy and Vincent et al.'s BioTRIZ theory of biologically inspired design. Structure Mapping: Gentner's structure-mapping theory is one of the classical theories of analogy. Falkenhainer, Forbus & Gentner (1989) describe the structure-mapping engine, a computational implementation of the structuremapping theory. Gentner & Markman (1997) discuss structure mapping as a more general theory of similarity and analogy. The process of analogical reasoning using structure mapping process starts with a target problem, and the method spawns the subtasks of retrieving a source analogue, finding mappings between the target problem and the source analogue, transfer of knowledge from the source to the target to generate a candidate solution, evaluation of the candidate solution, and storage of the new case in memory for potential reuse. The mapping task aligns the representations of the target problem and the source case - structure here refers to the structures of the two representations, and the principle of systematicity gives preference to higher-order relations. A comparison of our task model of biological inspired design and the theory of analogical reasoning shows several similarities and differences: • The structure-mapping theory of analogical reasoning is problem-driven. In contrast, biologically inspired design engages two distinct processes: problem-driven analogy and solution-based analogy. • There are broad correspondences between some subtasks in the process of analogical reasoning and subtasks in the problem-driven analogical processes of biologically inspired design. For example, the "biological solution search" task in the problem-driven analogical process corresponds to the "retrieval" subtask in the structure-mapping theory. The aggregate of "defining biological solution," "principle extraction" and "principle application" subtasks in the problem-driven process corresponds to the "mapping" and "transfer" subtasks in the structure-mapping theory. • On the other hand, there are subtasks in the problemdriven and solution-based analogical processes of biologically inspired design that are not directly matched by subtasks in the theory of analogical reasoning. In particular, the "problem abstraction" and "solution abstraction" subtasks in our task model of biologically inspired design that are preparatory to the subtasks of retrieval, mapping and transfer that follow. • The structure-mapping theory of analogical reasoning does not itself address problem decomposition, but it can be extended to include problem decomposition, and, with it, the use of compound analogies that may potentially be at multiple levels of abstraction. • While the structure-mapping focuses on the structure of the representations of the target problems and the solution analogues, our task model of biologically inspired design emphasizes the role of contents of knowledge, for example, the abstraction, acquisition, and use of knowledge of the design patterns. • Most designers typically are novices in biology, and thus most biologically inspired designers rely on interactive analogical retrieval from online information sources. This is in contrast to the structure-mapping that assumes that the source analogues are available in the long-term memory of the agent. BioTRIZ: Vincent et al.'s (2006) BioTRIZ is an information-processing theory of biologically inspired design derived from the earlier theory of engineering invention called TRIZ (Altshuller 1984). The TRIZ theory begins with a repository of design cases with known solutions, where each case is indexed by contradictions that arose in the original design situation. For example, consider a case in the repository that represents the design of an airplane wing. In this case the designer faces the contradiction of obtaining a material that is both strong and light-weight, and solves it using a solution, say S1. This June 2015 fi case is then indexed by the contradiction "strong yet lightweight material." Additionally, if the particular solution S1 belongs to a more general way of resolving contradictions of a particular kind, it may be categorized as a generic abstraction, such as "use porous materials" (to resolve the contradiction of strong yet light-weight material). TRIZ posits the existence of 40 generic ways of resolving conflicts, called inventive principles. The inventive principles were extracted by dropping the specifics of a particular case and domain and retaining the essence of how a particular class of contradictions is solved, so we can imagine each principle pointing to numerous cases (potentially belonging to different domains) in which that principle was used to resolve a conflict. The contradictions and the principles are organized in a contradiction matrix. When the designer is presented with a design problem, she reformulates the problem to identify certain key contradictions in the requirements of the design. For each contradiction, she is reminded of a general inventive principle that is applicable for resolving that conflict. In addition to suggesting the essence of a solution for resolving that conflict, the inventive principle also points to a number of cases in which that general principle was instantiated. These cases can originate from domains different from the one in which the designer is currently working. TRIZ, however, does not address the issue of how transfer occurs. Vincent et al. (2006) recently developed a modified version of TRIZ, called BioTRIZ, specifically for biologically inspired design. The primary difference between the two theories is a change in the features that compose the contradiction matrix. Whereas TRIZ defines 39 features with which to determine contradictions and index into inventive principles, the current version of BioTRIZ has six "operational fields": substance, structure, space, time, energy, and information. A comparison of our task model and BioTRIZ reveals the following similarities and differences: • Both BioTRIZ and our model address cross-domain analogies between biological and technological systems. • BioTRIZ is a prescriptive theory of biologically inspired design, derived from best practices in mechanical engineering design. In contrast, our task model is a descriptive theory based on in situ observations of biologically inspired design. • The processing in BioTRIZ is problem-driven. The processing in BioTRIZ always begins with a specification of a design problem. It does not directly address solution-based analogical process. Our task model accounts for both problem-driven and solutionbased analogies. • BioTRIZ does not directly address compound analogy. However, since a design problem may contain multiple contradictions, and the various contradictions may require the invocation of different principles, compound analogy appears to be feasible in BioTRIZ. So Is Biologically Inspired Design Different? The above comparative analysis brings us to the question often asked by design theorists: is biologically inspired design different from other design paradigms? After all, analogical reasoning is used extensively in other design paradigms, and cross-domain analogies often are the basis of creativity in the other design paradigms. So is analogical reasoning in biologically inspired design different from analogical reasoning in other design paradigms, other than the obvious fact the source analogues are from biology? Or, put a little differently, what precisely makes biologically inspired design a new design paradigm from the perspective of analogy and creativity? Note that the question here is not whether biological and technological systems are different. As Vincent et al. (2006) note, "biology and technology solve problems in design in rather different ways:" biological systems often use information for functions for which technological systems tend to use energy. French (1994) and Vogel (2000) make detailed analyses of the similarities and differences between biological and technological systems: biological systems in general tend to be more multifunctional than technological systems. Instead, the question here is: are the processes of analogical reasoning in biologically inspired design fundamentally different from that of other design paradigms? Our task model offers some insights into what may make analogical reasoning in biologically inspired design different from analogical reasoning in other domains, thereby making biologically inspired design a new design paradigm: 1. Biologically inspired design by definition is based on cross-domain analogies. While many design processes in and out of biologically inspired design sometimes engage cross-domain analogies, and while biologically inspired design also frequently engages within domain analogies (Vattam, Helms & Goel 2010), insofar as we know there are not many other kinds of design that by definition are based on cross-domain analogies. 2. Biologically inspired design often entails compound analogies. In particular, the target design problem is decomposed functionally, solutions to different subfunctions in the functional decomposition are found through analogy to different biological systems retrieved from a functionally indexed memory, and the overall design solution is obtained by composing the solutions for achieving the different subfunctions. While problem decomposition could be introduced into the structure-mapping theory of analogical reasoning, compound analogy appears to be a stronger characteristic of biologically inspired design. 3. Biologically inspired design engages two different analogical design processes, namely, problem-driven analogy and solution-based analogy. We first observed these two analogical processes in our in situ studies of biologically inspired design in practice. Insofar as we know, all information-processing theories of analogy June 2015 fi (e.g., Dunbar 2001; Gentner 1983; Gick & Holyoak 1983; Goel 1997; Hofstadter 1996; Holyoak & Indurkhya 1992; Thagard 1996; Keane 1988; Kolodner 1993) focus on and emphasize problemdriven analogy. Further, insofar as we know, computational theories of all other kinds of design focus on and emphasize problem-driven design (e.g., Brown & Chandrasekaran 1989; Chandrasekaran 1990; Dym & Brown 2012; French 1996; Maher & Tang 2003; Simon 1996). Therefore, that biologically inspired design entails both problem-driven and solution-based analogies appears to be another definitional characteristic of biologically inspired design. 4. Most designers typically are novices in biology, and thus most designers rely on interactive analogical retrieval from online information sources while engaging in biologically inspired design. This is in contrast to all theories of analogical reasoning that assume that source analogues are available in the longterm memory of the agent. 5. In biologically inspired design, problems and solutions co-evolve. This is similar to creative processes in other design domains but in sharp contrast to current theories of analogical reasoning. From the perspective of creativity in design, we should add that the question here is not binary. Most of the processes that occur in biologically inspired design also occur in other creative design. Instead, the difference lies in focus and emphasis. As an example, other types of creative design often engage cross-domain analogies irrespective of the design domains, but biologically inspired design is defined by cross-domain analogies. Conclusions In this paper, we found that biologically inspired design indeed is a novel methodology for creative design for at least five reasons: (1) Biologically inspired design by definition engages cross-domain analogies. (2) Problems and solutions in biologically inspired design co-evolve. (3) Problem decomposition plays a fundamental role in biologically inspired design. (4) Biologically inspired design often involves compound analogy, entailing a complex interplay between the processes of problem decomposition and the processes of analogical retrieval from memory. (5) Biologically inspired design entails two distinct but related processes: problem-driven analogy and solution-based analogy. For this reason, we now prefer the term biologically inspired invention, as in the title of this paper: while design always starts with a problem, invention need not, sometimes starting with a solution and only later finding a problem, perhaps by serendipity. These distinctions make for important differences in developing computational theories, techniques and tools for supporting biologically inspired design. For example, as we mentioned in the introduction, Nagle (2014) describes an engineering-to-biology thesaurus, with the (implicit) assumption that biologically inspired design is not very different from other analogy-based creative processes, that if we could only bridge the vocabulary gap between design and biology, we could borrow the rest from extant theories of design, analogy and creativity. However, if biologically inspired design is different, then we also need a different set of computational tools based on a different set of hypotheses. For example, Vattam & Goel (2013) describe Biologue, a computational tool for interactive analogical retrieval from online information sources that is based on the observation that analogical retrieval in biologically inspired design is situated online. Further, our work on biologically inspired design indicates that research on computational creativity may need to develop new theories of analogical reasoning that incorporate a more dynamic, a more flexible view of cognition, including problem-driven and solution-based analogies, problem decomposition and compound analogies, interactive analogical retrieval, and problemsolution coevolution. This makes for an exciting research agenda in computational creativity. Acknowledgements: This article is based in part on collaborative research with Michael Helms, Swaroop Vattam, Bryan Wiltgen, and Jeannette Yen (the primary instructor of the Georgia Tech ME/ISyE/MSE/PTFe/BIOL 4740 course) over several years. This article has also benefited from discussions with Norbert Hoeller, Spencer Rugaber, and Filippo Salustri. 2015_8 !2015 The role of blending in mathematical invention F. Bou M. Schorlemmer IIIA CSIC Barcelona J. Corneli Computing Goldsmiths, London D. Gómez-Ramírez Cognitive Science Osnabrück E. Maclean A. Smaill Informatics Edinburgh A. Pease Computing Dundee Abstract We model the mathematical process whereby new mathematical theories are invented. Here we explain the use of conceptual blending for this purpose, and show examples to illustrate the process in action. Our longerterm goal is to support machine and human mathematical creativity. Introduction We are concerned with creativity in mathematics: creativity as evinced by human and artificial mathematicians, individually and collectively. Work on conceptual blending has been much influenced by Fauconnier and Turner (1998, 2002). More recently, the centrality of conceptual blending to creativity has been stressed by Turner (2014), where he writes: . . . the human spark comes from our advanced ability to blend ideas to make new ideas. Blending is the origin of ideas. (Turner, 2014, p 2) The claim is that blending in this sense is a general human cognitive ability, and as such applies to mathematics, as much as to art, poetry, music and so on (see for example Turner (2005)). The place of mathematics and the sciences among creative endeavours has been stressed by the literary critic George Steiner: It is in mathematics and the sciences that the concepts of creation and of invention, of intuition and of discovery, exhibit the most immediate, visible force. Steiner (2001, p 145) Blending involves recognising features common to mathematical concepts, even when expressed in different terminology. The role of mathematical analogy in creative mathematics is well expressed by Weil (1960), and a general plea for analogical reasoning within science in (Arbib and Hesse, 1986). We are investigating computational accounts of mathematical creativity, taking conceptual blending as a key ingredient. The work of Goguen (1999, 2005) has provided a general framework for comparison of conceptual spaces, and computation of blends. This enables the use of richer representation formalisms, and so is closer to contemporary mathematics than previous computational realisations of blending, such as in Pereira (2007). This paper deals with the creative process in mathematics, as modelled along the lines above. We focus on the use of blending within a single process, searching for blends satisfying some evaluation criteria, from the starting point of some given conceptual spaces. While cognitive issues are important to us, this paper is focused on issues in representation and representation change; there are however brief comments on cognition in the conclusions. We start by providing some background, followed by an example to illustrate the components involved in our approach. A historical example based on Georg Cantor's work follows. The most extended example was carried out by a pure mathematician (D. Gómez-Ramírez), working in a domain close to his own; in this case, the blend mechanism threw up some unexpected properties, which provoked new work by the mathematician. Subsequently we give some more speculative thoughts on where this work can go in the future, by considering Galois theory as a test-bed. Finally, we discuss the evaluation of work along these lines, and give some conclusions. Background Blending in Mathematics Lakoff and Núñez (2000) are among the first to present a cognitive account of the origin and development of mathematical ideas,1 arguing against the "romance of mathematics" in which mathematics is presented as an ever-increasing set of universal, absolute, certain truths which exist independently of humans. They present the thesis that human mathematics is grounded in bodily experience of a physical world, and mathematical entities inherit properties which objects in the world have, such as being consistent or stable over time. Exploring the physical world of object collection might lead to concepts like the empty collection and rules like "adding a collection of n objects to an empty 1 This is lamented by Lakoff and Núñez, who claim that (prior to their work), "there was still no discipline of mathematical idea analysis from a cognitive perspective" (Lakoff and Núñez, 2000). June 2015 55 collection yields a collection with n objects". People then form grounding metaphors between the physical world and an abstract mathematical world, allowing us to project from everyday experiences onto abstract concepts, thus leading to the concept of zero and the axiom that n+0 = n. Lakoff and Núñez posit that blending different mathematical metaphors leads to more complex ideas (see also Alexander (2011)). Alongside this account of mathematical cognition, mainstream contemporary mathematics has developed its own methodology and foundations, enjoying an exceptional place among scientific disciplines. Its methods, objects of study and sometimes astonishing results have widespread, if not universal, acceptance. In conclusion, mathematics is a scientific discipline having not only a fundamental cognitive component, necessary in its development, but also possessing a collection of general principles and structures going beyond a particular school of thought. Among these general processes we want to highlight in this paper the importance that conceptual blending has in mathematics, incorporating both cognitive and mathematically specific aspects in order to create new mathematical concepts. Terminology for conceptual blending Our notion of conceptual blending is informed by Category theory, and highly influenced by Goguen's work on concepts (Goguen, 2005). In this paper we use the terminology below, and elucidate the terminology by means of a running example - discovering a version of the integers (in the sense of providing a partial approach to the genuine integers) using blending. Conceptual spaces are partial and temporary representational structures which are constructed on the fly when talking about a particular situation, which are informed by the knowledge structures associated with a domain. These are influenced by Boden's idea of a concept space which is mapped, explored and transformed by transcending mapped boundaries (Boden, 1977), and form the input spaces to our blend. As an example of two conceptual spaces, consider one as a theory NAT - a theory of the natural numbers, and FUNC - a theory of a total unary function with an inverse. We will refer back to these theories in this exposition. (Many-sorted) First-order Axioms are the criteria which will be used here to delineate the conceptual spaces. The axiomatic method has been a fundamental aspect of mathematical research since Euclid, and various axiom changes have led to revolutions in mathematics. For instance, rejecting the parallel postulate opened up fascinating new areas of non-Euclidean geometry. The precise formulations for NAT and FUNC can be found in Listings 1 and 2. Notice that these formulations obviously refer to partial representations of the genuine concepts employed by mathematicians. In the conceptual space with theory NAT, an example of an axiom is 8x.¬ 0 = s(x) - that is that zero is not a successor element. The conceptual space with theory FUNC has an axiom 8x. f(finv(x)) = x. Signature morphisms between conceptual spaces are mappings from the symbols of the source conceptual space into the symbols of the other conceptual space. For example NAT contains a function !x : Nat. s(x) that maps x to its successor, and FUNC contains a function defined over a set X that maps each element to an image !x : X. f(x). A theory G with a morphism to both NAT and FUNC might contain a function !x : N. func(x) that takes every number in some set N to its image under func. When we show a mapping we write this as s !(G,NAT) func !!(G,FUNC) f (1) Nat !(G,NAT) N !!(G,FUNC) X (2) The mapping "(G, NAT) is a signature morphism from G to NAT. Note that associated types are also mapped. Input Spaces refer to two or more conceptual spaces of interest. Generic spaces are conceptual spaces that possess commonality between input spaces. Colimits are conceptual spaces representing a blend of input spaces with respect to a given generic space and a set of signature morphisms. These are uniquely computed given a generic space and a set of morphisms. Here is a diagrammatic representation of such a computation in our example using theories NAT and FUNC : Colimit NAT FUNC G !(B,NAT) !(B,FUNC) !(G,NAT) !(G,FUNC) The conceptual space represented by the Colimit is often referred to as the blend. Internal Evaluation constitutes a variety of techniques to determine whether a computed colimit is viable as a conceptual space. In our example, since the conceptual spaces are mathematical theories, we can exploit the notion of consistency. This is a way of evaluating whether a blend is not only creative, but also valid. In the example of theories NAT and FUNC, the computed blend is inconsistent due to the emergent axioms in the computed colimit. The only type existing within the colimit is from now on referred to as Z to distinguish it from the natural numbers. Notice that in the colimit it holds that: 8x : Z.¬ zero = s(x) 8x : Z. s(sinv(x)) = x . This is an inconsistency, as from the second axiom we see that there is an element for which 0 is the successor. Weakening refers to the process of weakening the input theories by removing symbols or axioms. If we remove the axiom 8x : Nat.¬ zero = s(x) then the resulting computed colimit contains a mathematical theory which is consistent. Martinez et al. (2014) provides an algorithm to explore the space of blends resulting from given input spaces and a given generic space, where weakening is achieved by omitting axioms. The algorithm returns the blends which are consistent, and maximally so, among those in this space of June 2015 56 blends. This algorithm assumes that consistency of relevant theories can be checked, so is not always effective. Running the blend refers to elaborating or completing a mathematical theory. Sometimes there are missing definitions which need to be discovered. For example in the new theory the following axiom appears 8x, y : Z. s(x) + y = s(x + y) , but we also are interested in theorems such as 8x, y : Z. sinv(x) + y = sinv(x + y) . Finding suitable theorems is an example of running the blend, and from which it is possible to discover and prove theorems such as 8x, y : Z. sinv(x) + s(y) = x + y . Technologies The approach explained above corresponds to Goguen's proposal (Goguen, 1999) for implementing blending, but slightly simplified (as in Kutz, Neuhaus, Mossakowski, and Codescu (2014)): we use the normal colimit construction, rather than 3 2 -colimits (both described in Goguen (1999)). Additionally we assume that the conceptual spaces involved are given using a CASL specification (Astesiano et al., 2002) and that the morphisms are theorem preserving (i.e. map theorems to theorems). The reason for these assumptions is that in these cases it is well-known how to compute colimits: the colimit specification essentially corresponds to the disjoint union of the two target conceptual spaces except for not repeating the symbols given in the common source conceptual space. Moreover, we will be using the HETS system (Mossakowski, Maeder, and Lüttich, 2007) to compute such colimits. The code for the implemented examples in this paper is available on-line.2 The use of CASL specifications means that we deal with first-order logic; CASL is supported in the HETS system, and colimits here can be computed in the current implementation of HETS. Although higher-order logic (with Henkin semantics) is available in HETS (indeed in CASL) and the colimits are well-known to exist (because higher-order in this form is reducible to many-sorted first-order logic), it is worth noticing that the calculus of such colimits is not currently available in HETS. This restricts the formalisms that can be used directly for our purposes, where computation of colimits is central to our approach. Blending and the infinite Example Revisited - the Integers As a first demonstration of the machinery involved in blending mathematical theories, we consider combining a theory of natural numbers with the concept of the inverse of a function to obtain the integers. Let us assume a simple partial axiomatisation of the natural numbers (without order axioms) as shown in Listing 1, and call this theory NAT. Now let us also define a simple theory which introduces the concept of a function with an inverse as shown in Listing 2, and call this theory FUNC. 2 See: https://github.com/ewenmaclean/ICCC2015_hetsfiles spec NAT = sort Nat ops zero : Nat; s : Nat ! Nat; __+__ : Nat ⇥ Nat ! Nat 8 x, y : Nat • s(x) = s(y) ) x = y • ¬ zero = s(x) • s(x) + y = s(x + y) • zero + y = y end Listing 1: A theory of the natural numbers without order spec FUNC = sort X op f : X ! X op finv : X ! X 8 x : X • f(finv(x)) = x • finv(f(x)) = x end Listing 2: A theory with a function and its inverse defined Identifying a Generic Space In order to incorporate the notion of blending here we want to be able to identify a "generic" component of each theory and compute the colimit. We can use the HDTP system (Gust, Kühnberger, and Schmidt, 2006; Schmidt, 2010) to discover a common theory and signature morphism between symbols in the two theories NAT and FUNC. The Generic theory GEN contains a sort N and a function func, and the morphisms from the Generic theory to NAT and FUNC are: s !(G,NAT) func !!(G,FUNC) f (3) Nat !(G,NAT) N !!(G,FUNC) X (4) Here the successor function is identified in the mapping with the function in the theory FUNC. Computing the Colimit The HETS system (Mossakowski et al., 2007) can then be exploited to find a new theory by computing the colimit: BLEND NAT FUNC GEN This generates the theory shown in Listing 3 (for the sake of understanding it is used p, for predecessor, instead of sinv). Removal of Inconsistencies This theory is automatically determined to be inconsistent due to the axioms 8x : Z.¬ zero = s(x) (5) 8x : Z. s(p(x)) = x (6) June 2015 57 spec SPEC = sort N op __+__ : N ⇥ N ! N op p : N ! N op s : N ! N op zero : N 8 x, y : N • s(x) = s(y) ) x = y 8 x : N • ¬ zero = s(x) 8 x, y : N • s(x) + y = s(x + y) 8 y : N • zero + y = y 8 x : N • s(p(x)) = x 8 x : N • p(s(x)) = x end Listing 3: An inconsistent partial approach to the integers (without order) Removal of the limiting axiom (5) from Listing 1 results in generating a blend theory which is very similar to what we understand to be the integers as shown in Listing 4. spec SPEC = sort N op __+__ : N ⇥ N ! N op p : N ! N op s : N ! N op zero : N 8 x, y : N • s(x) = s(y) ) x = y 8 x, y : N • s(x) + y = s(x + y) 8 y : N • zero + y = y 8 x : N • s(p(x)) = x 8 x : N • p(s(x)) = x end Listing 4: A consistent partial approach to the integers (without order) Running the Blend Running the blend refers to discovering definitions or adding axioms to flesh out the blend. In the example of the version in Listing 4, the definition of plus needs to be extended to understand how to calculate with the predecessor function: p(x) + y = p(x + y) from which theorems such as p(x) + s(y) = x + y can be proved. Potential and actual infinity Some of the ideas of Lakoff and Núñez (2000) have been reworked by the authors, with increased emphasis on conceptual blending. In particular, the analysis of mathematical infinity, given in metaphorical form as the "Basic Metaphor of Infinity" (BMI) in Lakoff and Núñez (2000), is represented in blend form in Núñez (2005) as the "Basic Mapping of Infinity" (so, still "BMI"). We show here how this blend works out in our setting. The BMI suggests that the notion of completed infinity, in particular the possibility of transfinite numbers in the sense of Cantor, comes from a blend of the notion of completed, finite process with that of a potentially infinite and endless process. Thus take two corresponding input spaces, given by CASL specifications FinEnd and Inf corresponding to the following diagrams FinEnd: Start End Inf: ... Start • FinEnd: Completed Iterative Processes are those that from some initial state, terminate in a final state after a finite number of state transitions. One such case is chosen. • Inf: Infinite Iterative Processes are those that continue indefinitely to change state. In both cases, the arrows indicate steps of the processes, and the process states are in a discrete linear order indicated by left-to-right order in the diagrams. The generic space Gen simply identifies the start states, the notion of process step, and the linear ordering of states. Now we can compute the blend of these spaces, which includes new features taken from both of the input spaces. This blend is inconsistent, for the following two reasons: 1. the number of states is finite (from FinEnd), and infinite (from Inf); 2. there both is an end state (from FinEnd) and is no end state (from Inf). Search through the possibilities of weakening the input spaces by omitting as few axioms as possible among those involved in an inconsistency reveals the possibility of a structure with infinitely many states (from Inf) and an end state (from FinEnd). Computing the colimit from the weakened input spaces W-FinEnd, W-Inf gives a theory corresponding to this diagram: ... Start End Thus we have a blend as in the earlier examples: Colimit W-FINEND W-INF Gen Prime Ideals as a blend Introduction One of the most fundamental concepts of modern mathematics, which is the basis of commutative algebra and a seminal ingredient of the language of schemes in modern algebraic June 2015 58 geometry, is that of prime ideal (Grothendieck and Dieudonné, 1971; Eisenbud, 1995). The terminology "prime ideal" relates to the older notion of "prime number". The initial aim of this work was to look for a blend between prime numbers (from the integers) and the ideals of a commutative ring, to see what would emerge. It turned out that the blend process, along with providing a definition for prime ideals, also suggested an unexpected concept in the context of rings, namely what will be called Containment Division Rings (CDR). In turn, this prompted questions and proofs about this concept - thus running the blend (space prevents description of this step in this paper). We present a first blend involving weakening, followed by a second blend from fuller input spaces, where the emergent concept of CDR appears. The first conceptual space Let (R, +.⇤, 0, 1) be a commutative ring with unity (see the formal definition and examples in Eisenbud (1995)). Now, R can be understood as the sort containing the elements of the corresponding commutative ring with unity. An ideal I is a subset of R satisfying the following axiom: (8i, j 2 I)(8r 2 R)(i + ((j) 2 I ^ r ⇤ i 2 I). Let us define a unary relation (predicate) isideal on the set (sort) of subsets of P(R) corresponding to this definition. Now, we define Id(R) = {A 2 P(R) : isideal(A)} . Ideals are "multiplied" using the following definition: I ·◆ J = (Xn k=1 ik · jk : n 2 N ^ i1,...,in 2 I ^ j1,...,jn 2 J ) . In other words, I ·◆ J is the smallest ideal extending the set {i · j : i 2 I ^ j 2 J}. The key property that we want to keep in the blend is the one saying that this operation ·◆ has a neutral element 1◆, which can be seen as an additional notation for the ring. On the other hand, we want to see the containment relation ✓ as a binary relation over the sort Id(R). Summarizing, our first conceptual space consists of sorts R, Id(R) and P(R); operations +, ⇤, 0R, 1R, 1◆ and ·◆; and the relations ✓ and isideal. Let us denote this space by I. The second conceptual space Let Z be the set of the integer numbers. Here, we choose any partial axiomatization of them including at least the fact that (Z, ⇤, 1) is a commutative monoid. We define also an upside-down divisibility relation b defined as ebg := g|e, i.e. there exists an integer c such that e = c ⇤ g. Let us define a unary relation isprime on Z as follows: for all p 2 Z, isprime(p) holds if p 6= 1 and: (8a, b 2 Z) $ (abbp) ! (abp _ bbp) % . Besides, we define the set (sort) of the prime numbers as Prime = {p 2 Z : isprime(p)} In the CASL language, we consider Z as the sort of the integer numbers, ⇤ as a binary operation, prime as a predicate and b as a binary relation, any of them defined over the sort Z. We denote this conceptual space by P. The Generic Space The generic space G consists of a set (sort) G with a binary operation ⇤G, a neutral element S and a binary relation G. The Blending Morphisms The morphism to I uses: '(G) = Id(R), '(⇤G) = ⇤◆, '(S)=1◆ and '(G) =✓; the morphism to G uses: $(G) = Z, $(⇤G) = ⇤, $(S)=1 and $(G) = b. The Axiomatization of the Blending A straightforward colimit construction based on the input and generic spaces above yields a consistent space with properties inherited both from the prime elements into the integers and from the ideals of commutative rings; one of the concepts is a notion of prime ideals, another is that of CDR.3 Here we describe briefly a weakening of the given spaces that makes the resultant blend more generally applicable. From the properties defining the integers we transfer into the blend only the fact that Z is a set with a binary operation ⇤ having 1 as neutral element and b as a binary relation, without taking into account its formal definition. Now after computing the colimit, we obtain that any element P 2 G (i.e., an ideal of S) satisfies the predicate isprime if and only if P 6= S ^ (8X, Y 2 G = Id(S))(X ·◆ Y ✓ P ! (X ✓ P _ Y ✓ P)). Thus, the predicate isprime turns out to be the predicate characterizing the primality of ideals of S and the set (sort) P rime turns out to be the set of prime ideals of S. Using the weakened input spaces, the blending space consists of the axioms assuring that S is a commutative ring with unity, G is the set of ideals of S, isprime is the predicate specifying primality for ideals of S and Prime is the collection of all prime ideals of S. Implementation for prime ideals over CDR-s as a blend In this section we construct the concept of prime ideal over a CDR as a blend of the conceptual space of ideals of a commutative ring with unity and the conceptual space of the former second conceptual space where the axiom defining the upside-down divisibility relation is restored. It is worth mentioning again that the definition of CDR-s was obtained after doing this implementation and therefore it could be seen as a form of "creative" result coming from the blending process. After computing the corresponding colimit in HETS and interpreting "RingElt" as the sort containing the elements of the ring S, the theory defining the blend corresponds to the axioms defining a CDR (S), the set of all its ideals (Generic), the set all its prime ideals (SimplePrime) and a primality predicate (IsPrime). We present in Listing 5 just the 3 A ring R is a Containment Division Ring (CDR) if for all ideals I and J of R, I ✓ J if and only if J divides I (i.e. there exists an ideal U such that I = U ·◆ J). June 2015 59 theory corresponding to the colimit (omitting details of ring axioms and ideal generation). spec SPEC = sorts Generic, RingElt, SimplePrime, SubSetOfRing sorts SimplePrime < Generic,IdGeneric < SubSetOfRing ops 0, 1, S : RingElt op __⇤__ : RingElt ⇥ RingElt ! RingElt op __+__ : RingElt ⇥ RingElt ! RingElt op __x__ : Generic ⇥ Generic ! Generic pred IsIdeal : SubSetOfRing pred IsPrime : Generic pred __isIn__ : RingElt ⇥ SubSetOfRing pred gcont : Generic ⇥ Generic pred __generates__ : RingElt ⇥ Generic 8 I : SubSetOfRing • I 2 Generic , IsIdeal(I) 8 x : Generic • xxS = x 8 x : Generic • Sxx = x 8 A, B : Generic • gcont(A, B) , 8 a : RingElt • a isIn A ) a isIn B 8 x, y : RingElt • x + y = y + x %% and further ring axioms ... 8 I : SubSetOfRing • IsIdeal(I) , 8 a, b, c : RingElt • ((a isIn I ) a isIn S) ^ 0 isIn I) ^ (a isIn I ^ c isIn S ) c ⇤ a isIn I) ^ (a isIn I ^ b isIn I ^ c isIn S ^ b + c = 0 ) a + c isIn I) 8 a : RingElt; A : Generic %% and axioms for generates and x ... 8 x, y : Generic • gcont(x, y) , 9 c : Generic • x = yxc 8 p : Generic • p 2 SimplePrime , IsPrime(p) 8 p : Generic • IsPrime(p) , (8 a, b : Generic • gcont(axb, p) ) gcont(a, p) _ gcont(b, p)) ^ ¬ p = S end Listing 5: Colimit for prime ideals over CDR-s A Challenge Example for Blending Computational Creativity via Blending The examples shown thus far in the paper have been examples of blending in mathematics whose mechanisation has helped to identify some novel and unexpected results. The blending itself was a one-stage process where human input was required to identify the input concepts. A more ambitious aim of the approach of applying blending to the problem of computational creativity in mathematics, is to allow search to be done over multiple blends and for the process of blending to be controlled mechanically. In this section we describe very informally a mathematical domain that seems in some ways a natural candidate for a blending approach. Galois Theory Galois theory develops a relationship between a polynomial p(x) with coefficients in some field F, the extension of K of F (written "K/F") containing all of the roots of p(x) in the algebraic closure of F, and the group Gal(K) of automorphisms of K/F that fix the elements of F. The fundamental theorem of Galois theory states that there is a bijection between the subfields of K/F and the subgroups of Gal(K); namely, subgroups correspond to their fixed fields. Using this correspondence, properties of polynomials can be derived, most famously the fact that quintic polynomials cannot be solved by algebraic operations and the extraction of roots. We do not propose to reconstruct much of the theory here, but note that already in this basic account there are several steps that seem compellingly "blend-like." In the first place, for field extension, E is an extension of F if F is a subfield of E. We could derive the extension relationship from the input concepts E and F by "taking everything additional from E and adding it to F." This is made specific in the process of adjoining elements, which simply means to augment the field with all fractions of formal finite sums and products of the adjoined elements with coefficients in the base field. Second, the notion of the splitting field of a polynomial, namely the special extension K/F containing all of the roots of p(x). This could be formed conceptually by combining the concept "the roots of a polynomial p(x) with coefficients in a field F" and the concept "a field extension E/F formed by adjoining certain elements to F." As above, we could then form the concept of Gal(K) by blending at the conceptual level. This time, there would be several constituent pieces: "the roots of a polynomial p(x) with coefficients in a field F," "the splitting field of p(x)," "the group of automorphisms of a field extension E," "the automorphisms that fix F." Finally, assuming that we have built Gal(K) in this fashion, we would like to know some of its properties. Consider the claim that elements of Gal(K) permute the roots of f. This time, instead of being purely conceptual, we want to work at the process level, and consider before-and-after descriptions of the result of applying ' 2 Gal(K) to some r with the property p(r)=0. This is similar in some ways to the "Riddle of the Buddhist Monk", popularised by Koestler (1964), which is cited as an example of the power of blending.4 However, this time the generic space is not a simple geometric machine, but rather an algebraic machine with several moving parts. The proof of the claim is as follows. If p(r)=0, then 'p(r) = '0. Since ' is an automorphism, '0=0; and furthermore ' distributes over the sums and products that make up the polynomial p(x) and fixes its coefficients, therefore 'p(r) = p('r). Chaining the equalities together, we have p('r)=0. 4 "A Buddhist monk begins at dawn one day walking up a mountain, reaches the top at sunset, meditates at the top for several days until one dawn when he begins to walk back to the foot of the mountain, which he reaches at sunset. Making no assumptions about his starting or stopping or about his pace during the trips, prove that there is a place on the path which he occupies at the same hour of the day on the two separate journeys." June 2015 60 In short, the proof is a fairly direct result of combining the definitions. Goguen (1992) suggests that "combination is colimit." Can we realise the proof through (one or several) colimit operations? And is there anything special about this proof? Apart from these more theoretical questions, the foregoing discussion raises the following technical issues: Field Extension When reasoning about polynomials, it is useful to distinguish the three separate types - those of E, those of F and those of E/F as a supertype. Using blending machinery removes the distinction between these types. Splitting Field Extension Theorem A challenging but creative step is to discover the theorem that extending F only with the roots of f(x) forms a field. Automorphisms As mentioned in the background section, currently there is no way of computing colimits if automorphisms are characterised in higher-order logic. An alternative specification, or an implementation of colimit computation for higher-order logic is needed. Evaluation and Outlook Review of the current offering (a) We began the paper with the reconstruction of certain mathematical objects, showing the technical feasibility of the approach. (b) The more advanced example at the centre of the paper illustrates how this sort of reconstruction relates to mathematical practice. (c) A future-oriented example exposes some technical challenges, while suggesting that blending could offer a novel approach to computer mathematics. Broader issues in evaluation In addition to motivating a further investigation of the role blending can play in proofs, Galois theory, discussed above, is paradigmatic for other reasons. This discussion draws on the early 20th Century writings of Albert Lautman on the philosophy of mathematics and the subsequent interpretation of this work by Gilles Deleuze. It uses these ideas to propose an approach to embedding evaluation within the system itself. Concerning the common features of Galois theory, class field theory, and the development of the universal covering surface in Riemann geometry, (Lautman, 2011, p. 126) writes: What is characteristic of the movement of the theories that will be considered here is the existence of an end conceived in advance as a term of the ascent. This is reminiscent of our notion of internal evaluation that apply to the blend. To illustrate, let us briefly imagine how we would use blending techniques to move from porcupine+lion to the perfected porculione. Here, instead of field automorphisms that preserve mathematical structure and fix certain designated elements, we would look for mappings that preserve other properties that exist in the underlying domain. Porculiones would presumably have four feet, would be mammals, and would be omnivores; they should also be viable living creatures. (Deleuze, 1994, pp. 227- 228) follows Lautman in enthusiastically endorsing the Galoisian approach to mathematics: [T]he fact that an equation cannot be solved algebraically, for example, is no longer discovered as a result of empirical research or by trial and error, but as a result of the characteristics of the groups and partial resolvents which constitute the synthesis of the problem and its conditions (an equation is solveable only by algebraic means - in other words, by radicals, when the partial resolvents are binomial equations and the indices of the groups are prime numbers). The theory of problems is completely transformed and at last grounded, since we are no longer in the classic master-pupil situation where the pupil understands and follows a problem only to the extent that the master already knows the solution and provides the necessary adjunctions. For, as Georges Verriest remarks, the group of an equation does not characterise at a given moment what we know about its roots, but the objectivity of what we do not know about them. Conversely, this non-knowledge is no longer a negative or an insufficiency but a rule or something to be learnt which corresponds to a fundamental dimension of the object. Although there is a commonality between blending and the Galoisian approach insofar as progressive refinement carries us toward a "perfected" conclusion, Deleuze's enthusiasm about the pedagogical situation would be significantly cooled here. It would seem, in many of our examples, that we only make progress "to the extent that the master already knows the solution and provides the necessary adjunctions." However, this apparent infelicity may be less of a thick obstacle than it would initially appear. What seems to be most needed is a notion of a question inside the system. This would recover Lautman's basic thrust: "Scientific or not, every question has built in some assumptions about the form of the answer" (Larvor, 2011). In short, an experimental approach in which the system asks and answers questions would embed key aspects for evaluation in the system itself. Future work The idea of using blending to carry out steps in a proof would provide a useful training ground for further development. The primary problem is: If blending is the realisation of "combinatorial creativity" how will we avoid being swamped by the combinatorial explosion of possible things to combine? The first challenge is thus fitting different mathematical components together in a sensible manner. A related challenge would apply when modifying the system to selectively experiment with the rules it uses. The objective in this case would be for the system to learn to associate different (useful) techniques with different types of problems. Conclusions and Remarks The examples presented in this paper trace the development of the blending approach. The current paper begins with reconstructions, but also quickly shows how computed blends June 2015 61 can suggest new mathematical definitions and concepts of interest to practising mathematicians. The analysis offered here shows that this work is a building block that will be useful for future developments that are able to reason more flexibly about mathematical problems - and systematically find and propose new concepts and problems. In future work, we will look more at the cognitive issues raised in this work. In particular, the use of image schemas can give a link between the computational and representational approach taken here, and the cognitive claims coming from authors such as Fauconnier and Turner, and Johnson. Here the work of Mandler and Canovás (2014) and Hedblom, Kutz, and Neuhaus (2014) gives an idea of how these underlying cognitive primitives can be expressed in logical form, and can thus play an explicit role in our modelling of creativity in mathematics. Acknowledgements The authors are grateful to the referees for constructive comment. The project COINVENT acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET-Open Grant number: 611553. 2015_9 !2015 Unweaving The Lexical Rainbow: Grounding Linguistic Creativity in Perceptual Semantics Tony Veale1, Khalid Alnajjar2 1School of Computer Science and Informatics, University College Dublin, Belfield D4, Ireland. 2Department of Computer Science, University of Helsinki, Finland. Abstract The challenge of linguistic creativity is to use words in a way that is novel and striking and even whimsical, to convey meanings that remain stubbornly grounded in the very same world of familiar experiences as serves to anchor the most literal and unimaginative language. The challenge remains unmet by systems that merely shuttle or arrange words to achieve novel arrangements without concern as to how those arrangements are to spur the processes of meaning construction in a reader. In this paper we explore a problem of lexical invention that cannot be solved without an explicit model of the perceptual grounding of language: the invention of apt new names for colours. To solve this problem we shall call upon the notion of a linguistic readymade, a phrase that is wrenched from its original context of use to be given new meaning and new resonance in new settings. To ensure that our linguistic readymades, which owe a great deal to Marcel Duchamp's notion of found art, are anchored in a consensus model of perception, we introduce the notion of a lexicalized colour stereotype. Call me but [X], and I'll be new baptized What's in a name? that which we call a rose By any other name would smell as sweet; --Juliet, in William Shakespeare's Romeo and Juliet Shakespeare wrote that a rose by any other name would smell just as sweet. From a chemical perspective he was certainly correct: a rose retains all of its olfactory qualities no matter what we choose to call it. Yet as a talented poet, Shakespeare often exploited the power of words to evoke fond memories, to arouse the imaginations and to stir the emotions of his audience. It is certainly true that the word "rose" obtains its warm associations and poetic resonance from its perceptual qualities - its deep red color, silky texture and sweet fragrance - but it is surely just as true that this flower would not be so beloved of poets if its established name were a lexical eyesore like "goreweed", "bloodwort", "thorngore," "prickstem" or "turdblossom." Names are important. We choose them not just to serve as unique identifiers, but as evocative signs that are more than mere symbols. Steve Jobs chose the name "Apple" for his new technology venture to exploit the wholesome familiarity of its conventional meaning, a ubiquitous fruit that is seen as natural, attractive and unthreatening. Apple Corp. continues to make good use of this naming motif in its products, ranging from the Apple GS (nicknamed the Granny Smith) to the Apple Macintosh (a type of apple) to the Apple Newton (referencing both the popular myth of Isaac Newton and the falling apple than inspired him, and a fruit-filled cookie that is popular with children). The technology company Sun Microsystems chose its name to be a signifier of light, solidity and power, while Oracle chose its name to evoke all that is wise and knowledgable. Cisco is evocative of the freedoms one associates with the company's home city, San Francisco, while Google has benefited from seeing its name go from being a noun (a static thing) to a verb (a dynamic action). A good name cannot save a bad product, but it can help to make a good product great. Conversely, a poor choice of name can only add to the woes of a weak product. Though there are surely many reasons for the failure of Microsoft's "Zune", the fact that so many who care to speak of it can only remember the product as Microsoft's answer to the iPod suggests that its name was a big part of the problem. We also use names to divide up the colour spectruminto shareable bundles of perceptual experiences. We all know what is meant by the words "red" or "green" but we also appreciate that such simple names subsume a wealth of possible tones and tints. Insofar as each color varianthas its own uses, it deserves its own name. The Pantone company, a provider of colour palettes to industry, usesfunctional alphanumberic names for its many variations. Poets are more evocative, and anchor their chosen namesin our shared experiences of a shared physical world. Sowhen, in the Iliad, Homer describes the colour of morninglight with the epithet rosy-fingered dawn, he succeeds in conveying a very specific shade of red by grounding hisdescription in the familiar colour stereotype of the rose.A lexical stereotype is any lexicalized idea that can evoke a range of qualities, perceptual or otherwise. But one mustbe careful when using such dense descriptors. Homer'sfrequent use of the epithet "wine-dark sea" has led manya scholar to the edge of rational explanation, to questionnot just Homer's visual sense (he is traditionally believed June 2015 fi to have been blind, if indeed he was a single individual), but also ancient nautical conditions (e.g. to posit red tides, dense with rust-hued algae) and even the colour of ancientGreek wine (dark blue, perhaps, if heavily diluted with alkaline water). Yet the simplest answer is that whichdoes not ask us to question our colour stereotypes: Homerreally did mean to imply that the sea - at dusk, under an auspicious red sky - looked as dark and red as red wine. With creativity we aim to be fresh and orginal, yet it isfamiliarity that lies at the heart of creativity. Conversely, it is obviousness, not familiarity, that is the antithesis ofcreativity, for to be creative one must knowingly exploitfamiliar ideas in non-obvious ways. Indeed, psychologists have long argued that a grounding in familiar stereotypesshould guide the appreciation of new ideas, leading Gioraet al. (2004) to advance, and empirically verify, the theory of Optimal Innovation. This theory argues that novelty is, in itself, neither sufficient for creativity nor a reliable benchmark of creativity. For Giora, an optimal innovation is any novel turn that contains the recognizable seeds of its familiar origins, as when a witty phrase is seen as aclever variation on a familiar expression, or a novel namecan be decomposed into familiar elements. A colour namesuch as Jealous Monster, for a shade of green, would bean optimal innovation in this sense if it is appreciated as avariation on Shakespeare's Green ey'd monster, jealousy. So too are technology names that knowingly borrow - inthe fashion of Apple Corp. - from the world of fruit. Thus, BlackBerry and the Raspberry Pi each nod to Apple Corp. while emphasizing their berry-like petiteness. For a modern connoisseur of colours and colour names, a paintshop catalogue proves to be a more diverse source of evocative names than a book of verse. After all, paintmanufacturers have a vested interest in selling more than emulsified RGB codes. So like poets, paint makers craftnames that are dense in emotion and poetic resonance, tosell an entire colour "experience" to aspirational buyers. Why else name a paint colour Soho Loft or Eton Mist? The colour spectrum is free, and available to anyone witheyes, while paint makers all have access to much the sametechnologies. But names add value that can make a colourdesirable, allowing manufacturers to sell feelings in a can. Paint catalogues are thus filled with colour names such as Mocha Cream, Oyster Shell, Harvest Sun, Toffee Crunch, Vintage Plum and Almond Butter, each a name that canstir the appetite as much as the imagination. Paint makerscompete to find the most marketable names for what are virtually the same RGB codes, so that one maker's Pale Liqueur is another's Baked Biscotti or Creme Caramel. Our colour preferences serve as superficial expressions of deeper personality traits, or at least we feel this to be sowhen we stake out claims to favorite colours or ask others about theirs. On Twitter, an automated bot that generatesa random RGB code and a corresponding colour swatchevery hour has attracted almost 30,000 human followers. The outputs of this Twitterbot, named @everycolorbot, are frequently favorited and re-tweeted, not because usersare drawn to specific RGB hexcodes, but because of what the corresponding colours say about their own aesthetics. Similarly, the website colourlovers.com invites its users to express their loves for (i.e., to vote for) specific coloursand RGB codes. Users of the site may also invent theirown names for specific codes, and cluster these codes intorecommended palettes. Rather like a vast paint catalogue, the site is a trove of insightful data on the creative namingstrategies we humans use to lexicalize our favorite hues. In this paper we seek to automate the creative task of inventing new names for specific colours and RGB codes. The task is interesting not just because humans find it so, or because name invention is a creative industry in itself; rather, the task interests us here primarily because it offers us a framework to explore issues of perceptual grounding in linguistic creativity. Much like @everycolorbot, our solution is implemented as an autonomous bot on Twitter. Yet this new Twitterbot is not a mere generator of random RGB codes, but an inventor of meaningful, perceptuallygrounded names for its chosen colours. These names are grounded via a large inventory of colour stereotypes, and this database of stereotypes constitutes a reusable result of this research that we make available to others. To ensure that all names are semantically and syntactically wellformed as linguistic constructs, we also exploit the notion of a linguistic readymade, a Duchampian idea in art in which something - a physical object or even a phrase - is taken from its conventional context of use and placed in a new context that gives it new meaning and new relevance. The memory be green, and that it us befitted There is both a science and an art to creative naming (see Keller, 2003), for though we want our new names to seem effortlessly apt, their creation often requires considerable amounts of search, filtering, evaluation and refinement. So while inspiration can arise from almost any source, a small number of reliable generative strategies dominate. Punning, for instance, is popular as a naming strategy for non-essential services or products that exude informality. Puns thus proliferate in the names of pet shops and pet services (e.g., Indiana Bones and the Temple of Groom, Hairy Pop-Ins), hair salons (Curl Up & Dye), casual food emporia (Thai Me Up, Jurassic Pork, I Feel Like Crepe, Custard's Last Stand, Tequila Mockingbird) or any small business that relies on a memorable hook to direct future footfall (Lawn Order, Sew It Seams, Sofa So Good). As innovations, punning names are optimal in the sense of Giora et al. (2004), insofar as they ground themselves in the cozy familiarity of an idiom ("so far, so good") ora popular TV show ("Law and Order") or a film ("Indiana Jones and the Temple of Doom") and give their audience the thrill of recognition when first they encounter them. Computational Creativity (CC) has had notable successes with punning (Binsted and Ritchie, 1997; Hempelmann, 2008), leading Özbal and Strapparava (2012) to obtain promising results for a pun-based automated naming system. With tongue placed firmily in anesthetized cheek, these authors suggest that the punning name Fatal Extraction might be used to add humour to a dentist's June 2015 fi advertisement, or that a vendor of cruise holidays might find use for a slogan like Tomorrow is Another Bay (though not Die Another Bay). Newly invented names may often take the form of new words, or neologisms. One especially productive strategy for neologism creation is the portmanteau word, or formal blend, in which a new word is stitched together from the lexical clippings of two others. A good Frankenword (the word is itself a portmanteau of "Frankenstein" + "word") will contain identifiable components of both ingredients, as in "spork" ("spoon"+"fork"), "brunch" ("breakfast" + "lunch") or "digerati" ("digital"+ "literati"). Veale (2006) presents an automated approach to harvesting neologistic portmanteaux from Wikipedia and for assigning plausible interpretations using the site's link topology. For instance, as the Feminazi Wikipedia page links to that of feminist and Nazi, and each denotes a kind of person, a "Feminazi" is assumed to be a formal blend of a feminist and a Nazi. Butnariu and Veale (2006) later describe a system, named Gastronaut, that invents and evaluates its own neologistic portmanteaux, by combining morphemes of Greek origin (e.g. "gastro-", "-naut") to which it assigns lexical glosses (e.g. "gastro-"›food, "-naut›traveller|explorer). As this system can propose a phrasal gloss for each portmanteau it invents (e.g. proposing "food traveller" for gastronaut), it uses the presence of this phrase on the Web to validate the linguistic usefulness of the corresponding neologism. Özbal & Strapparava (2012) use a portmanteau strategy to propose salient names for products and their qualities; e.g., their system proposes "Televisun" for an extra-bright television, as sun is an oft-used stereotype for brightness. Smith et al. (2014) present a semi-automatic collaborative portmanteau creator, called Nehovah, that uses synonyms of the input words in its formal blends, as well as relevant phrases gleaned from sites such as www.thetoptens.com. This diversity of lexical sources allows Nehovah to invent portmanteau words that do not contain clippings from any of its inputs, but to clip words that are nonetheless salient. Özbal and Strapparava also use word associations in their formal blends, to propose names such as Eatalian ("Eat" + "Italian") and Pastarant ("Pasta" + "Restaurant") for Italian eateries, the first of which names a real restaurant. Creative naming, like modern art, is often a matter of wholesale appropriation: we reuse an existing product that is not itself original, but use it in a new context that makes it fresh again. Consider the name Fifty Shades of Grey for a hair salon that aims to imbue dye jobs with sex appeal, or the name The Master and Margherita for a pizzeria. The movie The Usual Suspects takes its striking title from an immensely quotable line from the movie Casablanca, the film Pretty Woman takes its title from a song by Roy Orbison, while the movie American Pie is named after a song by Don McLean. Veale (2012) refers to this kind of appropriation as a linguistic readymade, after the found art movement launched by Marcel Duchamp in 1917 with his Fountain - a signed urinal exhibited as a work of art. Veale (2011,2012) generalizes this approach to creative text appropriation into a computational paradigm named CIR: Creative Information Retrieval. CIR is based on the observation that much of what is deemed creative in language is either a wholesale reuse of existing linguistic forms - linguistic readymades - or a coherent patchwork of modified readymades. CIR provides a non-literal query language to permit creative systems to retrieve suitable readymades with appropriate meanings from a corpus of text fragments such as the Google n-grams (Brants and Franz, 2006). For example, the CIR query operator @Adj matches any word/idea that is stereotypically associated with the property Adj, and so the query "@cold @cold" retrieves bigrams whose first and second words denote a stereotype of coldness, such as "robot fish" or "January snow". The retrieved phrases may never have been used figuratively in their original contexts of use, but they can now be re-used to evocatively convey coldness in novel witticisms, similes and epithets. Veale (2011) uses CIR as a flexible middleware layer in a robust model of affective metaphor interpretation and generation that also combines metaphors to generate poetry. Veale (2012) uses CIR in a generative model of irony, to invent ironic similes such as "as threatening as a wet whisper" and "as strong as a cardboard tank"). A key advantage of using linguistic readymades for automated invention - perhaps the single biggest reason to exploit readymades - is that, as phrases, their syntactic and semantic well-formedness has already been well-attested in the outputs of human authors. We exploit CIR middleware here as a means of finding readymade colour names in the Google n-grams. That is, we seek out attested phrases that may evocatively suggest a colour, regardless of whether these phrases were ever used to name a colour in any of their original contexts of use (which, of course, an n-gram model cannot tell us). We use a large inventory of lexicalized colour stereotypes to permit CIR to find these candidate phrases, and employ a mapping from stereotypes to RGB hexcodes to derive a composite colour from their individual colour ingredients. Having established a mapping from colour readymades to colour codes, a perceptual Twitterbot can then creatively name the colours it wishes to showcase in its tweets. If Snow Be White CIR offers users a range of non-literal query operators, of which @ is perhaps the most useful for metaphor retrieval but also the most knowledge-dependent. For @ is only as useful as its stock of stereotypical associations - such as that fridges, winter, fish and ice are each cold or that suns, flames, ovens and deserts are all hot - will allow. Veale (2013) outlines a semi-automated approach to acquiring these associations from similes found on the Web, such as "hot as an oven" and "as cold as winter". While a number of these similes identify popular colour stereotypes, such as that lemons are yellow ("as yellow as a lemon"), night is black, grass is green and snow is almost always white, we require a considerably more substantial inventory of colour stereotypes if we are going to extract a diversity of readymade colour names from the Google n-grams. Basic colour words like "red" and "blue" are often used June 2015 fi as simple, descriptive adjectives, while more subtle hues call for longer adjectival forms. For example, hyphenated compounds, such as "cherry-red" and "nut-brown", are commonplace in English and easily harvested from Web texts or from large databases of Web n-grams. Consider the following matches for the CIR query "^noun -red" in the Google 3-grams (^noun matches any noun): blood -red (3-gram frequency: 57,932) ruby -red (3-gram frequency: 16,366) cherry -red (3-gram frequency: 15,667) rose -red (3-gram frequency: 14,513) brick -red (3-gram frequency: 11,676) flame -red (3-gram frequency: 2,874) coral -red (3-gram frequency: 2,371) Each of the nouns in the modifier-first position above denotes a familiar stereotype of redness. But the 3-grams also provide problematic matches, such as the following: tallahassee -red (3-gram frequency: 172,082) lemon -red (3-gram frequency: 5,486) mahogany -red (3-gram frequency: 1,029) Tallahassee, a place name, does not denote a stereotype of redness in the same way as e.g., the place name Mars. Rather, it is a conventionalized name for a specific shade of red, while lemons have no association at all with red in the popular imagination. Lemon-red most likely denotes a blend then, of red and lemon-yellow, rather than the name of a stereotypical source of redness. It takes knowledge of the world to distinguish such n-grams - undesirable near misses - from the desirable hits of earlier n-gram matches. We broaden our n-gram retrieval net by using the CIR query "^noun -^colour", where ^noun matches any noun and where ^colour matches any member of the set {red, blue, green, yellow, orange, brown, purple, black, white, grey, pink}. To keep the hits, such as coral-red, and to discard the misses, such as lemon-red, we must manually filter all retrieved matches. Since our aim is to construct a high-quality resource with extensive reuse value, manual filtering is a good investment of effort. We think it is better to build a near-perfect resource with manual effort than to design a one-off learning algorithm that would do the job imperfectly yet take longer to implement and test. A day of manual effort yields a filtered set of 801 compound adjectives, ranging from acid-green to zincwhite with hues such as sulfur-yellow, tandoori-red and whale-blue in between. But a more arduous task awaits. We must now assign a representative RGB code to each colour stereotype. For instance, we assign #E53134 to tandoori-red but #FD5E53 to sunset-red. This mapping of colour stereotypes to colour hexcodes provides the perceptual grounding for each stereotype and so must be performed with great care. The encycolorpedia.com site and others are used to explore possible RGB codes for each stereotype, and human judgment is used in each case in the selection of the most apt colour code. We use RGB as a coding system for its popularity and simplicity, as RGB codes can later be converted into one's preferred coding scheme, such as LAB (see Hunter, 1948), whose dimensions offer a better of model of human perception. The result of this manual effort is a map that associates each of our 801 colour stereotypes with an apt RGB code. And summer's green all girded up in sheaves These lexicalized stereotypes are the building blocks with which we can build novel colour names. Conversely, they are the identifiable signifiers of colour that we can use to recognize the potential of arbitrary readymades to suggest and name specific colours. As noted earlier, we choose to view the invention of colour names as a readymade art task, in which coherent, existing phrases are ripped from their original contexts of use - where they are unlikely to name a colour - and given new life as apt colour names. For CIR purposes, we construct the ad-hoc set ^stereo to hold the names of all of our colour stereotypes, from acid to zucchini. The simple CIR query "^stereo ^stereo" can now retrieve all bigram phrases from the Google ngrams in which both modifier and head suggest a colour. Consider the matching bigram "chocolate espresso" (freq =2,548). As the stereotype chocolate-brown maps to the RGB code #7B3F00, and the stereotype espresso-black maps to #393536, a creative system can infer that the colour named by "chocolate espresso" will have an RGB code that sits somewhere on the line connecting #7B3F00 to #393536 in RGB space. Veale (2011) demonstrates how phrases like "chocolate espresso" are retrieved from the Google n-grams because the stereotypes for chocolate and espresso have shared properties, such as smooth and dark, allowing a system named the Jigsaw Bard to invent the simile "as smooth and dark as a chocolate espresso." In effect, what we aim to achieve here is the generation of novel similes that have discernible perceptual groundings. The CIR query "^stereo ^stereo" retrieves 5,841 bigram phrases from the Google 2-grams, from "lemon tree" (frequency="3,236") and "honey mustard" (freq=3,120) to "Brick Park" (freq=40) and "Bear Shadow" (freq=40). When this query is applied to the Google 1-grams - by splitting complex unigrams into their lexical parts - an additional 5,666 unigram readymades are found, ranging from "honeymoon" (frequency=2,410,981, which may be interpreted as a pale blend of honey-yellow and moonwhite) to "firemelon" (freq=200, perhaps naming a blend of fire-red and melon-orange). The least frequent names also tend to be the most enigmatic. Consider "braincloud" (freq=201), which suggests a striking name for a shade of gray, or "demonmilk", "coralstar" and "bananadragon". These seem to have been crafted by another person in another context to name some idea or thing; now they can be used again, this time to provocatively name a colour. These readymades are not manually filtered for quality, and so, as CIR cannot disambiguate word-senses in ngrams, it may retrieve phrases that use colour stereotypes in non-stereotypical senses. For instance, CIR retrieves June 2015 fi "Holly Hunter" (an actress, but also a potential blend of holly-red and hunter-green) and "Tiger Woods" (a famous golfer, but also, potentially, a tawny blend of tiger-orange and wood-brown). Recall that the ultimate artistic value of a readymade lies in its ability to be re-interpreted with a new meaning or a new resonance. An orange-brown colour named Tiger Woods would be not just apt then, but humorously apt, and we should embace this serendipity. Each readymade can be assigned a potential RGB code at its moment of retrieval, by employing a parameterized mixture model to the RGB codes of its lexical ingredients. For a readymade like "chocolate espresso", whose words denote nearby points in RGB space, we can simply split the difference and average the colours, so that chocolate espresso is a mix of 50% chocolate-brown (#7B3F00) and 50% espresso-black (#393536). When these components denote more distant colours/codes, it is necessary to bring linguistic and perceptual intuition to bear on them. For instance, we can expect "chocolate forest" (freq=153) to denote a different hue than "forest chocolate" (freq=170). The rules of compounding suggest that "forest chocolate" denotes a kind of chocolate, and that its colour should be perceived as a brown hue. In contrast, as "chocolate" is a modifier, not a head, in "chocolate forest", we expect this name to denote some variation of (forest) green. As such, forest chocolate should contain as much forest-green as one can put into it while keeping it an identifiable brown, while chocolate forest should contain as much chocolatebrown as is possible while achieving a green hue overall. The assignment of colours to readymade phrases is one side of the coin, of which the naming task is the flip side. Given an RGB color code, a creative naming system must assign an apt and original name to this code. This is the specific task that we focus on in this paper. O, speak again, bright angel! Suppose one wanted a creative Twitterbot to respond to the postings of another bot, such as @everycolorbot. In this case, our responsive bot could await new tweets from @everycolorbot, extract the RGB code from each, and generate a catchy name for this colour to tweet as an apt response. Alternately, our bot could invent its own names for much loved colours on colorlovers.com, to compete with names already invented by human users of the site. Suppose our CC bot is given the RGB code #FCF9F0, a code which corresponds to a very pale yellow hue and which, on colorlovers.com has received 69 loves (and the name "vanilla ice cream" from one of the site's users). The colours of the RGB space can be arranged on a colour wheel (see Jennings, 2003), in which the three primaries (Red, Green and Blue) are found at equidistant points on the circumference of a circle, with all possible secondary and intermediate colours arranged between the corresponding points for their color ingredients. Locating #FCF9F0 on the colour wheel, we consider this to be the dominant colour in a scheme of three colours, comprising this and its two near-neighbors, #FCF3F0 and #F9FCF0. This arrangement is called an analogous colour scheme (Pentak, 2010), as it forms a trio of adjacent colours that bear an analogical relationship to the related hues that one sees in nature, such as the changing colours of the leaves in Autumn. We thus refer to #FCF3F0 and #F9FCF0 as analogous colours of our dominant colour, #FCF9F0. A colour scheme such as this allows a CC system to find adjacent colours that appear to match well because they are often found together in the real world. Moreover, we can use a pair of analogous colours to find a readymade name for the dominant colour they bracket on the colour wheel, one that is both perceptually and linguistically apt. For each analogous colour, our system seeks out the most appropriate colour stereotype. But first, we convert all relevant RGB codes into the equivalent CIE LAB code (Sharma 2003:29-32). The CIE LAB space is perceptually uniform, so any change . in a CIELAB code induces a uniform change .' in the perceptibility of the equivalent colour. The Delta E CIE76 distance function can now be used to measure the distance between a given colour and that associated with any colour stereotype term. Thus, for instance, the Delta E CIE76 distance between #FCF3F0 and seashell-white (#FFF5EE) is 2.17, while the distance between #F9FCF0 and pearl-white (#F7FBEF) is 0.55. As it happens, these two stereotypes - seashell-white and pearl-white - are the closest available colour stereotypes for the analogous colour pair #FCF3F0 and #F9FCF0. Multiple readymades may each combine the words "pearl" and "seashell" in various ways. But as neither of the unigrams pearlseashell or seashellpearl is attested in the Google 1-grams, the system cannot choose a solid compound for a name. But the Google 2-grams do attest to the bigrams "pearl seashell" (freq=1,383) and "seashell pearl" (freq=5,633), and also attest to the plural bigram "seashell pearls" (freq=421). To maximize its chances of choosing a phrase that is semantically and syntactically well-formed, the system most prefers to choose attested unigram names, as these are most likely to have been coined as names; if it cannot find an attested unigram, it prefers a plural bigram, such as "seashell pearls", as these are more likely to have been coined as a modifier:head construction; if it cannot find an attested plural bigram, it settles for the most frequent bigram (e.g. seashell pearl"). In this case, it opts for the plural bigram "seashell pearls" and chooses its singular form, "seashell pearl" as a name. A glance through any paint catalogue reveals that the most popular paint names are those that appeal to our love of nature, to our appetites, or to our aspirations. So paint names often use naming elements that denote a natural kind (tree, pearl, forest, sea, etc.), a food or drink (toffee, butter, almond, espresso, etc.) or a distinctive culture or place (China, Persian, etc.). So words such as tandoori and kangaroo tick two boxes at once. We may filter our readymade names by their adherence to this scheme, and choose only those phrases that use a colour stereotype that suggests a natural kind, food, drink, culture or place. The Thesaurus Rex Web service of Veale and Li (2013) can be used to provide fine-grained categorizations of colour June 2015 fi stereotypes (such as kangaroo, butter, pearl, etc.) and to filter possible readymades by the categories they evoke. The filter employed by a naming system determines its aesthetic sensibility, and different systems may exhibit different aethetic senses. One can imagine a system that prefers poetic names, smutty names, provocative names (e.g. cocainestar for a whiteish hue) or fantastic names (e.g. alienbrain for a gray-green hue). In the following experiments, our system employs the paintshop-friendly natural-animal-food-drink-culture filter described above. Beauty doth varnish age, as if new-born To evaluate the quality and aptness of the readymade phrases that we repurpose as attractive new colour names, we compare these automatic names to those assigned by humans on the website ColourLovers.com. We download the top 100,000 colour codes from this site, ranked from most to least loves; the mean number of loves per colour code is 13, while each code has at least one love and just one human-assigned name (as the site does not permit multiple names for the same RGB code). For each RGB code our automated naming system seeks out the most apt readymade name it can find. To ensure a good perceptual match between each code and its new name, a threshold distance of 14 is chosen for use with the Delta E CIE76 distance function, which measures Euclidean distance in the CIELAB space. Thus, the CIELAB code of any colour stereotype (such as pearl-white) will only match the CIELAB equivalent of an analogous RGB code (such as #F7FBEF) if their Euclidean distance in CIELAB space is 14 or less. We choose a maximum of 14 empirically, so as to impose tight control on colour matching while allowing every colour code to be assigned at least one readymade. We automatically identify the most apt readymade for each of the 100,000 downloaded colour codes, using the preferential approach to n-gram selection outlined in the previous section. Of the 100,000 assigned names, 2587 are selected as paintshop-style names using the aforementioned natural-animal-food-drink-culture filter. It is this subset of readymade names that we focus on here for purposes of empirical evaluation. The mean number of loves for each of the named colours on ColourLovers.com is 2.188. For each of the 2,587 machine-generated names, we determine the name assigned to the corresponding RGB colour by users of ColourLovers.com. This allows us to construct a set of 2,587 triples, each comprising an RGB code, a human-assigned name and a name invented (via a repurposed readymade) by a machine. We used these triples to pose comparison questions to human judges recruited via the crowd-sourcing platform CrowdFlower.com. For each triple, a visual sample of the colour and a pair of names, one human-generated and one machine-generated, were put before the judges, who were asked to take a moment to imagine the colour being used. The ordering of both names was randomly selected on a case-by-case basis, so that the human-generated name was listed first in ~50% of cases, and the machine-generated name was listed first in the other ~50% of cases. In all cases, judges were not told of the origin of either name. Each judge was paid a small sum to answer 4 questions: 1. Which name is more descriptive of the colour shown? 2. Which name do you prefer for this colour? 3. Which name seems the most creative for this colour? 4. Why did you answer these questions they way you did? The fourth question is a source of qualitative responses that may, in future work, offer useful insights into the factors that shape the appreciation of names. Judges were timed on their responses, and those that spent less than 10 seconds presenting their answers for any colour were classified as scammers and discarded. We required that each question be answered by 5 non-scamming judges to be trusted for evaluation, and thus, we obtained 12,608 trusted judgments in all that contributed to the evaluation, and 5,040 untrusted judgments that were instead ignored. A total of $220 was allocated to the experiment, which was terminated after these funds were exhausted and 940 judges had been paid to contribute to the task. At this point, 1578 out of 2587 colours had received five trusted judgments for each of their questions, and so it is on the collected judgments for these 1578 colours that we base our evaluation. Tallying the individual judgments per question, we see that 70.4% of individual judgments for most descriptive name (Q1) favored the machine; that 70.2% of individual judgments for most preferred name (Q2) favored the machine; and that 69.1% of individual judgments for most creative name (Q3) favoured the machine. Similarly, when we tally the majority judgment for each question under each colour - the choice picked by three or more judges - we see that for just 354 (23%) of the 1578 colours, a majority of judges deemed the human-assigned name for a given colour to be more descriptive than that assigned by the machine. The results for the next two questions, Q2: which name do you prefer? and Q3:which name is most creative?, are very much in line with those of the first question. Only for 355 colours does a majority of the five human judges for a given colour prefer the human-assigned name over that assigned by the machine, and only for 357 colours does a majority of judges consider the human-assigned name to be more creative than the machine-assigned name. This consistent breakdown of approx. 3-to-1 in favour of the machine suggests that machine-assigned readymade names can be more than competitive with human names. However, the surprising consistency of these results also suggests that the human judges are really only offering one opinion for all three of the binary questions that they are asked. It seems that judges, who are asked to ponder the possible users of a colour before answering the questions that follow, apparently favour a given name for a colour and then follow through with much the same answer for all three questions. Indeed, when we calculate the rate of agreement across all questions, we find that judges choose the same name for at least two of the three questions in 93% of cases, and choose the same name for June 2015 fi all three of the questions (that is, most descriptive, most preferred and most creative) in 91% of cases. These agreement statistics suggest that most human judges see these questions as paraphrases of each other. Though it can aid our understanding of the mechanics of linguistic creativity to try and tease apart the related notions of descriptive adequacy, personal preference and creative appreciation, these three notions now appear to be too tightly interwound to effectively separate them, at least within the same experimental task. Let our bloody colours wave! A Twitterbot named @HueHueBot has been constructed (by the second author) to showcase the perceptuallyanchored creativity of this readymade-based approach to colour-name invention. An example tweet of this bot, with attached colour sample, is shown in Fig. 1. Figure 1. A tweet with both RGB hexcode and apt name. @HueHueBot exploits colour stereotypes and Google ngrams in the manner described in previous sections. But this inventory of colour stereotypes and their RGB codes can be reused by other Twitterbots that exhibit their own colour aesthetics and linguistic framing preferences. To this end, we gave the stereotype lexicon and a large stock of relevant n-grams to students as resources to be used for a course project on computational linguistic creativity. Students were asked to build colour-naming Twitterbots which might invent and name their own colours, or name the colour codes generated by @everycolorbot. The bots that ensued demonstrate a variety of possible approaches to naming and to the linguistic framing of those names. @ColorCritics frames its outputs as though it as an art critic that specializes in colour, and thus, in addition to offering to name colours generated by @everycolorbot, it critiques the palette choices of this bot. @ColorCritics expresses a preference for unigram names, of which examples include TandooriTikka, PukePuke and FireSky. @WorldIsColored mimics the bravura personality of Stan Lee, a famous creator of comic book superheroes, and thus expresses a preference for colour names that use alliteration (a much-loved ploy of Lee's). Its alliterative colour names, such as BlueberryBlush, are framed in the language of superhero comics, such as in this tweet: "May be coloring my costume as BLUEBERRY BLUSH was not a very good idea! RT .@everycolorbot: 0xdd4fc3". @ColorMixALot combines 2-gram phrases to generate complex colour names that run to three and four words. Example colour names include tree frog bile yellow and moonlight coral pink. The Twitterbot @DrunkCircuit adopts the persona of a borded worker at an IT company, and so its tweets drip with ennui and bitterness. Examples include the sarcastic riposte to @everycolorbot in Fig. 2. Figure 2. A sarcastic response to another colour bot: "thank you @everycolorbot, now I want Rosé Champagne #WineStyles @everycolorbot: 0xf58aa4" Like @HueHueBot, @DrunkCircuit locates the category into which a new name fits best (using Wikipedia's hierarchy of topic categories), and then tailors its tweets to exploit this information. Thus, a name that denotes a kind of wine (as in Fig. 2) is affixed with the hashtag #WineStyles, while the name Almond Crust is used to anchor a tweet that insults the company canteen ("Looks just like the Almond Crust in the canteen today. Yuck! RT @everycolorbot: 0xd3ba8f "). @AwesomeColorBot also tailors its tweets to suit the category of a name, to produce outputs like that of Fig. 3. Figure 3. A tweet with a colour, a name, and an attitude. June 2015 fi @haraweq is a colour-naming hybrid that combines elements of two popular Twitterbots, @everycolorbot and @metaphorminute. The latter is a bot by Darius Kazemi that invents random metaphor-like tweets, such as "an evacuation is a mainframe: evergreen yet slicked." In this vein, @haraweq coins colour similes, such as "a location like a dusty taxicab RT @everycolorbot: 0xf4ec24." It uses Wikipedia to determine e.g. that a taxicab is a location, and uses the Google n-grams to find specific combinations such as "dusty taxicab", which it interprets as a blend of taxicab-yellow and dust-brown. So the most interesting colour bots do more than just invent new colour names; they find a context to motivate a new name, and then frame a tweet as an intelligent - or at least a human-like - response to this context. There is a lesson here for computational linguistic creativity. A new turn of phrase can only be considered creative in a context for which it is non-obvious and apt, and to the extent that it exercises the imagination of the reader. The imagination may take flight on the wings of whimsy, but the most compelling flights into the new and the original remain stubbornly grounded in the realm of familiar experiences. Acknowledgements This research was facilitated by a travel mission fundedby the Univerity of Helsinki and by the EC coordinationaction PROSECCO (Promoting the Scientific Explorationof Computational Creativity; PROSECCO-network.EU). The authors would like to thank Prof. Hannu Toivonen at the University of Helsinki and all the students of the 2014 UH class on Computational Linguistic Creativity (whosebots @ColorCritics, @WorldIsColored, @ColorMixALot, @DrunkCircuit, @haraweq, and @AwesomeColorBot are discussed here; please do check them out on Twitter). Wealso wish to express our gratitude to Mike Cook of Imperial College, London, who suggested the idea of acolour-naming Twitterbot, and to Hyesook Kim, whopainstakingly mapped colour stereotypes to RGB codes.