FIGURE8: A Novel System for Generating and Evaluating Figurative Language Sarah Harmon Computer Science Department University of California, Santa Cruz Santa Cruz, CA 95064 USA smharmon@ucsc.edu Abstract Similes are easily obtained from web-driven and casebased reasoning approaches. Still, generating thoughtful figurative descriptions with meaningful relation to narrative context and author style has not yet been fully explored. In this paper, the author prepares the foundation for a computational model which can achieve this level of aesthetic complexity. This paper also introduces and evaluates a possible architecture for generating and ranking figurative comparisons on par with humans: the FIGURE8 system. Introduction Figurative language is embedded within and intimately connected to our cultures, behaviors, and models of the world. In fact, humans use figurative language so often that we seldom realize it (Lakoff and Johnson 1980); still, its utility for communication is clear. Using metaphors and similes, one can relate the unfamiliar, or the tenor, in terms of the familiar, or vehicle (Richards 1980). In Figure 1, for example, “moon” is the vehicle for “garden”, the tenor. Attributes of the moon, such as its brilliance, are used to describe the beauty of the garden. Prior to the comparison, the garden’s appearance is unknown (is it beautiful and luminous, or neglected and overgrown?). The simile helps to resolve this ambiguity and provide the reader with a clearer picture of the scene. Comparison gives us the ability to delicately express irony and sarcasm (“clear as mud”), exaggeration (“that man was as tall as a giraffe”), and emotion (“my heart was a sinking ship”). With such tools, we can explain how we feel, what kinds of people we are, and what experiences we have had. Further, metaphors give color to dry speech and are understood faster than literal equivalents (Gibbs and Nagaoka 1985); this is likely due to their appeal to common previous experiences and memories. For the purpose of this paper, we will consider two styles of figurative language: conventional (common analogies used in daily language, such as “I see what you mean”) and creative (original comparisons that call attention to themselves as figures of speech, such as “Fear is a slinking cat I find / Beneath the lilacs of my mind” (Tunnell 1977)). Each type can provide value, although previous work on computational generation of figurative language has primarily focused on understanding and reconstructing conventional metaphors and similes. Cliches (e.g., “fast as lightning”) are arguably useful ´ when fast, informal communication is required between a computer and a human, and such phrases can be learned via web query (Veale and Hao 2007a). Generating creative comparisons on par with human authors is a much more difficult challenge. A conventional metaphor is considered “good” if many others have used it before, but uniqueness and aesthetic qualities are critical in generating a strong creative metaphor. For instance, several aesthetic properties, such as syllable counts, phonetics, stressed syllable position, rhyme, and alliteration have been identified as “obvious” criteria for making creative poetic lines sound good, despite the fact that these “do not translate well into precise generative rules” (Gervas, Herv ´ as, and Robinson 2007). While creative gen- ´ erators for figurative language exist, few address this concept of what makes for a high-quality metaphor or simile. I will describe a system, FIGURE8, which contains a novel underlying model for what defines creative and high quality figurative comparisons, and evaluates its own output based on these rules. Related Work Modern research in creativity has generally defined a creative system as one that generates novel, context-appropriate output (Rothenberg and Hausman 1976; Sawyer 2012). Within the context of creative natural language generation, a third criterion has been noted: a creative system must generate context-appropriate knowledge outside of its pre-existing knowledge base (Perez y P ´ erez and Sharples 2004). ´ Several computational systems exist which attempt to meet this benchmark. ASPERA, for instance, combines case-based reasoning with intelligent adaptation of examples from corpora (Gervas 2000). Psychological theories ´ have further informed the art of generating figurative language, resulting in more advanced and thoughtful systems. Notably, Brown (Ortony 1993) and Glucksberg (Glucksberg 2001) have argued that categorization is inherent to metaphor. As a consequence, the concept of propertybased concept mapping has inspired metaphor generation approaches, and has been cited as the best method for producing robust, scalable and useful metaphors (Hervas et al. ´ 2007; Veale and Hao 2007a). Proceedings of the Sixth International Conference on Computational Creativity June 2015 71 One must also consider how to develop an appropriate knowledge base without substantial manual authoring. Previous exemplary work in metaphor generation has emphasized the power of using the web to establish example cases of valid comparisons (Veale and Hao 2007a; 2007b). However, these systems merely generate large amounts of potentially creative descriptions, and cannot distinguish between original and poor quality comparisons (Veale and Hao 2007a). Further, they often ignore context, sentence construction, and aesthetics in the generation process, resulting in less evocative and meaningful language. FIGURE8 is a system that uses a web-driven approach to form a preliminary knowledge base of nouns and their properties. The system is provided with a model of the current world and an entity in the world to be described. A suitable vehicle is selected from the knowledge base, and the comparison between the two nouns is clarified by obtaining an understanding via corpora search of what these nouns can do and how they can be described. Sentence completion occurs by intelligent adaptation of a case library of valid grammar constructions. Finally, the comparison is ranked by the system based on semantic, prosodic, and knowledge-based qualities. In this way, FIGURE8 simulates the human-authoring process of revision by generating many vehicle choices and linguistic variations for a single tenor, and choosing the best among them as its favorite. While FIGURE8 does not claim to have a comprehensive set of rules - for example, it does not consider phonetics in its evaluation of description quality - it provides a novel foundation for an intelligent figurative language generation and assessment system. Approach Prior work has established that a strong creative metaphor is not only comprehensible (Tourangeau 1981), novel (Camac and Glucksberg 1984), and context-appropriate (Harwood and Verbrugge 1977; Tversky 1977; Gildea and Glucksberg 1983), but surprising (Tourangeau 1981). The following sections will illustrate how FIGURE8 considers these properties when generating metaphors and similes. A block diagram of the generation process is shown in Figure 3. Clarity A strong metaphor must have an understandable, accurate link between tenor and vehicle. A vehicle is thus only considered acceptable if it has properties in common with the tenor. Further, associating the tenor with the capacities and known manifestations of the vehicle should enhance the clarity of the description. In the FIGURE8 system, these associations are found by mining existing literary corpora (Hart 2014) for instances of the vehicle and using NLTK’s parts-of-speech tagging to identify associations (e.g., refer to Figure 2). This procedure enables the system to use words commonly associated with the vehicle to develop a fresh relation to the tenor. For example, if we were to compare a teacher to a horse, FIGURE8 may now be able to reason that the teacher would prance or trot into the room. In this way, a sentence can be generated by only implicitly referring to the vehicle (“The teacher pranced into the room” vs. “The teacher was a wild horse, prancing into the room”). Common verbs, such as forms of “to be”, were culled from the generated list of association because - as all nouns have the capability to exist and be - such verbs do not lend clarity to the comparison. Granted, the word chosen to relate to the tenor may not make sense (especially in the case of verbs), destroying the very clarity it was meant to enhance. FIGURE8 thus performs a web query using Python’s urllib module to ensure that others have associated the chosen word with the tenor before. If a previous association has not been made, the metaphor is ranked lower in terms of estimated clarity. This evaluation measure ensures that nonsensical descriptions, such as “The turtle darkened like a blue ocean”, are given a lower ranking overall. Novelty Cliches are frowned upon by expert authors; as Salvador ´ Dal´ı once said, “The first man to compare the cheeks of a young woman to a rose was obviously a poet; the first to repeat it was possibly an idiot” (1968). For computergenerated text, it is thus reasonable to expect that a quality metaphor is a fresh comparison. In the FIGURE8 system, each metaphor is checked against an existing knowledge base of comparisons (Friedman 1996), and all generations are ranked based on their similarity to conventional metaphors in this database. Aptness Ideally, a strong metaphor will fit the context within which it lives. For usage in a narrative context, the FIGURE8 system can be passed a model of a simple world of objects and character models, and incorporate these appropriately into its eventual output along with a prepositional phrase generation module. Additionally, one may ask FIGURE8 to generate ironic comparisons, such as those generated by a sarcastic character when speaking. Irony is achieved by selecting for properties with the exact opposite meanings, in accordance with prior work (Veale and Hao 2007b). The FIGURE8 system also endeavors to match a given context during sentence completion, which will be described in a later section. Unpredictability Metaphors are perceived as cleverer when the vehicle and tenor contain similarities, but the respective domains of these terms are distinct (Tourangeau 1981). A description is thus ranked as more surprising when the words are not very conceptually similar and contain fewer properties in common. With the assumption that they share at least one property in common, the chosen metaphor components are ranked by querying the UMBC Semantic Similarity System (Han et al. 2013). The degree to which the vehicle and tenor share major categories is also considered by using a function similar to WordNet’s lexname query. This check is needed because if one or more major categories are shared, the metaphor is considerably less surprising. For Proceedings of the Sixth International Conference on Computational Creativity June 2015 72 Figure 1: Example of a highly ranked output sentence by FIGURE8. Here, the tenor, vehicle, and associated phrases are garden, moon, lit up, and pale. The nouns garden and moon not only have low semantic similarity, but do not share a major category together. Likening a garden to a moon is also not a cliche comparison, lending to the description’s potential novelty. ´ Figure 2: Example of how FIGURE8 discovers and associates a verb with a chosen vehicle, using text from The Count of Monte Cristo, and a part-of-speech parsing module similar to the Stanford Parser (Socher et al. 2013). Here, the nsubj label refers to a link between a verb (“deceived”) and a noun phrase (in this case, the vehicle “world”). The remaining labels in the figure represent the part-of-speech tags. instance, “the strawberry is a pomegranate” is considered a poor metaphor because strawberry and pomegranate are contained within a major category: fruits. Such a description may be produced by a web-based generator (for instance, the online MIT-licensed Metaphorgy system (Groff-Palermo and Lawson 2013) produces “My strawberry is a Phaeacian cherry”), but will be given a low ranking by FIGURE8. Prosody The prosody of a metaphor can be defined as the rhythmic, tonal, and aesthetic qualities that distinguish one metaphor from another. Descriptions are ranked highly if their prosody is of consistent and high quality. For instance, consider the following similes: (1) The serpent stretched into the horizon, like a deserted desert. (2) The snake extended into the horizon, like an abandoned desert. Although alliteration and assonance can be used beautifully in figurative language, the high similarity of consecutive words in (1) may be distracting. Example (2) depicts the same imagery, but uses words of greater distance in terms of consecutive string similarity. At present, FIGURE8 conducts string similarity via Python’s difflib to evaluate the prosody of its outputs. Using difflib’s SequenceMatcher, one can determine a value indicating the degree of similarity between two input strings in a range from 0 (no similarity) to 1 (identical strings). FIGURE8 is thus able to quantify the string similarity for consecutive words, and ranks descriptions lower if there are many consecutive string similarity values above 0.7, which was deemed an appropriate threshold by the author. Consecutive words are also checked for alliteration and assonance, which are considered positive qualities by FIGURE8. Sentence Completion Automated metaphor identification in text has been thoroughly explored (Neuman et al. 2013; Steen et al. 2010) and, as such, FIGURE8 has been provided with a case library of appropriate sentence constructions for metaphor and simile. By following the procedure of imaginative recall (Turner 1992), FIGURE8 first attempts to fit the provided context of the situation to an exact, pre-existing solution. If no solution exists, FIGURE8 searches its memory, solves the problem for a similar case, and adapts that solution to the provided context. As an illustration: if FIGURE8 notes that other authors have used the phrase “to the barn”, it should recognize the barn as a noun denoting a man-made object via WordNet. Similarly, a “chair” is a man-made object, and thus, FIGURE8 may decide to replace “barn” with “chair” when told that a chair exists in the current narrative context. This adaptive process enables FIGURE8 to match its constructions to any provided context and complete statements creatively. Evaluation Little research, if any, has worked towards developing a model of what makes a high quality computer-generated metaphor. Although there is no standard method to evaluate computationally-generated figurative descriptions, one reasonable way to judge would seem to be agreement with huProceedings of the Sixth International Conference on Computational Creativity June 2015 73 Figure 3: Block diagram of the FIGURE8 generation system. If no world model is given, a tenor is selected at random from the noun-property database. A vehicle is then selected with at least one property in common with the tenor. The Clarity Enhancer module requests verbs and adjectives associated with the vehicle from mined literary corpora. Finally, the sentence is completed by performing imaginative recall with known valid sentence constructions for metaphor identified from literary corpora. Table 1: Comparison 1 of FIGURE8 and human rankings for clarity and overall quality. In this set, FIGURE8 was asked to generate and rank figurative descriptions given “pearl” as the tenor. Human clarity and likability rankings were found to be highly correlated (⇢ = 0.684). Spearman analyses also indicated positive correlations between human and FIGURE8 rankings (clarity: ⇢ = 0.872; quality: ⇢ = 0.821). man ratings. This can be assessed by requesting humans to rank descriptions generated by the FIGURE8 algorithm, and determining if the majority are in agreement with the computer’s (FIGURE8’s) ranking. A pilot study indicated that providing each description with additional context would make the ranking process too time-consuming for participants. Thus, functions to enhance aptness were not included when generating outputs to be evaluated in the full-scale study. Method One hundred participants (73 female, 27 male) were recruited via Amazon’s Mechanical Turk. Each participant viewed a series of five sentences at a time, and were asked to rank the similes by how understandable they were (clarity), and by how much they, as individuals, enjoyed the comparison (likability). Each set of five sentences contained the same tenor, and were originally generated and ranked by FIGURE8. The sets were not hand-selected by the author. That is, the first eleven sets FIGURE8 generated and ranked were used in the study. Results Human preferences were determined by following the majority criterion. As seen in Figures 4 and 5, human clarity ratings were often positively correlated with overall quality ratings, and this correlation was confirmed with Spearman analyses. Overall, FIGURE8’s top result for clarity and overall quality generally agreed with the human rankings for each of the eleven sets. FIGURE8 exactly matched the first ranking 46% of the time for clarity and likability. Further, it matched either the first or second ranking 82% and 100% of the time for the clarity and likability categories, respectively. Examples of how FIGURE8 matched human ratings are shown in Tables 1, 2, and 3. Discussion and Future Work In this paper, the author has introduced the FIGURE8 system as a novel tool for generating and evaluating creative figurative descriptions. FIGURE8’s assessments are grounded in psychological models of metaphor comprehension, and have thus far been found to adequately match human rankings when agreed upon. Participants in the evaluation portion were not told that the descriptions were generated by a computer. Only two Proceedings of the Sixth International Conference on Computational Creativity June 2015 74 Table 2: Comparison 2 of FIGURE8 and human rankings for sentences of tenor “snow”. Human clarity and likability rankings were found to be positively correlated (⇢ = 0.763). Spearman correlation analysis suggested that FIGURE8 clarity rankings were positively associated with human clarity rankings (⇢ = 0.872), but no significant association was found between likability rankings in this case (⇢ = -0.359). Table 3: A third comparison of FIGURE8 and human rankings for sentences of tenor “queen”. In addition to showing first choice rankings, this table displays human rankings when considering first and second choices. That is, “the queen stands like a strong castle” was ranked as either first or second for the majority of respondents. In both cases, human clarity and likability rankings were found to be positively correlated (⇢ > 0.9). Spearman analyses also suggested for both cases that FIGURE8 and human rankings for clarity and likability were positively correlated with high significance (⇢ > 0.7). Figure 4: First choice rankings for the generated set of sentences using pearl as the tenor. Although some disparities existed, the majority of respondents generally agreed upon which sentence was the most understandable. comments were made about checking sentences for validity prior to including them in the study, and one regarding how painful it was to rank “bad poetry”. Most participants, however, enjoyed the task and provided positive feedback about their experience (“cool hit”,“super fun”,“I love this”). It is conceivable that task enjoyment affected user responses, but controlling for explicit indication of task enjoyment yielded no significant difference in the results. Controlling for gender also did not reveal significantly different outcomes. Interestingly, for roughly half (50-60% per set) of the participants, how much they liked the figurative description was directly correlated with how well they understood it. The most highly ranked phrases for clarity were also often ranked first for likability, and the Spearman coefficient was used to confirm these positive associations. This was a surprising finding, because more variation and subjectivity was expected for these ratings. Discrepancies between human and FIGURE8 likability rankings, such as in Table 2, could potentially be explained by a human tendency to prefer metaphors containing words of positive sentiment value. However, more analysis is required to confirm this idea, and further study is needed to evaluate how qualities of language are weighted across general and expert populations. Judging from participant comments, it is also possible that some people may like metaphors primarily based on qualities other than clarity (such as prosody, sentiment, or whimsy). If these groups could be automatically identified, perhaps future computer-produced descriptions could adapt to generate more personalized descriptions for the optimum enjoyment of the reader. While FIGURE8 is able to rank its figurative descriptions over various measures of quality, how well its output comProceedings of the Sixth International Conference on Computational Creativity June 2015 75 Figure 5: Clarity rankings for the generated set of sentences using snow as the tenor. Participants rated what FIGURE8 considered the most unsurprising metaphor as the most clear, but there was no highly significant consensus regarding the most likable description. pares with human-authored descriptions was not assessed. The fact that most participants in the evaluation did not question the source of the texts is a promising sign that the system presented here generates human-like output. Regardless, its present constructions can be automatically assigned rankings on par with human evaluations. It is assumed that as the quality of FIGURE8’s generations increases, it will be able to extract the best output from the results of its “brainstorming”. Future research should build upon this foundation and work towards evaluating computer-generated descriptions in terms of aptness, prosody, and unpredictability. When machines are fully able to grasp the subtleties and aesthetics of figurative language, we as humans will be able to relate to them as never before. References Camac, M. K., and Glucksberg, S. 1984. Metaphors do not use associations between concepts, they are used to create them. Journal of Psycholinguistic Research 13(6):443–455. Friedman, S. M. 1996. Cliche finder. Retrieved 2 Mar 2015 from http://www.westegg.com/cliche/. Gervas, P.; Herv ´ as, R.; and Robinson, J. R. 2007. Diffi- ´ culties and challenges in automatic poem generation: Five years of research at UCM. e-poetry 2007. Gervas, P. 2000. An expert system for the composition of ´ formal Spanish poetry. Journal of Knowledge-based Systems 14:200–201. Gibbs, R. W., and Nagaoka, A. 1985. Getting the hang of American slang: Studies on understanding and remembering slang metaphors. Language and Speech 28(2):177–194. Gildea, P., and Glucksberg, S. 1983. On understanding metaphor: The role of context. Journal of Verbal Learning and Verbal Behavior 22:577–590. Glucksberg, S., ed. 2001. Understanding figurative language: From metaphors to idioms. Oxford: Oxford: University Press. Groff-Palermo, S., and Lawson, J. 2013. Metaphorgy: Metaphor generator. Retrieved 21 Dec 2014 from http://www.metaphor.gy/. Han, L.; Kashyap, A. L.; Finin, T.; Mayfield, J.; and Weese, J. 2013. UMBC EBIQUITY-CORE: Semantic textual similarity systems. In Proc. 2nd Joint Conf. on Lexical and Computational Semantics, Association for Computational Linguistics. Hart, M. 2014. Free ebooks - Project Gutenberg. Gutenberg.org. http://www.gutenberg.org/. Harwood, D. L., and Verbrugge, R. R. 1977. Metaphor and the asymmetry of similarity. Paper presented at the annual meeting of the American Psychological Association, San Francisco. Hervas, R.; Costa, R. P.; Costa, H.; Gerv ´ as, P.; and Pereira, ´ F. C. 2007. Enrichment of automatically generated texts using metaphor. MICAI 2007, LNAI 4827 944–954. Lakoff, G., and Johnson, M., eds. 1980. Metaphors We Live By. Chicago, IL: University Of Chicago Press. Neuman, Y.; Assaf, D.; Cohen, Y.; Last, M.; Argamon, S.; Howard, N.; and Frieder, O. 2013. Metaphor identification in large texts corpora. PLoS ONE 8(4): e62343:doi:10.1371/journal.pone.0062343. Ortony, A., ed. 1993. Metaphor and Thought. Cambridge University Press. Perez y P ´ erez, R., and Sharples, M. 2004. Three computer- ´ based models of storytelling: BRUTUS, MINSTREL and MEXICA. Knowledge-Based Systems 17(1):15–29. Richards, I. A., ed. 1980. The Philosophy of Rhetoric. Oxford: Oxford University Press. Rothenberg, A., and Hausman, C. R., eds. 1976. The Creative Question. Durham NC: Duke University Press. Sawyer, R. K. 2012. Explaining Creativity: The Science of Human Innovation. Oxford University Press. Socher, R.; Bauer, J.; Manning, C. D.; and Ng., A. Y. 2013. Parsing with compositional vector grammars. In Proceedings of ACL 2013. Steen, G. J.; Dorst, A. G.; Herrmann, J. B.; Kaal, A.; Krennmayr, T.; and Pasma, T. 2010. A method for linguistic metaphor identification. From MIP to MIPVU. Amsterdam: John Benjamins. Tourangeau, R. 1981. Aptness in metaphor. Cognitive Psychology 13(1):27–55. Tunnell, S. 1977. The Quotable Women, 1800-1975. Corwin Books. Turner, S. 1992. Minstrel: a computer model of creativity and storytelling. Technical Report CSD-920057, Ph.D. Thesis, Computer Science Department, University of California, Los Angeles, CA. Tversky, A. 1977. Features of similarity. Psychological Review 84:327–352. Veale, T., and Hao, Y. 2007a. Comprehending and generating apt metaphors: A web-driven, case-based approach to figurative language. In Proceedings of the TwentyProceedings of the Sixth International Conference on Computational Creativity June 2015 76 Second AAAI Conference on Artificial Intelligence (AAAI- 07), 1471–1476. Vancouver, British Columbia: AAAI Press. Veale, T., and Hao, Y. 2007b. Learning to understand figurative language: From similes to metaphors to irony. In Proceedings of Cog Sci, 683–688. Proceedings of the Sixth International Conference on Computational Creativity June 2015 77FIGURE8: A Novel System for Generating and Evaluating Figurative Language Sarah Harmon Computer Science Department University of California, Santa Cruz Santa Cruz, CA 95064 USA smharmon@ucsc.edu Abstract Similes are easily obtained from web-driven and case- based reasoning approaches. Still, generating thoughtful figurative descriptions with meaningful relation to narrative context and author style has not yet been fully explored. In this paper, the author prepares the foundation for a computational model which can achieve this level of aesthetic complexity. This paper also introduces and evaluates a possible architecture for generating and ranking figurative comparisons on par with humans: the FIGURE8 system. Introduction Figurative language is embedded within and intimately connected to our cultures, behaviors, and models of the world. In fact, humans use figurative language so often that we seldom realize it (Lakoff and Johnson 1980); still, its utility for communication is clear. Using metaphors and similes, one can relate the unfamiliar, or the tenor, in terms of the familiar, or vehicle (Richards 1980). In Figure 1, for example, “moon” is the vehicle for “garden”, the tenor. Attributes of the moon, such as its brilliance, are used to describe the beauty of the garden. Prior to the comparison, the garden’s appearance is unknown (is it beautiful and luminous, or neglected and overgrown?). The simile helps to resolve this ambiguity and provide the reader with a clearer picture of the scene. Comparison gives us the ability to delicately express irony and sarcasm (“clear as mud”), exaggeration (“that man was as tall as a giraffe”), and emotion (“my heart was a sinking ship”). With such tools, we can explain how we feel, what kinds of people we are, and what experiences we have had. Further, metaphors give color to dry speech and are understood faster than literal equivalents (Gibbs and Nagaoka 1985); this is likely due to their appeal to common previous experiences and memories. For the purpose of this paper, we will consider two styles of figurative language: conventional (common analogies used in daily language, such as “I see what you mean”) and creative (original comparisons that call attention to themselves as figures of speech, such as “Fear is a slinking cat I find / Beneath the lilacs of my mind” (Tunnell 1977)). Each type can provide value, although previous work on computational generation of figurative language has primarily focused on understanding and reconstructing conventional metaphors and similes. Clich´es (e.g., “fast as lightning”) are arguably useful when fast, informal communication is required between a computer and a human, and such phrases can be learned via web query (Veale and Hao 2007a). Generating creative comparisons on par with human authors is a much more difficult challenge. A conventional metaphor is considered “good” if many others have used it before, but uniqueness and aesthetic qualities are critical in generating a strong creative metaphor. For instance, several aesthetic properties, such as syllable counts, phonetics, stressed syllable position, rhyme, and alliteration have been identified as “obvious” criteria for making creative poetic lines sound good, despite the fact that these “do not translate well into precise generative rules” (Gerv´as, and Robinson 2007). While creative gen- as, Herv´ erators for figurative language exist, few address this concept of what makes for a high-quality metaphor or simile. I will describe a system, FIGURE8, which contains a novel underlying model for what defines creative and high quality figurative comparisons, and evaluates its own output based on these rules. Related Work Modern research in creativity has generally defined a creative system as one that generates novel, context-appropriate output (Rothenberg and Hausman 1976; Sawyer 2012). Within the context of creative natural language generation, a third criterion has been noted: a creative system must generate context-appropriate knowledge outside of its pre-existing knowledge base (P´erez and Sharples 2004). erez y P´ Several computational systems exist which attempt to meet this benchmark. ASPERA, for instance, combines case-based reasoning with intelligent adaptation of examples from corpora (Gerv´as 2000). Psychological theories have further informed the art of generating figurative language, resulting in more advanced and thoughtful systems. Notably, Brown (Ortony 1993) and Glucksberg (Glucksberg 2001) have argued that categorization is inherent to metaphor. As a consequence, the concept of property- based concept mapping has inspired metaphor generation approaches, and has been cited as the best method for producing robust, scalable and useful metaphors (Herv´as et al. 2007; Veale and Hao 2007a). Proceedings of the Sixth International Conference on Computational Creativity June 2015 One must also consider how to develop an appropriate knowledge base without substantial manual authoring. Previous exemplary work in metaphor generation has emphasized the power of using the web to establish example cases of valid comparisons (Veale and Hao 2007a; 2007b). However, these systems merely generate large amounts of potentially creative descriptions, and cannot distinguish between original and poor quality comparisons (Veale and Hao 2007a). Further, they often ignore context, sentence construction, and aesthetics in the generation process, resulting in less evocative and meaningful language. FIGURE8 is a system that uses a web-driven approach to form a preliminary knowledge base of nouns and their properties. The system is provided with a model of the current world and an entity in the world to be described. A suitable vehicle is selected from the knowledge base, and the comparison between the two nouns is clarified by obtaining an understanding via corpora search of what these nouns can do and how they can be described. Sentence completion occurs by intelligent adaptation of a case library of valid grammar constructions. Finally, the comparison is ranked by the system based on semantic, prosodic, and knowledge-based qualities. In this way, FIGURE8 simulates the human-authoring process of revision by generating many vehicle choices and linguistic variations for a single tenor, and choosing the best among them as its favorite. While FIGURE8 does not claim to have a comprehensive set of rules -for example, it does not consider phonetics in its evaluation of description quality -it provides a novel foundation for an intelligent figurative language generation and assessment system. Approach Prior work has established that a strong creative metaphor is not only comprehensible (Tourangeau 1981), novel (Camac and Glucksberg 1984), and context-appropriate (Harwood and Verbrugge 1977; Tversky 1977; Gildea and Glucksberg 1983), but surprising (Tourangeau 1981). The following sections will illustrate how FIGURE8 considers these properties when generating metaphors and similes. A block diagram of the generation process is shown in Figure 3. Clarity A strong metaphor must have an understandable, accurate link between tenor and vehicle. A vehicle is thus only considered acceptable if it has properties in common with the tenor. Further, associating the tenor with the capacities and known manifestations of the vehicle should enhance the clarity of the description. In the FIGURE8 system, these associations are found by mining existing literary corpora (Hart 2014) for instances of the vehicle and using NLTK’s parts-of-speech tagging to identify associations (e.g., refer to Figure 2). This procedure enables the system to use words commonly associated with the vehicle to develop a fresh relation to the tenor. For example, if we were to compare a teacher to a horse, FIGURE8 may now be able to reason that the teacher would prance or trot into the room. In this way, a sentence can be generated by only implicitly referring to the vehicle (“The teacher pranced into the room” vs. “The teacher was a wild horse, prancing into the room”). Common verbs, such as forms of “to be”, were culled from the generated list of association because -as all nouns have the capability to exist and be -such verbs do not lend clarity to the comparison. Granted, the word chosen to relate to the tenor may not make sense (especially in the case of verbs), destroying the very clarity it was meant to enhance. FIGURE8 thus performs a web query using Python’s urllib module to ensure that others have associated the chosen word with the tenor before. If a previous association has not been made, the metaphor is ranked lower in terms of estimated clarity. This evaluation measure ensures that nonsensical descriptions, such as “The turtle darkened like a blue ocean”, are given a lower ranking overall. Novelty Clich´ es are frowned upon by expert authors; as Salvador Dal´i once said, “The first man to compare the cheeks of a young woman to a rose was obviously a poet; the first to repeat it was possibly an idiot” (1968). For computer- generated text, it is thus reasonable to expect that a quality metaphor is a fresh comparison. In the FIGURE8 system, each metaphor is checked against an existing knowledge base of comparisons (Friedman 1996), and all generations are ranked based on their similarity to conventional metaphors in this database. Aptness Ideally, a strong metaphor will fit the context within which it lives. For usage in a narrative context, the FIGURE8 system can be passed a model of a simple world of objects and character models, and incorporate these appropriately into its eventual output along with a prepositional phrase generation module. Additionally, one may ask FIGURE8 to generate ironic comparisons, such as those generated by a sarcastic character when speaking. Irony is achieved by selecting for properties with the exact opposite meanings, in accordance with prior work (Veale and Hao 2007b). The FIGURE8 system also endeavors to match a given context during sentence completion, which will be described in a later section. Unpredictability Metaphors are perceived as cleverer when the vehicle and tenor contain similarities, but the respective domains of these terms are distinct (Tourangeau 1981). A description is thus ranked as more surprising when the words are not very conceptually similar and contain fewer properties in common. With the assumption that they share at least one property in common, the chosen metaphor components are ranked by querying the UMBC Semantic Similarity System (Han et al. 2013). The degree to which the vehicle and tenor share major categories is also considered by using a function similar to WordNet’s lexname query. This check is needed because if one or more major categories are shared, the metaphor is considerably less surprising. For Proceedings of the Sixth International Conference on Computational Creativity June 2015 Figure 1: Example of a highly ranked output sentence by FIGURE8. Here, the tenor, vehicle, and associated phrases are garden, moon, lit up, and pale. The nouns garden and moon not only have low semantic similarity, but do not share a major category together. Likening a garden to a moon is also not a clich´e comparison, lending to the description’s potential novelty. Figure 2: Example of how FIGURE8 discovers and associates a verb with a chosen vehicle, using text from The Count of Monte Cristo, and a part-of-speech parsing module similar to the Stanford Parser (Socher et al. 2013). Here, the nsubj label refers to a link between a verb (“deceived”) and a noun phrase (in this case, the vehicle “world”). The remaining labels in the figure represent the part-of-speech tags. instance, “the strawberry is a pomegranate” is considered a poor metaphor because strawberry and pomegranate are contained within a major category: fruits. Such a description may be produced by a web-based generator (for instance, the online MIT-licensed Metaphorgy system (Groff-Palermo and Lawson 2013) produces “My strawberry is a Phaeacian cherry”), but will be given a low ranking by FIGURE8. Prosody The prosody of a metaphor can be defined as the rhythmic, tonal, and aesthetic qualities that distinguish one metaphor from another. Descriptions are ranked highly if their prosody is of consistent and high quality. For instance, consider the following similes: (1) The serpent stretched into the horizon, like a deserted desert. (2) The snake extended into the horizon, like an abandoned desert. Although alliteration and assonance can be used beautifully in figurative language, the high similarity of consecutive words in (1) may be distracting. Example (2) depicts the same imagery, but uses words of greater distance in terms of consecutive string similarity. At present, FIGURE8 conducts string similarity via Python’s difflib to evaluate the prosody of its outputs. Using difflib’s SequenceMatcher, one can determine a value indicating the degree of similarity between two input strings in a range from 0 (no similarity) to 1 (identical strings). FIGURE8 is thus able to quantify the string similarity for consecutive words, and ranks descriptions lower if there are many consecutive string similarity values above 0.7, which was deemed an appropriate threshold by the author. Consecutive words are also checked for alliteration and assonance, which are considered positive qualities by FIGURE8. Sentence Completion Automated metaphor identification in text has been thoroughly explored (Neuman et al. 2013; Steen et al. 2010) and, as such, FIGURE8 has been provided with a case library of appropriate sentence constructions for metaphor and simile. By following the procedure of imaginative recall (Turner 1992), FIGURE8 first attempts to fit the provided context of the situation to an exact, pre-existing solution. If no solution exists, FIGURE8 searches its memory, solves the problem for a similar case, and adapts that solution to the provided context. As an illustration: if FIGURE8 notes that other authors have used the phrase “to the barn”, it should recognize the barn as a noun denoting a man-made object via WordNet. Similarly, a “chair” is a man-made object, and thus, FIGURE8 may decide to replace “barn” with “chair” when told that a chair exists in the current narrative context. This adaptive process enables FIGURE8 to match its constructions to any provided context and complete statements creatively. Evaluation Little research, if any, has worked towards developing a model of what makes a high quality computer-generated metaphor. Although there is no standard method to evaluate computationally-generated figurative descriptions, one reasonable way to judge would seem to be agreement with hu- Proceedings of the Sixth International Conference on Computational Creativity June 2015 Figure 3: Block diagram of the FIGURE8 generation system. If no world model is given, a tenor is selected at random from the noun-property database. A vehicle is then selected with at least one property in common with the tenor. The Clarity Enhancer module requests verbs and adjectives associated with the vehicle from mined literary corpora. Finally, the sentence is completed by performing imaginative recall with known valid sentence constructions for metaphor identified from literary corpora. Table 1: Comparison 1 of FIGURE8 and human rankings for clarity and overall quality. In this set, FIGURE8 was asked to generate and rank figurative descriptions given “pearl” as the tenor. Human clarity and likability rankings were found to be highly correlated (. = 0.684). Spearman analyses also indicated positive correlations between human and FIGURE8 rankings (clarity: . = 0.872; quality: . = 0.821). man ratings. This can be assessed by requesting humans to rank descriptions generated by the FIGURE8 algorithm, and determining if the majority are in agreement with the computer’s (FIGURE8’s) ranking. A pilot study indicated that providing each description with additional context would make the ranking process too time-consuming for participants. Thus, functions to enhance aptness were not included when generating outputs to be evaluated in the full-scale study. Method One hundred participants (73 female, 27 male) were recruited via Amazon’s Mechanical Turk. Each participant viewed a series of five sentences at a time, and were asked to rank the similes by how understandable they were (clarity), and by how much they, as individuals, enjoyed the comparison (likability). Each set of five sentences contained the same tenor, and were originally generated and ranked by FIGURE8. The sets were not hand-selected by the author. That is, the first eleven sets FIGURE8 generated and ranked were used in the study. Results Human preferences were determined by following the majority criterion. As seen in Figures 4 and 5, human clarity ratings were often positively correlated with overall quality ratings, and this correlation was confirmed with Spear- man analyses. Overall, FIGURE8’s top result for clarity and overall quality generally agreed with the human rankings for each of the eleven sets. FIGURE8 exactly matched the first ranking 46% of the time for clarity and likability. Further, it matched either the first or second ranking 82% and 100% of the time for the clarity and likability categories, respectively. Examples of how FIGURE8 matched human ratings are shown in Tables 1, 2, and 3. Discussion and Future Work In this paper, the author has introduced the FIGURE8 system as a novel tool for generating and evaluating creative figurative descriptions. FIGURE8’s assessments are grounded in psychological models of metaphor comprehension, and have thus far been found to adequately match human rankings when agreed upon. Participants in the evaluation portion were not told that the descriptions were generated by a computer. Only two Proceedings of the Sixth International Conference on Computational Creativity June 2015 Table 2: Comparison 2 of FIGURE8 and human rankings for sentences of tenor “snow”. Human clarity and likability rankings were found to be positively correlated (. = 0.763). Spearman correlation analysis suggested that FIGURE8 clarity rankings were positively associated with human clarity rankings (. = 0.872), but no significant association was found between likability rankings in this case (. = -0.359). Table 3: A third comparison of FIGURE8 and human rankings for sentences of tenor “queen”. In addition to showing first choice rankings, this table displays human rankings when considering first and second choices. That is, “the queen stands like a strong castle” was ranked as either first or second for the majority of respondents. In both cases, human clarity and likability rankings were found to be positively correlated (. > 0.9). Spearman analyses also suggested for both cases that FIGURE8 and human rankings for clarity and likability were positively correlated with high significance (. > 0.7). Figure 4: First choice rankings for the generated set of sentences using pearl as the tenor. Although some disparities existed, the majority of respondents generally agreed upon which sentence was the most understandable. comments were made about checking sentences for validity prior to including them in the study, and one regarding how painful it was to rank “bad poetry”. Most participants, however, enjoyed the task and provided positive feedback about their experience (“cool hit”,“super fun”,“I love this”). It is conceivable that task enjoyment affected user responses, but controlling for explicit indication of task enjoyment yielded no significant difference in the results. Controlling for gender also did not reveal significantly different outcomes. Interestingly, for roughly half (50-60% per set) of the participants, how much they liked the figurative description was directly correlated with how well they understood it. The most highly ranked phrases for clarity were also often ranked first for likability, and the Spearman coefficient was used to confirm these positive associations. This was a surprising finding, because more variation and subjectivity was expected for these ratings. Discrepancies between human and FIGURE8 likability rankings, such as in Table 2, could potentially be explained by a human tendency to prefer metaphors containing words of positive sentiment value. However, more analysis is required to confirm this idea, and further study is needed to evaluate how qualities of language are weighted across general and expert populations. Judging from participant comments, it is also possible that some people may like metaphors primarily based on qualities other than clarity (such as prosody, sentiment, or whimsy). If these groups could be automatically identified, perhaps future computer-produced descriptions could adapt to generate more personalized descriptions for the optimum enjoyment of the reader. While FIGURE8 is able to rank its figurative descriptions over various measures of quality, how well its output com- Proceedings of the Sixth International Conference on Computational Creativity June 2015 Figure 5: Clarity rankings for the generated set of sentences using snow as the tenor. Participants rated what FIGURE8 considered the most unsurprising metaphor as the most clear, but there was no highly significant consensus regarding the most likable description. pares with human-authored descriptions was not assessed. The fact that most participants in the evaluation did not question the source of the texts is a promising sign that the system presented here generates human-like output. Regardless, its present constructions can be automatically assigned rankings on par with human evaluations. It is assumed that as the quality of FIGURE8’s generations increases, it will be able to extract the best output from the results of its “brainstorming”. Future research should build upon this foundation and work towards evaluating computer-generated descriptions in terms of aptness, prosody, and unpredictability. When machines are fully able to grasp the subtleties and aesthetics of figurative language, we as humans will be able to relate to them as never before. References Camac, M. K., and Glucksberg, S. 1984. Metaphors do not use associations between concepts, they are used to create them. Journal of Psycholinguistic Research 13(6):443–455. Friedman, S. M. 1996. Cliche finder. Retrieved 2 Mar 2015 from http://www.westegg.com/cliche/. Gerv´as, P.; Herv´as, R.; and Robinson, J. R. 2007. Difficulties and challenges in automatic poem generation: Five years of research at UCM. e-poetry 2007. Gerv´as, P. 2000. An expert system for the composition of formal Spanish poetry. Journal of Knowledge-based Systems 14:200–201. Gibbs, R. W., and Nagaoka, A. 1985. Getting the hang of American slang: Studies on understanding and remembering slang metaphors. Language and Speech 28(2):177–194. Gildea, P., and Glucksberg, S. 1983. On understanding metaphor: The role of context. Journal of Verbal Learning and Verbal Behavior 22:577–590. Glucksberg, S., ed. 2001. Understanding figurative language: From metaphors to idioms. Oxford: Oxford: University Press. Groff-Palermo, S., and Lawson, J. 2013. Metaphorgy: Metaphor generator. Retrieved 21 Dec 2014 from http://www.metaphor.gy/. Han, L.; Kashyap, A. L.; Finin, T.; Mayfield, J.; and Weese, J. 2013. UMBC EBIQUITY-CORE: Semantic textual similarity systems. In Proc. 2nd Joint Conf. on Lexical and Computational Semantics, Association for Computational Linguistics. Hart, M. 2014. Free ebooks -Project Gutenberg. Guten berg.org. http://www.gutenberg.org/. Harwood, D. L., and Verbrugge, R. R. 1977. Metaphor and the asymmetry of similarity. Paper presented at the annual meeting of the American Psychological Association, San Francisco. Herv´as, P.; and Pereira, as, R.; Costa, R. P.; Costa, H.; Gerv´ F. C. 2007. Enrichment of automatically generated texts using metaphor. MICAI 2007, LNAI 4827 944–954. Lakoff, G., and Johnson, M., eds. 1980. Metaphors We Live By. Chicago, IL: University Of Chicago Press. Neuman, Y.; Assaf, D.; Cohen, Y.; Last, M.; Argamon, S.; Howard, N.; and Frieder, O. 2013. Metaphor identification in large texts corpora. PLoS ONE 8(4): e62343:doi:10.1371/journal.pone.0062343. Ortony, A., ed. 1993. Metaphor and Thought. Cambridge University Press. P´erez, R., and Sharples, M. 2004. Three computer erez y P´ based models of storytelling: BRUTUS, MINSTREL and MEXICA. Knowledge-Based Systems 17(1):15–29. Richards, I. A., ed. 1980. The Philosophy of Rhetoric. Ox ford: Oxford University Press. Rothenberg, A., and Hausman, C. R., eds. 1976. The Creative Question. Durham NC: Duke University Press. Sawyer, R. K. 2012. Explaining Creativity: The Science of Human Innovation. Oxford University Press. Socher, R.; Bauer, J.; Manning, C. D.; and Ng., A. Y. 2013. Parsing with compositional vector grammars. In Proceedings of ACL 2013. Steen, G. J.; Dorst, A. G.; Herrmann, J. B.; Kaal, A.; Krennmayr, T.; and Pasma, T. 2010. A method for linguistic metaphor identification. From MIP to MIPVU. Amsterdam: John Benjamins. Tourangeau, R. 1981. Aptness in metaphor. Cognitive Psy chology 13(1):27–55. Tunnell, S. 1977. The Quotable Women, 1800-1975. Corwin Books. Turner, S. 1992. Minstrel: a computer model of creativity and storytelling. Technical Report CSD-920057, Ph.D. Thesis, Computer Science Department, University of California, Los Angeles, CA. Tversky, A. 1977. Features of similarity. Psychological Review 84:327–352. Veale, T., and Hao, Y. 2007a. Comprehending and generating apt metaphors: A web-driven, case-based approach to figurative language. In Proceedings of the Twenty- Proceedings of the Sixth International Conference on Computational Creativity June 2015 Second AAAI Conference on Artificial Intelligence (AAAI 07), 1471–1476. Vancouver, British Columbia: AAAI Press. Veale, T., and Hao, Y. 2007b. Learning to understand figurative language: From similes to metaphors to irony. In Proceedings of Cog Sci, 683–688. Proceedings of the Sixth International Conference on Computational Creativity June 2015