Towards a Mixed Evaluation Approach for Computational Narrative Systems Jichen Zhu Drexel University Philadelphia, PA 19104 USA jichen.zhu@drexel.edu Abstract Evaluation is one of the major open problems in computational creativity research. Existing evaluation methods, either focusing on system performance or on user interaction, do not fully capture the important aspects of these systems as cultural artifacts. In this position paper, we examine existing evaluation methods in the area of computational narrative, and identify several important properties of stories and reading that have so far been overlooked in empirical studies. Our preliminary work recognizes empirical literary studies as a valuable resource to develop a more balanced evaluation approach for computational narrative systems. Introduction Evaluation is one of the major open problems in computational creativity research. A set of well-designed evaluation methods not only is instrumental in informing the development of better creative computational systems, but also helps to articulate overarching research directions for the field overall. However, research in creative systems has encountered tremendous difficulties in defining suitable evaluation methods and metrics, both at the level of individual systems and across systems. A recent survey of 75 creative systems shows that, only slightly above half of the related publications give details on evaluation; among those, there is lack of consensus on both the aim of evaluation and the suitable evaluation criteria (Jordanous 2011). Traditionally, methods for evaluating intelligent computational systems have been mainly developed in two areas: artificial intelligence (AI) and human-computer interaction (HCI). Following the scientific/engineering tradition, evaluation in AI typically relies on quantitative methods to measure the system’s performance against a certain benchmark (e.g., system performance, algorithmic complexity, and the expressivity of knowledge representation). A salient example is the measure of “classification accuracy” in machine learning, where new algorithms are evaluated by being compared to standard ones over the same sets of data. Whereas the AI community is primarily concerned with the operation of the system itself, HCI concentrates on the interaction between the user and the system. Borrowing from psychology, human factors, and other related fields, HCI developed a set of quantitative and qualitative user study methods to understand the usability of a system along such principles as learnability, flexibility, and robustness (Dix et al. 2003). Although these existing approaches offer useful insights into creative systems as functional and useful products, they do not fully capture a crucial property of creative systems, that is, they are and they produce cultural artifacts such as stories, music, and paintings. In these areas, there has not be an established tradition of formal evaluation. When we combine artistic expression and system building, evaluation becomes an issue. As Gervas observes in the context of com- ´ putational narrative, “[b]ecause the issue of what should be valued in a story is unclear, research implementations tend to sidestep it, generally omitting systematic evaluation in favor of the presentation of hand-picked star examples of system output as means of system validation” (2009). We argue that the difficulty of establishing an evaluation methodology in computational creativity research reflects the cultural clash between the scientific/engineering and the humanities/arts practices. Aligned with Snow’s notion of the two cultures (1964), researchers working in the intersection of the two communities have observed the conflict of different and sometimes opposing value systems and axiomatic assumptions (Mateas 2001; Sengers 1998; Manovich 2001; Harrell 2006; Zhu and Harrell 2011). One of the differences is what Simon Penny (2007) calls the “ontological status of the artifact” between the electronic media arts practice and computer science research. For an artwork, the effectiveness of the immediate sensorial effect of the artifact is the primary criterion for success. As a result, most if not all effort is focused on the persuasiveness of the experience, which is built on specificity and complexity. In computer science, the situation is reversed. The artifact functions as a “proof of concept” and hence its presentation can be overlooked; the real work is inherently abstract and theoretical. These differences, Penny argues, illustrate that the insistence upon “alphanumeric abstraction,” logical rationality, and desire for generalizability in science is fundamentally at odds with the affective power of artwork. In the context of evaluation, this conflict takes the form of the clash between the productivityand value-based methodologies adopted by both AI and HCI communities, and the general resistance to empirical studies in the arts. In this position paper, we present our initial work of developing a more balanced evaluation approach that takes into International Conference on Computational Creativity 2012 150 account both system and cultural aspects of creative systems, focusing on computational narrative systems and their output. Our work is not intended to replace the function of literary criticism and close reading with empirical studies and statistical analysis. Simplistic attempts to reproduce art as a scientific experiment without an in-depth understanding of the former’s tradition and value systems are shortsighted (as discussed in Ian Horswill’s panel presentation at the Fourth Workshop on Intelligent Narrative Technologies, Palo Alto, 2011) and counter-productive to the long-term goal for computational creativity research. In the meantime, we also believe that evaluation is a critical process to inform the development of creative systems and to deepen the understanding of computational creativity. Therefore, more research and discussions about evaluation are needed. In the rest of paper, we examine existing evaluation methods in the area of computational narrative, and identify several important properties of stories and reading that have so far been overlooked in existing evaluation methods. Our preliminary work suggests that empirical literary studies can be a valuable resource to develop a more balanced evaluation approach for computational narrative systems. Existing Work on Narrative Evaluation Broadly speaking, discussions of evaluating creative systems have taken place at two levels. At the level of computational creativity in general, researchers have attempted to come up with domain-independent evaluation criteria to measure a system’s level of creativity, both in terms of its process and output. For example, Colton (2008) and Jordanous (2011) proposed standardized frameworks to empirically evaluate system creativity. The importance of these approaches is that, in addition to evaluating specific systems, they also allow potential cross-domain comparison between systems. At the level of specific creative domains, evaluations are conducted to validate a specific creative system and its output in that domain. For instance, the recent work by Vermeulen et al. in the IRIS project (2011) proposed a list of standardized, systematic assessment criteria for interactive storytelling systems using concepts that “play a key role in users’ responses to interactive storytelling systems.” This section provides an overview of existing evaluation methods in the area of computational narrative. Our main focus is on the evaluation of story generation systems and their output, but some of our observations can also be applied to (non-generative) interactive digital storytelling systems. Recent examples of evaluating the latter type can be found in (Thue et al. 2011; Schoenau-Fog 2011). Although we do not specifically deal with high-level constructs such as ‘novelty’ and ‘value,’ we believe that more comprehensive evaluation criteria at the domain-specific level can indirectly contribute to the recognition and formulation of these highlevel creativity constructs at the first level. Based on our survey of major text-based story generation systems, existing evaluation methods can be categorized into three broad approaches. System Output Samples As Gervas pointed out above, providing sample generated ´ stories is one of the most common approaches for validating the system as well as the stories it generates. This approach started from the first story generation system TaleSpin (Meehan 1981), where sample stories (translated from the logical propositions generated by the system into natural language by the system author) are provided to demonstrate the system’s capabilities. In addition to successful examples, Meehan also picked different types of “failure” stories to illustrate the algorithmic limitation of the system for future improvement. Similarly, many later computational narrative systems such as BRUTUS (Bringsjord and Ferrucci 2000), and ASPERA (Gervas 2000) use selected system output for ´ validation. Besides the lack of established specific evaluation metrics, the reason for the wide appeal of this approach is that it aligns with the tradition in literary and art practice where the final artifact should stand on its own without formal evaluation. However, simply showing the “successful” and/or “interesting” output without explicitly stating the system author’s selection criteria can be potentially problematic. Some recent work in this approach has attempted to make this selection process more transparent. For example, in the evaluation of the GRIOT system, Harrell (2006) evaluates the generated poems based on the quality and novelty of the metaphors they invoke. When the system generates “my world was so small and heavy,” the author evaluates it by the metaphor it evokes — “Life is a Burden.” Similarly, the Riu system (Ontan˜on and Zhu 2011) automatically assesses ´ the generated stories by measuring the semantic distances of the analogies in the stories based on the WordNet knowledge base. Evaluating the System’s Process The second approach is to evaluate the system primarily based on its underlying algorithmic process. Among the three evaluation approaches, this one is most aligned with traditional AI evaluation methods. Cognitive systems often use this approach to show that the system’s underlying processes are cognitively sound. For instance, the evaluation of the Universe system (Lebowitz 1985) included fragments of the system’s reasoning trace, along with the corresponding story output. It is intended to illustrate the system’s capability to expand its plot-fragments library by generalizing from given example stories. Although the sample output and the process are relatively simple compared to those of the previous approach, Lebowitz intends to show, especially through the system processes, that the learning process is a necessary condition to creativity. In a more complex example, the Minstrel system (Turner 1993), presented as a model for the creative process and storytelling, is evaluated in two ways. First, Turner evaluates the system by comparing it to related work in psychology, creativity, and storytelling. Minstrel’s process is contrasted to existing AI models of creativity both in the similar domain of narrative (e.g., Tale-Spin and Universe) and in different ones (e.g., AM (Lenat 1976)). Second, Minstrel is empirically studied in terms of its plausibility and quality as a test International Conference on Computational Creativity 2012 151 bed for evaluating different hypotheses of creativity. Specifically, plausibility is evaluated based on 1) the quantity of possible output stories, by testing the system in different domains, and 2) the quality of output stories through a series of user studies (details in the next section). In the evaluation of the “test bed” criteria, Turner studies why some TRAMS (i.e., problem-solving strategies) were added, removed, etc. to prove that one can experiment with different models of creativity. For instance, to test its model of “boredom” as how many repeated elements are there in the stories, Minstrel was asked to generate stories about the same topic four times. The differences and similarities between these stories are analyzed to evaluate how boring these stories are. User Studies Evaluating the system’s process alone, however, does not provide insights into the quality of the output. For systems that are more geared towards seeing narrative as a goal in its own right, user studies provide a way to assess the output story without counting solely on the author’s own intuition. As a result, user studies has been increasingly adopted both as a standalone evaluation method and as a complement to other approaches. For example, the MEXICA system (Perez y P ´ erez and ´ Sharples 2001) is evaluated through an Internet survey. The users rated seven stories by answering a set of 5-point Likert scale questions over five factors (i.e., coherence, narrative structure, content, suspense, and overall experience). Among these seven stories, four were generated by MEXICA using different system configurations (with or without certain modules). Two stories were generated by other computational narrative systems (i.e., GESTER and MINSTREL). The last story was written by a human author using “computer-story language.” The scores each stories received is used to determine MEXICA’s level of “computerised creativity” (c-creativity) in reference to human writers and other similar systems. In a more complex example, in addition to the methods mentioned above, the stories generated by Minstrel are evaluated through a series of independent user studies. In the first user study, users were given the generated stories, without being told that they were generated by a computer. Then they were asked to answer questions regarding their impressions of the author and the stories. In the second study, a different group of users repeated the above test, except the generated stories were rewritten by a human writer for better presentation with improved grammar and more polished prose. In the third study, the users were presented an unrelated story written by a 12-year-old and asked to answer the same set of questions. User studies of narrative systems do no always adopt some form of the Turing Test. In the Fabulist system (Riedl 2004), the system author conducted two quantitative evaluations without using human writers as a benchmark. The first study evaluates plot coherence, measured based on the assumption that unimportant sentences decrease plot coherence. A group of users independently rates the importance of each sentence in the generated story and hence the coherence of the plot. Second, character believability is evaluated by asking users to rate the difference in characters’ motivation in stories generated by two configurations of the system. What is Missing Computational narrative is still in its early stage, both in terms of the depth and breath of the narrative content. It is especially true when we compare these generated stories with what we typically conceive as literary text produced by human authors. In this regard, the different methods described in the previous section are arguably adequate for the current state of these systems. As argued above, however, evaluation methods play an important role not only in assessing existing systems, but also in informing what kind of future systems should be built. In this regard, waiting for the narrative systems to mature before starting to develop suitable evaluation criteria is detrimental to the research community. As computational narrative research moves forward, a set of more comprehensive evaluation methods can help to reduce the gap between computer generated stories and traditional literature. Our position is that many important lessons from literary criticism and communication theory are by and large overlooked in computational narrative. We argue that they can be instrumental to developing evaluation methods that not only focus on the algorithmic and usability aspects of narrative systems, but also the expressiveness of the generated stories as cultural artifacts. Below is our preliminary work in identifying some crucial elements that are missing in many existing evaluation methods. It is not intended to be seen as a comprehensive list, but rather as an initial step towards incorporating fundamental knowledge and concerns from related fields in the arts and the humanities. Different Modes of Reading Reading is a complex activity. Depending on the setting, purpose of the reading, and background of the reader, different aspects of the text are highlighted. Vipond and Hunter (1984) distinguished among point-driven, story-driven, and information-driven orientations for reading. Shown by recent studies in Reader Response theory (Miall and Kuiken 1994), ordinary readers typically adopt the story-driven approach, that is, to read for plot. They contemplate what characters are doing, experience the stylistic qualities of the writing, and reflect on the feelings that the story has evoked. This mode is adopted while we read for pleasure. By contrast, the point-driven orientation is the foundation for literary criticism. Experts perform informed close reading — a complex act of interpretation at the linguistic, semantic, structural, and cultural levels — in order to understand the “point” of plot, setting, dialogue, etc. Point-driven reading assumes that the text is a purposeful act of communication between the author and the reader, and the “points” in the story have to be constructed through the reader’s careful examination of the text. Finally, in the information-driven orientation, a reader is more concerned about extracting specific knowledge from the text. We adopt this orientation while, for example, International Conference on Computational Creativity 2012 152 following a recipe or checking facts in an encyclopedia. Information-driven reading places a strong emphasis on the coherence and informativeness of the text. This orientation is less common in computational narrative. Different reading orientations place different emphasis on evaluation methods. As story-driven reading is primarily concerned with creating the “lived-through experience” for the reader, compatible evaluation needs to focus on the immersiveness of the story world. In computational narrative, most existing evaluation criteria presume the story-driven reading orientation and center on interestingness, presence, and engagement of the stories (e.g. plot coherence and character believability). Additionally, this orientation requires the participants of the evaluation to be close to an “average reader.” A point-driven-based evaluation requires participants, usually experts, to perform more in-depth reading of the text beyond the surface plot. The effectiveness of different literary techniques, such as thematic structures, linguistic patterns, and points of view in the story can be evaluated in ways similar to traditional literary criticism. To the best of our knowledge, there have not been attempts of point-driven-based evaluation in the context of computational narratives. There are many complex reasons for this. Some may argue that computational narrative, at its current stage, is too simple for this level of close reading. However, electronic literature (e-lit) work demonstrated that less algorithmically complex systems can still produce rich meanings. Establishing these evaluation criteria helps to develop a wider range of computational narrative. Authorial Intention Contradictory to the tradition of literary criticism, the evaluation of computational narrative systems has by and large ignored the intention of the authors. If we subscribe to the assumption that storytelling is a form of communication between the author and the reader, authorial intention should play a role in evaluating how effective these stories are. For instance, a user’s report of unpleasantness may be positive or even desirable, if the system author intends to use her stories to challenge the reader’s belief system, in ways similar to Duchamp’s Urinal. A more balanced evaluation needs to differentiate this scenario from unpleasantness caused either by poorly written story or by unintuitive user interface. Similarly, intentional ambiguity in the story can be a powerful device, leaving something undetermined in order to open up multiple possible meanings. In the history of literature, intentionally ambiguous works such as Henry James’s 1898 novel The Turn of the Screw have triggered many distinctive interpretations and vigorous debates about them. Mixed Methods A large percentage of the evaluations we surveyed gravitate towards quantitative methods with qualitative methods as a supplement, if at all. Through surveys and experiments, numerical data is collected, then analyzed statistically to provide an average user response. Although these methods have the clear advantage of being relatively easy to collect and analyze, they filter out the specificity and contextualization that is crucial to cultural artifacts. Several research projects have attempted to address this issue. Mehta et al. (2007) devised an empirical study for the Fac¸ade system, which was intended by its authors to evoke rich exchange of meanings. Mehta et al. acknowledge that the standard quantitative criteria in the conversational system research community (e.g., task success rate, turn correction ratio, concept accuracy and elapsed time) are not adequate because they assume a task-based philosophy, where conversational interaction is framed as a simple exchange of clear, well-defined meanings. As a result, they made a deliberate choice to use more in-depth but less statistically significant ethnographic methods to study a small group of users’ perceptions and interpretations of their conversations with non-player characters. Using video recording and retrospective interviews, their study found that participants created elaborate back-stories to make sense of character reactions in order to fill in the gaps of AI failures, an insight difficult to capture with pure quantitative methods. The limitation of quantitative methods is echoed in Ho¨ok, ¨ Sengers and Andersson’s user study of their digital art project (Ho¨ok, Sengers, and Andersson 2003). They ob- ¨ served, “[g]rossly speaking, the major conflict between artistic and HCI perspectives on user interaction is that art is inherently subjective, while HCI evaluation, with a science and engineering inheritance, has traditionally strived to be objective. While HCI evaluation is often approached as an impersonal and rigorous test of the effects of a device, artists tend to think of their system as a medium through which they can express their ideas to the user and provoke them to think and behave in new ways.” As a response, their interpretive methods (open-ended interviews) focuses on giving the artists a grounded feeling for how the interactive system was interpreted and their message was communicated. Despite the sentiment against user studies in the interactive arts community, some artists involved in the project acknowledged that laboratory evaluations can help artists to uncover problems in interaction design. Because of these limitations, we believe that a mixed methods approaches may be more suitable for evaluating computational narrative outputs. In addition to the closedended questions and surveys, qualitative methods such as phenomenology, grounded theory, ethnography, case studies can better capture the plurality of meanings interpreted by different readers and the complexity of such readings. In literary studies, a group of researchers have started developing methods to empirically study readers’ responses to literature. Due to the field’s predisposition to point-driven interpretation, these methods offer a good example of balancing expert interpretation and ordinary readers’ responses to and experience of the stories under evaluation. For example, Miall (2006) identified four kinds of empirical literary studies. First, studies that manipulate a literary text to isolate a particular effect. Second, studies that use an intact text in which the researchers hypothesize that intrinsic features of the text influence the reader. Instead of manipulating a text, each text itself provided a naturally varying level of foregrounding from high to low. A third kind of study involves comparison of two or more texts. Four, readers are asked to think aloud about a text during or after reading it. All of International Conference on Computational Creativity 2012 153 these can be further explored and potentially incorporated into the evaluation of computational narrative systems. Conclusion In this position paper, we discussed the challenge of designing evaluation methods for creative systems due to their dual status. Focusing in the area of computational narrative, we surveyed existing evaluation approaches in story generation systems and identified crucial aspects of computational narrative, as a potential form of cultural artifacts, that have been so far downplayed. Penny warned us of the danger of the “unquestioned axiomatic acceptance of the concept of generality as being a virtue in computational practice especially when that axiomatic assumption is unquestioningly applied in realms where it may not be relevant” (Penny 2007). We suggest that work in empirical literary study research can offer valuable insights of developing more interdisciplinary and more balanced evaluation methods. References Bringsjord, S., and Ferrucci, D. A. 2000. Artificial Intelligence and Literary Creativity: Inside the Mind of BRUTUS, a Storytelling Machine. Hillsdale, NJ: Lawrence Erlbaum. Colton, S. 2008. Creativity versus the perception of creativity in computational systems. In Proceedings of the AAAI 2008 Spring Symposium in Creative Intelligent Systems. AAAI Press. Dix, A.; Finlay, J.; Abowd, G.; and Beale, R. 2003. HumanComputer Interaction. Edinburgh Gate, England: Prentice Hall. Gervas, P. 2000. An expert system for the composition of ´ formal spanish poetry. Journal of Knowledge-Based Systems 14:200–1. Gervas, P. 2009. Computational approaches to storytelling ´ and creativity. AI Magazine 30(3):49–62. Harrell, D. F. 2006. Walking blues changes undersea: Imaginative narrative in interactive poetry generation with the griot system. In Liu, H., and Mihalcea, R., eds., AAAI 2006 Workshop in Computational Aesthetics: Artificial Intelligence Approaches to Happiness and Beauty, 61–69. Boston, MA: AAAI Press. Ho¨ok, K.; Sengers, P.; and Andersson, G. 2003. Sense and ¨ sensibility: evaluation and interactive art. In Proceedings of the SIGCHI conference on Human factors in computing systems, 241–248. Jordanous, A. 2011. Evaluating evaluation: Assessing progress in computational creativity research. In Proceedings of the Second International Conference on Computational Creativity (ICCC-11), 102–107. Lebowitz, M. 1985. Story-telling as planning and learning. Poetics 14(6):483–502. Lenat, D. B. 1976. Am: an artificial intelligence approach to discovery in mathematics as heuristic search. Ph.d., Stanford University. Manovich, L. 2001. Post-media aesthetics, available at http://www.manovich.net/docs/post media aesthetics1.doc. Mateas, M. 2001. Expressive ai: A hybrid art and science practice. Leonardo 34(2):147–153. Meehan, J. 1981. Tale-spin. In Riesbeck, C. K., ed., Inside Computer Understanding: Five Programs Plus Miniatures. New Haven, CT: Lawrence Erlbaum Associates. Mehta, M.; Dow, S.; Mateas, M.; and MacIntyre, B. 2007. Evaluating a conversation-centered interactive drama. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, 8:1–8:8. Miall, D. S., and Kuiken, D. 1994. Foregrounding, defamiliarization, and affect response to literary stories. Poetics 22:389–407. Miall, D. S. 2006. Literary Reading: Empirical and Theoretical Studies. NewYork: Peter Lang. Ontan˜on, S., and Zhu, J. 2011. On the role of domain knowl- ´ edge in analogy-based story generation. In Proceedings of the Twenty-Second International Joint Conferences on Arti- ficial Intelligence (IJCAI-2011), 1717–1722. Penny, S. 2007. Experience and abstraction: the arts and the logic of machines. In Proceedings of PerthDAC 2007: 7th Digital Arts and Culture Conference. Perez y P ´ erez, R., and Sharples, M. 2001. Mexica: A ´ computer model of a cognitive account of creative writing. Journal of Experimental & Theoretical Artificial Intelligence 13(2):119–139. Riedl, M. 2004. Narrative Generation: Balancing Plot and Character. Ph.D. Dissertation, North Carolina State University. Schoenau-Fog, H. 2011. Hooked! evaluating engagement as continuation desire in interactive narratives. In Proceedings of the Fourth International Conference on Interactive Digital Storytelling (ICIDS 2011), 219–230. Sengers, P. 1998. Anti-Boxology: Agent Design in Cultural Context. Ph.D. Dissertation, Carnegie Mellon Universit. Snow, C. P. 1964. The Two Cultures. New York: Menton Books. Thue, D.; Bulitko, V.; Spetch, M.; and Romanuik, T. 2011. A computational model of perceived agency in video games. In Proceedings of the Seventh Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE), 91– 96. Turner, S. R. 1993. Minstrel: a computer model of creativity and storytelling. Ph.D. Dissertation, University of California at Los Angeles, Los Angeles, CA, USA. Vermeulen, I.; Roth, C.; Vorderer, P.; and Klimmt, C. 2011. Measuring user responses to interactive stories: Towards a standardized assessment tool. In Proceedings of the Fourth International Conference on Interactive Digital Storytelling (ICIDS 2011), 38–43. Vipond, D., and Hunt, R. A. 1984. Point-driven understanding: Pragmatic and cognitive dimensions of literary reading. Poetics 13:261–277. Zhu, J., and Harrell, D. F. 2011. Navigating the Two Cultures: a Critical Approach to AI-based Literary Practice. Singapore: World Scientific. 222–246. International Conference on Computational Creativity 2012 154