Automatic evaluation of punning riddle template extraction Try Agustini and Ruli Manurung Faculty of Computer Science Universitas Indonesia Depok 16424, Indonesia try.agustini@gmail.com and maruli@cs.ui.ac.id Abstract This paper reports an empirical study to automatically evaluate the ability of T-PEG (Hong and Ong 2009) to extract joke templates by providing it with a corpus of punning riddles produced by another system, STANDUP (Manurung et al. 2008). This setup allows us to compare the extracted templates against the underlying data structures used by STANDUP in generating the corpus. In our setup, T-PEG is modified with a generalization component that clusters extracted templates based on structural similarity. These clusters are then compared against the underlying rules used by STANDUP to measure how well T-PEG is able to induce the schema used by STANDUP to generate the jokes. Whilst far from conclusive, an overall precision of 0.61 and recall of 0.763 suggests that T-PEG is able to extract some salient information regarding the underlying lexical relationships found within a punning riddle. Background The automatic construction of jokes, specifically punning riddles, as artefacts of linguistic creativity, has received quite a bit of attention in recent years. See (Ritchie 2005) for a good overview. Compared to other forms of linguistic creativity, such as stories and poems, punning riddles obey more formal structures, and hence are more amenable to automated construction. Most of the existing systems such as JAPE (Binsted 1996) and STANDUP (Manurung et al. 2008) work from a predefined set of rules often called schemata, which forms the foundation of a joke, and which arguably encodes the humorous knowledge which makes the resulting text recognizably a joke. The schemata are typically handcrafted by the researchers based on analysis and observation of collections of exemplar jokes such as punning riddles. T-PEG (Hong and Ong 2009) is a system which works along similar lines, but with a crucial difference: the schemata, or in T-PEG terms the ‘templates’, are automatically extracted from a corpus of exemplar punning riddles. Although (Hong and Ong 2009) report a manual evaluation of the extracted templates, it is subjective in nature and limited in scope. In this paper we present our work in automatically evaluating the template extraction functionality of T-PEG by providing it with example jokes produced by STANDUP, from which we can carry out a more extensive empirical study due to the fact that we have access to the underlying data structures used by STANDUP in generating the samples. The rest of the paper is structured as follows. We first provide a brief overview of automatic punning riddle generation, describing the mechanisms of STANDUP and T-PEG, before discussing our proposed setup of how to carry out an automatic evaluation of the extracted templates. We then present the results and discussion thereof. Punning riddle generation STANDUP Riddle generation in systems such as JAPE and STANDUP consists of several stages, where at each stage a particular kind of rule is selected and instantiated, i.e. schemas, description rules, and templates. Instantiation is performed by matching against a lexical database. In STANDUP, various lexical resources are utilized, among others Unisyn1 for phonetic information and WordNet (Fellbaum 1998) for lexical semantic information. A schema is composed of a header, which specifies a symbolic label and its input parameters, a set of lexical preconditions that specify phonetic, syntactic, and semantic relations that must hold between key lexical items, and a question and answer specification, which determines the output for the next stage (Manurung et al. 2008). For example, the newelan2 schema is defined as follows: • Header: newelan2(NP, A, B, HomB) • Lexical preconditions: nouncompound(NP,A,B), homophone(B,HomB), noun(HomB) • Question specification: shareproperties(NP, HomB) • Answer specification: phrase(A,HomB) It states that a punning riddle can be generated if a set of four lexical items, i.e. NP, A, B, and HomB, can be found within a lexical database such that A and B form the compound noun NP, B is a homophone of HomB, and HomB is 1 http://www.cstr.ed.ac.uk/projects/unisyn International Conference on Computational Creativity 2012 134 a noun. For example2, an instantiation in which NP = computer screen, A = computer, B = screen, and HomB = scream3 could give rise (after the two further stages) to a riddle such as “What do you call a shout with a window? A computer scream.” Once a schema has been instantiated as above, the question and answer specifications are non-deterministically matched against a set of description rules, which encode linguistic variations that are warranted given the schema instantiation. These rules have a structure similar to schemas, in that they have a header, some preconditions, and an output expression called the template specifier. For example, one description rule is: • Header: shareproperties(X,Y) • Preconditions: meronym(X, MerX), synonym(Y, SynY) • Template specifier: merHyp(MerX, SynY) In the example joke given above, the question speci- fication shareproperties(computer screen, scream) would match the header for this rule, where the values (computer screen, scream) would be bound to the local variables X,Y of the rule. Subsequently, the preconditions check further lexical relations to determine whether the rule is applicable. It may also instantiate additional variables, e.g. MerX and SynY, where in the example above MerX is bound to window, a meronym of computer screen, and SynY is bound to shout, a synonym of scream. The answer specification phrase(A,HomB) trivially concatenates the instantiations, e.g. computer scream. Finally, all these instantiations are passed on to the template-filling stage, where the template specifier merHyp(MerX, SynY) is non-deterministically matched against a set of canned text such as “What do you call a * with a * ?” where * indicates a slot to be filled with the instantiated words. T-PEG In the preceding section, we can see that the role of schemas, description rules, and templates is crucial in defining the humorous effect of the resulting riddle. In both JAPE and STANDUP, these rules were manually handcrafted. For instance, STANDUP has 11 schemas. (Hong and Ong 2009) present T-PEG (Template-based Pun Extractor and Generator), a system that generates punning riddles in a manner similar to JAPE and STANDUP, i.e. working from a set of symbolic rules that define a punning riddle, it instantiates certain key variables by selecting appropriate sets of words from a combination of lexical resources. The crucial difference is that whereas the symbolic rules used by JAPE and STANDUP were manually crafted, T-PEG relies on templates that are automatically extracted from a collection of exemplar jokes. 2 The following worked example is taken from (Manurung et al. 2008). 3 Homophony can be generalized to include pairs of words whose phonetic similarity is above a certain threshold. Given a sample punning riddle, T-PEG constructs a template by firstly identifying nouns, verbs, and adjectives with the help of a part-of-speech tagger, and replacing them with what they term regular variables. Additionally, similarsound variables are identified as words that are homophones of regular variables. Regular variables follow a naming convention where Xn indicates the n-th word in the question part and Y n indicates the n-th word in the answer part. Similar-sound variables are indicated by appending a dash and a number. All pairs of variables are then checked against a lexical resource to detect whether any semantic relations hold between them. In T-PEG, the lexical resources used are Unisyn, WordNet, and ConceptNet (Liu and Singh 2004). For example, given the joke “How is a window like a headache? They are both panes”, T-PEG can extract the template “How is a like a ? They are both ”, where the following word relations must hold: • Y3-0 SoundsLike Y3 • X3 ConceptuallyRelatedTo Y3 • Y3 ConceptuallyRelatedTo X3 • Y3 PartOf X3 • X6 ConceptallyRelatedTo Y3-0 • X6 IsA Y3-0 • Y3-0 ConceptuallyRelatedTo X6 From the example above we can see that the notion of a template in T-PEG is equivalent to the conflation of a schema, description rule, and template in STANDUP. The constructed templates are then filtered based on a graph-connectedness heuristic, i.e. if the variables are nodes and the word relationships are edges, a template must form a connected graph to be deemed a valid template. Once the template has been extracted, generation proceeds in a similar fashion to STANDUP. In this paper we are less interested in the generation aspect of T-PEG, as it is largely similar to JAPE and STANDUP, and more interested in its ability to automatically learn or extract templates. Specifically, we are interested in assessing the ability of TPEG to correctly identify the necessary and sufficient conditions for generating punning riddles. (Hong and Ong 2009) report a manual evaluation where a subset of the extracted templates was manually assessed by a linguist, whose job was to determine if the extracted templates were able to capture the ‘essential word relationships in a pun’. The evaluation criteria are based on the number of incorrect relationships as identified by the linguist, and includes missing relationship, extra relationship, or incorrect word pairing. A scoring system from 1 to 5 is used, where 5 means there is no incorrect relationship, 4 means there is one incorrect relationship, and so on. They report an average score of 4.0 out of a maximum 5. Automatic evaluation of template extraction There are two issues concerning the manual evaluation of template extraction presented in (Hong and Ong 2009). International Conference on Computational Creativity 2012 135 Firstly, we believe this evaluation is rather subjective. Although punning riddles are relatively simple and straightforward to analyse, the linguists were not the original authors of the jokes, and thus there is room for misinterpretation or incorrect emphasis. Furthermore, it is unquestionable that relying on the manual judgment of a linguist is both time-consuming and costly. The manual evaluation reported in (Hong and Ong 2009) was carried out on 27 templates generated from 27 jokes, which is a rather small sample from which to draw any conclusion. Our observation is that if one had access to a large corpus of punning riddles that had somehow been annotated with the ‘correct’ word relationships that underlie the joke, one could assess T-PEG’s template extraction functionality by comparing the resulting template against the reference word relationships. Unfortunately, we know of no such annotated resource that currently exists. However, we can use an existing punning riddle generator such as STANDUP to produce an approximation of such a resource, since we can access the underlying data structure of a generated punning riddle. In the joke generation module of STANDUP, the JokeGraph object of a generated punning riddle provides full access to the underlying lexical relationships. Another issue we attempt to address is the fact that T-PEG makes no attempt at generalization of the extracted templates. Given fifty exemplar punning riddles, it will attempt to construct fifty templates. Hong and Ong state that this is beneficial ‘to increase coverage’. However, we contend that if we are interested in building systems that computationally model the mechanisms of linguistic humor, coverage is not enough. A creative generative system should be able to generate artefacts from a limited set of symbolic rules. Thus, T-PEG should be able carry out some abstraction over the extracted templates, to yield a set of highly-productive patterns. These two goals form the rationale of our proposed setup, which we discuss in the next section. Proposed setup As discussed above, the purpose of our experiment is to automatically evaluate the ability of T-PEG to correctly extract templates that underlie a collection of punning riddles. The proposed setup is as follows. Firstly, STANDUP is used to generate a large number of punning riddles. For each riddle, we note the actual rules used by STANDUP to generate them, which are used during the evaluation phase. The riddles are then given to T-PEG, which yields a template for each riddle. These templates are then organized into clusters using agglomerative clustering by calculating the similarity between templates using a structural similarity metric based on the semantic similarity evaluation function presented in (Manurung, Ritchie, and Thompson 2012). We then apply a simple majority rule to label the clusters, and then evaluate the template extraction process using the widely-used notions of precision and recall. To achieve this, T-PEG first had to be modified by replacing its lexical and conceptual resources with those that were used in STANDUP, thus ensuring that the template extraction module would be able to identify the original lexical relationships in STANDUP. Template clustering The agglomerative clustering process starts with all templates belonging to singleton clusters. The distance of each cluster to all other clusters is then computed. The distance between two clusters is defined as the average distance of each pair of elements contained within the two clusters, also known as average linkage clustering. The two clusters with the shortest distance are then merged together. This process is repeated until k clusters remain, where k is provided as an input parameter. In defining the notion of distance between two templates, we turn to the structure-mapping work of (Love 2000) and (Falkenhainer, Forbus, and Gentner 1989) that has de- fined a computational model of semantic similarity in terms of conceptual and structural similarity. Structural similarity measures the degree of isomorphism between two complex expressions. Conceptual similarity is a measure of relatedness between two concepts. More specifically, we use the semantic similarity evaluation function used in (Manurung, Ritchie, and Thompson 2012), which implements a greedy algorithm based on Gentner’s structure mapping theory (Falkenhainer, Forbus, and Gentner 1989). It takes two complex expressions, in our case two T-PEG extracted templates, and attempts to ‘align’ them in an optimal manner. It then applies a scoring function based on Love’s computational model of similarity (Love 2000) to compute a score based on various aspects of the alignment. This function yields a distance of zero between two conceptually and structurally identical templates, and further distances for increasingly different template pairs. Cluster labeling We then automatically label the clusters using a simple majority rule. First, we define the underlying schema of a template to be the label of the STANDUP schema that was used to generate the punning riddle from which the template was extracted. A cluster is then automatically labelled by identifying the underlying schemas of all its member templates, and simply choosing the schema that created the majority of templates within that cluster. If there are several underlying schemas that produced the same number of templates in a cluster, one is randomly selected. As an example, if a cluster contains 10 templates whose underlying schema is lotus, and 6 templates whose underlying schema is newelan1, then that cluster is labelled as representing the lotus schema. Precision and recall Using these cluster labels, we can compute measures which correspond to the widely-used notions of precision and recall in pattern recognition. In classification tasks, these measures are defined as follows: P recision = tp tp + f p Recall = tp tp + fn International Conference on Computational Creativity 2012 136 where in our case, given a cluster c with label l, tp (true positive) is the number of extracted templates that appear as members of c whose underlying schemas are l, f p (false positive) is the number of templates in c whose underlying schemas are not l, and fn (false negative) is the number of templates not in c but whose underlying schemas are l. Precision and recall can be computed for each cluster, or as an aggregate measure over all resulting clusters. Experimental setup The experimental setup is as follows. Firstly, STANDUP is used to generate 20 jokes for each of 10 schemas, namely bazaar, lotus, doublepun, gingernutpun, rhyminglotus, newelan1, newelan2, phonsub, poscomp, and negcomp, resulting in a collection of 200 exemplar jokes. These jokes are then analysed by T-PEG, which yields 200 joke templates. We then apply agglomerative clustering until 10 clusters are formed (since 10 STANDUP schemas are used). Our hypothesis is that for T-PEG to be deemed successful in extracting templates, it should be able to correctly organize the 200 templates into 10 clusters that correspond to the 10 STANDUP schemas. The precision and recall metrics should provide appropriate quantitative measures as to this goal. Results and discussion Table 1 shows the results of applying agglomerative clustering on the 200 templates into 10 clusters. The first column indicates the cluster number. The second and third columns specify the cluster membership, by indicating the number of templates with a given underlying STANDUP schema found within that cluster. The fourth column indicates the cluster size. The last column indicates the label assigned to that cluster using the majority rule described above. For example, cluster 1 contains 21 templates, 19 of which have rhyminglotus as their underlying schema, and 2 of which have bazaar as their underlying schema. Accordingly, it is assigned the label rhyminglotus. Table 2 shows the precision and recall values computed for the clustering results. The first two columns indicate the cluster numbers and assigned labels, which correspond to the information in Table 1, and the last two columns indicate the precision and recall values computed for each cluster. Finally, the last row indicates the overall precision and recall. This aggregate result take into account the weighted averages given the different cluster sizes. Note that we collapsed clusters 1 and 6 because they were both labeled as rhyminglotus, and similarly, clusters 2 and 3 are collapsed due to the fact that they are both labeled as bazaar. Finally, Table 3 shows the confusion matrix of how the templates are classified according to their underlying schema. The rows indicate the original underlying schemas, and the columns indicate the cluster labels. For example, of the 20 templates extracted from punning riddles generated using the bazaar schema, 14 are correctly found within a cluster labeled bazaar, 2 are found in a cluster labeled rhyminglotus, and 4 are found in a cluster labeled lotus. No. Schema Amount Total Label 1 rhyminglotus 19 21 rhyminglotus bazaar 2 2 bazaar 11 11 bazaar 3 bazaar 3 3 bazaar 4 lotus 20 89 lotus newelan1 19 gingernutpun 17 newelan2 16 doublepun 13 bazaar 4 5 doublepun 7 newelan2 4 14 doublepun gingernutpun 3 6 rhyminglotus 1 1 rhyminglotus 7 newelan1 1 1 newelan1 8 phonsub 20 20 phonsub 9 poscomp 20 20 poscomp 10 negcomp 20 20 negcomp Table 1: Results of agglomerative clustering No. Label Precision Recall 1 & 6 rhyminglotus 20/22=0.91 20/20=1 2 & 3 bazaar 14/14=1 14/20=0.7 4 lotus 20/89=0.225 20/20=1 5 doublepun 7/14=0.5 7/20=0.35 7 newelan1 1/1=1 1/20=0.05 8 phonsub 20/20=1 20/20=1 9 poscomp 20/20=1 20/20=1 10 negcomp 20/20=1 20/20=1 Overall 0.61 0.763 Table 2: Precision and recall measures From these results, we can see that only templates with phonsub, poscomp, and negcomp as their underlying schemas are perfectly identified. Templates with the underlying schemas rhyminglotus and lotus are correctly clustered together, but suffer some impurities with other templates also being deemed to belong to their clusters. Most notably, the cluster labeled lotus contains a very large number of templates from other schemas such as bazaar, doublepun, gingernutpun, newelan1, and newelan2. The cluster labeled newelan1 contains only one template. No clusters were labeled as gingernutpun and newelan2. A purely random baseline, in which the 200 extracted templates are randomly assigned to 10 different clusters, would yield an expected precision and recall of 0.1. Whilst this is an artificially low baseline, the results of an overall precision of 0.61 and recall of 0.763 suggests that T-PEG is able to extract some salient information regarding the underlying lexical relationships of a pun. However, certain underlying schemas are very difficult to distinguish from each other. Upon further analysis, we can see that the problems arise from the fact that the templates extracted by T-PEG conflate International Conference on Computational Creativity 2012 137 bazaar rhyminglotus lotus doublepun gingernutpun newelan1 newelan2 phonsub poscomp negcomp bazaar 14 2 4 rhyminglotus 20 lotus 20 doublepun 13 7 gingernutpun 17 3 newelan1 19 1 newelan2 16 4 phonsub 20 poscomp 20 negcomp 20 Table 3: Confusion matrix of clustering results the various rules in STANDUP, i.e. schemas, description rules, and canned text templates. To illustrate the issues, observe the following two jokes, both generated using the lotus schema, and their resulting extracted templates: Joke 1: What do you call a cross between a firearm and a first step? A piece initiative The resulting template is: What do you call a cross between a and a ? A with the following word relationships: • IsCompoundNoun(X11, X12) • IsCompoundNoun(Y1-0, Y2) • Synonym(X8, Y1) • Synonym(Y2, X11:X12) • Hypernym(X11:X12, Y1-0:Y2) • Homophone(Y1-0, Y1) From the joke we can see that the instantiations for the regular variables are X8=firearm, X11=first, X12=step, Y1=piece, and Y2=initiative. Furthermore, the similarsound variable Y1-0 is bound to peace, because in WordNet, “peace initiative” is an instance of an initiative. Thus, the word relationships state that “first step” and “peace initiative” are compound nouns, firearm is a synonym of piece, initiative is a synonym of “first step”, which in turn is a hypernym of “peace initiative”, and that lastly, peace and piece are homophones. Joke 2: What do you call a cross between an l and a correspondent? A litre writer. The resulting template is: What do you call a cross between an and a ? A with the following word relationships: • IsCompoundNoun(Y1-0, Y2) • Synonym(X8, Y1) • Synonym(X11, Y1-0:Y2) • Homophone(Y1-0, Y1) From the joke, we can see that the instantiations for the regular variables are X8=l, X11=correspondent, Y1=litre, and Y2=writer. Furthermore, the similar-sound variable Y1- 0 is bound to letter, because in WordNet, “letter writer” is a synonym for “correspondent”. Thus, the word relationships state that “letter writer” is a compound noun, l is a synonym of litre, correspondent is a synonym of letter writer, and litre and letter are deemed to be homophones. From these two jokes and their resulting extracted templates, we can make several observations. Firstly, T-PEG correctly extracts the core lexical preconditions stated for the lotus schema, in that the ’punchline’ must contain a compound noun Y1-0:Y2 (“peace initiative” and “letter writer”, respectively), where the first word is replaced with a homophone, Y1 (in the first joke, piece, and in the second, litre). However, the two jokes use different description rules for the question part. Whereas the former joke used a synonym (firearm) and a hypernym (first step) to describe the punchline, the latter used two synonyms, namely l and correspondent. Since T-PEG makes no distinction between word relationships arising from schemas or description rules, the choice of description rule, which is a somewhat trivial linguistic variation, leads the agglomerative clustering to falsely conclude that two jokes from different underlying schemas in fact use the same pattern. Additionally, T-PEG extracted ‘noisy’ word relationships that play no part in the joke construction. Whereas the word relationships of the extracted template for the second joke capture exactly the necessary and sufficient conditions, in the former joke, nothing hinges on the fact that a “first step” is a compound noun, nor that initiative is a synonym of “first step”. Such extraneous word relationships further pull the templates into wrong clusters. From this experiment, we can conclude that although TPEG may be successful at learning some joke templates given sample punning riddles, it is still making faulty assertions as to the constraints that specify what makes the riddle ‘work’ as a joke. However, we speculate that much work can be done to repair such errors. The data redundancy contained within the specific templates that are clustered together, for instance, can be further analysed to form core relationships that are at the heart of the punning riddle structure. This is an avenue of future work that we intend to explore. Summary The manually constructed rules of STANDUP are specifically designed to maximize generative powers whilst retaining a fairly limited set of symbolic rules. T-PEG, on the other hand, tries to address the issue of having to handcraft rules by automatically extracting templates from example jokes. However, the evaluation of this functionality was International Conference on Computational Creativity 2012 138 fairly limited given the difficulty and cost of manual evaluation. This paper has attempted to carry out a fairly novel methodology of automatically evaluating the performance of one creative system (T-PEG) using another creative system (STANDUP) to produce sample data with complete underlying annotations for comparing against. Although the results are far from conclusive, it corroborates the results of the manual evaluation in (Hong and Ong 2009) that claims T-PEG was successful in extracting templates from sample jokes. Furthermore, the experiment could shed more light on where T-PEG was still lacking in its ability to extract the underlying generative rules of the punning riddles. Furthermore, the benefit is twofold: the template clustering process proposed in this work has attempted to address the generalizability of T-PEG. It is not sufficient to say that a huge number of templates will ensure maximum coverage. For a generative system to be deemed creative, it should be able to generate a high ratio of good quality artefacts from a limited set of symbolic rules. By breaking down the patterns into schemas, description rules, and templates, and by stating the conditions when they can be composed together, STANDUP is able to produce a very wide range of different jokes, and in some sense can “explain” the craftsmanship behind its joke production facilities, as each individual component represents a specific function of the joke. T-PEG’s templates, on the other hand, are monolithic structures that cannot be broken down into its functional components. This distinction is to be expected, given that the rules found within STANDUP were manually constructed, whereas T-PEG rules are automatically extracted. Nevertheless, this points towards a promising direction of future work in the automatic extraction of rules from creative artefacts. As stated in the previous section, we believe that the template clustering process opens up the possibility of future work, i.e. by further leveraging the data redundancy contained with the resulting clusters. Further still, given that the clustering process makes use of the notion of distances between templates and clusters, one could imagine a technique that selects the template closest to the centroid of the cluster as being the most representative template for further generation. Finally, we would like to explore more sophisticated generalization techniques that would enable us to tease out the distinctions between component rules such as schemas, description rules, and canned text templates. Acknowledgments We would like to thank the creators of the STANDUP and TPEG systems for making their software available for further experimentation and development. We would also like to thank the anonymous reviewers for their feedback and suggestions. References Binsted, K. 1996. Machine Humour: An Implemented Model of Puns. Ph.D. Dissertation, Department of Artificial Intelligence, University of Edinburgh, Edinburgh, UK. Falkenhainer, B.; Forbus, K. D.; and Gentner, D. 1989. The structure-mapping engine: Algorithm and examples. Artifi- cial Intelligence 41:1–63. Fellbaum, C., ed. 1998. WordNet: An Electronic Lexical Database. MIT Press. Hong, B. A., and Ong, E. 2009. Automatically extracting word relationships as templates for pun generation. In Proceedings of the 1st Workshop on Computational Approaches to Linguistic Creativity (CALC-09). Liu, H., and Singh, P. 2004. Conceptnet: A practical commonsense reasoning toolkit. BT Technology Journal 22(4):221–226. Love, B. C. 2000. A computational level theory of similarity. In Proceedings of the 22nd Annual Meeting of the Cognitive Science Society, 316–321. Manurung, R.; Ritchie, G.; Pain, H.; Waller, A.; O’Mara, D.; and Black, R. 2008. The construction of a pun generator for language skills development. Applied Artificial Intelligence 22(9):841–869. Manurung, R.; Ritchie, G.; and Thompson, H. 2012. Using genetic algorithms to create meaningful poetic text. Journal of Experimental & Theoretical Artificial Intelligence 24(1):43–64. Ritchie, G. 2005. Computational mechanisms for pun generation. In Proceedings of the 10th European Natural Language Generation Workshop, 125–132. International Conference on Computational Creativity 2012 139