Experimental Results from a Rational Reconstruction of MINSTREL
Brandon Tearse Peter Mawhorter Michael Mateas Noah Wardrip-Fruin
Department of Computer Science
University of California, Santa Cruz
Santa Cruz, CA 95064
{batman, pmawhorter, michaelm, nwf }@soe.ucsc.edu
Abstract
This paper presents results from a rational reconstruction
project that is aimed at exploring the creative potential
of Scott Turner’s 1993 MINSTREL system. In particular,
we investigate the properties of Turner’s original system
of Transform-Recall-Adapt Methods (TRAMs) and analyze
the performance of an alternate TRAM application strategy.
In order to estimate the creativity of our reconstructed system
under various conditions, we measure both the variance
of the output space and the ratio of sensible to nonsensical
results (as determined by hand-labeling). Together, these
metrics give us insight into the creativity of the algorithm as
originally constructed, and allow us to measure the changes
that our modifications induce.
Introduction
Scott Turner’s 1993 MINSTREL system (Turner 1994) is
considered a high water mark for computational story generation.
This system was based on the notion of imaginative
recall: the idea that stories can be built by adapting fragments
from other stories and jigsawing them together into
something new. Turner argued that MINSTREL was a computer
simulation of human creativity and that the TransformRecall-Adapt
Methods (TRAMs) were the cornerstone of
that creativity, but never fully evaluated his system.
This paper formally evaluates the creativity of MINSTREL’s
TRAM system. Previous work done by Tearse et al.
discussed an initial reimplementation of MINSTREL (Tearse,
Mateas, and Wardrip-Fruin 2010) and in this paper we detail
our experiments which seek to shed light on the creativity
exhibited by TRAMs in MINSTREL REMIXED. We
have continued the rational reconstruction efforts on the system
and have attempted to measure creativity according to
Turner’s original definition. Turner states that “Whether or
not something is creative depends on the number and quality
of its differences from similar works,” (Turner 1994) so our
goal is to measure the variety and quality of the TRAM system’s
output relative to its input (the stories that it knows).
Specifically, we measure the expected amount of variance
between results, and whether they are sensible or nonsensical.
Although the diversity of results is not a direct measure
of variance relative to similar works, as the diversity
increases, the likelihood that the system generates more creative
results does as well, so diversity of the results (along
with their quality) is an appropriate proxy for the creativity
of the system. Beyond measuring the creativity of the original
system, we also modify the system in order to shed light
on the trade-offs that it gives rise to between quality and diversity.
In running these experiments, we aim to validate the
performance of MINSTREL as a creative system relative to
Turner’s definition of creativity.
Related Work
Rational Reconstruction
Rational reconstruction is done to investigate the inner
workings of a system, ideally identifying the differences between
implementation details and core processes. Projects
such as Musen et al. (Musen, Gennari, and Wong 1995)
have successfully used rational reconstruction to better understand
the fundamental concepts of their system. A partial
reconstruction of MINSTREL was even performed (Peinado
and Gervas 2006) in which the knowledge representation
systems of MINSTREL were recreated in W3C’s OWL.
While this did a good job of proving that the knowledge representation
can be successfully recast, without the full system
in place it could not be used to investigate other aspects
of MINSTREL.
CBR-Based Storytelling Systems
Case Based Reasoning (CBR) is a popular approach for creating
intelligences and is what MINSTREL’s TRAM searching
system is based on. While most CBR systems try
to match input to a library and then adapt the responses
into useful results, MINSTREL goes a step further by transforming
its query in order to locate matches which are
further afield which in turn increases its creative options.
Other systems have used enhanced CBR to develop stories
or character actions (Fairclough and Cunningham 2003;
Gervas et al. 2005). While these systems are all interesting
to look at, Turner in particular made claims about his
system’s creative output which can now be investigated.
Creativity
The idea of computational creativity is an area of interest for
computer science, in part because it is closely tied with the
notion of artificial intelligence. Unfortunately, creativity is
difficult to define and although Boden and Ritchie et al. have
Proceedings of the Second International Conference on Computational Creativity 54
both presented guidelines and measurements (Boden 2004;
Ritchie 2007), their concepts remain difficult to implement.
Given the nature of our specific system and the general
agreement between Boden, Ritchie, and Turner on the importance
of variety and quality, we decided to measure creativity
based on Turner’s (Turner 1994) suggestion, looking
at the variety and quality of the results. By using Turner’s
original definition, we are also measuring the system by its
own standards, so to speak: our results will have direct bearing
on whether Turner’s system should be viewed as a success
on his terms. Although measuring the number of possible
outputs and the quality of those outputs (by measuring
both size and sense to nonsense ratios) is useful, we also
found it interesting to cluster the results in a manner similar
to Smith (Smith et al. 2011) in order to get a better picture
of actual result differences.
Method
Rational Reconstruction
As a rational reconstruction, the ultimate goal of our project
is twofold: recreate MINSTREL in a form that can be used
by others and investigate the design choices that went into
MINSTREL in order to learn about the creative potential of
the system. To achieve these goals, we have created MINSTREL
REMIXED, which consists of a Scala codebase that
reimplements the functionality of the original MINSTREL.
Working from Turner’s 1993 dissertation, we have tried to
create a faithful reproduction of the original while introducing
modularity. This modularity has allowed us to explore
alternatives to several of Turner’s design choices, thus better
characterizing the tradeoffs faced by the system.
TRAMs
The TRAM system performs all of the fine-grained editing
in MINSTREL REMIXED. TRAMs are small bundles of operations
designed to help recall information from the story
library. TRAMs are used to return information by giving
them a query (a graph containing nodes describing some
story fragment, using MINSTREL’s graph-based story representation).
The TRAM transforms the query, finds matches
in the story library, and adapts one of those matches before
returning it. Of course, the process of finding matches in the
library may require further use of TRAMs. The TRAM system
has the task of choosing which TRAMs to use during
this recursive querying process.
An example of TRAMs in action using one of Turner’s
original King Arthur stories is as follows: a graph is passed
in requiring a knight, John, to die by the sword. By transforming
this query using a TRAM called Generalize Constraint,
we might end up with a query in which something
dies in a fight. If this then matched a story about another
knight, Frances, who kills an Ogre, the TRAM system could
replace the Ogre with John the knight and return a fragment
about a duel between John and Frances in which John dies.
The creative power of TRAMs ultimately comes from
their ability to find cases in the story library which aren’t
easily recognizable as applicable. In the original MINSTREL,
queries that fail to return results are transformed and
resubmitted. This leads to a random sequence of transformations
before a result is located. In MINSTREL REMIXED
however, we have implemented a weighted random TRAM
selection scheme. This allows both the original functionality
and more targeted weight based TRAM application to be
used. The targeted TRAM approach results in fewer alterations
being needed to get results and thus more similarity
between the original query and the eventual result.
To provide a concrete illustration of the TRAM system
and its ability to produce varying results, we can look at
other ways for John the knight to die. If we start with a
fragment in which John is required to die and then alter
it with one TRAM to require an ambiguous health change
rather than death and then another which replaced John with
a generic person, the resulting query matches with the story
fragment from figure 1. Generic John’s health changing action
matches to Princess Peach who uses a potion to hurt
(rather than kill) herself. Upon adaptation back to the original
requirements, the resulting fragment is that John commits
suicide, killing himself with a potion (Figure 1 shows
the process of generalization, recall, and adaption from left
to right).
Figure 1: An example of the transform and adaption process.
Measuring Creativity
Many notions of creativity focus on the differences between
system input and output. In particular, a system can be considered
creative when it produces outputs that are signifi-
cantly different from the given input (Pease and Winterstein
2001). Of course, the ability to generate a great variety of
nonsensical outputs is not particularly impressive: variety
must be balanced against quality. Unfortunately, both story
quality and “significant difference” are difficult to measure
in the realm of stories. In domains that have less dense semantic
information, perceived difference between artifacts
is usually related to measurable qualities of those artifacts
(for example, pitch or duration in music are directly measurable
and affect the perception of a piece). Given measurable
qualities that relate to difference, a distance function can be
used to cluster the output of a generator, and the resulting
clustering will characterize how much meaningful variety
Proceedings of the Second International Conference on Computational Creativity 55
that generator can produce. For stories, however, significant
difference is difficult to measure, let alone the task of creating
a distance metric between stories. Measurable qualities,
like word differences or story length, have unpredictable in-
fluences on the difference between two stories (for example,
the same story could be told twice with very different overall
lengths). More generally, every measurable aspect of a story
has some relevance to the story, but changing less-relevant
aspects of the story can result in insignificant differences.
The problem of deciding which details of a story are relevant
enough is itself a difficult unsolved AI problem; creating a
computational distance metric over stories that corresponds
to human perceptions would be a difficult task.
Besides characterizing the variety of our output, we hope
to measure its quality. Ideally, each generated story could
be assigned a quality value, and then a result set could be
evaluated in terms of the quality of the results that it contained.
Of course, computationally evaluating story quality
is also an open problem, so for both variety and quality
we are forced to rely on estimates. To estimate the overall
quality of a result set, we hand-label each distinct result
in the set as either sensible or nonsensical. By using a binary
labelling, we come up with a relatively concrete metric
over our set of results (an integral scale would be subject to
more noise during labelling and would require a subjective
weighting function to be comparable between tests).
To estimate the variety of our result space, we measure
the expected variation between a pair of results, and use that
as an estimate of the rate at which novel artifacts will be
generated. The higher this estimate, the greater variety of
artifacts we expect will occur for a fixed-size result set. And
although we do not know exactly how variety among generated
artifacts (measured using differences at the symbolic
level e.g. a princess is used instead of a knight) affects the
variety of the results in terms of significant differences, we
can assume that they are correlated (i.e. the more varied the
results in terms of raw symbol differences, the more likely it
is that they contain stories that are significantly different).
Figure 2: The story space divided into creative groups.
Figure 2 is a demonstration of our estimation method for
result set variance. Within the entire space (all possible stories),
each point represents a particular story (along with
minor variants of that story, such as stories that substitute
one particular weapon for another in a fight scene). These
points in turn are grouped into sets each of which is composed
of stories that are perceptually similar (so significant
differences exist between these story groups, but not within
them). These actual groupings are unknown (because automatically
measuring significant difference between stories
isn’t yet possible), but we can measure the number of different
particular stories (individual points) that our generator
can be expected to produce. Although the exact correspondence
is unknown, it is clear that the greater the variety of
particular stories our system generates, the more story categories
it can be expected to create examples of (in this case,
if our variance measure were to increase from four to five by
adding a random story, there would be a two in five chance
that that increase would also increase the creativity of the
system by including a story from group 3). So our measure
of story variance is a proxy for the number of creative
categories that our system will produce examples of, which
in turn is (along with our measure of the sensibility of the
results) a measure of the creativity of the system.
Experimental Setup
To measure the creativity of MINSTREL’s TRAMs, we performed
a variety of experiments each of which involved a
single query. For each experimental condition, we ran five
runs of 1000 repeated queries. We then calculated averages
for each condition and ran four metrics on each. First we
tallied the number of unique results (by collapsing exact duplicates).
Next we computed the number of sensible versus
nonsensical results using a mapping from results to sensibility
that we built by hand which covered all results. Finally,
we computed both the probability that a pair of queries under
the given experimental conditions would have at least
one difference, and the expected number of differences between
such a pair. In addition to these measures we compute
separate sense to nonsense ratios (s/ns) for just the unique
results. Using these numbers, we are able to characterize
the creativity of MINSTREL under our various conditions by
comparing them to the baseline.
Figure 3: Our default query.
Proceedings of the Second International Conference on Computational Creativity 56
Each experiment varied one of five underlying variables:
the depth limit used to find results (how far TRAMs can recurse),
the query supplied, the story library used, the set of
TRAMs available, and the weights on the TRAMs in use.
For the base case, we used a search depth of five, a default
test query, our full set of story libraries, our full set
of TRAMs, and uniform TRAM weights. Our default query
consisted of an unspecified State node connected to a Goal
node, which connected to an Act node and then another State
that linked back to the Goal. In terms of links, the first state
motivated the goal, which planned the act, which intended
the second state, which achieved the goal (Figure 3 shows
this structure). In the Act node, we specified “Shnookie
the Princess” as the actor, and in the second State node, we
specified a type of “Health” and a value of “Dead”. Beyond
these constraints, the nodes were completely unspecified,
so our query corresponded to the instruction: “Tell
me a story in which Shnookie the princess kills something.”
Our story library contained sixteen hand-authored stories
which ranged from simple (“PrincessAndPotion”, in which
a princess drinks a potion and becomes injured), to fairly
complex (Our largest story contained 26 nodes and included
eight different nouns). On average, our stories contain 11.5
nodes (Goals, Acts, States, and Beliefs) and include 4 nouns.
For these tests, we used a limited set of TRAMs that focused
on Act nodes (hence the structure of our query). To
test the full range of TRAMs, we would need to engage
MINSTREL’s author-level planning system in order to perform
multiple lookups over a single query, which would
make it difficult to make statements about the TRAM system
in isolation. Given our limited query, we use a total
of seven TRAMs: GeneralizeConstraint, GeneralizeActor,
GeneralizeRole, LimitedRecall, RecallActs, IntentionSwitch
and SimilarOutcomes.
Results
To get a sense of the TRAM system’s creativity, we can
look at our baseline result. We find that our measure of expected
variance is about 7 (6.90), while our measure of quality
is near 0.5 (0.544). In other words, if we were to submit
two queries using these parameters, we’d expect about seven
fields in which the results would differ, and only half as
many sensible results as nonsensical ones on average. These
parameters are not desirable for use in producing stories, because
they produce too many nonsensical results (only about
35% (35.2%) of total results were sensible) but for testing
the system, the even mix of sense and nonsense allows us
to observe how changes promote one or the other. The expectation
of seven differences between two random results
is encouraging: it indicates that there is variety in the output
space. Looking at the total number of stories, we can
see that 1000 trials produces an average of about 70 (68.8)
unique stories, about 21 (21.0) of which will be sensible.
Among unique results, the s/ns ratio is around 0.4 (.439).
The fact that this ratio is higher among total results than
among unique results indicates that the sampling of unique
results is biased towards sensible ones: sensible stories are
repeated more often than nonsensical ones. Our baseline sets
a high bar for variance, but exhibits lackluster quality.
Figure 4: Total sensible versus nonsensical results.
Looking at the results as a whole, we can see some significant
differences between the various cases. In figures 4
and 5, Our baseline initial query is shown under the heading
Control. The D2, D3, D4, and D6 headings are for TRAM
depth limits of 2, 3, 4, and 6 respectively (the initial query
and all others used depth 5). In terms of the proportion of
total results which are sensible, the depth doesn’t seem to
make a difference. Additionally, depths 3, 4, 5, and 6 are all
roughly equivalent in terms of the total number of responses
generated, both sensible and nonsensical. Depth 2 does generate
significantly fewer results, however, although the s/ns
ratio is still approximately the same as it is in the other runs.
We can hypothesize that although a significant number of
possible stories require at least three TRAMs to reach from
our test query, the distribution of these depth-3 stories in
terms of sense and nonsense is roughly the same as the distribution
of results at depth 2. Based on this idea, we graphed
the actual distribution of results across TRAM depth for the
D6 test (shown in figure 7, which also shows results for the
Weight case). This confirmed our hypothesis: most results
have depth 2, many others have depth 1 or 3, but after depth
3, the number of results falls off sharply. Looking at just
unique results (figure 8), we can see that there is an even
more marked bias towards lower depths, with extremely few
results at depths 5 and 6. Comparing figures 7 and 8 we see
that many of the results at depths 5 and 6 are almost certainly
identical to results at lower depths, unless the deeper
unique results are repeated much more often. Examining the
log files in detail confirms this. In terms of the MINSTREL
system then, it appears that some TRAMs might have no effect
on the result of a search, possibly because other TRAMs
later reverse their effects. Of course, these TRAMs do have
an impact on the overall distribution of results, even if they
sometimes don’t effect individual searches, and the TRAMs
are only sometimes ineffectual.
After experimenting with the TRAM depth, we ran an alternate
query to explore how much our results might depend
on the query. The alternate query had the same graph structure,
but stipulated that the goal and motivating state nodes
had type ‘Health’, and that the actor of the act was a troll
rather than a princess. Additionally, the state node in the
Proceedings of the Second International Conference on Computational Creativity 57
Figure 5: Unique sensible versus nonsensical results.
alternate query was completely blank. As figures 4 and 5
show, this alternate query generated fewer unique results,
but had much better ratios of s/ns in both the unique results
(.947) and the total results (3.103). Given our TRAMs and
story library, this alternate query is apparently more restrictive,
but also makes it easier for the system to find sensible
results. As might be expected, this increased story quality
comes with a decrease in story variance: figure 6 shows that
the expected differences between two results using the alternate
query has fallen to about 3.5 (which is approximately
half of the 7 differences expected from the baseline). The
query test shows that our example query should not be taken
as a good estimate of the average query: there is a lot of variation
between queries, and we have no reason to expect that
our default query is representative. This test also exhibits
the fundamental trade-off of our system: more constrained
queries increase the s/ns ratio, but come at the expense of
less variation. To overcome this trade-off, the system would
have to either cull nonsensical results, or simply generate
variety that does not include nonsense.
After testing varying the query, we next tested the effect
of library size. We constructed a smaller story library by randomly
removing eight stories from our default library. We
then re-ran the original query (the results are listed under
Story in our graphs). This smaller story set significantly reduced
the number of unique stories generated, while at the
same time depressing the s/ns ratio. The expected number
of differences remained high, but removing stories clearly
degrades the system: MINSTREL has trouble finding parsimonious
story fragments during TRAM searches, and as a
result it more often resorts to nonsensical matches. Interestingly,
though, the decrease in the number of unique results
was not proportional to the decrease in the number of stories
(although the decrease in the number of sensible unique results
was). This implies that nonsensical stories are easier to
generate than sensible ones.
Our next test was the most promising: we took the two
TRAMs that create the most liberal changes (SimilarOutcomes
and GeneralizeConstraint) and removed them. The
results clearly show how important TRAMs are: the sense
to nonsense ratio was drastically increased for both total and
unique results. At the same time, they show even more
clearly that nonsense is the usual price for variation: although
the majority of results were now sensible, the variation
within these results was reduced: the TRAM case has
an expected differences value of close to 3, compared to the
original 7. Effectively, even though these liberal TRAMs
often create nonsense, they are also key in creating enough
variation to generate creative results.
For our final trial, we implemented an alternate search
method that we hoped would provide a compromise between
the chaos of the more liberal TRAMs and the boring results
generated without them. Rather than use random TRAM selection
during search, we used weighted random selection,
and biased the selection towards less-liberal TRAMs. We
gave the two liberal TRAMs that had been removed in the
TRAM case weights of 2 and 3, and most of the TRAMs got
a weight of 5. Two of the more specific TRAMs got weights
of 8 and 10. Given these weights, the expected variation
was maintained, but the s/ns ratios decreased for both total
and unique results. The number of unique results decreased
as well. Figure 7 shows that among all results, results that
were found after only a single TRAM application signifi-
cantly increased, at the expense of results that used more
than 4 TRAMs. Interestingly, the distribution of unique results
among depths (seen in figure 8) did not fundamentally
change. Essentially, our weighting scheme did not help find
new results, but instead biased the total results towards shallow
unique results, some of which were nonsensical. This
result demonstrates that even minor changes to the search
method can negatively impact the results (as opposed to simply
favoring either variety or quality). The fact that TRAM
weights can significantly impact the result set also suggests
that a principled approach to selecting TRAM weights could
potentially enhance the system.
Conclusions
To evaluate the creativity of MINSTREL’s TRAM system,
we adopted formal measures of variety and quality to systematically
investigate the effect of MINSTREL parameters
and design choices on the creativity of the system. In addition
to indicating measurable creativity we were pleased
to notice that a number of the story fragments showed interesting
results (for example, one of our stories involved a
group of possessed townsfolk: the princess wants to injure
them because they are possessed, but ends up killing them).
Figure 6: The expected number of fields that will differ between
two results.
Proceedings of the Second International Conference on Computational Creativity 58
Figure 7: TRAM depths for the D6 and Weight trials.
Figure 8: Unique depths for the D6 and Weight trials.
We feel that our main finding is that there is a demonstrable
trade-off between the variety and the quality of the results:
conditions that increase the quality of the results come
with corresponding decreases in the variety thereof. No single
configuration among our tests emerged as superior, but
given a preference for either variance or sensibility, we have
enough data to recommend a set of parameters. By measuring
both variance and quality, we were able to effectively
track the creative properties of the system, and observe both
expected and unexpected changes under various conditions.
Although our measurement of creativity is not direct, our
data-driven approach has allowed us to make very specific
statements about the way that the system operates, and to
experiment and discover nuances of the system that would
not be uncovered with a more general creativity assessment.
Future Work
The next step in the evolution of MINSTREL REMIXED is
to implement Turner’s author-level planning system. Creativity
in MINSTREL is not restricted to the TRAM system
alone: the way that author-level plans use the TRAM
system and the interplay between various author-level plans
gives rise to sensible creative results (for example, there are
author-level plans that check the consistency of the story,
which can be helpful when the TRAM system comes up with
odd results). Once author-level planning is implemented,
a more holistic study of Minstrel’s creativity could be produced.
To do this, more modern measures of creativity such
as cognitively inspired methods (Riedl and Young 2006) and
non-automated evaluative frameworks (Ritchie 2007) should
be investigated to determine what measures would be applicable
to the full system output.
We may also decide to revisit our TRAM weighting system.
Although the weights that we chose for this experiment
resulted in poor performance, armed with metrics for
variance and quality, we could optimize the TRAM weights.
This process would provide further information about how
each TRAM contributes to variation and to sensibility.
Acknowledgements
This work was supported in part by the National Science
Foundation, grants IIS-0747522 and IIS-1048385. Any
opinions, findings, and conclusions or recommendations expressed
in this material are those of the authors and do not
necessarily reflect the views of the National Science Foundation.
<references_biblio/>
References
Boden, M. 2004. The Creative Mind. London: Routledge,
2nd edition.
Fairclough, C., and Cunningham, P. 2003. A multiplayer
case based story engine. Technical report, Trinity College
Dublin, Department of Computer Science, Dublin.
Gervas, P.; Diazagudo, B.; Peinado, F.; and Hervas, R. 2005.
Story plot generation based on CBR. Knowledge-Based Systems
18(4-5):235–242.
Musen, M. a.; Gennari, J. H.; and Wong, W. W. 1995. A
rational reconstruction of INTERNIST-I using PROTEGEII.
In Proc. Annual Symp. on Computer App. in Medical
Care., 289–93.
Pease, A., and Winterstein, D. 2001. Evaluating machine
creativity. In Proc. ICCBR. Workshop on Creative Systems,
129–137.
Peinado, F., and Gervas, P. 2006. Minstrel Reloaded : From
the Magic of Lisp to the Formal Semantics of OWL. Lecture
Notes in Computer Science 4326(2006):93–97.
Riedl, M., and Young, M. 2006. Story planning as exploratory
creativity: Techniques for expanding the narrative
search space. New Generation Computing 24(3):303–323.
Ritchie, G. 2007. Some empirical criteria for attributing creativity
to a computer program. Minds and Machines 17:67–
99. 10.1007/s11023-007-9066-2.
Smith, G.; Whitehead, J.; Mateas, M.; Treanor, M.; March,
J.; and Cha, M. 2011. Launchpad: A Rhythm-Based Level
Generator for 2D Platformers. To appear in IEEE TCIAIG
0(0).
Tearse, B.; Mateas, M.; and Wardrip-Fruin, N. 2010. MINSTREL
Remixed: a rational reconstruction. In Proc. of the
INT III Workshop. Monterey, CA: ACM.
Turner, S. R. 1994. The Creative Process: A Computer
Model of Storytelling And Creativity. Hillsdale, New Jersey:
Lawrence Erlbaum Associates, Inc.
Proceedings of the Second International Conference on Computational Creativity 59