Evaluating Musical Metacreation in a Live Performance Context
Arne Eigenfeldt
Contemporary Arts
Simon Fraser University
Vancouver, BC CANADA
arne_e@sfu.ca
Adam Burnett
Cognitive Science
Simon Fraser University
Burnaby, BC CANADA
ajb14@sfu.ca
Philippe Pasquier
Interactive Arts and Technology
Simon Fraser University
Surrey, BC CANADA
pasquier@sfu.ca
Abstract
We present an evaluation study of several musical metacreations.
An audience that attended a public concert of music
performed by string quartet, percussion, and Disklavier was
asked to participate in a study to determine its success: 46
complete surveys were returned. Ten compositions, by two
composer/programmers, were created by five different software
systems. For purposes of validation, two of these
works were human-composed, while a third was computerassisted:
the audience was not informed which compositions
were human-composed. We briefly discuss the different
systems, and present the artistic intent of each work, the
methodology used in gathering audience responses, and the
interpreted results of our analyses.
Introduction
The Musical Metacreation project1 is an ongoing research
collaboration between scientists and composer/musicians
at Simon Fraser University that explores the theory and
practice of metacreation – the notion of developing software
that demonstrates creative behaviour (Whitelaw
2004). The objectives include not only developing software,
but producing and presenting artistic works that use
the software, and validating their musical success.
 The research team includes a composer of acoustic and
electroacoustic music who has created music composition
and performance systems for over twenty years, an artificial
intelligence researcher whose specialty includes multiagent
systems and cognitive modeling (and who is himself
a creative artist in the field of computer music, sound design,
audio and media arts), and several research assistants
who are composers and/or scientists.
 The fields of musical metacreation revolves around two
central tasks:
• The composition task: the aim of this task is to produce
music in the form of a symbolic representation, often a
musical score. If the system takes existing compositions
as input, it will be said to be corpus-based.
• The interpretation task: given some symbolic musical
notation, this task consists of generating an acoustic
signal.
 Sometimes, these two tasks collide. For example, in
electroacoustic music (in which we include electronica),
an acoustic signal is directly generated as the output of the
composition task. In the case of improvised music, composition
and interpretation can be seen to happen simultaneously.
The systems described in this paper, along with their
evaluation, are all addressing the composition task.
 The creative systems produced by our research team
have already been described in conference proceedings and
journals, while the music produced has been presented in
public concerts and festivals. On the surface, therefore, we
could state that our work has already been validated; however,
there are deeper issues involved that we discuss in
this paper.
 In considering how a metacreative system might be validated,
there are at least five potential viewpoints that can
be considered:
1.The designer: the designer of the system accepts the output
as artistically valid;
2.The audience: the work is presented publicly, and the
audience accepts the work;
3.The academic experts: the system is described in a technical
peer-reviewed paper and accepted for conference
or journal publication;
4.The domain experts: the system receives critical attention
through the media or non-academic artists via demonstration;
5. Controlled experiments: the system is validated through
scientifically accepted empirical methods, using statistical
analysis of the results in order to accept or reject
the hypothesis made about the system.
 In the first instance, any artwork created by a human,
and publicly presented, conceivably requires the artist to
consider it complete and successful and representative of
the artist’s aesthetic vision. Similarly, metacreative works
have, so far and to our knowledge, reflected the artistic
sentiment of their designers. According to this viewpoint,
the system evaluation is made directly by the designer. In
our case, our metacreative systems have produced works
that we find artistically interesting.
 The second step reflects an artist’s desire to share their
work with the public. Whether the audience accepts, appreciates,
or enjoys the work is, unfortunately, often difficult
to ascertain, as many audiences will politely applaud any
work. One could include more quantitative measures, such
as audience counts, album sales, or online downloads.
 The third case involves peer-review, albeit for a description
of the system in technical terms. A different criteria is
in place, one dependent less upon the artistic output, and
more upon the technical contribution of the system in its
novelty and usefulness. Often, the evaluation is also an
evaluation of the originality and soundness of the process
encoded in the system in regard to the computational creativity
literature (Colton, 2008).
1 http://www.metacreation.net/
International Conference on Computational Creativity 2012 140
 Both metacreation software and their output can be discussed
in the media. Journalists and critics are different
from the regular audience, in that their opinion will be further
diffused to the audience: this may influence the audience
judgment and the work can gain or lose notoriety as a
consequence.
 Lastly, empirical quantitative or qualitative validation
studies can be undertaken that involve methods long supported
by the research community for generating knowledge
within the hard and soft sciences. While the computational
creativity literature has started investigating these
(Pearce and Wiggins, 2001; Pease et al., 2001, Ritchie,
2007, Jordanous, 2011), a great deal remains to be done.
 While most previous work regarding the evaluation of
musical metacreation (and computationally creative software
in general for that matter) have been focusing on dimensions
1, 3 and 5, this paper presents an experimental
study realized in the context of the public presentation of
artworks in a concert setting (mixing dimensions 2 and 5).
Also, there are very few instances of evaluation studies that
consider more than one metacreative system at a time; our
study is a comparative study of five different systems for
computer-generated or computer-assisted composition.
 The remainder of this paper discusses our evaluation
study and the results we received, but also the questions
that were raised. We first describe the different software
systems involved, as well as the artistic intent of the compositions
produced. We then present the methodology used
in gathering audience responses to the compositions, as
well as the results garnered from these responses. Finally,
we posit our conclusions, as well as potential future work
in this area.
Description
The public presentation of the metacreative software systems
described in this paper took place as a public concert
in December, 2011. Audience included members of the
general public, as well some students of the first and third
authors. Ten compositions, by two composers, were performed
by a professional string quartet, percussionist, and
Disklavier (a mechanized piano equipped to interpret MIDI
input). The music was produced by five different software
systems designed, and coded individually by the two composers.
For comparison purposes, two of the pieces were
composed without software; in other words, composed
completely by human; a third was computer-assisted. The
audience was informed beforehand that at least two of the
works were human-composed, but were not informed as to
which pieces these were; however, the program notes made
it rather obvious that fundatio and experiri were, at most,
computer-assisted. See Table 1 for a list of compositions.
The Systems and Compositions
In Equilibrio was generated by a real-time multi-agent system,
described in (Eigenfeldt, 2009b). The system is concerned
with agent interaction and negotiation towards a
integrated melodic, harmonic, and rhythmic framework; its
final output are MIDI events. The generated MIDI data was
sent to a Yamaha Disklavier; no effort was made to disguise
the fact that the performance was by a mechanical
musical instrument. Along with the Disklavier and some
high-level performance control by the composer, this system
was responsible for both the “live” composition and its
interpretation.
One of the Above consists of three movements for solo
percussion. The music is notated by a system described in
(Eigenfeldt and Pasquier, 2012). This system uses multiple
evolutionary algorithms, including genetic algorithms, to
control how a population of musical motives is presented in
time, and how it is combined with other populations of
motives. Intended for solo percussionist, the composition is
a concentrated investigation in development of rhythmic
motives. Each movement of the composition was presented
separately, and treated as a unique composition within the
evaluation. One additional movement, composed with the
same intentions as the other three in this series, is humancomposed
(for reasons discussed in the Evaluation section).
Dead Slow / Look Left is a notated composition for string
quartet and percussion, by a system that employs the harmonic
generation algorithm described in (Eigenfeldt and
Pasquier, 2010). The composition consists of a continuous
overlapping harmonic progression generated using a harmonic
analysis of 87 compositions by Pat Metheny, and a
third-order Markov model based upon this analysis. In this
corpus-based system, durations, dynamics, playing style,
range, and harmonic spread were determined using patterns
generated by a genetic algorithm. These continuous harmonies
were interrupted by contrapuntal sections that interpret
tendency masks (Truax 1991), which define such parameters
as sequence length, number of instruments, subdivisions,
playing style, number of playing styles, dynamics,
and the number of gestures in a section.
Other, Previously was generated by a system described
generally in (Eigenfeldt, 2009a), while the composition is
described more fully in (Eigenfeldt, 2012b). A corpus of
MIDI files – in this case 16 measures of the traditional
Javanese ensemble composition Ladrang Wilugeng – was
analysed, and generative rules regarding rhythmic construction
was derived from the corpus. These rules were
used by a genetic operator to create a population of everevolving
melodies and rhythms that the system reassembled
in a multi-agent environment over a rotating harmonic
field. The real-time output was transcribed in a music notation
program, and performed by string quartet. The end
result is a piece of notated music that reflects many of the
tendencies of the original corpus material, without direct
quotation. The composer’s role was limited to dynamic
markings, orchestration, and assembling sections.
Gradual was generated by an extension of the system
used to generate One of the Above, with an additional module
to control pitch aspects integrated into the system. The
final output was a notated work for marimba, violin, and
Disklavier. While the system achieved the composition on
its own, the interpretation was mixed: humans were playing
the marimba and violin while the system was in charge of
operating the Disklavier.
International Conference on Computational Creativity 2012 141
Composition Instrumentation
Experience Level
Expert Novice Combined
1 In Equilibrio
[c]
Disklavier 3.17
(0.99)
2.71
(1.23)
2.90
(1.14)
2 One of the Above #1
[h]
Solo percussion 4.00
(1.00)
3.36
(1.19)
3.67
(1.13)
3 Dead Slow /Look Left
[c]
String quartet and
percussion
4.16
(0.90)
3.08
(1.15)
3.51
(1.16)
4 One of the Above #2
[c]
Solo percussion 3.68
(0.67)
3.16
(1.07)
3.42
(0.93)
5 fundatio
[h]
String quartet 4.29
(0.80)
4.24
(0.83)
4.24
(0.81)
6 experiri
[c-a]
String quartet 4.47
(0.61)
4.36
(0.86)
4.40
 (0.76)
7 One of the Above #3
[c]
Solo percussion 3.39
(0.76)
3.12
(1.20)
3.22
 (1.04)
8 Other, Previously
[c]
String quartet 4.31
(0.75)
4.50
(0.59)
4.40
(0.66)
9 One of the Above #4
[c]
Solo percussion 3.63
(1.16)
2.71
(1.00)
3.10
(1.16)
10 Gradual
[c]
Violin, marimba,
Disklavier
4.05
(0.85)
3.88
(0.95)
3.93
(0.89)
Table 1. Individual composition engagement score means (out of 5). Standard deviations appear in parentheses. [c] = computer- composed.
[h] = human-composed. [c-a] = computer-assisted.
fundatio and experiri were created by composer and
software designer James Maxwell, with the help of his generative
composition software that rests on a cognitive
model of music learning and production. This software,
ManuScore, is partially described in (Maxwell et al. 2009,
2011). ManuScore is a notation-based, interactive music
composition environment. It is not a purely generative system,
but rather a system which allows the composer to load
a corpus, and proceed with that compositional process
while enjoying recommendations from the system of possible
continuations as suggested by the model.
fundatio was written using the commercial music notation
software, Sibelius, following the compositional process
used by the composer for many years, while experiri
was written using ManuScore. Although this latter work
remains clearly human-composed, the formal development
of the music, and much of the melodic material used, were
both directly influenced by the software.
 Performances of the compositions can be viewed here:
In Equilibrio: http://youtu.be/x5fIdHbqEhY
Other; Previously: http://youtu.be/gaQfyhOiRio
One of the Above #2: http://youtu.be/gAIjQOiMG54;
One of the Above #3: http://youtu.be/bUYr7T7DKGs;
 One of the Above #4: http://youtu.be/cQNQKinbJ-s.
 Gradual: http://youtu.be/HZ2_Pr35KyU.
experiri: http://youtu.be/Gr5E7UVUoE8
fundatio: http://youtu.be/rNXt8b-kLMQ
Evaluation Study
The public concert was meant to serve two purposes:
firstly, to present the artworks of the metacreative systems
to the public, and secondly, to explore the idea of conducting
evaluation in concert settings.
 The opportunity for serious validation prompted the first
composer to write an additional work separate from the
metacreative systems, with the same musical goal. The
purpose was not to fool the audience in making them guess
which piece was not composed by machine, but rather to
add human-composed material to the comparative study.
While we hope that audiences will, one day, accept machine
generated music without bias, Moffat and Kelly
(2006) suggest this is not yet occurring. In our case, given
three works for solo percussionist, composed in a particularly
modernist style, it would be difficult to ascertain
whether an audience’s appreciation – or lack thereof – was
due to the musical style, the restricted timbral palette, the
lack of melodic and harmonic material, or any failings of
the metacreative system. The human-composed piece allowed
the composer to demonstrate the above-mentioned
aspects, yet composed by the system designer. If the audience’s
rating of the human-composed piece was statistically
similar to the metacreative works, it would demonstrate
that the audience’s preferences were based upon
style, rather than musical creativity and/or quality.
International Conference on Computational Creativity 2012 142
Methods
Participants were 46 audience members from the general
public (rather than only students) who attended a paid concert
put on by Simon Fraser University. A program distributed
to each audience member explicitly indicated that
“machine-composed and machine-assisted musical compositions”
would be performed. Each audience member also
received an evaluation card on which they were encouraged
to provide feedback. Audience members were asked to indicate,
on a Likert-scale from 1 to 5, their level of familiarity
with contemporary music, followed by ten similar 5
point Likert-scales regarding how “engaging” they found
each piece to be. Audience members were also asked to
indicate which three pieces they felt were the most directly
human-composed. Audience members where also given
space to write in their own comments. See Table 1.
Hypotheses
We hypothesized that the machine-generated and
computer-assisted works were sufficiently similar in quality
and style to the human-composed pieces that audience
members would show no preference for the timbrally similar
human-composed pieces (null hypothesis). This preference
would be indicated by audience members' indication
of how “engaging” they found each piece.
Analysis
In order to avoid alpha inflation that arises from multiple
comparisons, statistical tests were made using post-hoc
Bonferroni corrected alpha levels of .005 (0.5/10). For part
of the analysis, the 46 audience members were divided into
novice and expert groups depending on the score they indicated
for the “familiarity with contemporary music” question.
The “novice” group consisted of audience members
that gave a score 1, 2, or 3 out of 5 on the familiarity scale
(N = 25). The “expert” group consisted of the remaining
audience members who gave a 4 or 5 (N = 19). Two audience
members failed to provide a familiarity score, so their
data was excluded from group comparisons.
Audience did not seem to discriminate between all the
percussion pieces. Comparing the average engagement
scores for the human-composed solo percussion piece One
of the Above #1 (M =3.59, SD = 1.15) with the average
scores for the machine-composed One of the Above #2
through #4 (M = 3.28, SD = 1.02) was not significant, t(44)
= 1.43; p = .16 ns, leaving us unable to suggest that participants
were able to discriminate between the human and
machine-composed percussion pieces.
Audience did not “recognize” which piece was not
computer-made. Assuming participants would find
human-composed pieces more engaging, participants’ engagement
rating of the individual pieces were interpreted as
an indication of whether participants could implicitly distinguish
human-composed from machine-composed pieces.
Tests comparing expert listeners’ engagement scores for the
human-composed One of the Above #1 (M = 4.00, SD =
1.00) against the machine-composed alternatives (M =
3.57, SD = 0.88) were not significant (t(18)=1.68; p = 0.11
ns). Similarly, novice listeners' scores for One of the Above
#1 (M = 3.33, SD = 1.20) compared to the alternatives (M
= 3.01, SD = 1.08) demonstrated no significant preference
for the human-composed piece, t(23)=0.96; p = 0.34 ns.
 Comparisons between the expert listener engagement
ratings for the two string quartet pieces, the humancomposed
fundatio (M = 4.29, SD = 0.81) and the
machine-assisted experiri (M = 4.47, SD = .61) were nonsignificant,
t(18) = 1.00; p = .33 ns. Novice ratings for fundatio
(M = 4.24, SD = 0.83) and experiri (M = 4.36, SD =
0.86) were similarly non-significant, t(24) = .72; p = .48 ns.
This also failed to show that audience was discerning between
the computer-assisted composition made using
ManuScore and the human-made composition by the same
composer.
 Together, these results do not support the hypothesis that
audience members were able to implicitly pick out which
pieces were human-composed.
There was no difference between experts and novice
choices. To determine whether audience members' ability
to explicitly pick out the human-composed piece could
depend on one's familiarity with contemporary music, a
chi-square test compared novice and expert listeners’ three
“most directly human-composed” choices. The results of
this test were non-significant, X2 (9, N = 113) = 14.17; p =
.51 ns. This result fails to support the hypothesis that expert
and novice listeners differ in their ability to explicitly discriminate
human-composed pieces from machinecomposed
pieces.
Discussion
In addition to the above results, several further remarks can
be made.
Overall, the evaluation results were pretty successful,
showing both a rather high level of engagement from the
audience, as well as good range with ranking means varying
from 2.7/5 to 4.5/5. The audience did not discern computer
composed from human-composed material, which
seems to give credit to the five systems presented above.
More precisely, this might just mean that the system were
successful in portraying the goal, aesthetic and style of the
two composers who developed them.
 One further general observation that can be made is that
while an evaluation in a concert setting allows us to capture
the audience reaction to musical output in its “natural”
presentation environment, it also introduces many variables
that get us out of the usual controlled environment setting.
The experimental protocol is also more difficult to follow.
On the other hand, controlled experiments are not the
traditional setting in which a musical artwork is presented
and this does introduce a number of biases in this type of
evaluation. While these are well known, and solutions exist
to circumvent them, our goal was to conduct an evaluation
study in a live concert setting. We were concerned if conducting
an evaluation in a concert setting would risk upsetting
the audience’s appreciation of the artwork. To our surprise,
it did not seem to be the case, and the feedback forms
were really welcomed. The whole process triggered a
International Conference on Computational Creativity 2012 143
longer than expected question and answer session at the
end of the show. It is to be noted that very few audience
members left before the end of the Q&A session.
Conclusions and Future Work
Finally, the whole process shed some light on the difficulty
of evaluating computational creativity (and creativity in
general). Artificial intelligence addresses the problem of
emulating intelligence by having the computer achieve
tasks that would require intelligence if achieved by humans.
These tasks are usually formalized as well-formed
problems. Rational problem solving is then evaluated by
comparison to some optimal solution. If the optimal solution
is theoretical and not attainable, optimization and approximation
techniques can be used to get closer to the
optimal, or at least improve the quality of the solution according
to some metrics. Computational creativity is faced
with the dilemma that, while creative behavior is intelligent
behavior, such notions of optimality are not defined. It is
often unclear which metrics need to be used to track progresses
in the area. As demonstrated by this paper, it is at
least an issue for the evaluation of composition systems.
 Musical success is subjective in nature. This is why we
resort to a comparative study capturing the relative level of
success, rather than absolute ones. In the absence of formal
metrics, we used human subjects to evaluate musical metacreation.
However, creativity is a process (Boden, 2033).
When evaluating a musical composition system, one particularly
challenging aspect is that the system is capable of
generating numerous pieces, with possibly varying levels
of success: designing methodologies to measure that variability
is an inherent challenge of the area. This is especially
true when one has to use human subjects, since getting
average relative evaluations of the average system
production makes the experimental design particularly
challenging.
 To our knowledge, this paper is the first one to report on
an evaluation experiment of machine-generated material
conducted in real-world public situation. Beside the findings
exposed above, the research instrument discussed here
is a contribution in itself. As the systems presented are musical
metacreations, validation and evaluation of such a
system’s output is itself a relatively novel and challenging
research area. Our future work will continue to investigate
and try to evaluate the methodologies to do so. Meanwhile,
besides the finding exposed above, the paper raises a number
of concerns and questions that will likely need further
consideration in future work.
Acknowledgements
This research was funded by a grant from the Canada
Council for the Arts, and the Natural Sciences and Engineering
Research Council of Canada.
<references_biblio/>
References
Boden, M. 2003. The Creative Mind - Myths and Mechanisms
(2. ed.). Routledge I-XIII, 1-344
Colton, S. 2008. Creativity versus the perception of creativity
in computational systems. In Creative Intelligent Systems:
Papers from the AAAI Spring Symposium.
Eigenfeldt, A. 2009a. The Evolution of Evolutionary Software:
Intelligent Rhythm Generation in Kinetic Engine.
Applications of Evolutionary Computing, Berlin.
Eigenfeldt, A. 2009b. Multi-Agency and Realtime Composition:
In Equilibrio. eContact 11.4 Toronto Electroacoustic
Symposium 2009 http://cec.concordia.ca/econtact/11_4/
Eigenfeldt, A., Pasquier, P. 2010. Realtime Generation of
Harmonic Progressions Using Constrained Markov Selection.
Proceedings of the First International Conference on
Computational Creativity, Lisbon.
Eigenfeldt, A., Pasquier, P. 2012a. Populations of Populations
- Composing with Multiple Evolutionary Algorithms,
P. Machado, J. Romero, and A. Carballal (Eds.): EvoMUSART
2012, LNCS 7247, 72–83. Springer, Heidelberg.
Eigenfeldt, A. 2012b. Corpus-based Recombinant Composition
using a Genetic Algorithm. Soft Computing - A Fusion
of Foundations, Methodologies and Applications.
Springer Special issue on Evolutionary Music, forthcoming.
Jordanous, A. 2011. Evaluating Evaluation: Assessing Progress
in Computational Creativity. Proceedings of the Second
International Conference on Computational Creativity,
Mexico City.
Maxwell, J., Pasquier, P. and Eigenfeldt, A. 2009. Hierarchical
Sequential Memory for Music: A Cognitive Model.
Proceedings of the International Society of Music Information
Retrieval Conference, Kobe.
Maxwell, J., Pasquier, P. and Eigenfeldt, A. 2011. The
Closure-based Cueing Model: Cognitively-Inspired Learning
and Generation of Musical Sequences, Proceedings of
the 8th Sound and Music Computing Conference, Padova.
Moffat, D., and Kelly, M. 2006. An investigation into people’s
bias against computational creativity in music composition.
Third Joint Workshop on Computational Creativity,
Riva del Garda.
Pearce, M. and Wiggins, G. 2001. Towards a framework
for the evaluation of machine compositions. In Proceedings
of the AISB’01 Symposium on Artificial Intelligence and
Creativity in Arts and Science. Brighton: SSAISB. 22–32.
Pease, A., Winterstein, D., and Colton, S. 2001. Evaluating
machine creativity. In Workshop on Creative Systems, 4th
International Conference on Case Based Reasoning, Vancouver,
56–61.
Ritchie, G. 2007. Some Empirical Criteria for Attributing
Creativity to a Computer Program. Minds & Machines. 17,
67–99.
Truax, B. 1991. Capturing musical knowledge in software
systems. Interface. 20:3-4, 217–233.
Whitelaw, M. 2004. Metacreation. Art and Artificial Life.
Cambridge, MA: MIT Press.
International Conference on Computational Creativity 2012 144