Validation of Harmonic Progression Generator Using Classical Music
Adam Burnett
Cognitive Science
Simon Fraser University
Burnaby, BC Canada
ajb14@sfu.ca
Evon Khor
Cognitive Science
Simon Fraser University
Burnaby, BC Canada
ewk@sfu.ca
Philippe Pasquier
Interactive Arts and
Technology
Simon Fraser University
Surrey, BC Canada
pasquier@sfu.ca
Arne Eigenfeldt
Contemporary Arts
Simon Fraser University
Vancouver, BC Canada
arne_e@sfu.ca
Abstract
We evaluate the output of a Markov model-based harmonic
progression generator, a classic model for corpus-based
computational creativity. 87 participants performed
a discrimination task classifying 20 musical excerpts
as either human-composed or computer-composed.
Also recorded was each participant's level of
confidence in their choice. Results indicated that while
overall performance was above what would be expected
from random guessing, further analysis revealed this
was due to the human-composed pieces being much
easier to identify than computer-composed pieces. Assessed
separately, participants were unable to identify
computer-composed pieces above chance-levels. We
suggest improvements to the experimental design that
could be implement in future evaluations.
 Introduction
The trouble with evaluating artistic creativity is that it is
difficult to establish objective criteria with which to judge
the resulting creative artifact. This problem is compounded
when the source of the artifact is itself a creative software.
Computational creativity results in creations of creations,
or metacreations (Whitelaw 2004), that differ from the artifacts
we are used to encountering. As they are produced by
machines that vary in their level of autonomy and in the
amount of user-interaction they require in order to function,
we must keep in mind a different set of considerations
when we evaluate the resulting pieces.
The evaluation of aesthetics in metacreations is a fixture
in the computational creativity literature (Eigenfeldt and
Pasquier 2010; Eigenfeldt and Pasquier 2011; Pease, Winterstein,
and Colton 2001; Stamp, Isenberg, and Carpendale
2007; Wallraven, Cunningham, Rigau, Feixas, and
Sbert 2009). Whereas grading intelligence is a straight-forward
matter of assessing how closely or quickly an individual
can reach the optimum solution to a formally specifiable
problem, there is usually no clear “goal” or “problem”
that needs to be solved in a creative artwork; it often
exists for its own sake, and to be enjoyed. When judging
creative works such as music, it often comes down to relying
on the subjective impressions of experts in the relevant
domain (music critics, musicologists) or by quantifying
subjective impressions in some measurable way (album
sales, concert attendance).
Relying on entirely subjective measures is undesirable
because it is not sufficient to simply test whether human
listeners find computer-generated music creative or enjoyable
or emotional: the mind is capable of finding patterns,
design, and intention in random noise, deriving pleasure in
the beauty of living organisms and ecosystems which were
“designed” by the unguided, unintelligent processes of
evolution, and sometimes even in randomness itself. To
fairly and accurately assess the quality of computer-generated
music, we must devise some sort of objective means
to do so, even if that means indirectly measuring an effect
of that creativity rather than directly measuring the creativity
itself.
In the following we will describe previous attempts to
evaluate computer-generated music, and then present our
evaluation of the harmonic progression generator developed
by Arne Eigenfeldt and Philippe Pasquier (2010).
Background
A problem is inevitably encountered when one tries to
merge a scientific discipline concerned with objectivity,
like Artificial Intelligence, with the subjectivity inherent in
the creation, appreciation, and evaluation of art. This is the
challenge for anyone proposing ways to evaluate machine
creativity. As noted by Spector and Alpern (1994), there is
no universally agreed-upon theory of aesthetic value within
the artistic community. How then do we know when we
have a computational artist? Approaches to this problem
generally fall into two main camps. The first advocates a
reliance on human judgements, particularly of the artworld
and museum-going public, by holding art shows and
getting feedback. However, this requires a lot of time and
resources and is not always practical nor reliable. The other
approach recommends the creation and application of codified,
formalized evaluation criteria with which to judge a
computational artist’s creations. This method is especially
popular in computer music evaluation as many forms of
music can be formulated to follow a rule-system. There are
three problems with this, however: 1) many existing formulations
are “dead forms”, which would penalize works
in more contemporary genres for which detailed formalizations
have not yet been established, 2) it is not evident that
strict adherence to rules of a particular art-form or genre
indicate aesthetic value. Meeting this criterion might indicate
nothing more than aesthetically mediocre and boring,
formulaic work, and 3) many formulations are rigid and
International Conference on Computational Creativity 2012 126
once established may not lend themselves to generalization
across genres, essentially punishing novel works for being
too original, even if they are of high quality.
Alternatively, Colton (2008) suggests that, rather than a
focus on the input and output, how a creative work is produced
is critical to its being perceived as creative. Colton
asks us to consider the question of whether we label works
as “creative” based on their quality, or whether we determine
the quality of works based on how creative we found
the process that generated them to be. Colton notes that in
painting audiences are concerned with the process that led
to the final product, and that this affects their enjoyment of
the piece. In fact, it is noted that often the actual aesthetic
quality of an artwork has little to do with how creative the
work is perceived to be (consider Duchamp's Fountain,
which was nothing more than a urinal). One conundrum
which follows from this approach is that when too little is
known about the process, we cannot evaluate whether or
not it is creative, and if we know too much about the process,
it is regarded as too mechanical and deterministic,
leaving no room left for “creativity” to be exercised.
Colton presents a model of art appreciation, proposing
that there are three judgements that consumers make about
the creative process when determining how much they like
a piece. These are: 1) the perceived effort required during
the process, 2) the ingenuity of the process, and 3) the skill
needed to carry out the process. From these Colton derived
The Creative Tripod, which defines the three properties a
system must possess in order to be judged as being creative:
skill, appreciation, and imagination. Without skill,
nothing can be produced; without appreciation, nothing of
value can be produced, and without imagination, nothing
original can be produced. The tripod analogy highlights the
need for all three properties to be present in order of the label
of “creative” to stand.
Whereas Colton directs attention on the process, Ritchie
(2007) de-emphasizes the internal processes and favours
focusing solely on the output. Ritchie argues that when assessing
creative artifacts, we should be faithful to the traditional
use of the word “creativity”, which is tied to subjective
human judgements. This, combined with the stance that
we should only evaluate observable creative behaviours,
levels the playing field and allows us to assess both human-produced
and machine-produced creative works fairly
and without bias. Ritchie warns that considering both the
artifact and the process would introduce a fatal circularity:
we would be left arguing that an artifact is creative because
the process that produced it was creative, and that we know
the process that produced it was creative because the artifact
it produced was creative.
Discrimation Tasks
Pearce, Meredith, and Wiggins (2002) define four motivations
for developing generative music systems: 1) to implement
them as tools for personal use and/or 2) for general
compositional use, 3) to provide theories of musical style,
and 4) to provide cognitive theories of processes in compositional
expertise. These four motivations can be collapsed
into two general categories, the first of which is to
use generative music systems as creative tools to produce
original music, the other is to use these systems as a way to
model theories of musical style and cognition. We will not
be discussing the latter category any further here.
The problem of evaluating creativity mirrors a similar
problem that befell early artificial intelligence researchers:
how do we evaluate machine intelligence? It was difficult
to say whether or not a machine could ever be said to think
or demonstrate intelligence because there was little agreement
on what those words would mean in the context.
Today we face the same problem with machine creativity,
unable to unanimously agree on what is meant by “creativity”
in the question: how do we evaluate machine creativity?
Alan Turing (1950) famously suggested a way to tackle
the problem. He had us consider a party game (the “imitation
game”) where a judge tries to determine which of two
unseen players is pretending to be a woman; it is the job of
the man to fool the judge by responding in the way he
thinks a woman would, and it is the job of the woman to
assist the judge in exposing him. With that in mind, Turing
suggest that instead of trying to answer the impossible
question of can machines think?, we should reformulate
the question into something we can answer: are there imaginable
digital computers which would do well in the imitation
game? That is, could a computer program ever be designed
that could successfully convince a judge that he was
conversing with a human? This hypothetical procedure
came to be known as the Turing Test. How this approach
might be adapted for the problem of machine creativity is
apparent: reformulate the question from whether or not a
composition system is creative, and instead ask whether it
does well in the “imitation game”.
A popular method of evaluating generative music systems
is to run a Turing-style test on the system’s output
(Boden 2010). This involves comparing computer-generated
compositions to human-generated compositions
through participant evaluation of the various pieces. Ariza
(2009) cautions against the use of term “musical Turing
Test” since intelligence of a generative music system cannot
be determined by evaluating its output. The Turing Test
has underlying assumptions on which it builds its criteria
for machine intelligence: humans have minds, and natural
language is sufficient to represent the mind; thus, if a machine
is indistinguishable from a human in discourse, then
it too has a mind. Joseph Weizenbaum’s ELIZA is an early
example of a system suited for the Turing Test. The ELIZA
system was able to fool human interrogators; however, can
we say that ELIZA is intelligent? John Searle’s Chinese
Room Argument suggests that a system is able to fool its
interrogators without knowing anything about what it is
doing; and so, having a façade of intelligence does not
mean that the system is actually doing anything intelligent.
To test the outputs of generative music systems, it is possible
to tweak the criteria of the Turing Test to accommodate
the evaluation of musical outputs. Instead of having a
text-based medium, sound symbols or forms are used. The
International Conference on Computational Creativity 2012 127
interrogator is replaced by a critic who may or may not interact
with the system. Harnad (2000) labels tests of these
sorts as toy Tests (tTs) instead of Turing Tests (TTs). In the
Musical Output toy Test (MOtT), the critic is presented
with musical pieces from two composer-agents. One of
these agents is human, and the other, of course, is a machine.
Based only on these works, the critic must attempt
to distinguish the human from the machine.
Caution must be taken when interpreting the results of
this type of test. For one, what criteria the listener uses
when trying to discern between the machine- and human-composed
pieces needs to be asked explicitly, and
even so, listeners may fail to even be cognisant of their decision-making
processes. Second, musical judgements are
influenced by any combination of factors and can vary
greatly from individual to individual. Furthermore, it is important
to keep in mind that these tests are surveys of musical
judgement and not of whether the system has thought
or intelligence.
Boden (2010) cites David Cope's Experiments in Musical
Intelligence (EMI) system, which generated music in the
style of music contained in a supplied database, and notes
how those listeners with some musical experience had difficulty
determining whether the pieces it produced were
human-composed or not. However, those with more extensive
familiarity with, for example, Mozart were more
readily able to distinguish true Mozart-composed pieces
from the EMI-composed Mozart-esque pieces, though they
there were unable to tell the difference between the EMIcomposed
pieces and human-composed pieces which were
both meant to mimic Mozart's style. Even when EMI failed
to perfectly mimic the intended style, it still was able to
produce pieces that demonstrate proper compositional
technique. Performance in a Turing Test thus varies heavily
depending on exactly what is being tested (ability to mimic
a particular composer? Ability to follow compositional
conventions? Ability to produce interesting melodies?).
Another obstacle for Turing Tests is that they require the
cooperation of the human judges. People have been known
to retract their praise for computer-generated works upon
learning of their synthetic source, protesting that it is requisite
of art to have been produced by a human being,
possessing all the facilities that enable one to express and
communicate human emotion and experience. Some have
refused to even give audience to a creative work knowing
that it was produced by a machine, as happened to David
Cope when debuting EMI. The reason cited is the belief
that art requires creativity, and the belief that computers
cannot be creative precludes computers from creating art.
This prejudice will prevent some from ever accepting the
results of a Turing Test, even if it is deemed internally successful.
Pearce and Wiggins (2001) proposed an objective framework
for evaluating computer-generated musical compositions
which, as they themselves point out, elicits comparisons
to the Turing Test. The framework was developed in
response to problems they identified with previous attempts
to evaluate music composed by computer systems.
They distinguished two kinds of evaluation: the critic and
what we could call the evaluation-proper. The critic is part
of the music-generating system itself and helps guide the
development of the composition by evaluating the intermediate
products of the system. The evaluation-proper is that
which we are mainly concerned with here, and is unfortunately
the more elusive of the two: it is the process and
methods of establishing whether the compositions produced
by the system satisfy the specified conditions of creativity.
Pearce and Wiggins highlight the necessity of objective
measurements when evaluating machine creativity, as an
objective approach to evaluation would be consistent with
standard scientific investigation. Empirical science carries
a respectable weight, and if a creative system could survive
the sort of rigorous testing expected in scientific domains,
then the results would be far more compelling than
the wishy-washy, subjective evaluations seen elsewhere.
The existing evaluation methods Pearce and Wiggins reviewed
are criticized for failing to confirm to these standards
of objectivity, in part due to the presence of programmers'
bias in the critic algorithms embedded in a number of
the systems they discussed. They also note that subjective
impressions are very imprecise and potentially unreliable:
it is difficult to ensure that a group of human evaluators are
following the same criteria.
In reaction to these shortcomings, Pearce and Wiggins
layout a method of evaluating composition systems that
maintain objective integrity. In order to be objective, a specific
compositional aim must be explicitly established beforehand—if
the goal is to mimic the style of a specific
Baroque composer, the system should not receive a positive
evaluation score because it is able to generate a realistic
progression of 20th century jazz chords. To eliminate programmer
bias, the critic should be derived from a pattern
extracted from a data set of existing compositions using a
machine learning algorithm. Once music is composed that
is able to satisfy the critic, an evaluation using human participants
is conducted. The participants should be played
both music composed by the system and from the data set
itself, and then tested to see if they are able to distinguish
the two.
Having participants simply indicate whether they think a
piece of music is computer-composed or not frees us from
the subjective question of whether the computer-generated
work is creative, enjoyable, or emotional, and instead allows
us to home in on the objective fact about whether or
not humans are able to tell whether the works are computer
composed. Reformulating the question from can this system
produce creative works? to can this system produce
works indistinguishable from the human composers in the
data set? creates a predictable, testable, and perhaps most
importantly, a clearly refutable claim, as would be expected
from a rigorous scientific experiment.
Experiment
In the framework of Pearce and Wiggins (2001), no atInternational
Conference on Computational Creativity 2012 128
tempt was made to deceive the judges/participants about
the nature of the experiment: participants were explicitly
informed that they were comparing human- and machine-composed
music. Along with this candid approach,
our experiment resembled the framework described by
Pearce and Wiggins in many additional ways. Our experiment
differs however in that we compare the performance
of both musicians and nonmusicians, and go beyond offering
a simple binary choice between machine-composed
and human-composed by providing a 4-point scale which
will reflect both the participants' choice and their confidence
in their choice.
In our study, we aim to evaluate the quality and particularly
the robustness of the harmonic progression generator
developed by Arne Eigenfeldt and Philippe Pasquier
(2010). We sought to determine how successful the program
is at generating harmonic progression of the same
quality and style as human composers from traditional
classical style periods. To do this, we had two groups of
human participants, varying in their musical fluency, attempt
to distinguish musical excerpts generated by the program
from those written by human composers.
System Description
The system we are evaluating uses a third-order Markov
model to derive harmonic progressions from a supplied
corpus (Eigenfeldt and Pasquier 2010). This allows versatility
as the particular rules from a style-period or genre do
not have to be hard-coded into the program. Instead, the
appearance of the rules emerge from the reliance on the
corpus to guide the generation of progressions. By foregoing
set rules, the system is not biased toward only producing
progressions that follow traditional harmonic and
voice leading conventions, but can just as competently
function within the harmonic freedom of 20th century music
if provided with a sufficiently rich corpus.
In contrast to many other music generators, the system
was designed to function and respond to user request. The
user is given the ability to specify the number of bars to be
generated, a target bass line, the level and variation of harmonic
complexity, and the voice-leading tension of the
generated chords. These vectors help select the best candidate
among the generated Markov conditional probability
distributions of chord transitions. The system is written in
MaxMSP and is available on its first author's webpage1
. A
full outline of the system is provided in Eigenfeldt and
Pasquier (2010).
The corpus which serves as input to the system consists
of pre-processed MIDI files: all musical content is reduced
to a sequences of chords (and their durations) with controller
data indicating the beginning of phrases and cadences.
Participants
The participants were recruited from Simon Fraser University
and the University of British Columbia. Participation
was incentivized by offering four $50 prizes to be distrib-
1
 http://www.sfu.ca/~eigenfel/arne/main.html
uted upon completion of the study.
To increase the resolution of our test of the harmonic
progression generator, we compared the performance of
two independent groups: musicians and nonmusicians.
Much like a spoken language, well-written music is constrained
by and emerges from conventions and rules and
patterns. If one were to do a validation of a spoken or written
language-generating program using human participants
as judges, it would clearly be necessary to have the participants
be fluent in the target language. It is for this reason
that we found that in order to perform an accurate validation,
it was critical to test the difference between musicians
and nonmusicians in this task.
For the purpose of this experiment, only those with
formal training in classical musical analysis were deemed
“musicians”—mere proficiency with an instrument did not
suffice. While instrumentalists are indisputably “musicians”,
we were exclusively interested in those students
who have spent time studying and analyzing classical music
scores and may have developed an ability to identify
unusual harmonic choices and other errors that might arise
in a machine-generated composition. Therefore, we decided
that the musician group would consist of students
who have received two or more years of classical musical
training at a post-secondary institution. To ensure sufficient
group-size, we also admitted those who have received at
least 5th grade certification in the royal conservatory of music
(or equivalent). The group consisting of laypeople (nonmusicians)
was screened during the survey to ensure their
musical naïvety.
Music Selection
Corpus For our study, we used harmonic progressions derived
from a corpus of classical music (a mixture of Classical
and Romantic style periods). Only chords already
present in the pieces that made up the corpus found their
way into the generated excerpts. We presented ten excerpts
of music generated by the system and ten from classical
pieces adapted to match the non-melodic presentation style
of the computer-generated pieces.
The following is a list of the pieces that made up the corpus
from which the computer-generated pieces were derived.
The corpus is divided into five sections, each containing
five to six pieces from the same composer to ensure
consistency of style. We have tried to maintain consistence
of form in our selections as well. Two of the pieces from
each section were included in the survey as the human-composed
pieces (marked in bold), and two progressions
were generated from each section using the harmonic
progression generator. As we are evaluating the quality of
progressions produced by the system rather than determining
the limits of its functionality, the setup described here
corresponds to a typical use of the system.
Frédéric Chopin: Nocturne in Eb Major Op. 9, No. 2;
Nocturne in F# Major Op. 15, No. 2; Nocturne in G minor,
Op. 15, No. 3; Nocturne in Db Major, Op. 27, No. 2; Nocturne
in F major Op. 55, No. 1.
International Conference on Computational Creativity 2012 129
Antonín Dvo ák ř : Humoresque, Legend, Slavonic Dance
No. 1, Slavonic Dance No. 2, Symphony No. 9 “From The
New World” Second Movement, Valse Gracieuse.
Johannes Brahms: Symphony No. 1 In C Minor 3rd Movement,
Symphony No. 2 In D 3rd Movement, Symphony
No. 3 in F 2nd Movement, Symphony No. 3 in F 3rd
Movement, Symphony No. 4 In E minor 3rd Movement,
Hungarian Dance No. 5.
Felix Mendelssohn: Consolation, If With All Your Hearts,
Spinning Song, O Rest In The Lord, Scherzo in E Minor,
Venetian Boating Song (from Songs Without Words).
Robert Schumann: About Strange Lands And People,
Träumerei, (from Scenes from Childhood), The Happy
Farmer (from Album for the Young), Piano Concerto in
A Minor, The Wild Horseman, Arabesque.
Processing The original Turing Test was not an assessment
of a machine's ability to mimic speech, and neither
was our experiment a test of the system's ability to creatively
interpret and audibly produce music like a performer,
but merely to compose it. Therefore, all pieces used in the
experiment were “performed” and recorded using Kontakt
Player (Native Instruments) and Cakewalk Sonar 4 (Cakewalk).
The system we evaluated requires pre-processing of the
items in the data set: as the system is concerned only with
analyzing and generating harmonic progressions, it was designed
to receive as input MIDI files that conform to a specific
format of block chords in closed position with the
bass note separately specified. In the name of efficiency,
rather than manually analyzing the harmony in our chosen
classical pieces, we utilized “The Real Little Classical
Fake Book” (Hal Leonard Corp. 1993), a large collection
of classical themes transcribed for piano, and simply discard
the melodic line and sequenced the harmonies and
harmonic rhythms into MIDI files using the chord symbol
realization plugin (which generates notes from chord symbols)
for the Sibelius scorewriting software (Sibelius).
As we planned to test both musicians as well as
nonmusicians, we recognized it was important that the human-composed
pieces we chose be unfamiliar enough to
reduce the likelihood that either groups would recognize
their harmonic structure. Though we imagine that the elimination
of the melodic line alone sufficiently obscured the
identity of the pieces (as will transposition to a different
key and changing the tempo), discretion was taken to ensure
that pieces that obtain most of their notoriety from
their harmonic sequences were excluded.
Determining which computer-generated pieces
would make it into the final test was done by selecting
those that most closely conformed to a pre-specified criteria.
To ensure the feasibility of the study, it was decided to
restrict the length of the harmonic sequences generated to
around 8-bars and ensure that the progression ended on the
tonic or dominant chord (regardless of the chord that preceded
it), or in a cadence. The first four bars of one of the
computer-generated pieces is presented in Figure 1. Note
that the system encodes rhythm and can generate more
than one chord per bar.
Figure 1. four-bar example of a “Brahms-inspired” excerpt.
Procedure
87 participants, a mix of students and faculty from Simon
Fraser University and the University of British Columbia,
were provided a URL to an online survey1
. The survey was
built using Drupal (Drupal) and a number of modules to
enable audio playback and time tracking (to ensure that the
participants were listening to the musical excerpts in full at
least once). Participants were presented with a consent
page indicating that their consent would be assumed by
their completing the survey. They were then directed to a
screen inquiring about their musical training (to enable us
to assign them to the correct experimental group), followed
by instructions detailing how to complete the survey. The
instructions were upfront about the purpose and methods of
the experiment; participants were informed that they would
hear a mix of human-composed music and machine-composed
music. No deceptive protocols were utilized.
Participants were presented one piece of music at a time,
presented in a pre-established random order (limitations of
the implementation prevented us from having each participant
experience a unique ordering of questions). After
listening, they were asked to rate the likely composer of
each piece on a 4-point scale with 1 being “definitely human”,
2: “probably human”, 3: “probably computer” and 4:
“definitely computer”.
We decided to avoid using an odd-numbered scale for
two reasons: we first wanted to discourage participants
from disengaging from the task and choosing a neutral rating
of 3 throughout. If participants lack certainty, this will
be reflected in a greater proportion of “probably X” responses.
We wanted to encourage participants to provide
their best guess instead of defaulting to a safe “don't know”
choice as it has been shown in other perceptual judgement
tasks that participants underestimate their ability and that
when forced to make a choice they often choose the appropriate
response (Brown 1910). Gilljam and Granberg
(1993) suggest that the presence of “don't know” options in
questionnaires might encourage even those with definite
opinions to chose the more cautious response. Poe et al.
(1988) found that, when concerned with question testing
factual knowledge, there is little difference in the responses
1
The survey can be accessed at the following URL: http://magnum-interactive.com/metacreation
International Conference on Computational Creativity 2012 130
on questionnaires with and without a “don't know” option,
but that excluding a neutral choice resulted in more usable
data. A study by Alwin and Krosnick (1991) also found
that including a “don't know” option did not improve the
reliability of the results.
Secondly, the 4-point scale allowed us to reserve the
ability to collapse the data into a binary human/computer
choice, as well as compare the frequencies of “definitely
X” to “probably X” selections between musicians and nonmusicians
later on for statistical analysis.
Following the questionnaire, participants were thanked
for their participation and asked to indicate whether or not
they recognized any of the progressions (and specify what
they thought they were if they did), and what strategies (if
any) they used to determine if the pieces were human- or
computer-composed. Participants were then directed to a
separate website where they provided their email address.
Here they could indicate whether they wanted to be contacted
about the results of the experience and/or be entered
into the prize draw. As this section is separate from the survey-proper,
it prevented us from matching survey answers
to identifiable e-mail address, preserving anonymity.
We hypothesized that if the harmonic progression generator
is capable of creating music of a quality and style similar
to human composers, we should expect to see the performance
of the two groups be similar to that which would
result from random guessing (null hypothesis). If there is a
detectable difference between the computer-generated and
human-composed originals (that is: it is possible to distinguish
the two), we should expect to see the nonmusicians
perform either close to or slightly-above chance levels, and
the musicians out-perform the nonmusicians with a performance
even further from chance levels in the direction
of correct classification.
We used one sample t-tests to compare musicians and
nonmusicians to chance levels, and two sample t-tests to
compare the mean scores of the groups. Tests were conducted
using Bonferroni corrected alpha level of 0.005
(0.05/10). Comparisons were also made between these two
groups' confidence with their choices as derived from their
proportions of 1s and 4s compared to 2s and 3s on the 4-
point scale.
Analysis of Data
Performance Participants were given four options when
indicating their level of musical experience. They could
specify that they had at least 2 years of a Bachelor's degree
in music (Bachelor's), had achieved 5th grade certification
in the royal conservatory of music or equivalent (Royal
Cons.), had some unspecified formal musical training
(Some), or no training (None). Our original “musician”
category collapsed the data from the Bachelor's and Royal
Cons. groups together, while the “nonmusicians” are made
up of participants from the Some and None group. Table 1
shows the different groups' overall performance on the discrimination
task.
Experience mean t-score p df
musicians 11.92
(3.27) 2.633 0.0164 19
nonmusicians 11.62
(2.46) 5.386 < .0001 66
Table 1. mean of correct answers out of 20, t-score (compared
to chance), significance level, and degrees of freedom. Note.
Standard deviations appear in paratheses.
As a test of our first hypothesis, one sample t-tests were
used to compare performance to chance-levels (given that
the questions were binary, 10 good guesses out of 20 is the
mean for chance level). The results were not as anticipated:
nonmusicians did significantly better than chance, leaving
us unable to retain the null hypothesis (that the quality or
style of the computer-generated pieces are indistinguishable
from the quality and style of the human-composed
pieces).
These data appear at first glance to be in the opposite
directions of what we expected. Further analysis however
revealed that by only looking at participants' total scores
we had overlooked an interesting pattern buried in the data.
Inspired by a comparable analysis conducted in Pearce and
Wiggins (2001), when scores on identifying human-composed
pieces were analyzed independently from scores
identifying computer-composed pieces, a much different
picture of the results emerged. Table 2 shows the results of
this analysis.
Experience mean t-score p df
musicians (H) 6.550
(1.82) 3.808 0.0012 19
musicians (C) 5.300
(1.92) 0.698 0.4936 19
nonmusicians (H) 6.477
(1.51) 8.003 < .0001 66
nonmusicians (C) 5.089
(1.99) 0.369 0.7136 66
Table 2. results broken down by group and compositional
source. (H) = human-composed. (C) = computer-composed.
When tested with a one sample t-test, this analysis
shows that while participants were able to classify human-composed
pieces (H) well above chance-levels (5 out
of 10), their performance identifying computer-composed
pieces (C) was not significantly different from chancelevels.
Human-composed pieces were much more easily
identifiable as human-composed than computer-composed
pieces were identifiable as computer-composed. However,
we still failed to see any statistically significant difference
between the scores of the musician and nonmusician
groups.
Confidence We also measure the level of confidence the
International Conference on Computational Creativity 2012 131
participants experienced for each question in the discrimination
task. Confidence for each question was determined
by assigning two points for “definitely” answers and one
point for “probably” answers. A percentage was calculated
using the score and the maximum possible score (thus a
score of 100 would mean that the participant gave a “definitely”
answer on every question). Average group confidence
scores are indicated in Table 3.
Experience mean n
musicians (H) 67.25
(11.06) 20
musicians (C) 64.00
(12.84) 20
nonmusicians (H) 67.46
(12.83) 67
nonmusicians (C) 63.06
(11.18) 67
Table 3. confidence scores by group and compositional source.
While comparisons between groups' confidence are not
statistically significant (there was no difference in confidence
between experts and laymen), if the group means do
in fact hint at a general tendency, they would indicate that
participants are more confident about their answers when
classifying human-composed pieces. This would be consistent
with the analysis of performance that indicates that
participants are likely to correctly identify these pieces as
human-composed.
Strategies In the written response section of the survey,
out of 87 participants, and out of only three who ventured
guesses, only one correctly identified that they heard a progression
taken from a Chopin piece (though they did not
specific the piece's name). For the rest, participants seemed
to rely on a number of factors to help them correctly identify
the pieces' compositional source. Participants classifying
human-composed pieces indicated that they listen for qualities
such as depth, clarity, complexity, feeling, life, regularity
in rhythm, consonance, variety of dynamics, fluidity,
subtly, repetition, pleasantness, simplicity, and logic. When
trying to identify computer-composed pieces, participants
indicated that they listened for repetition, simplicity, increased
speed, odd resolutions, invariable rhythm, dissonance,
lack of feeling, symmetry, rigidity, formality, awkwardness,
logic, choppiness, static dynamics, and disorganization.
Interestingly, a number of these properties
overlap: participants trying to identify both human and
computer-composed pieces claimed to be listening for simplicity
and logic, and participants within each condition often
were looking for opposite properties to help identify
the same source.
Conclusion
Participants listened to a series of 20 harmonic progressions
and indicated whether they thought each was humanor
computer-generated, along with a rating of their confidence
for each choice. We hypothesized that participants
would not perform significantly better than chance at this
task.
Overall, participants did discriminate between the human-composed
material and the progressions generated by
the system. However, examining the results in more detail
revealed something unexpected. When looking at participant
responses to trials containing computer-composed
progressions in isolation, it was found that participants
were not capable of identifying the pieces generated by the
harmonic progression system as computer-composed. Surprisingly,
participants were nonetheless capable of identifying
the human-composed pieces above chance levels. This
results suggest that humans have a "natural" tendency to
correctly recognize human-generated content. This would
explain while our validation test failed. Further study
would be needed to generalize this last finding. We believe
that this tendency could be of interest to the computational
creativity community as well as for cognitive sciences in
general.
There are a number of changes to our experimental
design that would be worth attempting in follow-up studies.
The group sizes in the present experiment were quite
heterogeneous, and the results seem to suggest that a larger
number of participants qualifying for the Bachelor's group
could provide us with valuable data.
We would also likely benefit from randomizing the
presentation order of the musical excerpts or offering a
more extensive “practice” section in future experiments.
The collective results of all participants, plotted against
time, gave a Pearson's correlation of r = 0.54, suggesting a
significant practice effect.
Another concern is that we were not explicit enough
when explaining our procedure. A number of participants
tried to “outsmart” us and listen for superficial clues in the
recordings, such as whether a real or synthesized piano was
used, which evidently led them astray as both human-composed
and computer-composed excerpts were created using
the same equipment. We may also want to increase the duration
of the excerpts as eight-bar phrases may be too short
for listeners to be able to realistically gauge authorship.
We might consider abandoning the candid approach and
instead employ an experimental paradigm that relies on deception,
such as was done in Levisohn and Pasquier's evaluation
of BeatBender (2008). This would rid us of the complications
that arose from participants trying to over-dissect
the musical excerpts for clues, and allow us to test for a larger
range of properties. A limitation of our study was that it
only asked participants to rate whether the pieces were human-
or computer-generated; what, one could wonder, does
this tell us about how successful the system was at being
creative? If we modify the design and add additional criteria
(e.g. ratings of naturality, enjoyableness, and complexity
as was done in the evaluation of BeatBender) that partiInternational
Conference on Computational Creativity 2012 132
cipants could listen for, it might tell us something more detailed
about the differences between human-generated and
computer-generated music.
Looking beyond the dichotomy of subjective judgements
verses formalized criteria, there are arguably five levels of
validation for artistic metacreations: the academic forum
(whether the paper describing the creative system gets accepted
or not), controlled evaluation (experiment such as
those described in this paper), and feedback from journalists
and critics, peers (artist from that community), and
audiences. No evaluation study to our knowledge has attempted
to cover all five of these levels. In future studies,
we may consider rectifying this by adopting a methodology
which would encompass all of these dimension, enhancing
the validity of and confidence in our conclusions.
Acknowledgements
Thanks to Arne Eigenfeldt and David Mesiha for taking the
time to give us a demonstration of the software and for
providing troubleshooting correspondence.
<references_biblio/>
References
Alwin, D. F., and J. A. Krosnick. 1991. The reliability of
survey attitude measurement:The influence of question and
respondent attributes. Sociological Methods & Research.
20 (1), 139–181.
Ariza, C. 2009. The Interrogator as Critic: The Turing Test
and the Evaluation of Generative Music Systems. Computer
Music Journal. 33(2), 48-70.Ariza, C. 2009. The Interrogator
as Critic: The Turing Test and the Evaluation of
Generative Music Systems. Computer Music Journal.
33(2), 48-70.
Boden, M. 2010. The Turing test and artistic creativity,
Kybernetes, 39(3), 409–413.
Brown, W. 1910. The Judgment of Distance. Publications
in Psychology. 1, 1–71.
Cakewalk. http://www.cakewalk.com.
Colton, S. 2008. Creativity versus the perception of creativity
in computational systems. In Creative Intelligent
Systems: Papers from the AAAI Spring Symposium, 2008.
Menlo Park, CA: AAAI Press.14–20.
Drupal. http://drupal.org.
Eigenfeldt, A. and Pasquier, P. 2010. Realtime Generation
of Harmonic Progressions Using Controlled Markov Selection.
In Proceedings of the First International Conference
on Computational Creativity (ICCCX), ACM Press,
Lisbon, Portugal, 16-25, 2010.
Eigenfeldt, A. and Pasquier, P. 2011. Negotiated Content:
Generative Soundscape Composition by Autonomous Musical
Agents in Coming Together: Freesound. In Proceedings
of the Second International Conference on Computational
Creativity, ACM Press, México City, México, 27-32,
2011.
Gilljam, M., and D. Granberg. 1993. Should we take “don’t
know” for an answer? Public Opinion Quarterly. 57, 348–
357.
Harnad, S. 2000. Minds, Machines and Turing. Journal of
Logic, Language and Information. 9(4), 425–445.
Hal Leonard Corp. 1993. The Real Little Classical Fake
Book. Hal Leonard Publishing Corporation, Milwaukee,
WI.
Levisohn, A. and Pasquier, P. 2008. BeatBender: Subsumption
Architecture for Rhythm Generation. ACM International
Conference on Advances in Computer Entertainment
(ACE 2008), Yokohama, Japan, pages 51-58, ACM Press,
2008.
Native Instruments. http://www.native-instruments.com.
Pearce, M., Meredith, D., and Wiggins, G. 2002. Motivations
and Methodologies for Automation of the Compositional
Process. Musicae Scientiae 6(2), 119–147.
Pearce, M. and Wiggins, G. 2001. Towards a framework
for the evaluation of machine compositions. In Proceedings
of the AISB’01 Symposium on Artificial Intelligence
and Creativity in Arts and Science. Brighton: SSAISB. 22–
32.
Pease, A., Winterstein, D., and Colton, S. 2001. Evaluating
machine creativity. In Workshop on Creative Systems, 4th
International Conference on Case Based Reasoning. (ICCBR-01),
Vancouver, British Columbia, Canada, 30 July-2
August. 56–61.
Poe, G. S., I. Seeman, J. McLauglin, E. Mehl, and M. Dietz.
1988. “Don’t know” boxes in factual questions in a
mail questionnaire: Effects on level and quality of response.
Public Opinion Quarterly. 52, 212–222.
Ritchie, G. 2007. Some Empirical Criteria for Attributing
Creativity to a Computer Program. Minds & Machines. 17,
67–99.
Sibelius. http://www.sibelius.com.
Spector, L. and Alpern, A. 1994. Criticism, Culture, and the
Automatic Generation of Artworks. In Proceedings of
Twelfth National Conference on Artificial Intelligence
(Seattle, Washington, USA, 1994). 3–8. AAAI Press/MIT
Press.
Stamp, A., Isenberg, T., and Carpendale, M.S.T. A Case
Study from the Point of View of Aesthetics: A Dialogue
Between an Artist and a Computer Scientist. In Proceedings
of Computational Aesthetics in Graphics, Visualization,
and Imaging 2007 (CAe 2007, June 20–22, 2007,
Banff, Alberta, Canada). (Aire-la-Ville, Switzerland),
Eurographics Association. 129–134, 2007.
Turing, A. 1950. Computing Machinery and Intelligence.
Mind. 59, 236 (Oct. 1950), 433–460.
Wallraven, C., Cunningham, D., Rigau, J., Feixas, M. and
Sbert, M. 2009. Aesthetic appraisal of art - from eye movements
to computers. Computational Aesthetics 2009: Eurographics
Workshop on Computational Aesthetics in Graphics,
Visualization and Imaging, 137–144.
Whitelaw, M. 2004. Metacreation: Art and Artificial Life.
Cambridge, MA: MIT Press.
International Conference on Computational Creativity 2012 133