Development of Techniques for the
Computational Modelling of Harmony
Raymond Whorley, Geraint Wiggins, Christophe Rhodes, and Marcus Pearce
Centre for Cognition, Computation and Culture
Goldsmiths, University of London
New Cross, London SE14 6NW, UK.
Wellcome Laboratory of Neurobiology
University College London
London WC1E 6BT, UK.
{r.whorley,g.wiggins,c.rhodes}@gold.ac.uk
marcus.pearce@ucl.ac.uk
Abstract. This research is concerned with the development of representational
and modelling techniques employed in the construction of
statistical models of four-part harmony. Multiple viewpoint systems have
been chosen to represent both surface and underlying musical structure,
and it is this framework, along with Prediction by Partial Match (PPM),
which will be developed during this work. Two versions of the framework
are described, starting with the strictest possible application of multiple
viewpoints and PPM, and then extending and generalising a little. Some
implementation details are reported, as are some preliminary results.
1 Introduction
The problem we are attempting to solve by computational means is this: given a
soprano part, add alto, tenor and bass such that the whole is pleasing to the ear.
This is not as easy as it might initially appear, as there aremany rules of harmony
to be followed, which have arisen out of composers’ common practice. Rather
than providing the computer with rules [1], however, we wish to investigate the
process of learning such rules. The idea is to write a program which allows the
computer to learn for itself how to harmonise in a particular style, by creating
a model of harmony from a corpus of existing music in that style. In our view,
however, present techniques are not sufficiently well developed for models to
generate stylistically convincing harmonisations (or even consistently competent
harmony) from both a subjective and an analytical point of view; although Allan
and Williams [2] have demonstrated the potential of this sort of approach.
A means of representing music which, when combined with machine learn-
ing and modelling techniques, shows particular promise, is multiple viewpoint
systems [3]. This framework allows us to model different aspects of the music,
and then combine the individual predictions of these models to give an overall
prediction. Our research aims to make a theoretical contribution to the field
of computational creativity in the domain of music by extending the multiple
viewpoint framework in order to cope with the complexities of harmony, such
11
that improved computational models of four-part harmonisation can be created.
This is not merely an application to harmony of the framework as it stands. This
paper is concerned with two versions of the framework, beginning with a very
strict application, and then extending and generalising a little.
2 Brief Description of Multiple Viewpoint Systems
and Their Evaluation
See Table 1 for a list of basic and derived viewpoints (not exhaustive) and their
meanings. Basic types are the fundamental attributes that are predicted, such as
cpitch and dur. Derived types such as cpint and dur-ratio are derived from,
and can therefore predict, basic types (in this case cpitch and dur respectively).
Threaded types are defined only at certain positions in a sequence, determined
by Boolean test viewpoints such as tactus; for example, (cpitch ⊖ tactus)
has a defined cpitch value only on tactus beats (i.e., the main beats in a bar).
A linked type, or product type, is the conjunction of two or more viewpoints; for
example, dur-ratio ⊗ cpint is able to predict both dur and cpitch. See also
[3] for more details.
Table 1. Basic and derived viewpoint types (not exhaustive).
Viewpoint Meaning Viewpoint Meaning
dur duration of event barlength number of time units in a bar
cont event continuation, or not phrase event at start or end of phrase
cpitch chromatic pitch piece event at start or end of piece
ioi difference in start-time contour descending, level, ascending
posinbar position of event in the bar cpintfref pitch interval from tonic
metre metrical importance of event inscale event in major scale, or not
cpint sequential pitch interval dur-ratio sequential duration ratio
fib on first beat of bar, or not liph last event in phrase, or not
tactus event on tactus pulse, or not fip first event in piece, or not
fiph first event in phrase, or not
N-gram Models are Markov models employing sub-sequences of n symbols.
The probability of the n
th symbol, the prediction, depends only upon the previous
n − 1 symbols, the context. The number of symbols in the context is the order
of the model. See [5] for more details.
What we call a viewpoint model is a weighted combination of various orders
of n-gram model of a particular viewpoint type. The n-gram models can be com-
bined by, for example, Prediction by Partial Match (PPM) [6]. PPM makes use
of a sequence of models, which we call a back-off sequence, for context matching
and the construction of complete prediction probability distributions. The back-
off sequence begins with the highest order model, proceeds to the second-highest
order, and so on. An escape method determines prediction probabilities at each
stage in the sequence.
12
A multiple viewpoint system comprises more than one viewpoint. The predic-
tion probability distributions of the individual viewpoint models are combined
by employing a weighted arithmetic or geometric [10] combination technique.
See [7] for more information.
Conklin [7] introduced the idea of using a combination of a long-term model
(LTM), which is a general model of a style derived from a corpus, and a short-
term model (STM), which is constructed as a piece of music is being predicted or
generated. The latter aims to capture musical structure particular to that piece.
An information-theoretic measure, cross-entropy, is used to guide the con-
struction of models, evaluate them, and compare generated harmonisations. The
model assigning the lowest cross-entropy to a set of test data is likely to be the
most accurate model of the data. See [5] for more details.
3 Development of the Multiple Viewpoint
and PPM Frameworks
Version 1: Strict Application of Multiple Viewpoints and PPM The starting
point for the definition of the strictest possible application of viewpoints is the
formation of vertical viewpoint elements [8]. An example of such an element is
{69, 64, 61, 57}, where all of the values are from the domain of the same view-
point, and all of the parts (soprano, alto, tenor and bass) are represented. This
method reduces the entire set of parallel sequences to a single sequence, thus
allowing an unchanged application of the multiple viewpoint framework, includ-
ing its use of PPM. Only those elements containing the given soprano note are
allowed in the prediction probability distribution, however. This is the base-level
model, to be developed with the aim of substantially improving performance.
Version 2: Dividing the Harmonisation Task into Sub-tasks In this version, it
is hypothesised that predicting all unknown symbols in a vertical viewpoint
element (as in version 1) at the same time is neither necessary nor desirable.
It is anticipated that by dividing the overall harmonisation task into a number
of sub-tasks [2] [9], each modelled by its own multiple viewpoint system, an
increase in performance can be achieved. For example, given a soprano line, the
first sub-task might be to generate the entire bass line. This version allows us
to experiment with different arrangements of sub-tasks. For example, having
generated the bass line, is it better to generate the alto and tenor lines together,
or one before the other? As in version 1, vertical viewpoint elements are restricted
to using the same viewpoint for each part. The difference is that not all of the
parts are now necessarily represented in a vertical viewpoint element.
4 Implementation
At present, the corpus comprises fifty major key hymn tunes, and the test data
five, harmonised as in [4].
The Lisp implementation of version 1 is capable of predicting or generating
the attributes dur (note duration), cont (note continuation, which is the part
of an already sounding note which continues to be heard when a new note is
13
sounded) and cpitch (chromatic pitch) for the alto, tenor and bass parts, given
the soprano. More than forty viewpoints have been implemented, and any link
between two viewpoints which is capable of predicting dur, cont or cpitch
is allowed. A modification of the feature selection algorithm described in [10],
which involves ten-fold cross-validation of the corpus, is used to optimise multiple
viewpoint systems for the long-term model alone, the short-term model alone, or
for both together (in which case the same system is used for both). The maximum
order of the n-gram models can be varied, as can the method of combining
prediction probability distributions, which are initially created using PPM with
escape method C. Parameters (biases) affecting the weighting of distributions
during combination can also be varied.
Version 2 extends version 1, and is implemented as described in Section 3.
5 Preliminary Results
Table 2 shows the lowest cross-entropy version 1 multiple viewpoint systems
found so far for prediction of dur, cont and cpitch. These are for a combination
of long-term and short-term models (LTM and STM, with a cross-entropy of
4.46 bits per event), LTM only (with a cross-entropy of 4.54 bits per event), and
STM only (with a cross-entropy of 6.20 bits per event), using weighted geometric
combination. This confirms the findings of previous research, for example that
of Pearce [10], that using both LTM and STM results in a lower cross-entropy
than the use of either of them alone. What is particularly interesting, however,
is the fact that the STM system does not share a single viewpoint with the
LTM + STM system, and has only one viewpoint in common with the LTM
system; this is in stark contrast with the substantial overlap between the LTM
+ STM system and the LTM system. This prompted us to try using two different
multiple viewpoint systems together, one optimised for the LTM and the other
separately optimised for the STM; but with a cross-entropy of 4.51 bits per
event, this turned out to be not as good a model as LS in Table 2.
For prediction of cpitch only, the best version 1 LTM system found so far
results in a cross-entropy of 3.29 bits per event. By comparison, the best version
2 LTM system found so far predicts the bass first (1.70 bits per prediction),
followed by the alto and tenor together (1.55 bits per prediction), giving a total
cross-entropy of 3.25 bits per event. For prediction of cpitch only, then, version
2 appears to be very slightly better than version 1. It is worth noting that the
best version 2 system reflects the usual human approach to harmonisation: bass
first, followed by alto and tenor together.
6 Conclusions and Future Work
We have described two versions of the multiple viewpoint framework and PPM,
motivated by our aim to take account of the complexities of four-part harmony.
The preliminary results weakly indicate that version 2 is better than version 1 for
the prediction of cpitch only. They also suggest the perhaps counter-intuitive
conclusion that optimising the LTM and STM together leads to a better model
than optimising them separately. This latter result opens interesting routes for
14
Table 2. Best version 1 multiple viewpoint systems (predicting dur, cont and cpitch)
for LTM + STM (LS), LTM only (L) and STM only (S).
Viewpoint LS L S Viewpoint LS L S
cont ⊗ cpint × × (cpintfref ⊖ fiph) ⊗ piece ×
cont ⊗ (cpintfref ⊖ tactus) × × cpitch × ×
dur ⊗ (cpintfref ⊖ liph) × × dur-ratio ⊗ (ioi ⊖ fib) ×
cont ⊗ metre × × dur-ratio ⊗ phrase ×
dur ⊗ posinbar × × dur ⊗ cont ×
cpintfref × × cont ⊗ (cpitch ⊖ tactus) ×
dur ⊗ liph × × inscale ×
(cpintfref ⊖ liph) × × contour ×
(cpintfref ⊖ fiph) ⊗ fip × × cpitch ⊗ tactus ×
cpint ⊗ cpintfref × cpitch ⊗ (cpintfref ⊖ liph) ×
(cpintfref ⊖ fib) × inscale ⊗ barlength ×
cont ⊗ (cpintfref ⊖ liph) × cpitch ⊗ (cpintfref ⊖ fiph) ×
further work. Finally, using the LTM alone is less good still; and the STM alone
is, as expected, by far the least good model.
In the immediate future, we intend to implement other versions which push
the development of the multiple viewpoint/PPM framework further.
<references_biblio/>
References
1. Ebcio˘glu, K.: An Expert System for Harmonizing Four-Part Chorales. Computer
Music Journal, 12(3), 43–51 (1988)
2. Allan, M., Williams, C.K.I.: Harmonising Chorales by Probabilistic Inference. In:
L.K. Saul, Y. Weiss, L. Bottou, editors, Advances in Neural Information Processing
Systems, vol. 17. MIT Press (2005)
3. Conklin, D.,Witten, I.H.: Multiple Viewpoint Systems for Music Prediction. Journal
of New Music Research, 24(1), 51–73 (1995)
4. Vaughan Williams, R., editor. The English Hymnal. Oxford University Press (1933)
5. Manning, C.D., Sch¨utze, H.: Foundations of Statistical Natural Language Processing.
MIT Press (1999)
6. Cleary, J.G., Witten, I.H.: Data Compression Using Adaptive Coding and Partial
String Matching. IEEE Trans Communications, COM-32(4), 396–402 (1984)
7. Conklin, D.: Prediction and Entropy of Music. Master’s Thesis, Department of Computer
Science, University of Calgary, Canada (1990).
8. Conklin, D.: Representation and Discovery of Vertical Patterns in Music. In: C.
Anagnostopoulou, M. Ferrand, A. Smaill, editors, Music and Artificial Intelligence:
Proc. ICMAI 2002, LNAI, vol. 2445, pp. 32–42. Springer-Verlag (2002)
9. Hild, H., Feulner, J., Menzel,W.: Harmonet: A Neural Net for Harmonizing Chorales
in the Style of J.S. Bach. In: R.P. Lippmann, J.E. Moody, D.S. Touretzky, editors,
Advances in Neural Information Processing Systems, vol. 4, pp. 267–274. Morgan
Kaufmann (1992)
10. Pearce, M.T.: The Construction and Evaluation of Statistical Models of Melodic
Structure in Music Perception and Composition. Ph.D. Thesis, Department of Computing,
City University, London (2005)
15