Autonomously Creating Quality Images
David Norton, Derrall Heath and Dan Ventura
Computer Science Department
Brigham Young University
Provo, UT 84602 USA
dnorton@byu.edu, dheath@byu.edu, ventura@cs.byu.edu
Abstract
Creativity is an important part of human intelligence, and it
is difficult to quantify (or even qualify) creativity in an intelligent
system. Recently it has been suggested that quality,
novelty, and typicality are essential properties of a creative
system. We describe and demonstrate a computational system
(called DARCI) that is designed to eventually produce
images in a creative manner. In this paper, we focus on quality
and show, through experimentation and statistical analysis,
that DARCI is beginning to be able to produce images
with quality comparable to those produced by humans.
Introduction
DARCI (Digital Artist Communicating Intention) is a computer
system designed to eventually create visual art in order
to convey intention and meaning to the viewer. Currently,
DARCI can automatically render a given image to match an
accompanying list of adjectives. This ability is the foundation
of a visual language for DARCI to communicate with
an audience—an important element of creative expression
in the visual arts. DARCI is part of ongoing research that is
exploring the perception of creativity in an artificial system.
Measuring creativity both quantitatively and qualitatively
is a difficult challenge. Ritchie describes quality, novelty,
and typicality as being essential in ascribing creativity to a
system (2007). Ritchie defines quality as the extent to which
the artefact is a high quality example of its genre. In this paper,
we focus on quality, and show that DARCI is beginning
to be able to produce quality artefacts comparable to human
artists given the same resources.
DARCI’s design has two main components: the image appreciation
component, and the image creation component.
The image appreciation component is designed to allow
DARCI to learn to evaluate it’s own artwork according to
various descriptive words. This ability to assess these qualities
in an image guides the image creation component. The
image creation component uses evolutionary mechanisms to
create artefacts and the appreciation component serves as
part of the fitness function.
We briefly describe the main components of DARCI and
how they work together to produce artefacts. We then
present several images that DARCI has created and describe
an experiment in which we compare DARCI’s images with
ones made by humans. Finally, we discuss how the results
show that DARCI is becoming comparable to humans in
producing quality artefacts.
Image Appreciation
It has been argued that the ability to appreciate and evaluate
its own artefacts is necessary for a system to be considered
creative (Colton 2008). In order for DARCI to appreciate
art, it must first acquire some basic understanding of art. For
example, in order for DARCI to appreciate an image that is
dark and gloomy, DARCI must first understand the concepts
dark and gloomy. To do this, DARCI must learn to associate
images with artistic descriptions.
Image Features Before DARCI can form associations between
images and descriptive words, appropriate image features
for the task must be extracted from the image. Significant
research has been done in the area of image feature
extraction (Gevers and Smeulders 2000; Datta et al. 2006;
Li and Chen 2009; Wang, Yu, and Jiang 2006; King ;
Wang and He 2008), and we have culled 102 image features
from this. These are low-level features that can be coarsely
classified as treating one the following image characteristics:
color, light, texture, and shape.
Artistic Descriptions As an initial step, the artistic descriptions
that DARCI can learn are limited to lists of adjectives.
We use WordNet’s (Fellbaum 1998) database of
adjectives to give us a large, yet finite, set of descriptive labels.
In WordNet, each word belongs to a synset of one or
more words that share the same meaning. If a word has multiple
meanings, then it can be found in multiple synsets. To
collect training data, we have created a public website for
training DARCI (http://darci.cs.byu.edu). From this website,
users are presented with a random image and asked to
provide adjectives that describe the image. Additionally, for
each image presented to the user, DARCI lists seven adjectives
that it associates with the image. The user is allowed
to flag those labels that are not accurate. This creates strictly
negative examples of those synsets, which is important for
learning.
Another program for creatively generating visual art,
NodeBox, is also dependent on semantic networks such as
WordNet. The NodeBox project takes the use of semantic
networks even further by using a more elaborate database
they created called “Perception” (De Smedt, De Bleser, and
Proceedings of the Second International Conference on Computational Creativity 10
Nijs 2010). However, unlike DARCI, NodeBox does not
have a strong learning component. In the future, we hope to
expand DARCI by using more sophisticated semantic networks,
perhaps even “Perception” itself.
Learning Method In order to make the association between
image features and synsets, we use a collection of
artificial neural networks (ANNs) that we call appreciation
networks. There is an appreciation network for each synset
that has a sufficient amount of training data. As we incrementally
accumulate more data, new neural networks can
be dynamically added to the collection to accommodate the
new synsets. Currently, there are 211 appreciation networks.
This means that DARCI essentially “knows” 211 synsets.
For more details on our learning method, image features,
and use of synsets, the reader is referred to earlier work describing
DARCI (Norton, Heath, and Ventura 2010).
Image Creation
DARCI uses a evolutionary mechanism to render images according
to given synsets, and this mechanism operates in two
modes. The initial mode, which we call practice mode, operates
by exploring the space of image filters that will render
any image according to a single specific synset. For this
mode, DARCI creates and maintains a separate gene pool
for each synset that the system knows. The second mode,
called commission mode, operates by exploring the space of
image filters that will render a specific image according to
a specified list of synsets. There is no restriction on synset
combinations; in fact, incoherent combinations can produce
unexpected and interesting results as we will demonstrate
later. For commission mode, users prescribe the image and
list of synsets that they wish DARCI to render—in other
words, they “commission” DARCI. For each commission,
DARCI creates a unique gene pool that terminates once the
commission is complete. For both modes, the evolutionary
mechanism functions as follows.
The genotypes that comprise each gene pool are lists of
filters, and their accompanying parameters, for processing
an image. Many of these filters are similar to those found
in Adobe Photoshop and other image editing software. Others
come from a series of 1000 filters Simon Colton discovered
using his own evolutionary mechanism (Colton et
al. 2010). Colton’s set of filters, called Filter Feast, is divided
into categories of aesthetic effect that were discovered
by exploring combinations of very basic filters within a tree
structure. We have treated Colton’s filters as if each category
were a unique filter with a single parameter that specifies
the specific filter within the category to use. Figure 1 gives
an example of a genotype and its effect on a sample image.
There are a total of sixty-one traditional filters that we selected
for DARCI to use and a total of thirty-one categories
of filters from Filter Feast, making ninety-two filters available
for each genotype. We selected traditional filters that
were easily accessible, diverse, fast, and that didn’t incorporate
alpha values (since our feature extraction techniques
cannot yet process alpha values).
The fitness function for the evolutionary mechanism can
be expressed by the following equation:
Figure 1: Sample genotype (list of image filters with parameters)
and its effect on an image. “Ripple” and “Weave” are
the names of two (of ninety-two) possible filters.
Fitness(g) = λAA(g) + λI I(g) (1)
where g is an image artefact and A : G → [0, 1] and I :
G → [0, 1] are two metrics: appreciation and interest. These
compute a real-valued score for an image artefact (here, G
represents the set of all image artefacts). λA + λI = 1, and
for now, λA = λI = 0.5.
Both metrics used in the fitness function are applied to
the phenotype (the image that results when each genotype
is applied to a source image). The fitness of every phenotype
within a generation of the evolutionary mechanism is
determined using the same source image; but, the source image
used from generation to generation depends upon which
mode the system uses. In commission mode, the source image
is the same from generation to generation, while in practice
mode the source image for each generation is randomly
selected from DARCI’s growing image database.
The appreciation metric A is computed as the (weighted
sum) of the output(s) of the appropriate appreciation network(s),
producing a single (normalized) value:
A(g) = X
w∈C
αwnetw(g) (2)
where C is the set of synsets to be portrayed, netw(·) is the
output of the appreciation network for synset w,
P
w αw =
1, and αw = 1/|C| (though this can, of course, be changed
to weight synsets unequally).
The interest metric I penalizes phenotypes that are either
too different from the source image, or are too similar. This
metric is useful for producing images that meet our definition
of imaginative; however, the interest metric is currently
too simplistic to do more than prevent extreme cases. The
metric begins by tallying the number, n, of image analysis
features that have similar values between the two images
(i.e. that fall within a specified distance of each other). This
can be expressed with the following equation:
n =
X
i
d0.3 − |F
S
i − F
P
i
|e (3)
Proceedings of the Second International Conference on Computational Creativity 11
Number of Sub-Populations 8
Size of Sub-Populations 15
Crossover Rate 0.4
Filter Mutation Rate 0.03
Parameter Mutation Rate 0.1
Migration Rate 0.2
Migration Frequency 0.1
Tournament Selection Rate 0.75
Initial Genotype Length 2 to 4 filters
Table 1: Parameters used for the evolutionary mechanism.
F
S
i
represents feature i of the source image and F
P
i
represents
feature i of the phenotype. Note that all features are
normalized to the range [0...1], so the ceiling function above
returns either 0 or 1. The value 0.3 was chosen empirically.
The interest metric is calculated using n as follows:
I(g) = 1 −



τd−n
τd
if n < τd
n−τs
|F |−τs
if n > τs
0 if τd ≤ n ≤ τs
(4)
τd and τs are constants that correspond to the threshold for
determining, respectively, when a phenotype is too different
from or too similar to the source image. The values τd = 20
and τs = 57 were used here. |F| is the total number of
features analyzed, in our case 102.
Fitness-based tournament selection determines those
genotypes that propagate to the next generation and those
genotypes that participate in crossover. One-point “cut and
splice” crossover is used to allow for variable length offspring.
Crossover is accomplished in two stages: the first
occurs at the filter level, so that the two genomes swap an
integer number of filters; the second occurs at the parameter
level, so that filters on either side of the cut point swap an
integer number of parameters. By necessity, parameter list
length is preserved for each filter. Table 1 shows the parameter
settings used.
Mutation also occurs at two levels. Filter mutation is a
wholesale change of filter (discrete values), while parameter
mutation is a change in parameter values for a filter (continuous
values). When filter mutation occurs, either a single filter
within a genotype changes or a new filter is added. When
a parameter mutation occurs, anywhere from one to all of
the parameters for a single filter in a genotype are changed.
The degree of this change, ∆fi
, for each parameter, i, is
determined by one of the following two equations chosen
randomly with equal probability:
∆fi = (1 − fi) · rand 
0,
(|f| + 1) − |∆f|
|f|

(5)
∆fi = −fi
· rand 
0,
(|f| + 1) − |∆f|
|f|

(6)
Here, |f| is the total number of parameters in the mutating
filter, |∆f| is the number of changing parameters in the mutating
filter, and rand(x, y) is a function that uniformly selects
a real value between x and y.
Because there are potentially many ideal filter configurations
for modeling any given synset, we have implemented
sub-populations within each gene pool. This allows the evolutionary
mechanism to converge to multiple solutions, all of
which could be different and valid. The migration frequency
controls the probability that a migration will occur at a given
epoch, while the migration rate refers to the percentage of
each sub-population that migrates. Migrating genomes are
selected uniform randomly, with the exception that the most
fit genotype per sub-population is not allowed to migrate.
Migration destination is also selected uniform randomly, except
that sub-population size balancing is enforced.
Practice gene pools are initialized with random genotypes,
while commission gene pools are initialized with the
most fit genotypes from the practice gene pools corresponding
to the requested synsets. This allows commissions to become
more efficient as DARCI practices known synsets. It
also provides a mechanism for balancing permanence (artist
memory) with growth (artistic progression).
Methods and Results
The evaluation of artefacts is very subjective, making an
evaluation of DARCI non-trivial. Furthermore, the quality
of the artefacts that DARCI produces can be judged based
on two distinct criteria: how well the artefacts portray the
synsets dictated by a commission, and how well the artefacts
demonstrate artistic skill. Depending on the synsets
in question, the first criterion can be considered less subjective
than the second. For example, if the synset blue, as in
the color blue, were chosen, the degree to which an artefact
possesses the color blue could be measured quite objectively.
As less simple/concrete synsets are applied, this criterion becomes
increasingly subjective; however, we argue that it will
never be more subjective than a general assessment of artistic
merit. For this reason, we have chosen to focus on the
first criterion of quality and relegate the second criterion to
an interesting side note in this paper.
Despite focusing on the first criterion of quality, we want
to eventually move in the direction of artistic analysis of
DARCI’s artefacts. Thus, we have selected three synsets
that, while dictating some expected traits within an image,
also prescribe subjective features within an image. The
synsets we have selected are “fiery” as in like or suggestive
of fire, “happy” as in enjoying or showing or marked by joy
or pleasure, and “lonely” as in lacking companions or companionship.
These synsets are well represented in DARCI’s
database and are distinct in meaning.
Because there is always a subjective component in determining
whether an image can be described by a given adjective,
the most objective way that we can evaluate such
quality is through a combination of many personal opinions.
For this reason, we designed a survey in which people
rank DARCI’s artefacts, alongside several other artefacts,
with respect to how well the images reflect particular
adjectives. For this survey we selected three images
on which to test DARCI’s rendering of the aforementioned
synsets. The images we selected are shown in Figure 2. We
chose photographs in order to accentuate the impact of the
non-photorealistic rendering tools available to DARCI. The
Proceedings of the Second International Conference on Computational Creativity 12
(a) Image A (b) Image B (c) Image C
Figure 2: The three source images used to evaluate the quality
of DARCI’s artefacts.
three photographs explore the light vs. dark, chromatic vs.
monochromatic, and close vs. distant spectrums.
For each photograph and for each synset we commissioned
DARCI to produce an image that portrays the synset;
we also commissioned DARCI to produce a variation of
each image that portrays all three synsets simultaneously in
order to demonstrate the effect of combining synsets with
disjointed meaning (this results in a total of 3 × 3 + 3 = 12
images). For comparison, we collected three additional sets
of 12 homologous images: a set chosen by us from a collection
of images created by DARCI, a set commissioned to
human artists, and a set chosen by us from a collection of
randomly generated images.
For the set created by DARCI, we allowed DARCI to
practice the three synsets for eight hours a piece, and then
gave the system sixteen hours to complete each commission.
For every commission, DARCI chose the single image with
the highest fitness as the result of the commission.
In addition, for each commission, DARCI saved the top
five unique images (those with the highest fitness) encountered
within each sub-population, for a total of forty images.
From these, we chose the single image we thought best portrayed
the commission target synset(s). (We made this selection
with no knowledge of DARCI’s fitness values for the
40 images, and, in particular, we did not know which of the
images DARCI ranked highest and selected as the result of
its commission.) This image we selected represents a close
cooperation between DARCI and DARCI’s programmers—
or, looked at another way, the use of DARCI as a tool rather
than as an autonomous agent.
A third set of images was created by human volunteer
artists, who were restricted to a toolset similar to that used by
DARCI (i.e. image filters) and were skilled with programs
(e.g. Photoshop) using this toolset.
Each image in the final set was chosen from a set of 40
randomly generated images, each of which was generated
using 1 − 8 of the same filters available to DARCI. In order
to ensure a reasonable image, and to provide a point of
comparison between random filter generation and DARCI’s
evolutionary mechanism, we chose the one image (out of 40)
that we thought best portrayed the synset in question.
In summary, we acquired four images for every synsetimage
combination. One was DARCI’s most fit artefact
(DARCI), one was our choice out of DARCI’s top artefacts
(Coop), one was produced by a human (Human), and
one was our choice out of randomly filtered images (Best
Random). Representative examples of some of the twelve
Figure 3: The average ranking for each synset across images
A, B, and C for each of the four artefact sources: DARCI,
Human, Coop, and Best Random. “triple” refers to the artefacts
rendered with all three synsets. These results were obtained
from 42 volunteers. Lower rank is better.
Human DARCI Coop Best Random
Average Rank 2.4067 2.7282 2.3194 2.5456
Table 2: The average ranking of the four artefact sources
across all image-synset combinations. These results were
obtained from 42 volunteers. Lower rank is better.
synset-image combinations can be found in Figures 4-7. In
the online survey, volunteers were instructed to rank the four
images for each synset-image combination according to how
well they portrayed the synset(s) in question. In addition, we
asked the volunteers to indicate which images they liked regardless
of adjective compatibility. This additional question
was added to stress to the volunteers the fact that the ranking
was to be independant of personal preference for the images.
We obtained a total of forty-two survey responses.
The results of this survey are encapsulated in Figure 3
and Table 2. Figure 3 shows the average ranking for each
synset across all three images for each of the four artefact
sources just summarized: DARCI, Human, Coop, and Best
Random (the lower the rank, the better). Table 2 shows the
average ranking of each of the four artefact sources across all
synsets and images. Table 3 shows which pairs of datapoints
in the aforementioned figures are statistically significant—
such pairs are denoted with an asterisk.
fiery happy lonely triple all synsets
Human/DARCI 0.501 * 1.000 * *
Human/Coop * * 0.155 * 0.217
Human/Best Random 0.286 * 0.326 0.339 0.0540
DARCI/Coop * * 0.132 * *
DARCI/Best Random 0.691 * 0.298 * *
Coop/Best Random * 1.000 0.640 * *
* p-value < 0.01
Table 3: Results of t-Test comparing all binary combinations
of image sources for each synset. The “all synsets” column
refers to Table 2. The other columns refer to Figure 3.
Proceedings of the Second International Conference on Computational Creativity 13
(a) DARCI (b) Best Random (c) Coop (d) Human
Figure 4: Image A rendered “fiery”—ranked left-to-right
from most to least fiery.
(a) Human (b) Coop (c) DARCI (d) Best Random
Figure 5: Image A rendered “happy”—ranked left-to-right
from most to least happy.
Discussion
By looking at Table 2, we see that DARCI functioning autonomously
does perform the worst of the artefact sources,
but not dramatically so. Furthermore, Table 3 indicates that
human performance was not distinguishable, in the statistical
significance sense, from the performance of DARCI in
cooperation with humans nor from the performance of humans
choosing the best random image. These results suggest
that, overall, volunteers are not strongly preferring one
artefact source over another.
When looking at performance over individual synsets
(Figure 3), we see that a more distinct preference is given to
certain artefact sources over others. But, even in these cases,
the source given preference varies from synset to synset.
Looking at Figure 3, the clearest distinction between sources
is between the human and autonomous DARCI when rendering
“happy” images. In this case humans clearly outperform
DARCI. However, in the case of “fiery” images,
DARCI performs statistically the same as humans. When in
cooperation with humans, DARCI significantly outperforms
solo humans in both “fiery” images and images combining
all three synsets. In the case of “lonely” images, none of
the artefact sources perform statistically different from one
another. Volunteers prefer human creations for “happy” images
and they prefer DARCI-human collaborations for both
“fiery” images and images combining “fiery”, “happy”, and
“lonely”.
If we look even more specifically at the individual synsetimage
pairs, we find that all artefact sources are top ranked
for some of the pairings. Autonomous DARCI is top ranked
for “fiery” image A and “lonely” image C; the best-ofrandom
source is top ranked for “happy” image C, “lonely”
image A, and “triple” image A; DARCI in cooperation with
humans is ranked top for “fiery” image B, “fiery” image C,
and “triple” image B; humans creating solo are ranked top
for “happy” image A, “happy” image C, “lonely” image B,
and “triple” image C. The rankings for the most substantial
successes of each artefact source are shown in Figures 4-7.
While DARCI’s solo artefacts often rank on par with hu-
(a) Best Random (b) Coop (c) Human (d) DARCI
Figure 6: Image B rendered “happy”—ranked left-to-right
from most to least happy.
(a) Coop (b) Human (c) DARCI (d) Best Random
Figure 7: Image B rendered “fiery”, “happy”, and
“lonely”—ranked left-to-right from most to least fiery,
happy, and lonely.
man artefacts, the best random artefacts do as well. Furthermore,
these partially random artefacts are sometimes ranked
better than DARCI’s. If these were totally randomly generated
artefacts, then this would be an area of concern. It turns
out, however, that given the number of random images from
which we selected, it is fairly common to encounter at least
one image that (at least to some extent) satisfies the demands
of the synset in question. Taking into account Ritchie’s proposal
that the proportion of high quality artefacts produced
should be correlated with creativity (Ritchie 2007), and observing
DARCI’s top forty artefacts, it becomes clear that
DARCI is accomplishing something better than random image
generation. Figure 8 shows the 40 images DARCI chose
to save while rendering image A as “fiery”, while, for comparison,
Figure 9 shows the 40 random images generated for
the same task. While we did not empirically determine the
proportion of images in these sets that are “fiery”, it is apparent
that significantly more images are “fiery” in Figure 8
than in Figure 9.
Conclusions
If we assume that the human artists commissioned to produce
artefacts for this research did indeed produce renderings
that portray the synsets, then we conclude that, given
the same toolset, DARCI can also produce renderings that
portray them. This is a compulsory assumption since by the
nature of art, the only way DARCI can be evaluated as an
artist, is in comparison to other (human) artists. While on the
whole, at this point people tend to favor human solo works
over DARCI’s solo works, the differences are not substantial
or consistent enough to warrant a different conclusion. Furthermore,
the collaboration between DARCI and human’s
was frequently favored over human solo artefacts. This indicates
the potential for DARCI to be used as a tool to augment
the creative process of human artists.
Only three synsets were tested in this experiment. However,
these synsets are representative of the meaning that
we want DARCI to be able to incorporate into artefacts to
facilitate visual communication with an audience. DARCI
Proceedings of the Second International Conference on Computational Creativity 14
Figure 8: The top five “fiery” renderings for all eight subpopulations
discovered by DARCI for image A (not ordered
by fitness).
has sufficient data, ergo sufficient appreciation to perform
similarly on many more synsets. We are currently updating
DARCI so that the system can perform commissions online
while using any known synsets. This will allow us to further
observe DARCI’s capacity for rendering.
In future work regarding the evaluation DARCI, we will
be exploring Ritchie’s other criteria for creativity: namely
novelty and typicality. In addition, we will explore the artistic
side of quality, rather than the strictly pragmatic one explored
in this research (i.e. the degree to which synsets were
incorporated into the artefacts).
Acknowledgments
Warm thanks to Simon Colton for providing us with Filter
Feast image filters that were included in DARCI’s toolset.
This material is based upon work supported by the National
Science Foundation under Grant No. IIS-0856089.
<references_biblio/>
References
Colton, S.; Gow, J.; Torres, P.; and Cairns, P. 2010. Experiments
in objet trouve browsing. ´ Proceedings of the International
Conference on Computational Creativity 238–247.
Colton, S. 2008. Creativity versus the perception of creativity
in computational systems. Creative Intelligent Systems:
Papers from the AAAI Spring Symposium 14–20.
Datta, R.; Joshi, D.; Li, J.; and Wang, J. Z. 2006. Studying
aesthetics in photographic images using a computational approach.
Lecture Notes in Computer Science 3953:288–301.
Figure 9: The forty randomly generated renderings of image
A for the synset “fiery”.
De Smedt, T.; De Bleser, F.; and Nijs, L. 2010. Nodebox 2.
Proceedings of the 16th International Symposium on Electronic
Art (ISEA 2010).
Fellbaum, C., ed. 1998. WordNet: An Electronic Lexical
Database. The MIT Press.
Gevers, T., and Smeulders, A. 2000. Combining color and
shape invariant features for image retrieval. IEEE Transactions
on Image Processing 9:102–119.
King, I. Distributed content-based visual information
retrieval system on peer-to-pear(p2p) network.
http://appsrv.cse.cuhk.edu.hk/~miplab/discovir/.
Li, C., and Chen, T. 2009. Aesthetic visual quality assessment
of paintings. IEEE Journal of Selected Topics in Signal
Processing 3:236–252.
Norton, D.; Heath, D.; and Ventura, D. 2010. Establishing
appreciation in a creative system. Proceedings of the International
Conference on Computational Creativity.
Ritchie, G. 2007. Some empirical criteria for attributing creativity
to a computer program. Minds and Machines 17:67–
99.
Wang, W.-N., and He, Q. 2008. A survey on emotional
semantic image retrieval. Proceedings of the International
Conference on Image Processing.
Wang, W.-N.; Yu, Y.-L.; and Jiang, S.-M. 2006. Image retrieval
by emotional semantics: A study of emotional space
and feature extraction. IEEE International Conference on
Systems, Man, and Cybernetics 4:3534–3539.
Proceedings of the Second International Conference on Computational Creativity 15