Jordan Bryan

AI for art

By definition, music is confined to travel through the audio domain. Once it reaches an observer, though, it is free to speak through other sensory channels.

Music evokes imagery: colors, shapes and motion. For some people, this imagery is incredibly vivid and fantastical. For most people – myself included – the imagery is more pedestrian. For instance if I am listening to a Miles Davis record, I perceive the bell of a trumpet, some essence of its luster, and perhaps a flickering silhouette of Davis himself.

Though the subject matter of these images is obvious, they are compelling because the mind renders them imperfectly. These imperfections lift the evocations of music into a higher reality than the one captured by a camera. Close your eyes and stare into the teeming darkness behind your eyelids. This is the world of musical imagery, where instruments melt into one another, and forms materialize and dissolve in a cyclic hallucination.

The goal of this project was to approximate that world.


Duke University’s +DATASCIENCE initiative recently closed the submission period for its first AI for Art Competition. I put forward an entry that used a convolutional variational autoencoder to generate video clips of different instruments being played by a stationary figure.

The autoencoder was trained on a corpus of frames extracted from videos of a figure playing the instruments listed below:

The video clip below was generated according to a sampling scheme that chooses a random point in the low-dimensional space, lingers there for a random amount of time according to an auto-regressive lag-1 Markov process, and then moves to another randomly sampled point. Several frames are rendered along the straight line (probably not ideal) between sampled points to make the transitions between regions of the space appear somewhat smooth.

The video frames the model generates do not correspond directly to any of the frames used to train the model. They are the autoencoder’s “imaginings” of new frames that resemble those used in training. The submission to the AI for Art Competition consisted of an extended version of the clip above, accompanied by audio from Jimi Hendrix’s “Manic Depression.”

One obvious shortcoming of this work is the fact that audio does not factor into the model at all. The task of trying to learn a joint representation of audio/visual signals seemed daunting given a tight time frame. However, an ideal visualization algorithm would have such a representation, whereby audio signals could be used as input to generate corresponding images.


  1. The code to train the convolutional variational autoencoder was adapted from code found in this GitHub repository
  2. The code to join static frames output by the audoencoder into a .mp4 file was taken and adapted from
  3. Video frames used as training data were taken with iPhone 8 and trimmed using iMovie