CMU-ML-16-103
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-16-103

Expressive Collaborative Music Performance via Machine Learning

Gus (Guangyu) Xia

August 2016

Ph.D. Thesis

CMU-ML-16-103.pdf


Keywords: Expressive performance, human-computer interaction, automatic accompaniment, spectral learning, humanoid robot


Techniques of Artificial Intelligence and Human-Computer Interaction have empowered computer music systems with the ability to perform with humans via a wide spectrum of applications. However, musical interaction between humans and machines is still far less musical than the interaction between humans since most systems lack any representation or capability of musical expression. This thesis contributes various techniques, especially machine-learning algorithms, to create artificial musicians that perform expressively and collaboratively with humans. The current system focuses on three aspects of expression in human-computer collaborative performance: 1) expressive timing and dynamics, 2) basic improvisation techniques, and 3) facial and body gestures. Timing and dynamics are the two most fundamental aspects of musical expression and also the main focus of this thesis. We model the expression of different musicians as co-evolving time series. Based on this representation, we develop a set of algorithms, including a sophisticated spectral learning method, to discover regularities of expressive musical interaction from rehearsals. Given a learned model, an artificial performer generates its own musical expression by interacting with a human performer given a predefined score. The results show that, with a small number of rehearsals, we can successfully apply machine learning to generate more expressive and human-like collaborative performance than the baseline automatic accompaniment algorithm. This is the first application of spectral learning in the field of music.

Besides expressive timing and dynamics, we consider some basic improvisation techniques where musicians have the freedom to interpret pitches and rhythms. We developed a model that trains a different set of parameters for each individual measure and focus on the prediction of the number of chords and the number of notes per chord. Given the model prediction, an improvised score is decoded using nearest-neighbor search, which selects the training example whose parameters are closest to the estimation. Our result shows that our model generates more musical, interactive, and natural collaborative improvisation than a reasonable baseline based on mean estimation.

Although not conventionally considered to be "music", body and facial movements are also important aspects of musical expression. We study body and facial expressions using a humanoid saxophonist robot. We contribute the first algorithm to enable a robot to perform an accompaniment for a musician and react to human performance with gestural and facial expression. The current system uses rule-based performance-motion mapping and separates robot motions into three groups: finger motions, body movements, and eyebrow movements. We also conduct the first subjective evaluation of the joint effect of automatic accompaniment and robot expression. Our result shows robot embodiment and expression enable more musical, interactive, and engaging human-computer collaborative performance.

147 pages

Thesis Committee:
Roger Dannenberg (Chair)
Geoff Gordon
Larry Wasserman
Arshia Cont (INRIA/CNRS)

Manuela M. Veloso, Head, Machine Learning Department
Andrew W. Moore, Dean, School of Computer Science


SCS Technical Report Collection
School of Computer Science