Top Banner
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012 747 Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis Ferda Oi, Member, IEEE, Engin Erzin, Senior Member, IEEE, Yücel Yemez, Member, IEEE, and A. Murat Tekalp, Fellow, IEEE Abstract—We propose a novel framework for learning many-to-many statistical mappings from musical measures to dance gures towards generating plausible music-driven dance choreographies. We obtain music-to-dance mappings through use of four statistical models: 1) musical measure models, repre- senting a many-to-one relation, each of which associates different melody patterns to a given dance gure via a hidden Markov model (HMM); 2) exchangeable gures model, which captures the diversity in a dance performance through a one-to-many relation, extracted by unsupervised clustering of musical measure segments based on melodic similarity; 3) gure transition model, which cap- tures the intrinsic dependencies of dance gure sequences via an -gram model; 4) dance gure models, capturing the variations in the way particular dance gures are performed, by modeling the motion trajectory of each dance gure via an HMM. Based on the rst three of these statistical mappings, we dene a discrete HMM and synthesize alternative dance gure sequences by employing a modied Viterbi algorithm. The motion parameters of the dance gures in the synthesized choreography are then computed using the dance gure models. Finally, the generated motion parameters are animated synchronously with the musical audio using a 3-D character model. Objective and subjective evaluation results demonstrate that the proposed framework is able to produce compelling music-driven choreographies. Index Terms—Automatic dance choreography creation, multi- modal dance modeling, music-driven dance performance synthesis and animation, music-to-dance mapping, musical measure clus- tering. I. INTRODUCTION C HOREOGRAPHY is the art of arranging dance move- ments for performance. Choreographers tailor sequences of body movements to music in order to embody or express ideas and emotions in the form of a dance performance. There- fore, dance is closely bound to music in its structural course, artistic expression, and interpretation. Specically, the rhythm and expression of body movements in a dance performance are Manuscript received December 07, 2010; revised August 25, 2011 and November 17, 2011; accepted December 12, 2011. Date of publication December 23, 2011; date of current version May 11, 2012. This work was supported by TUBITAK under project EEEAG-106E201 and COST2102 action. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Daniel Gatica-Perez. F. Oi is with the Tele-Immersion Group, Electrical Engineering and Com- puter Sciences Department, College of Engineering, University of California at Berkeley, Berkeley, CA 94720 USA (e-mail: fo[email protected]). E. Erzin, Y. Yemez, and A. M. Tekalp are with the Multimedia, Vision and Graphics Laboratory, College of Engineering, Koç University, Sariyer, Istanbul 34450, Turkey (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TMM.2011.2181492 in synchrony with those of the music, and hence, the metric orders in the course of music and dance structure coincide, as Reynolds states in [1]. In order to successfully establish the contextual bond as well as the structural synchrony between dance motion and the accompanying music, choreographers tend to thoughtfully design dance motion sequences for a given piece of music by utilizing a repertoire of choreographies. Based on this common practice of choreographers, our goal in this study is to build a framework for automatic creation of dance choreographies in synchrony with the accompanying music; as if they were arranged by a choreographer, through learning many-to-many statistical mappings from music to dance. We note that the term choreography generally refers to spatial formation (circle, line, square, couples, etc.), plastic aspects of movement (types of steps, gestures, posture, grasps, etc.), and progression in space (oor patterns), whereas in this study, we use the term choreography in the sense of composi- tion, i.e., the arrangement of the dance motion sequence. A. Related Work Music-driven dance animation schemes require, as a rst step, structural analysis of the accompanying music signal, which includes beat and tempo tracking, measure analysis, and rhythm and melody detection. There exists extensive research in the literature on structural music analysis. Gao and Lee [2] for instance propose an adaptive learning approach to analyze music tempo and beat based on maximum a posteriori (MAP) estimation. Ellis [3] describes a dynamic programming solution for beat tracking by nding the best-scoring set of beat times that reect the estimated global tempo of music. An extensive evaluation of audio beat tracking and music tempo extraction algorithms, which were included in MIREX’06, can be found in [4]. There are also some recent studies on the open problem of automatic musical meter detection [5], [6]. In the last decade, chromatic scale features have become popular in musical audio analysis, especially in music information retrieval, since introduced by Fujishima [7]. Lee and Slaney [8] describe a method for automatic chord recognition from audio using hidden Markov models (HMMs) through supervised learning over chroma features. Ellis and Poliner [9] propose a cross-correlation based cover song identication system with chroma features and dynamic programming beat tracking. In a very recent work, Kim et al. [10] calculate the second order statistics to form dynamic chroma feature vectors in modeling harmony structures for classical music opus identication. Human body motion analysis/synthesis, as a unimodal problem, has also been extensively studied in the literature in many different contexts. Bregler et al. [11] for example describe 1520-9210/$26.00 © 2011 IEEE
13

Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

Apr 27, 2023

Download

Documents

Irina Temnikova
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012 747

Learn2Dance: Learning Statistical Music-to-DanceMappings for Choreography Synthesis

Ferda Ofli, Member, IEEE, Engin Erzin, Senior Member, IEEE, Yücel Yemez, Member, IEEE, andA. Murat Tekalp, Fellow, IEEE

Abstract—We propose a novel framework for learningmany-to-many statistical mappings from musical measures todance figures towards generating plausible music-driven dancechoreographies. We obtain music-to-dance mappings throughuse of four statistical models: 1) musical measure models, repre-senting a many-to-one relation, each of which associates differentmelody patterns to a given dance figure via a hidden Markovmodel (HMM); 2) exchangeable figures model, which captures thediversity in a dance performance through a one-to-many relation,extracted by unsupervised clustering of musical measure segmentsbased on melodic similarity; 3) figure transition model, which cap-tures the intrinsic dependencies of dance figure sequences via an-gram model; 4) dance figure models, capturing the variations inthe way particular dance figures are performed, by modeling themotion trajectory of each dance figure via an HMM. Based on thefirst three of these statistical mappings, we define a discrete HMMand synthesize alternative dance figure sequences by employing amodified Viterbi algorithm. The motion parameters of the dancefigures in the synthesized choreography are then computed usingthe dance figure models. Finally, the generated motion parametersare animated synchronously with the musical audio using a 3-Dcharacter model. Objective and subjective evaluation resultsdemonstrate that the proposed framework is able to producecompelling music-driven choreographies.

Index Terms—Automatic dance choreography creation, multi-modal dance modeling, music-driven dance performance synthesisand animation, music-to-dance mapping, musical measure clus-tering.

I. INTRODUCTION

C HOREOGRAPHY is the art of arranging dance move-ments for performance. Choreographers tailor sequences

of body movements to music in order to embody or expressideas and emotions in the form of a dance performance. There-fore, dance is closely bound to music in its structural course,artistic expression, and interpretation. Specifically, the rhythmand expression of body movements in a dance performance are

Manuscript received December 07, 2010; revised August 25, 2011 andNovember 17, 2011; accepted December 12, 2011. Date of publicationDecember 23, 2011; date of current version May 11, 2012. This work wassupported by TUBITAK under project EEEAG-106E201 and COST2102action. The associate editor coordinating the review of this manuscript andapproving it for publication was Dr. Daniel Gatica-Perez.F. Ofli is with the Tele-Immersion Group, Electrical Engineering and Com-

puter Sciences Department, College of Engineering, University of California atBerkeley, Berkeley, CA 94720 USA (e-mail: [email protected]).E. Erzin, Y. Yemez, and A. M. Tekalp are with the Multimedia, Vision

and Graphics Laboratory, College of Engineering, Koç University, Sariyer,Istanbul 34450, Turkey (e-mail: [email protected]; [email protected];[email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TMM.2011.2181492

in synchrony with those of the music, and hence, the metricorders in the course of music and dance structure coincide, asReynolds states in [1]. In order to successfully establish thecontextual bond as well as the structural synchrony betweendance motion and the accompanying music, choreographerstend to thoughtfully design dance motion sequences for a givenpiece of music by utilizing a repertoire of choreographies.Based on this common practice of choreographers, our goalin this study is to build a framework for automatic creationof dance choreographies in synchrony with the accompanyingmusic; as if they were arranged by a choreographer, throughlearning many-to-many statistical mappings from music todance. We note that the term choreography generally refersto spatial formation (circle, line, square, couples, etc.), plasticaspects of movement (types of steps, gestures, posture, grasps,etc.), and progression in space (floor patterns), whereas in thisstudy, we use the term choreography in the sense of composi-tion, i.e., the arrangement of the dance motion sequence.

A. Related Work

Music-driven dance animation schemes require, as a firststep, structural analysis of the accompanying music signal,which includes beat and tempo tracking, measure analysis,and rhythm and melody detection. There exists extensiveresearch in the literature on structural music analysis. Gao andLee [2] for instance propose an adaptive learning approach toanalyze music tempo and beat based on maximum a posteriori(MAP) estimation. Ellis [3] describes a dynamic programmingsolution for beat tracking by finding the best-scoring set ofbeat times that reflect the estimated global tempo of music. Anextensive evaluation of audio beat tracking and music tempoextraction algorithms, which were included in MIREX’06,can be found in [4]. There are also some recent studies on theopen problem of automatic musical meter detection [5], [6]. Inthe last decade, chromatic scale features have become popularin musical audio analysis, especially in music informationretrieval, since introduced by Fujishima [7]. Lee and Slaney [8]describe a method for automatic chord recognition from audiousing hidden Markov models (HMMs) through supervisedlearning over chroma features. Ellis and Poliner [9] propose across-correlation based cover song identification system withchroma features and dynamic programming beat tracking. Ina very recent work, Kim et al. [10] calculate the second orderstatistics to form dynamic chroma feature vectors in modelingharmony structures for classical music opus identification.Human body motion analysis/synthesis, as a unimodal

problem, has also been extensively studied in the literature inmany different contexts. Bregler et al. [11] for example describe

1520-9210/$26.00 © 2011 IEEE

Page 2: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

748 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012

a body motion recognition approach that incorporates low-levelprobabilistic constraints extracted from image sequences ofarticulated gestures into high-level manifold and HMM-basedrepresentations. In order to synthesize data-driven body motion,Arikan and Forsyth [12], and Kovar et al. [13] propose motiongraphs representing allowable transitions between poses, toidentify a sequence of smoothly transiting motion segments. Liet al. [14] segment body motions into textons, each of which ismodeled by a linear dynamical system, in order to synthesizehuman body motion in a manner statistically similar to theoriginal motion capture data by considering the likelihood ofswitching from one texton to the next. Brand and Hertzmann[15] study motion “style” transfer problem, which involves in-tensive motion feature analysis and learning motion patterns viaHMMs from a highly varied set of motion capture sequences.Min et al. [16] present a generative human motion model forsynthesis of personalized human motion styles by constructinga multilinear motion model that provides explicit parametrizedrepresentation of human motion in terms of “style” and “iden-tity” factors. Ruiz and Vachon [17] specifically work on dancebody motion, and perform analysis of dance figures in a chainof simple steps using HMMs to perform automatic recognitionof basic movements in the contemporary dance.A parallel track of literature can be found in the domain of

speech-driven gesture synthesis and animation. The earliestworks in this domain study speaker lip animation [18], [19]. In[18], Bregler et al. use morphing of mouth regions to re-syncthe existing footage to a new soundtrack. Chen shows in [19]that lip reading a speaker yields higher speech recognitionrates and provides better synchronization of speech with lipmovements for more natural lip animations. Later a largebody of work extends in the direction of synthesizing facialexpressions along with lip movements to create more naturalface animations [20]–[22]. Most of these studies adopt vari-ations of hidden Markov models to represent the relationshipbetween speech and facial gestures. The most recent studiesin this domain aim at synthesizing not only facial gestures butalso head, hand, and other body gestures for creating morerealistic speaker animations [23]–[25]. For instance, Sargin etal. develop a framework for joint analysis of prosody and headgestures using parallel branch HMM structures to synthesizeprosody-driven head gesture animations [23]. Levine et al. in-troduce gesture controllers for animating the body language ofavatars controlled online with the prosody of the input speechby training a specialized conditional random field [25].Designing amusic-driven automatic dance animation system,

on the other hand, is a relatively more recent problem involvingseveral open research challenges. There is actually little workin the literature on multimodal dance analysis and synthesis,and most of the existing studies focus solely on the aspect ofsynchronization between a musical piece and the correspondingdance animation. Cardle et al. [26] for instance synchronize mo-tion to music by locally modifying motion parameters usingperceptual music cues, whereas Lee and Lee [27] employ dy-namic programming to modify timing of both music and motionvia time-scaling the music and time-warping the motion. Syn-chronization-based methods cannot however (actually do notaim to) generate new dance motion sequences. In this sense,

the works in [28]–[31] present more elaborate dance analysisand synthesis schemes, all of which follow basically the samesimilarity-based framework: They first investigate rhythmicaland/or emotional similarities between the audio segments of agiven input music signal and the available dance motion seg-ments, and then, based on these similarities, synthesize an op-timal motion sequence on a motion transition graph using dy-namic programming.In our earlier work [32], we have addressed the statistical

learning problem in a multimodal dance analysis scheme bybuilding a correlation model between music and dance. The cor-relation model was based upon the confusion matrix of co-oc-curring motion and music patterns extracted via unsupervisedtemporal segmentation, and hence, was not complex enoughto handle realistic scenarios. Later in [33], we have describedan automatic music-driven dance animation scheme based onsupervised modeling of music and dance figures. However theconsidered dance scenario was very simplistic, where a danceperformance was assumed to have only a single dance figure tobe synchronized with the musical beat. In this current paper, wepropose a complete framework, based on [34], for modeling,analysis, annotation, and synthesis of multimodal dance per-formances, which can handle complex and realistic scenarios.Specifically, we focus on learning statistical mappings, whichare in general many-to-many, betweenmusical measure patternsand dance figure patterns for music-driven dance choreographyanimation.

B. Contributions

An open challenge in music-driven dance animation is dueto the fact that, for most dance categories, such as ballroomand folk dances, the relationship between music and danceprimitives usually exhibits a many-to-many mapping pattern.As discussed in the related work section, the previous methodsproposed in the literature for music-driven dance animation,whether similarity-based [28]–[30] or synchronization-based[26], [27], do not address this challenge. They are all deter-ministic methods and do not involve any true dance learningprocess (other than building motion transition graphs), thereforecannot capture the many-to-many relationship existing betweenmusic and dance, producing always a single optimal motionsequence given the same input music signal. In this paper weaddress this open challenge by modeling the many-to-manycharacteristics via a statistical framework. In this respect, ourprimary contributions are 1) choreography analysis: automaticlearning of many-to-many mapping patterns from a collectionof dance performances in a multimodal statistical framework,and 2) choreography synthesis: automatic synthesis of alterna-tive dance choreographies that are coherent to a given musicsignal, using these many-to-many mapping patterns.For choreography analysis, we introduce two statistical

models: one capturing a many-to-one and the other capturinga one-to-many mapping from musical primitives to danceprimitives. The former model learns different melody patternsassociated with each dance primitive. The latter model learnsthe group of candidate dance primitives that can be replacedwith one another without causing an artifact in the choreog-raphy. To further consolidate the coherence and the quality of

Page 3: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

OFLI et al.: LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS 749

the synthesized dance choreographies, we introduce a thirdmodel to capture the intrinsic dependencies of dance primitivesand to preserve the implicit structure existing in the continuumof dance motion. Combining the aforementioned three models,we present a modified Viterbi algorithm to generate coherentand enriched sequences of dance primitives for plausiblechoreography synthesis.The organization of the paper is as follows: Section II first

gives an overview of our music-driven dance animation system,and then describes briefly the feature extraction modules. Wepresent the proposed multimodal choreography analysis andsynthesis framework, hence our primary contribution, inSection III. The problem of character animation for visual-ization of synthesized dance choreographies is addressed inSection IV. Section V presents the experiments and results,and finally, Section VI gives concluding remarks and discussespossible applications of the proposed framework.

II. SYSTEM OVERVIEW AND FEATURE EXTRACTION

Our music-driven dance animation scheme is musical mea-sure based, hence we regard musical measures as the musicprimitives. A measure is the smallest compositional unit ofmusic that corresponds to a time segment, which is defined asthe number of beats in a given duration. We define a dancefigure as the dance motion trajectory corresponding to asingle measure segment. Dance figures are taken as the danceprimitives.The overall system, as depicted in Fig. 1, comprises of three

parts: analysis, synthesis, and animation. Audiovisual datapreparation and feature extraction modules are common to bothanalysis and synthesis parts. An audiovisual dance databasecan be pictured as a collection of measures and dance figuresaligned in two parallel streams: music stream and dance stream.Fig. 2 illustrates a sample music-dance stream extracted fromour folk dance database. In the data preparation module, theinput music stream is segmented by an expert into its units, i.e.,musical measures. We use to denote the measure segmentat frame . Measure segment boundaries are then used by theexpert to define the motion units, i.e., dance figures. We useto denote the dance figure segment corresponding to measureat frame . The expert also assigns each dance figure afigure label to indicate the type of the dance motion. Thecollection of forms the set of candidate dance figures, i.e.,

, where is the number of distinctdance figure labels that exist in the audiovisual dance database.The resulting sequence of dance figure labels is regarded asthe original (reference) choreography, i.e., , where

and is the number of musical measure segments.The feature extraction modules compute the dance motionfeatures and music chroma features for each and, respectively.We assume that the relation between music and dance prim-

itives in a dance performance has a many-to-many mappingpattern. That is, a particular dance primitive (dance figure) canbe accompanied by different music primitives (measures) in adance performance. Conversely, a particular musical measurecan correspond to different dance figures. Our choreographyanalysis and synthesis framework respects the many-to-many

Fig. 1. Block diagram of the overall multimodal dance performance analysis-synthesis framework.

Fig. 2. Audiovisual dance database is a collection of dance figure-musical mea-sure pairs. Recall that a measure is the smallest compositional unit of musicthat corresponds to a time segment, which is defined as the number of beats ina given duration. On the other hand, we define a dance figure as the dance mo-tion trajectory corresponding to a single measure segment. Hence, by definition,the boundaries of the dance figure segments coincide with the boundaries of themusical measure segments, which is in conformity with Reynolds’ work [1].

nature of the relationship between music and dance primitivesby learning two separate statistical models: one capturing amany-to-one and the other capturing a one-to-many mappingfrom musical measures to dance figures. The former modellearns different melody patterns associated with each dancefigure. For this purpose, music chroma features are usedto train a hidden Markov model for each dance figure labelto create the set of musical measure models . The latter

model, i.e., the model capturing a one-to-many relation frommusical measures to dance figures, learns the group of candidatedance figures that can be replaced with one another withoutcausing an artifact in the dance performance (choreography).We call such a model as exchangeable figures model . Musicchroma features are used to cluster measure segmentsaccording to the harmonic similarity between different measuresegments. Based on these measure clusters, we determine thegroup of dance figures that are accompanied by the musicalmeasures with similar harmonic content. We then create theexchangeable figures model based on such dance figuregroups. While the former model is designed to keep the under-lying correlations between musical measures and dance figuresas intact as possible, the latter model is useful for allowing ac-ceptable (or desirable) variations in the dance choreography byoffering various possibilities in the choice of dance figures thatreflect the diversity in a dance performance (choreography). Tofurther consolidate the coherence and quality of the synthesizeddance choreography, we introduce a third model, i.e., the figuretransition model , to capture the intrinsic dependencies of thedance figures and to preserve the implicit structure existing inthe continuum of dance motion. The figure transition model

Page 4: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

750 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012

basically appraises the figure-to-figure transition relations bycomputing -gram probabilities from the audiovisual dancedatabase. The choreography synthesis makes use of these threemodels, namely, , , and , to determine the output dancefigure sequence (i.e., choreography), taking music chromafeatures as input, which are extracted from a test music signal.Here, , where and is the number ofmusical measure segments. Specifically, the choreographysynthesis module employs a modified Viterbi decoding on adiscrete HMM, which is constructed by the musical measuremodels, , and the figure transition model, , to determinethe sequence of dance figures subject to the exchangeablefigures model .In the analysis part of the animation model, the dance motion

features are used to train a hidden Markov model foreach dance figure label to construct the set of dance figuremodels . Eventually, the body posture parameters corre-sponding to each dance figure in the synthesized choreographyare generated using the dance figure models to animate

a 3-D character.

A. Music Feature Extraction

Unlike speech, music consists of a sequence of tones whosefrequencies are defined. Moreover, musical melody is a rhyth-mical succession of single tones in different patterns. In thisstudy, we model the melodic pattern in each measure segmentwith tone-related features using temporal statistical models, i.e.,HMMs. We extract tone-related chroma features to characterizethe melodic/harmonic content of music. In order to represent thechroma scale, we project the entire spectrum onto 12 bins cor-responding to the 12 distinct semi-tones of the musical octave.Theoretically, the frequency of the th note in the th octaveis defined as , where the pitch of the C0 noteis Hz based on Shepard’s helix model in [35] and

, . In this study, we extract the chromafeatures of 60 semi-tones for (over 5 octaves fromthe C4 note to the B8 note).We extract chroma features similar to the well-known mel-

frequency cepstral coefficient (MFCC) computation [36] by ap-plying cepstral analysis to the semitone spectral energies. Henceour chroma features capture information of fluctuations of semi-tone spectral energies. We center the triangular energy windowsat the locations of the semi-tone frequencies, , at different oc-taves for and . Then, we computethe first 12 DCT coefficients of the logarithmic semitone spec-tral energy vector, that constitute the chromatic scale cepstralcoefficient (CSCC) feature set, , for the music frame .We also compute the first and second time derivatives of these12 CSCC features, using the following regression formula:

(1)

The music feature vector, , is then formed by includingthe first and second time derivatives:

(2)

Each , therefore, corresponds to the sequence of music fea-ture vectors that fall into themeasure segment . Specif-ically, is a matrix of CSCC features in the form

(3)

where is the number of audio frames in measure segment.

B. Motion Feature Extraction

We acquire multiview recordings of a dancing actor for eachdance figure in the audiovisual dance database using 8 synchro-nized cameras. We then employ a motion capture technique fortracking the 3-D positions of the joints of the body based on themarkers’ 2-D projections on each camera’s image plane, usingthe color information of the markers [33]. The resulting set of3-D points are used to fit a skeleton structure to the 3-D mo-tion capture data. Through this skeleton structure, we solve thenecessary inverse kinematics equations and calculate accuratelythe set of Euler angles for each joint in its local frame as well asthe global translation and rotation of the skeleton structure forthe motion trajectory defined by the input 3-D motion capturedata. We prefer joint angles as our dance motion features dueto their widespread usage in human body motion analysis-syn-thesis and 3-D character animation literature. We compute 66angular values associated with 27 key joints of the body aswell as 6 values for the global rotation and translation of thebody, which leads to a dance motion feature vector of di-mension 72 for each dance motion frame . However, angularfeatures are generally discontinuous at boundary values due totheir -periodic nature and this situation causes a problem intraining statistical models to capture the temporal dynamics ofa sequence of angular features. Therefore, instead of using thestatic set of Euler angles , we use their first and second dif-ferences computed with following difference equation:

(4)

where the resulting discontinuities are eliminated by the fol-lowing conditional update:

ififotherwise.

(5)

Then the 44-dimensional dynamic motion feature vector isformed as

(6)

Hence, each is a sequence of motion feature vectorsthat fall into dance figure segment while training temporalmodels of motion trajectories associated with each dance figurelabel . That is, is a matrix of body motion feature valuesin the form

(7)

where is the number of dance motion frames within dancemotion segment . We also calculate the mean trajectory foreach dance figure label , namely , by calculating for each

Page 5: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

OFLI et al.: LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS 751

motion feature an average value over all instances (realizations)of the dance figures labeled as . These mean trajectories ( )are required later in choreography animation since each dancefigure model capture only the temporal dynamics of the firstand second differences of the Euler angles of the key joints as-sociated with the dance figure label .

III. CHOREOGRAPHY MODELING

In this section we present the proposed choreography anal-ysis-synthesis framework. We first describe our choreographyanalysis procedure which involves statistical modeling ofmusic-to-choreography mapping. We then explain how thisstatistical modeling is used for choreography synthesis.

A. Multimodal Choreography Analysis

We perform choreography analysis through three statis-tical models, which together define a many-to-many mappingfrom musical measures to dance figures. These choreographymodels are: 1) musical measure models , which capturemany-to-one mappings from musical measures to dance figuresusing HMMs; 2) exchangeable figures model , which cap-tures one-to-many mappings from musical measures to dancefigures, and hence, represents the subjective nature of the dancechoreography with possibilities in the choice of dance figuresand in their organization; and 3) figure transition model ,which captures the intrinsic dependencies of dance figures.We note that these three choreography models constitute the“choreography analysis” block in Fig. 1.1) Musical Measure Models ( ): In a dance performance,

musical measures that correspond to the same dance figuremay exhibit variations and are usually a collection of differentmelodic patterns. That is, different melodic patterns can ac-company the same dance figure, displaying a many-to-onemapping relation from musical measures to dance figures.We capture this many-to-one mapping by employing HMMsto identify and model the melodic patterns corresponding toeach dance figure. Specifically, we train an HMM over thecollection of measures co-occurring with the dance figure ,using the musical measure CSCC features, . Hence, wetrain an HMM for each dance figure in the dance performance.We define left-to-right HMM structures with for

, where is the transition probability fromstate to state . The transitions from state to accountfor the differences in measure durations. Emission distributionsof the chroma-based music features are modeled by Gaussianmixture density functions with diagonal covariance matrices ineach state of . The use of Gaussian mixture density inenables us to capture different melodic patterns that correspondto a particular dance figure. We denote the collection of musicalmeasure models as , i.e., .Musical measure models provide a tool to capture themany-to-one part of the many-to-many musical measure todance figure mapping problem.2) Exchangeable Figures Model ( ): In a dance perfor-

mance, it is possible that several distinct dance figures can beperformed equally well along with a particular musical mea-sure pattern, exhibiting a one-to-many mapping relation frommusical measures to dance figures [1]. To represent this one-to-

many mapping relation, we introduce the notion of exchange-able figure groups, each containing a collection of dance fig-ures that can be replaced with one another without causing anartifact in a dance performance. To learn exchangeable figuregroups, we cluster the measure segments in each musical piecewith respect to their melodic similarities. The melodic simi-larity between two different measure segments andis computed as the local match score obtained from dynamictime warping (DTW) [37] of the chroma-based feature matrices

and , corresponding to and , respectively, in themusical piece . Then, based on the melodic similarity scoresbetween pairs of musical measure segments in , we form anaffinity matrix , where

if , and . Finally, we apply the spectral clus-tering algorithm described in [38] over to cluster the mea-sure segments in . The spectral clustering algorithm in [38]assumes that the number of clusters is known a priori and em-ploys k-means clustering algorithm [39]. Since we do not knowthe number of clusters a priori, we measure the “quality” of thepartition in the resulting clusters using the internal indexes, sil-houettes [40], to determine the appropriate number of clusters.The silhouette value for each point is a measure of how sim-ilar that point is to the points in its own cluster compared tothe points in the other clusters, and ranges from to . Av-eraging over all the silhouette values, we compute the overallquality of the clustering for a range of cluster numbers and pickthe one that results in the highest silhouette value.We perform separate clustering for each musical piece in

order to increase the accuracy of musical measure clusteringsince similar measure patterns are likely to occur in the samemusical piece rather than spread among different musicalpieces. Once we obtain clusters of measures in all musicalpieces, we can then use all of the measure clusters in all musicalpieces to determine the exchangeable figures group foreach dance figure by collecting the dance figure labels thatco-appear with in any of the resulting clusters. Note that aparticular dance figure can appear in more than one musicalpiece (see also Fig. 5). Based on the exchangeable figure groups, we define the exchangeable figures model as an indicator

random variable:

ifotherwise

(8)

where is the exchangeable figure group associated with thedance figure . The collection of for all dance figure labelsin gives us the exchangeable figures model .The notion of exchangeable figures is the key to reflect the

subjective nature of the dance choreography with possibilitiesin the choice of dance figures and their organization throughoutthe choreography estimation process. The use of exchangeablefigures model allows us to create a different artistic dance per-formance content each time we estimate a dance choreography.3) Figure Transition Model ( ): The figure transition

model is built to capture the intrinsic dependencies of thedance figure sequences within the context of dance chore-ographies. The intrinsic dependencies of the choreographyare defined with figure-to-figure transition probabilities.The figure-to-figure transition probability density functions

Page 6: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

752 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012

are modeled in -gram language models, where the prob-ability of the dance figure at given the dance figuresequence at , , , , i.e.,

, defines the -gramdance language model. This model provides a number of rulesthat specify the structure of a dance choreography. For instance,a dance figure that never appears after a particular sequence of

dance figures in the training video does not appear in thesynthesized choreography either. We can also enforce a dancefigure to always follow a particular sequence of dancefigures if it is also the case in the training video with the helpof the -gram dance language model.

B. Multimodal Choreography Synthesis

We formulate the choreography synthesis problem as esti-mating a dance figure sequence from a sequence of musicalmeasures. The core of the choreography synthesis is definedas a Viterbi decoding process on a discrete HMM, which isconstructed by the musical measure models, , and the figuretransition model, . Furthermore, the exchangeable figuresmodel, , is utilized to introduce acceptable variations into theViterbi decoding process to enrich the synthesized choreog-raphy. Based on this Viterbi decoding process, we define threeseparate choreography synthesis scenarios: 1) single best path,2) likely path, and 3) exchangeable path, which are explainedin detail in the following subsections. We note that all thesechoreography synthesis scenarios are multimodal, that is, theydepend on the joint statistical models of music and choreog-raphy, and they have different refinements and contributions toenrich the choreography synthesis.Besides these three synthesis scenarios, we investigate two

other reference (baseline) choreography synthesis scenarios:one using only the musical measure models to map eachmeasure segment in the test musical piece to a dance figure label(which we refer to as acoustic-only choreography), and anotherone using only the figure transition model (which we refer toas figure-only choreography). The acoustic-only choreographycorresponds to a synthesis scenario in which only the corre-lations between musical measures and dance figure labels aretaken into account, but the correlations between consecutive fig-ures are ignored. In contrast to the acoustic-only choreography,the figure-only choreography scenario predicts the dance figurefor the next measure segment only according to figure-to-figuretransition probabilities, which are modeled as bigram proba-bilities of , by discarding the correlations between musicalmeasures and dance figures. Note that the figure-only synthesiscan be regarded as a synchronization-based technique discussedin Section I-A, such as the ones proposed in [26] and [27]. Infigure-only synthesis, the dance figure sequence is generatedrandomly by only respecting figure-to-figure transition prob-abilities to ensure visual continuity of the resulting characteranimation. The dance figure sequences resulting from these twobaseline scenarios constitute reference choreographies that helpus comparatively assess the contributions of our choreographyanalysis-synthesis framework.1) Single Best Path Synthesis: We construct a discrete HMM,, using the musical measure models, , and the figure tran-sition model, . In the figure transition model , the figure-to-

Fig. 3. Lattice structure of the discrete HMM .

figure transition probability distributions are computed with bi-gram models. This choice of model is due to the scale of chore-ography database that we use in training, and amuch larger data-base can be used to model higher order -gram dance languagemodels. The discrete HMM, , is defined with thefollowing parameters:• is the number of time frames (measure segments). Foreach time frame (measure), the choreography synthesisprocess outputs exactly one dance figure label. Recall thatwe denote the individual dance figures as and individualmeasures as for .

• is the number of distinct dance figure labels, i.e., ,where . Dance figure labels are the outputsof the process being modeled.

• is the dance figure transition probability distri-bution with elements defined as

(9)

where the elements, , are the bigram probabilities fromthe figure transition model and they satisfy.

• is the dance figure emission distribution formeasure . Elements of are defined using the musicalmeasure models as

(10)

• is the initial dance figure distribution, where

(11)

The discrete HMM constructs a lattice structure, say ,as given in Fig. 3. The proposed choreography synthesis can beformulated as finding a path through the lattice . Assuminga uniform initial figure distribution , the single best path syn-thesis scenario decodes the Viterbi path along the lattice toestimate the synthesized figure sequence . The Viterbi algo-rithm for finding the single best path synthesis can be summa-rized as follows:1) Initialization:

(12)

Page 7: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

OFLI et al.: LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS 753

2) Recursion: For

(13)

(14)

3) Termination:

(15)

4) Path (dance figure sequence) backtracking:

(16)

Here represents the partial likelihood score of performingthe dance figure at frame , and is used to keep track ofthe best path retrieving the dance figure sequence. The path

decodes the resulting dance figure label sequence asthe desired output choreography. Note that the resulting dancechoreography is unique for the single best path synthesis sce-nario since it is the Viterbi path along the lattice .2) Likely Path Synthesis: In the second synthesis scenario,

we find a likely path along in which we follow one of thelikely partial paths in lieu of following the partial path that hasthe highest partial likelihood score at each time frame. The likelypath synthesis is expected to create variations along the singlebest path synthesis, which results in an enriched set of synthe-sized choreography sequences with high likelihood scores.We modify the recursion step of the Viterbi algorithm to de-

fine the likely path synthesis:Recursion: For

(17)

(18)

(19)

where and are the figure indices with the top two partialpath scores and returns randomly one of the argumentswith uniform distribution. Note that corresponds to in(14). The likely path scenario is expected to synthesize differentdance choreographies since it propagates randomly among toptwo ranking transitions at each time frame. This intricately in-troduces variation into the choreography synthesis process.3) Exchangeable Path Synthesis: In this scenario, we find an

exchangeable path by letting the exchangeable figures modelreplace and update the single best path. Unlike the likely pathsynthesis, the exchangeable path scenario introduces randomvariations to the single best path that respect one-to-many map-pings from musical measures to dance figures as defined in .The exchangeable path synthesis is implementedwith the fol-

lowing procedure:1) Compute the single best path synthesis for a given mu-sical measure sequence and set the measure segmentindex .

2) The figure at measure segment is replaced with an-other figure from its exchangeable figure group

(20)

where returns randomly one of the argumentsaccording to the distribution of acoustic scores

of the dance figures .

3) The rest of the figure sequence, , is updatedby determining a new single best path using the Viterbialgorithm.

4) The steps 2) and 3) are repeated for measure segments.

The exchangeable path synthesis yields an alternative path bymodifying the single best path in the context of the exchangeablefigures model. Its key difference from the likely path is that thecollectionof the candidatedancefigures that can replace a partic-ular dance figure in the choreography, say , is constrained withthe dance figures for which the exchangeable figures modelyields 1.Hence, it is expected to introducemore acceptablevaria-tions into the synthesized choreography than the likely path.

IV. ANIMATION MODELING

In this section, we address character animation of dance fig-ures to visualize and evaluate the proposed choreography anal-ysis and synthesis framework. First we define a dance figuremodel, which captures the variations in the way particular dancefigures are performed, by modeling the motion trajectory of adance figure via an HMM. Then we define a character anima-tion system, which generates a sequence of dance motion fea-tures from a given sequence of dance figures, so as to animatea 3-D character model.

A. Dance Figure Models ( )

The way a dancer performs a particular dance figure may ex-hibit variations in time in a dance performance. Therefore, it isimportant to model the temporal statistics of each dance figure tocapture the variations in the dance performance. Note that thesemodels will also capture the personalized dance figure patternsof a dancer. We use the set of motion features to train anHMM, , for each dance figure label to capture the dynamicbehavior of the dancing body. Since a dance figure contains typ-ically a well-defined sequence of bodymovements, we employ aleft-to-right HMM structure (i.e., for , whereis the transition probability from state to state in ) to

model each dance figure. Emission distributions of motion pa-rameters are modeled by a Gaussian density function with fullcovariance matrix in each state of . We denote the collectionof dance figure models as , i.e., .

B. Character Animation

The synthesized choreography (i.e., ) specifies thelabel sequence of dance figures to be performed with each mea-sure segment whose duration is known beforehand in the pro-posed framework. The body posture parameters correspondingto each dance figure in the synthesized choreographyare then generated such that they fit to the statistical dance figuremodels .To generate body posture parameters using the dance figure

model for the dance figure , we first determine the numberof dance motion frames required for the given segment dura-tion. Next, we distribute the required number of motion framesamong the states of the dance figure model according to

the expected state occupancy duration:

(21)

Page 8: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

754 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012

Fig. 4. Plots compare a synthesized trajectory with two sample trajectories, aswell as with the mean trajectory for different motion features from differentdance figures in the database. The expected state durations, all associated withthe HMM structure trained for the same dance figures, are also displayed inthe plot with horizontal solid lines. The corresponding angular values of eachhorizontal solid line are the means of the Gaussian distributions, associated withthe particular motion feature in each state of the HMM structure trained for thegiven dance figure.

where is the expected duration in state , is the self-state-transition probability for state (assuming ), and isthe number of states in .In order to avoid generation of noisy parameters, we first in-

crease the time resolution of the dance motion by oversamplingthe dance motion model. That is, we generate parameters for amultiple of , say , where is an integer scale factor. Then,wegenerate thebodymotionparameters along the statesof ac-cording to the distribution of motion frames to these states,using the corresponding Gaussian distribution at each state. Toreverse the effect of oversampling, we perform a downsamplingby that eventually yields smoother state transitions, andhence,more realistic parameter generation that avoid motion jerkiness.The dance figure models are trained over the first and

second differences of the Euler angles of the joints, which aredefined in Section II-B. Therefore, to obtain the final set ofbody posture parameters for a dance figure , we simply needto sum the generated first differences with the mean trajectoryassociated with , i.e., . Each plot in Fig. 4 depicts a syn-thesized trajectory against two sample trajectories for one ofthe motion features from a dance figure in the database alongwith the mean trajectory associated with the same motion fea-ture from the same dance figure. The length of the horizontalsolid lines represent the expected state durations (in terms of the

Fig. 5. Distribution of dance figures to musical pieces is visualized using animage plot. Columns of the plot represent the dance figures ( ) whereas therows represent the musical pieces ( ) in the database. Note that and aredropped in the figure for clarity of the representation. Consequently, each cellin the plot indicates how many times a particular dance figure is performed withthe corresponding musical piece. Cells with different gray values in a columnimply that the same dance figure can be accompanied with different musicalpieces. Similarly, cells with different gray values in a row imply that differentdance figures can be performed with the same musical piece.

number of frames), and the corresponding angular values rep-resent the means of the Gaussian distributions, associated withthe particular motion feature in each state of the HMM struc-ture trained for the particular dance figure. In these plots, thetwo sample trajectories exemplify the temporal variations be-tween different realizations of the same dance figure. Derivingfrom the trained dance figure HMMs, the synthesized dancefigure trajectories mimic the underlying temporal dynamics ofa given dance figure. Hence, modeling the temporal variationsusing HMMs for each dance figure allows us to synthesize morerealistic and personalized dance motion trajectories.After repeating the described procedure for each dance figure

in the synthesized choreography, the body posture parametersat the dance figure boundaries are smoothed via cubic interpo-lation within a -neighborhood of each dance figure boundaryin order to generate smoother figure-to-figure transitions.We note that the use of HMMs for dance figure synthesis pro-

vides us with the ability of introducing random variations in thesynthesized body motion patterns for each dance figure. Thesevariations make the synthesis results look more natural due tothe fact that humans perform slightly varying dance figures atdifferent times for the same dance performance.

V. EXPERIMENTS AND RESULTS

We investigate the effectiveness of our choreography analysisand synthesis framework using the Turkish folk dance, Kasik.1

The Kasik database consists of 20 dance performances with 20different musical pieces with a total duration of 36 min. Thereare 31 different dance figures (i.e., ) and a total of 1258musical measure segments (i.e., ). Fig. 5 shows thedistribution of dance figures to different musical pieces whereeach column represents a dance figure label and each row rep-resents a musical piece . Hence, entries with different colorsin a column indicate that the same figure can be performed with

1Kasikmeans spoon in English. The dance is named so, since the dancers clapspoons while dancing.

Page 9: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

OFLI et al.: LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS 755

Fig. 6. Matrix plot demonstrates the assessment levels associated with any pairof dance figures in the database. Assessment levels are indicated with differentcolors. The pairs of dance figures that fall into assessment level in a rowcorrespond to the group of exchangeable dance figures for that particular dancefigure. For instance, by looking at the first row, one can say that the dance figureis exchangeable with the dance figures , , , , and . In other words,

{ , , , , } is the group of exchangeable figures for , i.e., .

different melodic patterns whereas entries with different colorsin a row indicate that different dance figures can be performedwith the same melodic pattern. Therefore, Fig. 5 can be seen asa means to provide evidence for our basic assumption that thereis a many-to-many relationship between dance figures and mu-sical measures.We follow a 5-fold cross-validation procedure in the experi-

mental evaluations.We train musical measure models with four-fifths of the musical audio data in the analysis part and use thesemusical measure models in the process of choreography estima-tion for the remaining one-fifth of the musical audio data in thesynthesis part. We repeat this procedure five times, each timeusing different parts of the musical audio data for training andtesting. This way, we synthesize a new dance choreography forthe entire musical audio data.

A. Objective Evaluation Results

We define the following four assessment levels to evaluateeach dance figure label in the synthesized figure sequence ,compared to the respective figure label in the original dancechoreography , assigned by the expert:• (Exact-match): is marked as if matches .• (X-match): ismarked as if does notmatch , butit is in ’s exchangeable figure group ; i.e., .

• (Song-match): is marked as if neither matchesnor is in ; but, and are performed within the

same musical piece; i.e., .• (No-match): is marked as if it is not marked asone of through .

Fig. 6 displays all assessment levels associated with any pos-sible pairing of dance figures in a single matrix. Note that thematrix in Fig. 6 defines a distancemetric on dance figure pairs by

TABLE IAVERAGE PENALTY SCORES (APS) OF VARIOUS

CHOREOGRAPHY SYNTHESIS SCENARIOS

Fig. 7. Percentage of figures that fall into each assessment level for the pro-posed five different synthesis scenarios.

mapping the four assessment levels through into penaltyscores from 0 to 3, respectively. Hence in this distance metric,low penalty scores indicate desirable choreography synthesis re-sults. Average penalty scores are reported to measure the “good-ness” (coherence) of the resulting dance choreography.Recall that we propose three alternative choreography

synthesis scenarios together with the two other reference chore-ography synthesis techniques, as discussed in Section III-B.The average penalty scores of these five choreography synthesisscenarios are given in Table I. Furthermore, the distributionof the number of figures that fall into each assessment levelfor all synthesis scenarios are given in Fig. 7. The averagepenalty scores for the reference acoustic-only and figure-onlychoreography synthesis scenarios are 0.82 and 2.07, respec-tively. The high average penalty score of the figure-onlychoreography is mainly due to the randomness in this syn-thesis technique. Hence, the unimodal learning phase of thefigure-only choreography synthesis, which takes into accountonly the figure-to-figure transition probabilities, does not carrysufficient information to automatically generate dance chore-ographies which are similar to the training Kasik database. Onthe other hand, the acoustic-only choreography, which employsthe musical measure models for synthesis, attains a loweraverage penalty score. We also note that the average dancefigure recognition rate using the musical measure modelsthrough five-fold cross validation is obtained as 49.05%. Thisindicates that the musical measure models learn the correlationbetween measures and figures. However, the acoustic-onlychoreography synthesis that depends only on the correlationof measures and figures fails to sustain motion continuity atfigure transition boundaries. This creates visually unacceptablecharacter animation for the acoustic-only choreography.

Page 10: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

756 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012

The proposed single best path synthesis scenario refines theacoustic-only choreographywith the inclusion of the figure tran-sition model , which defines a bigram model for figure-to-figure transitions. The average penalty score of the single bestpath synthesis scenario is 0.56, which is the smallest averagepenalty score among all scenarios. This is expected, since thesingle best path synthesis generates the optimal Viterbi pathalong the multimodal lattice structure . The likely path syn-thesis introduces variation into the single best path synthesis.The average penalty score of the likely path synthesis increasesto 0.91. This increase is an indication of the variation introducedto the optimal Viterbi path. The exchangeable path synthesisrefines the likely path synthesis by introducing more accept-able random variations to the single best path synthesis. Recallthat random variations of the exchangeable path synthesis de-pend on the exchangeable figures model . The average penaltyscore of the exchangeable path synthesis is 0.63, which is an im-provement compared to the likely path synthesis.In Fig. 7, we observe that among all the assessment levels,

the levels and are indicators of the diversity of alterna-tive dance figure choreographies, rather than being an error in-dicator, whereas the assessment levels and indicate anerror in the dance choreography synthesis process. In this con-text, we observe that only 31% of the figure-only choreographyand only 83% of the acoustic-only choreography fall into thefirst three assessment levels. On the other hand, using the map-ping obtained by our framework increases this ratio to 94% and92% for the single best path and the exchangeable path syn-thesis scenarios, respectively. The percentage drops to 80% forthe likely path synthesis scenario, yet it is still a high percentageof the entire dance sequence.

B. Subjective Evaluation Results

We performed a subjective A/B comparison test usingthe music-driven dance animations to measure the opinionsof the audience on the coherence of the synthesized dancechoreographies with the accompanying music. During the test,the subjects were asked to indicate their preference for eachgiven A/B test pair of synthesized dance animation segmentson a scale of ( ; ; 0; 1; 2), where the scale correspondsto strongly prefer A, prefer A, no preference, prefer B, andstrongly prefer B, respectively. We compared dance animationsegments from five different choreographies, namely, original,single best path, likely path, exchangeable path, and figure-onlychoreographies. We, therefore, had ten possible pairings ofdifferent dance choreographies, e.g., original versus single bestpath, or likely path versus figure-only, etc. For each possiblepair of choreographies, we used short audio segments fromthree different musical pieces from the Kasik database tosynthesize three A/B pairs of dance animation video clips forthe respective dance choreographies. This yielded us a total of30 A/B pairs of dance animation segments. We also includedone A/B pair of dance animation segments for pairing eachchoreography with itself, i.e., original versus original, etc.,in order to test if the subjects were careful enough and showalmost no preference over five such possible self-pairings ofthe aforementioned choreographies. As a result, we extracted35 short segments from the audiovisual database, where each

TABLE IISUBJECTIVE A/B PAIR COMPARISON TEST RESULTS

segment was approximately 15 s. We picked at most twonon-overlapping segments from each musical piece in orderto make a full coverage of the audiovisual database in thesubjective A/B comparison test.The subjective tests are performed over 18 subjects. The

average preference scores for all comparison sets are presentedin Table II. Note that the rows and the columns of Table II,respectively, correspond to A and B of the A/B pairs. Also,the average preference scores that tend to favor B are givenin bold to ease the visual inspection. The first observation isthat the animations for the original choreography and for thechoreographies resulting from the proposed three synthesisscenarios (i.e., single best path, likely path, and exchangeablepath choreographies) are preferred over the animations for thefigure-only choreography. We also note that the likely pathand exchangeable path choreography animations are stronglypreferred against the figure-only choreography animation. Thisobservation is an evidence of the fact that audience is generallyappealed by variations in the dance choreography as long asthe overall choreography is coherent with the accompanyingmusic. Hence we observe the likely path and the exchangeablepath synthesis as the most preferable scenarios in subjectivetests, and they manage to create alternative choreographies thatare coherent and appealing to the audience.

C. Discussion

The objective evaluations carried out using the assessmentlevels in Section V-A indicate that the most successful chore-ography synthesis scenarios are the single best path and the ex-changeable path scenarios. We note that the exchangeable pathsynthesis as well as the likely path can be seen as variations orrefinements of the single best path synthesis approach. On theother hand, according to the subjective evaluations presented inSection V-B, the most preferable scenarios are the likely pathand the exchangeable path. Hence the exchangeable path syn-thesis, being among the top two according to both objective andsubjective evaluations, can be regarded as our best synthesis ap-proach. Note also that the exchangeable path synthesis includesall the statistical models defined in Section III.The other two scenarios (acoustic-only and figure-only) are

mainly used to demonstrate the effectiveness of the musicalmeasure models and the figure transition model, hence theyboth serve as reference synthesis methods. The acoustic-onlychoreography however depends only on the correlations be-tween measures and figures, and therefore fails to sustain mo-tion continuity at figure-to-figure boundaries, which is indis-pensable to create realistic animations. Hence we have chosenthe figure-only synthesis as the baseline method to comparewith our best result, i.e., with the exchangeable path synthesis,

Page 11: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

OFLI et al.: LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS 757

and prepared a demo video (submitted as supplemental mate-rial) that compares side by side the animations resulting fromthese two synthesis scenarios for twomusical pieces in theKasikdatabase.We have also prepared another video which demonstrates

the likely path, the exchangeable path, and the single best pathsynthesis results with respect to sample original dance perfor-mances. The demo video starts with a long excerpt from theoriginal and the synthesized choreographies driven by a mu-sical piece that is available in the Kasik database. The longexcerpt is followed by several short excerpts from the orig-inal and the synthesized choreographies driven by the musicalpieces that are available in the Kasik database. The demo isconcluded with two long excerpts from the synthesized chore-ographies driven by two musical pieces that are not availablein the Kasik database. Both demo videos are also availableonline [41].

VI. CONCLUSIONS

We have described a novel framework for music-drivendance choreography synthesis and animation. For this purpose,we construct a many-to-many statistical mapping from musicalmeasures to dance figures based on the correlations betweendance figures and musical measures as well as the correlationsbetween successive dance figures in terms of figure-to-figuretransition probabilities. We then use this mapping to synthesizea music-driven sequence of dance figure labels via a constraintbased dynamic programming procedure. With the help ofexchangeable figures notion, the proposed framework is able toyield a variety of different dance figure sequences. These outputsequences of dance figures can be considered as alternativedance choreographies that are in synchrony with the drivingmusic signal. The subjective evaluation tests indicate that theresulting music-driven dance choreographies are plausible andcompelling to the audience. To further evaluate the synthesisresults, we have also devised an objective assessment schemethat measures the “goodness” of a synthesized dance choreog-raphy with respect to the original choreography.Although we have demonstrated our framework on a folk

dance database, the proposed music-driven dance animationmethod can also be applied to other dance genres such asballroom dances, Latin dances and hip hop, as long as thedance performance is musical measure-based (i.e., the metricorders in the course of music and dance structure coincide [1]),and the dance database contains sufficient amount of data totrain the statistical models employed in our framework. Onepossible source of problem in our current framework might bedue to the dance genres which do not have a well-defined set ofdance figures. Hip hop, which is in fact a highly measure-baseddance genre, is a good example of this. In such cases, figureannotation may become a very tedious task due to possiblyvery large variations in the way a particular movement (sup-posedly a dance figure) is performed. More importantly, if thedance genre does not have a well-defined set of dance figure,the 3-D motion capture data needed to train the dance figuremodels (described in Section IV-A) must be carefully prepared,since otherwise joining up individual figures smoothly during

character animation can be very difficult and the continuum ofthe synthesized dance motion may not be guaranteed.The performance of the proposed framework strongly de-

pends on the quality, the complexity and the size of the audiovi-sual dance database. For instance, higher-order -gram modelscan be integrated in the presence of sufficient training data, tobetter exploit the intrinsic dependencies of the dance figures.Currently we employ only bigrams (with ) to model in-trinsic dependencies of the dance figures. This choice of modelis mainly due to the scale of the choreography database that weuse in training. We also note that, in our experiments, the bi-gram statistics proved to be sufficient for the current state of theframework and for the particular Turkish folk dance genre, i.e.,Kasik.Our music-driven dance animation scheme currently supports

only the single dancer scenario; but the framework can be ex-tended to handle multiple dancers as well by learning the corre-lations between the movements of the dancers through the useof additional statistical models, that would however increasethe complexity of the overall learning process. Having multipledancers will also increase the complexity of the animation sincethen the spatial positioning of the dancers relative to each otherwill also have to be taken into account.The proposed framework currently requires expert input and

musical transcription prior to the audiovisual feature extractionand modeling tasks. This tedious pre-processing can be elimi-nated by introducing automatic measure/dance figure segmen-tation capability into the framework. However, such automaticsegmentation techniques are not yet currently available in theliterature, and they seem to remain as open research areas in thenear future. In this study, HMM structures are used to modelthe dance motion trajectories, since they can represent varia-tions among different realizations of dance figures in person-alized dance performances. However, one can consider othermethods such as style machines that will also represent stylisticvariations associated with dance figures.We define a dance figure as the dance motion trajectory corre-

sponding to a single measure segment. The choice of measuresas elementary music primitives simplifies the task of statisticalmodeling and allows us to use the powerful HMM frameworkfor music-to-dance mapping. However this choice can also beseen as a limiting assumption that ignores higher levels of se-mantics and correlations which might exist in a musical piecesuch as chorus and verses. This current limitation of our frame-work could be addressed by using, for example, hierarchicalstatistical modeling tools and/or higher order -grams (with

). Yet, modeling higher levels of semantics remains asan open challenge for further research.The proposed framework can trigger interdisciplinary studies

with collaboration of dance artists, choreographers, and com-puter scientists. Certainly, it has the potential of creating ap-pealing applications, such as fast evaluation of dance chore-ographies, dance tutoring, entertainment, and more importantlydigital preservation of folk dance heritage by safeguarding irre-placeable information that tend to perish. As a final remark, wethink that the proposed framework can be modified to be usedfor other multimodal applications such as speech-driven facialexpression or body gesture synthesis and animation.

Page 12: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

758 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012

REFERENCES

[1] W. C. Reynolds, “Foundations for the analysis of the structure and formof folk dance: A syllabus,” Yearbook Int. Folk Music Council, vol. 6,pp. 115–135, 1974.

[2] S. Gao and C.-H. Lee, “An adaptive learning approach to music tempoand beat analysis,” in Proc. IEEE Int. Conf. Acoustics, Speech, andSignal Processing., 2004, vol. 4, pp. 237–240.

[3] D. P.W. Ellis, “Beat tracking by dynamic programming,” J. NewMusicRes., vol. 36, no. 1, pp. 51–60, 2007.

[4] M. F. McKinney, D. Moelants, M. E. P. Davies, and A. Klapuri, “Eval-uation of audio beat tracking and music tempo extraction algorithms,”J. New Music Res., vol. 36, no. 1, pp. 1–16, 2007.

[5] A. Klapuri, A. Eronen, and J. Astola, “Analysis of the meter of acousticmusical signals,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14,no. 1, pp. 342–355, Jan. 2006.

[6] M. Gainza, “Automatic musical meter detection,” in IEEE Int. Conf.Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009), 19–24,2009, pp. 329–332.

[7] T. Fujishima, “Realtime chord recognition of musical sound: A systemusing common lisp music,” in Proc. Int. Computer Music Conf., 1999,pp. 464–467.

[8] K. Lee and M. Slaney, “Automatic chord recognition from audiousing a supervised HMM trained with audio-from-symbolic data,” inProc. 1st ACM Workshop Audio and Music Computing Multimedia(AMCMM ’06), New York, 2006, pp. 11–20.

[9] D. Ellis and G. Poliner, “Identifying ’cover songs’ with chroma fea-tures and dynamic programming beat tracking,” in Proc. IEEE Int.Conf. Acoustics, Speech and Signal Processing, 2007 (ICASSP 2007),15–20, 2007, vol. 4, pp. IV–1429–IV–1432.

[10] S. Kim, P. Georgiou, and S. Narayanan, “A robust harmony structuremodeling scheme for classical music opus identification,” in Proc.IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2009(ICASSP 2009), 19–24, 2009, pp. 1961–1964.

[11] C. Bregler, S. M. Omohundro, M. Covell, M. Slaney, S. Ahmad, D. A.Forsyth, and J. A. Feldman, “Probabilistic models of verbal and bodygestures,” in Computer Vision in Man-Machine Interfaces. Cam-bridge, U.K.: Cambridge Univ, Press, 1998, pp. 267–290.

[12] O. Arikan and D. A. Forsyth, “Interactive motion generation from ex-amples,” ACM Trans. Graph., vol. 21, no. 3, pp. 483–490, 2002.

[13] L. Kovar, M. Gleicher, and F. Pighin, “Motion graphs,” ACM Trans.Graph., vol. 21, no. 3, pp. 473–482, 2002.

[14] Y. Li, T. Wang, and H.-Y. Shum, “Motion texture: A two-level statis-tical model for character motion synthesis,” ACM Trans. Graph., vol.21, no. 3, pp. 465–472, 2002.

[15] M. Brand and A. Hertzmann, “Style machines,” in Proc. 27th Annu.Conf. Computer Graphics and Interactive Techniques (SIGGRAPH’00), New York, 2000, pp. 183–192.

[16] J. Min, H. Liu, and J. Chai, “Synthesis and editing of personalizedstylistic human motion,” in Proc. 2010 ACM SIGGRAPH Symp. Inter-active 3D Graphics and Games (I3D ’10), New York, 2010, pp. 39–46.

[17] A. Ruiz and B. Vachon, “Three learning systems in the reconnaissanceof basic movements in contemporary dance,” in Proc. 5th Biannu.World Automation Congr., 2002 , 2002, vol. 13, pp. 189–194.

[18] C. Bregler, M. Covell, and M. Slaney, “Video rewrite: Driving visualspeech with audio,” in Proc. 24th Annual Conf. Computer Graphicsand Interactive Techniques (SIGGRAPH ’97), New York, 1997, pp.353–360.

[19] T. Chen, “Audiovisual speech processing,” IEEE Signal Process.Mag., vol. 18, no. 1, pp. 9–21, 2001.

[20] M. Brand, “Voice puppetry,” in Proc. 26th Annual Conf. ComputerGraphics and Interactive Techniques (SIGGRAPH ’99), New York,1999, pp. 21–28.

[21] Y. Li and H.-Y. Shum, “Learning dynamic audio-visual mapping withinput-output hidden Markov models,” IEEE Trans. Multimedia, vol. 8,no. 3, pp. 542–549, Jun. 2006.

[22] J. Xue, J. Borgstrom, J. Jiang, L. Bernstein, and A. Alwan, “Acousti-cally-driven talking face synthesis using dynamic Bayesian networks,”in Proc. IEEE Int. Conf. Multimedia and Expo, 2006, Jul. 2006, pp.1165–1168.

[23] M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp, “Analysis ofhead gesture and prosody patterns for prosody-driven head-gesture an-imation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 8, pp.1330–1345, Aug. 2008.

[24] M. Neff, M. Kipp, I. Albrecht, and H.-P. Seidel, “Gesture modeling andanimation based on a probabilistic re-creation of speaker style,” ACMTrans. Graph., vol. 27, pp. 5:1–5:24, Mar. 2008.

[25] S. Levine, P. Krähenbühl, S. Thrun, and V. Koltun, “Gesture con-trollers,” ACM Trans. Graph., vol. 29, pp. 124:1–124:11, Jul. 2010.

[26] M. Cardle, L. Barthe, S. Brooks, and P. Robinson, “Music-driven mo-tion editing: Local motion transformations guided by music analysis,”in Proc. Annu. Eurographics UK Conf., 2002, vol. 0, pp. 38–44.

[27] H. C. Lee and I. K. Lee, “Automatic synchronization of backgroundmusic and motion in computer animation,” Comput. Graph. Forum,vol. 24,, pp. 353––361,, 2005.

[28] T.-H. Kim, S. I. Park, and S. Y. Shin, “Rhythmic-motion synthesisbased on motion-beat analysis,” ACM Trans. Graph., vol. 22, no. 3,pp. 392–401, 2003.

[29] G. Alankus, A. A. Bayazit, and O. B. Bayazit, “Automated motion syn-thesis for dancing characters,” Comput. Animat. Virtual Worlds, vol.16, no. 3-4, pp. 259–271, 2005.

[30] T. Shiratori, A. Nakazawa, and K. Ikeuchi, “Dancing-to-music char-acter animation,” Comput. Graph. Forum, vol. 25, no. 3, pp. 449–458,2006.

[31] J. W. Kim, H. Fouad, J. L. Sibert, and J. K. Hahn, “Perceptually moti-vated automatic dance motion generation for music,” Comput. Animat.Virtual Worlds, vol. 20, no. 2–3, pp. 375–384, 2009.

[32] F. Ofli, Y. Demir, E. Erzin, Y. Yemez, and A.M. Tekalp, “Multicameraaudio-visual analysis of dance figures,” in Proc. IEEE Int. Conf. Mul-timedia and Expo, 2007, 2007, pp. 1703–1706.

[33] F. Ofli, Y. Demir, E. Erzin, Y. Yemez, A. M. Tekalp, K. Balci, I.Kiziloglu, L. Akarun, C. Canton-Ferrer, J. Tilmanne, E. Bozkurt, andA. Erdem, “An audio-driven dancing avatar,” J. Multimodal UserInterfaces, vol. 2, no. 2, pp. 93–103, Sep. 2008.

[34] F. Ofli, E. Erzin, Y. Yemez, and A. M. Tekalp, “Multi-modal analysisof dance performances for music-driven choreography synthesis,”in Proc. IEEE Int. Conf. Acoustics Speech and Signal Processing(ICASSP), 2010, 2010, pp. 2466–2469.

[35] R. Shepard, “Circularity in judgements of relative pitch,” J. Acoust.Soc. Amer., vol. 36, no. 12, 1964.

[36] S. Davis and P. Mermelstein, “Comparison of parametric representa-tions for monosyllabic word recognition in continuously spoken sen-tences,” IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4,pp. 357–366, Aug. 1980.

[37] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimiza-tion for spoken word recognition,” IEEE Trans. Acoust., Speech, SignalProcess., vol. 26, no. 1, pp. 43–49, Feb. 1978.

[38] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Anal-ysis and an algorithm,” in Advances in Neural Information ProcessingSystems. Cambridge, MA: MIT Press, 2001, vol. 14, pp. 849–856.

[39] J. B. MacQueen, “Some methods for classification and analysis of mul-tivariate observations,” in Proc. 5th Berkeley Symp. Mathematical Sta-tistics and Probability, 1967, pp. 281–297.

[40] P. Rousseeuw, “Silhouettes: A graphical aid to the interpretation andvalidation of cluster analysis,” J. Comput. Appl. Math., vol. 20, no. 1,pp. 53–65, 1987.

[41] F. Ofli, E. Erzin, Y.Yemez, andA. Tekalp,Music-DrivenDance Chore-ography Synthesis Demo, 2011. [Online]. Available: http://mvgl.ku.edu.tr/Learn2Dance.

Ferda Ofli (S’07–M’11) received the B.Sc. degrees,both in electrical and electronics engineering andcomputer engineering, and the Ph.D. degree inelectrical engineering from Koç University, Istanbul,Turkey, in 2005 and 2010, respectively.He is currently a postdoctoral researcher in

the Tele-Immersion Group of the University ofCalifornia at Berkeley, Berkeley, CA. His researchinterests span the areas of multimedia signal pro-cessing, computer vision, pattern recognition, andmachine learning. He received the Graduate Studies

Excellence award in 2010 for outstanding academic achievement at KoçUniversity.

Page 13: Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis

OFLI et al.: LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS 759

Engin Erzin (S’88–M’96–SM’06) received theB.Sc., M.Sc., and Ph.D. degrees from the BilkentUniversity, Ankara, Turkey, in 1990, 1992, and1995, respectively, all in electrical engineering.During 1995–1996, he was a postdoctoral fellow

in the Signal Compression Laboratory, Universityof California, Santa Barbara. He joined LucentTechnologies in September 1996, and he was withthe Consumer Products for one year as a Memberof Technical Staff of the Global Wireless ProductsGroup. From 1997 to 2001, he was with the Speech

and Audio Technology Group of the Network Wireless Systems. Since January2001, he has been with the Koç University, Istanbul, Turkey. His researchinterests include speech signal processing, audio-visual signal processing,human-computer interaction, and pattern recognition.Dr. Erzin is serving as an Associate Editor of the IEEE TRANSACTIONS ON

AUDIO, SPEECH, AND LANGUAGE PROCESSING (2010–2013).

Yücel Yemez (M’03) received the B.Sc. degree fromMiddle East Technical University, Ankara, Turkey, in1989, and theM.Sc. and Ph.D. degrees fromBoğaziçiUniversity, Istanbul, Turkey, in 1992 and 1997, re-spectively, all in electrical engineering.From 1997 to 2000, he was a postdoctoral

researcher in the Image and Signal ProcessingDepartment of Télécom Paris (ENST), Paris, France.Currently he is an Associate Professor of the Com-puter Engineering Department at Koç University,Istanbul. His research is focused on various fields of

computer vision and graphics.

A. Murat Tekalp (S’80–M’84–SM’91–F’03) re-ceived the M.Sc. and Ph.D. degrees in electrical,computer, and systems engineering from RensselaerPolytechnic Institute (RPI), Troy, NY, in 1982 and1984, respectively.He has been with Eastman Kodak Company,

Rochester, NY, from December 1984 to June 1987,and with the University of Rochester from July1987 to June 2005, where he was promoted toDistinguished University Professor. Since June2001, he has been a Professor at Koç University,

Istanbul, Turkey. His research interests are in the area of digital image andvideo processing, including video compression and streaming, motion-com-pensated video filtering for high-resolution, video segmentation, content-basedvideo analysis and summarization, 3DTV/video processing and compression,multicamera surveillance video processing, and protection of digital content.He authored the book Digital Video Processing (Englewood Cliffs, NJ:Prentice-Hall, 1995) and holds seven U.S. patents. His group contributedtechnology to the ISO/IEC MPEG-4 and MPEG- 7 standards.Dr. Tekalp was named Distinguished Lecturer by the IEEE Signal Processing

Society in 1998, and awarded a Fulbright Senior Scholarship in 1999. Hereceived the TUBITAK Science Award (highest scientific award in Turkey) in2004. He chaired the IEEE Signal Processing Society Technical Committeeon Image and Multidimensional Signal Processing (January 1996–December1997). He served as an Associate Editor for the IEEE TRANSACTIONS ONSIGNAL PROCESSING (1990 to 1992) and the IEEE TRANSACTIONS ON IMAGEPROCESSING (1994 to 1996), and the Kluwer journalMultidimensional Systemsand Signal Processing (1994 to 2002). He was an Area Editor for GraphicalModels and Image Processing (1995 to 1998). He was also on the EditorialBoard of the Academic Press journal Visual Communication and Image Repre-sentation (1995 to 2002). He was appointed as the Special Sessions Chair forthe 1995 IEEE International Conference on Image Processing, the TechnicalProgram Co-Chair for IEEE ICASSP 2000 in Istanbul, the General Chair ofIEEE International Conference on Image Processing (ICIP) in Rochester in2002, and Technical Program Co-Chair of EUSIPCO 2005 in Antalya, Turkey.He is the Founder and First Chairman of the Rochester Chapter of the IEEESignal Processing Society. He was elected as the Chair of the Rochester Sectionof IEEE for 1994 to 1995. At present, he is the Editor-in-Chief of the EURASIPjournal Signal Processing: Image Communication (Elsevier). He is serving asthe Chairman of the Electronics and Informatics Group of the Turkish Scienceand Technology Foundation (TUBITAK) and as an independent expert toreview projects for the European Commission.