Perceptually motivated automatic dance motion generation for music

Perceptually Motivated Automatic Dance Motion Generation for

Music

By Jae Woo Kim

B.S. in Physics, February 1991, Hankuk University of Foreign Studies

M.S. in Computer Science, February 1993, Hankuk University of Foreign Studies

A Dissertation Submitted to

the Faculty of The School of Engineering and Applied Science

of the George Washington University in partial satisfaction of the requirements

for the degree of Doctor of Science

May 17, 2009

Dissertation directed by

James K. Hahn

Professor of Computer Science

UMI Number: 3349630

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy

submitted. Broken or indistinct print, colored or poor quality illustrations and

photographs, print bleed-through, substandard margins, and improper

alignment can adversely affect reproduction.

In the unlikely event that the author did not send a complete manuscript

and there are missing pages, these will be noted. Also, if unauthorized

copyright material had to be removed, a note will indicate the deletion.

______________________________________________________________

UMI Microform 3349630 Copyright 2009 by ProQuest LLC

All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.

_______________________________________________________________

ProQuest LLC 789 East Eisenhower Parkway

P.O. Box 1346 Ann Arbor, MI 48106-1346

ii

The School of Engineering and Applied Science of The George Washington University

certifies that Jae Woo Kim has passed the Final Examination for the degree of Doctor of

Science as of December 2, 2008. This is the final and approved form of the dissertation.

Perceptually Motivated Automatic Dance Motion Generation for

Music

Jae Woo Kim

Dissertation Research Committee:

James K. Hahn, Professor of Computer Science, Dissertation Director

Simon Y. Berkovich, Professor of Computer Science, Committee Member

John L. Sibert, Professor of Computer Science, Committee Member

Shmuel Rotenstrich, Associate Professor of Computer Science, Committee Member

Hesham Fouad, President of VRSonic Inc., Committee Member

iii

ABSTRACT

Perceptually Motivated Automatic Dance Motion Generation for Music

In this paper, we describe a novel method to automatically generate synchronized

dance motion that is perceptually matched to a given musical piece. The proposed

method extracts thirty musical features from musical data as well as thirty seven motion

features from motion data. A matching process is then performed between the two feature

spaces considering the correspondence of the relative changes in both feature spaces and

the correlations between musical and motion features.

Similarity matrices are introduced to match the amount of relative changes in both

feature spaces and correlation coefficients are used to establish the correlations between

musical features and motion features by measuring the strength of correlation between

each pair of the musical and motion features. By doing this, the progressions of musical

and dance motion patterns and perceptual changes between two consecutive musical and

motion segments are matched.

To demonstrate the effectiveness of the proposed approach, we designed and

carried out a user opinion study to assess the perceived quality of the proposed approach.

The statistical analysis of the user study results showed that the proposed approach

generated results that were significantly better than those produced using a random walk

through the dance motion database.

The approach suggested in this dissertation can be applied to a number of

application areas including film, TV commercials, virtual reality applications, computer

games and entertainment systems.

iv

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii

Chapter 1 – Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Original Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Document Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 2 – Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Human Motion Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Music Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Speech Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Synchronization between Music and Dance Motion . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Event-Based Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.2 Feature-Based Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.3 Emotion-Based Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.4 Limitations of Previous Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Chapter 3 – Dance Performance Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

v

3.2 Music Analysis. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Rhythm Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.2 Pitch Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.3 Timbre Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.4 Musical Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Motion Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 Dynamic Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.2 Postural Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.3 Motion Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.4 Motion Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 Dance Motion Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4.1 Matching Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 40

3.4.2 Matching the Progressions of Patterns in Music and Motion . . . . . . . . . 44

3.4.3 Correlation between Musical and Motion Features . . . . . . . . . . . . . . . . . 46

3.4.4 Dance Motion Generation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Chapter 4 – User Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.4 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Chapter 5 – Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

vi

List of Figures

1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Work Flow of Dance Motion Generation System . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Virtual Human Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Calculation of Dynamic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Calculation of Arm Shape Motion Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Calculation of Balance Motion Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6 Motion Graph for Human Dance Motion Generation . . . . . . . . . . . . . . . . . . . . . . . . 40

3.7 Similarity Matrix and Matching Progressions of Patterns . . . . . . . . . . . . . . . . . . . . . 46

3.8 Matching between Musical Intensity and Motion Intensity . . . . . . . . . . . . . . . . . . . 47

3.9 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1 Questions used in the User Opinion Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

vii

List of Tables

1.1 Relevant Research Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Musical Features for Rhythm Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 Musical Features for Pitch Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Spectral Shape Features and Other Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Musical Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Motion Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Statistical Analysis of the Results for Prediction 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 58




1

Chapter 1 -- Introduction

The goal of the research is to develop a method to generate a dance performance that

is perceptually matched to a given musical piece. The proposed method extracts musical

features by analyzing input audio streams while also extracting motion features from

motion data. A mapping is then performed between the two feature spaces by matching

the progressions of musical patterns and dance motion patterns and correlating the feature

values between the two. Finally, the input music is automatically transformed into a

dance performance through a process of dance motion recombination based on the

derived mapping. The process results in the generation of “natural” or realistic dance

motion that approximates a human performance for a given musical piece.

1.1 Motivation

Synthesizing realistic human motion is one of the most important research topics in

computer animation. Various methods such as keyframing, inverse kinematics, and

dynamic simulation have been used to synthesize human behavior. More recently, with

the advent of motion capture technology, motion capture and editing techniques have

become widely used in realistic human motion synthesis.

On the other hand, synchronization of sound with motion is also a very essential

problem in animation because sound plays an important role in computer animation. The

process of creating computer generated animation usually begins with the creation of a

visual animated sequence, followed by the sound editing process where perceptually

appropriate sound effects are manually added to the animation sequence. The sound

2

design process is tedious, time consuming, and requires a high level expertise to produce

convincing results. Automatic sound effects synchronization has therefore been a crucial

issue within the computer animation research community as well as industry.

When dealing with music and human dance motion, the synchronization between

music and motion becomes even more important because dance motion has a much

stronger linkage to music than any other type of motion. However, synchronizing music

to dance motion is a very difficult problem due to the intricate relationship that exists

between music and motion in a dance performance. To date, little research has been done

on the problem of synchronization between dance motion and music [CAR02].

The synchronization problem is further compounded due to the complexity inherent

in both musical and human motion data. Musical sound contains a wealth of information

such as pitch, timbre, rhythm and harmony. Human motion data, on the other hand, is

multidimensional and of all human motion (i.e. walking, jumping, running); human dance

motion is the most complex. This complexity makes it difficult to analyze the data and to

explore the relationship between the musical and dance motion features.

The nature how dance is created and performed also creates additional challenges.

Dance performance is carefully choreographed by expert choreographers based on a

given musical piece. This process requires a high degree of intelligence and requires

much expertise, education and experience. This problem is therefore not amenable to the

use of analytic or algorithmic models because of the aesthetic, perceptual, and

psychological aspects that are involved in this process.

sy

th

d

co

d

re

pr

d

nu

In this

ynchronized

he resulting

issertation c

ommercials,

1.2 Pro

The goal

ance motion

esulting in a

rocess wher

ance perform

The prob

umber of res

dissertation,

d dance moti

g dance pe

can be app

, virtual reali

oblem Dom

of this diss

n generation

an animated

re a musical

mance (figur

blem addres

search areas

, we devel

ion that is pe

erformance

plied to a

ity applicatio

main

sertation is

n where an

d dance perf

l piece is au

re 1.1).

Figure

ssed by this

(table 1.1).

3

lop a nov

erceptually m

is convinci

number of

ons, comput

to develop

arbitrary pi

formance. T

uditioned an

1.1: Problem

s work is i

el method

matched to a

ing. The a

application

ter games an

a solution t

ece of musi

This approac

nd analyzed

m definition

inherently m

to automa

a given mus

approach su

n areas incl

nd entertainm

to the probl

ic is an inp

h mimics a

to inspire t

multidisciplin

atically gen

ical piece so

uggested in

luding film,

ment systems

lem of autom

put into a sy

choreograp

the creation

nary spanni

nerate

o that

this

, TV

s.

matic

ystem

pher’s

n of a

ing a

4

Table 1.1: Relevant Research Areas

Research Area Relevance

Motion synthesis Generating human figure animation from given musical cues

Music visualization Rendering a given musical performance using visual constructs

Motion graph search problem Searching an sequence of motion that perceptually matched to a given musical piece from a motion graph constructed by solely human dance motion segments

Motion retrieval problem Retrieving a motion clip that matches to the input musical cues

Music analysis and motion analysis problem

Extracting music and motion features which best describe the properties of music and motion which are useful in matching music to motion

1.3 Proposed Solution

The proposed solution to the problem of dance motion generation consists of four

components – 1) music analysis, 2) motion analysis, 3) motion graph construction, and 4)

matching between musical and motion features. Musical features as well as motion

features are extracted from input music and a database of dance motion clips. Feature

vectors are then constructed from the extracted feature values. Musical features represent

the properties of the musical segments contained in the input music. Motion features

represent the postural and dynamic properties of the motion segments in motion

sequences. A matching process between musical features and motion features is

performed to create a perceptually matched dance performance for the given music

through a process of dance motion recombination; figure 1.2 depicts an overview of the

5

approach. The problem can be formulated as a search problem where a motion database

is searched for a perceptually optimal sequence of motion segments that can be

recombined into a dance performance based on an input musical piece.

Figure 1.2: Solution Overview

Both music and dance performances can be considered to be sequences composed of

a finite number of patterns. Musical performances generally consist of a set of patterns or

themes that repeat throughout the musical piece. Similarly a progression of patterns or

themes exists in a dance performance. Current approaches to this problem have only

considered the local properties of musical and motion segments in finding an optimal

match. The global or thematic structures of the musical performance and reconstructed

dance motion are ignored. Consequently, dance motion produced using current

approaches are not convincing.

In this dissertation, we propose a novel approach using a motion-to-music matching

method that extracts thirty musical features from musical data as well as thirty seven

motion features from motion data. A matching process is then performed between the two

feature spaces considering the correspondence of the relative changes in both feature

spaces and the correlations between musical and motion features. Similarity matrices are

Motion Database

Music Data (.wav or .mp3)

Motion Analysis Music Analysis

Motion FeaturesMatching Musical Features

6

introduced to match the amount of relative changes in both feature spaces and correlation

coefficients are used to establish the correlations between musical features and motion

features by measuring the strength of correlation between each pair of the musical and

motion features. By doing this, the progressions of musical and dance motion patterns

and perceptual changes between two consecutive musical and motion segments are

matched.

To demonstrate the effectiveness of the proposed approach, some measure of

perceived quality will have to be developed. Evaluating the perceived quality of an

animated dance performance using an analytic or algorithmic solution, however, is not

feasible due to the complexity of articulated human figure animation as well as the

perceptual, psychological, and aesthetic aspects that contribute to the perceived quality of

the generated dance performance. We therefore designed and carried out a user opinion

study to assess the perceived quality of the proposed approach.

1.4 Original Contributions

This dissertation will make several original contributions in the area of human figure

animation:

• A novel approach has been developed to address the problem of automatic dance

motion generation using a thematic analysis of music and dance motion.

• The perceptual relationship between musical and motion features has been

explored and established to match musical contents and motion contents.

7

• A set of motion features which are useful in describing the postural and dynamic

properties of human motion has been developed. The motion features developed in

this research are not only useful for motion to music matching but also useful for

many other applications such as motion analysis and motion retrieval.

• The proposed approach can be used in a number of application areas requiring

perceptually optimal mapping between two different media types. Examples of this

include abstract animation, movie clip generation, textures generation from

musical data or automatic music or sound effects generation from motion data.

1.5 Document Organization

The remainder of this document is organized as follows: Chapter 2 reviews previous

work in the various related domains including music-to-motion matching, human motion

synthesis, music visualization, and speech animation. Chapter 3 describes, in detail, the

approach proposed in this dissertation. Chapter 4 discusses about user opinion study of

the proposed approach in motion-to-music synchronization. Chapter 5 concludes the

dissertation and discusses future work.

8

Chapter 2 -- Related Work

In this chapter, we review previous work related to this dissertation. The related

research areas are divided into the following four categories: 1) human motion synthesis,

2) music visualization, 3) speech animation, and 4) dance motion generation from music.

Major issues regarding previous approaches in each area are described along with the

advantages and limitations of each approach.

2.1 Human Motion Synthesis

Human motion synthesis or articulated figure animation has been an important topic

in computer animation research. It has many application areas and much research has

been performed on creating animation of human behaviors. Some research efforts have

focused on the appearance of the generated human motion while others focused on the

physical correctness of the movements of human body depending on the application area

or requirements of the problem. Research in human motion synthesis can be divided into

several categories – keyframing, inverse kinematics, and dynamic simulation – in terms

of the solution methods used in generating animation.

Keyframing approaches have been widely used in classical animation. Animators

specify the keyframes of the animation and the in-between frames are generated

automatically using interpolation methods. The merit of this approach lies in that it gives

the animators full control over the human characters’ movements. However, this

approach requires much time and effort as well as a high degree of expertise. The inverse

kinematics approach allows animators to specify only the positions of end-effectors, such

9

as hands and feet, and all the joint angles are automatically obtained by applying the

inverse kinematics method to define the pose of each frame in the animation sequence.

Additional constraints can be defined to address the problem of ambiguity which

frequently occurs when using this approach. The problem of ambiguity can be solved by

optimization methods using the given constraints. Inverse kinematics approaches require

less time and effort compared to the keyframing approache. [CHU99] It is nonetheless a

tedious and time-consuming procedure. Finally, dynamic simulation approaches calculate

dynamic constraint formulations to automatically produce physically correct motion.

While this approach is the least labor intensive, it is very difficult to specify the necessary

constraints to produce a desired motion and the computational cost of this approach is

very high[HOD95][HOD97].

Since the advent of the motion capture technology, approaches using motion capture

data in human motion synthesis have been widely used in generating realistic human

motion. Human body movements generated using this approach is highly realistic

because the motion capture data is captured from real human actors’ movements

[BOD97][MOL96][OBR00]. Motion capture data can be reused by modifying the

captured motion data using various techniques such as signal processing, space-time

constraints, and displacement mapping. Motion editing techniques modify the motion

capture data to meet user-specified requirements or environmental constraints while

keeping the original quality of the motion [BRU95][WIT95][GLE97] [GLE98]. Motion

graphs have been used to generate new sequences of motion by stitching many small

clips of motion data according to the specifications and constraints given by users and the

environment. User specifications can be given as a set of key character poses, a path

10

traveled by the character, or as a reference motion specified by the user and captured by a

video camera. [ARI02][KOV02][Lee02][LI02].

While a variety of approaches have been developed for the specification of desired

motion in human motion synthesis systems, little work has been done on the generation

human motion from the musical cues extracted from a musical performance

[SHI06a][SHI06b].

2.2 Music Visualization

Music visualization is the process of rendering music graphically by analyzing and

visually mapping the properties of the musical sound. Music visualization can be grouped

into two categories: one is visualization for analysis of musical contents and the other is

visualization for artistic expression.

There are two approaches in visualizing musical content for analysis purposes, one is

visualizing direct musical data and the other is visualizing interpreted musical data. Here,

direct musical data refers to information extracted directly from the musical data such as

pitch or onset time. Interpreted musical data, on the other hand, denotes higher level

information extracted from musical sound such as tempo or key. Misra et al. developed

methods of direct music visualization in which they displayed 2D waveforms or

spectrograms where the x-axis represented time and y-axis represented the primary

values of interest [MIS05]. Interpreted music visualization renders static or animated

imagery to represent structural characteristics, tonal contexts, tempo and loudness

variations of the given musical data. Researchers have used a number of approaches in

11

order to represent the interpreted music including: a 2D grid, a sequence of translucent

arches, toroid representation and moving dots [COO01].

Approaches for music visualization for artistic expression establish a mapping

between musical and visual content and, in some cases, enables interactivity between a

musician and the generated visualization. In these approaches, music is visualized

through responsive video imagery, virtual character behavior, and responsive virtual

environments. Ox developed a system that allows users to navigate virtual landscapes

generated by assigning geometric and color data to characteristics found within input

musical streams [OX02]. Oliver et al. developed a virtual environment containing both

real and abstract elements where user vocalizations result in both auditory and visual

feedback [OLI97]. Leven and Lieberman developed a system which generates glyphs to

augment live performer’s vocalizations [LEV04]. Wayne Lytle developed a number of

three dimensional animations based on MIDI events. In his work, MIDI data controls and

animates preposterous-looking instruments to produce pleasing movements to every

sound generated from the MIDI data [LYT].

Various commercial media players such as Winamp, Microsoft media player, Real

Player have the ability to generate animated imagery based on a piece of recorded music.

They visualize music based on basic musical features such as loudness and frequency

content.

The problem addressed by this dissertation can be viewed as an approach to music

visualization using human character dance performance. In this regard, the proposed

approach is closely related the problem of music visualization for artistic expression.

12

However, as we discussed in this section, much of the research efforts to date have been

focused on the problem of music visualization for analysis of musical content while the

problem of visualization for artistic expression has not been very well studied.

Unfortunately, due to the complexity of the problem, current approaches in music

visualization are not applicable to the problem of automatic dance performance

generation from music. The information extracted from musical performance for the

purpose of musical analysis or artistic visualization is not sufficient for the problem of

dance motion generation. Current approaches generally extract a small number of musical

features for the purposes of visualization. Global or structural information regarding the

musical composition is not extracted. Therefore, current approaches do not provide

sufficient information for the generation of realistic dance performance.

2.3 Speech Animation

Synchronizing the lip movement of an animated character with speech is an important

topic in human facial animation. This problem is comparable with the motion-to-music

matching problem because both problems deal with the process of generating animation

for the given audio signals. Research efforts on the synchronization of lip movement with

speech are concerned with several issues: (1) how to represent the facial model, (2) how

to control the facial model, (3) how to label vesemes or the visual configurations, and (4)

how to generate the motion sequence [COH93][BRE97].

Generic 3D mesh models, 3D scans, real images, or hand-drawn images have been

used as human facial models in this area [PAR72][LEW91][GUI94][MOR91]. The

13

control parameters are defined depending on the facial models used. Examples of control

parameters are three dimensional deformations or the labels attributed to specific facial

locations. Target utterances are matched with corresponding phoneme labels with which

associated visual configurations have been assigned. Those associated visual

configurations are used to generate the animated images in the synthesis phase. The

phoneme labels can be assigned manually or automatically.

Keyframing methods and physics-based methods have been used in synchronizing lip

movement with speech and more recently machine learning methods have been used as

well. With keyframing methods, the animator specifies particular key-frames and the

systems generate intermediate frames to generate animated images

[PAR74][PEA86][COH90]. With physics-based methods, physically based models are

used to determine the mouth movement for given initial condition and the set of forces

acting on the facial model. To do this the facial model must include a representation of

the underlying facial muscles and skin [WAT87][LEE95]. In machine learning methods,

systems are trained from recorded and labeled data and then used to synthesize new

motions [BRA99][BRO94].

The approaches used in speech animation are not directly applicable to the problem of

dance motion generation because the objectives of the two problems are different. The

objective of speech animation is to generate correct mouth movement that can realize the

appropriate phonemic targets, while objective of dance motion generation is to generate

perceptually matched dance movements that are aesthetically satisfying. The phonemic

labeling approach that uses predefined visual configurations for each phoneme used in

14

speech animation systems is not applicable to dance motion generation because for a

given set of musical data can have multiple visual configurations that are perceptually

well matched.

2.4 Synchronization between Music and Motion

There have been some recent efforts addressing the problem of music to motion

synchronization. Those efforts are focused on establishing a correlation between musical

and motion features that represent the perceptual properties of music and motion

respectively. Those efforts have addressed the problem using two approaches – one is the

mapping of music to motion, and the other is mapping motion to music. Each of these

approaches is applicable to different application areas. Whatever the approach, the

essence of the problem, however, is the same.

Musical features that have been used in this research include both MIDI (Musical

Instrument Digital Interface) data as well as audio signal data. MIDI is a standardized

protocol for communication between electronic music devices as well as between those

devices and host computers. MIDI data contains information on event messages such as

the pitch and intensity of musical notes, control signals for sound generation parameters

such as volume, vibrato and panning, as well as clock signals to set the tempo. Because

MIDI data already contains much information about the music while audio signal data

does not, it is generally easier to extract musical features from MIDI data. Much of the

music to motion synchronization research has therefore utilized MIDI data as input due to

its simplicity. Most music, however, is not stored as MIDI data, but as either an analog or

15

digital signal. This limits the usefulness of approaches based on MIDI data. Approaches

using audio signal data are much more widely applicable.

We can divide research efforts in music to motion synchronization into the following

three categories: 1) Event-based matching, 2) Feature-based matching, and 3) Emotion-

based matching. In this section, we investigate these approaches and discuss their

strengths and limitations.

2.4.1 Event-Based Matching

Motion and music can be synchronized by matching events extracted from musical

data with events extracted from motion data. An event in the musical domain is defined

as a point in the time domain where a certain significant perceptual change occurs.

Events in the musical domain include: dominant drum beats, peak points in amplitude

while events in the motion domain include: motion beats, footsteps, arm swings, sudden

pauses, and jumps [ALA05][KIM03][LEE05][SAU07].

Sauer and Yang suggested a system for creating an animation that is synchronized to

input music by matching musical events such as beat positions and dynamics (e.g., peaks

and valleys of amplitude) with predefined actions. They developed a script language as

well that is used to define the mapping from musical events to motion events. The users

can easily define the mapping by editing a script file [SAU07].

Although their script allows users the flexibility to define a mapping scheme, their

system has several limitations. First, the variety of movements their system can generate

is limited by a small number of predefined movements. Second, the dance performance

16

does not look realistic because it is generated by a combination of several simple actions.

Lastly, matching is not desirable because it is done manually ignoring the correlation

between music and dance.

Kim et al. suggested an approach for synthesizing synchronized motion from a set of

motion clips using an event matching approach for synchronization. In their approach,

new motion is synthesized by traversing a movement transition graph that is constructed

based on basic movements and acceptable transitions between them. Here, motion

capture data is segmented into small units of motion or basic movements based on motion

beat information extracted from the motion data [KIM03].

In order to synchronize motion to music, musical beats as well as motion beats are

first extracted followed by an incremental time-warping process that aligns the motion

beats and the musical beats so that the synthesized motion is synchronized to the input

sound. While this approach has demonstrated promising results for some classes of

periodic motion, its does not perform well in generating novel choreography. This is due

to the lack of sufficient musical and motion features that are used in synchronizing

motion with music.

Alankus et al. suggested an approach for synthesizing dance motion utilizing a

database of dance motion clips. In this approach new dance motion is synthesized

through a beat matching process where motion clips of dance moves are recombined in

order to match musical beats that are extracted from input music. Distinct dance moves

are delineated by a motion frame where a significant change in the direction of the

movement of some body part occurs.

17

In this approach, dance motion is synthesized by traversing a transition graph

consisting of dance motion segments – called dance figures – and acceptable transitions

between them. A matching algorithm traverses the transition graph using two different

algorithms – a fast greedy method and a genetic algorithm. The motion segments are time

warped so that they are aligned with musical beats extracted through a music analysis

process [ALA05].

Although this approach generates perceptually matched dance motions, it does not

create desirable results from a choreographic or aesthetic point of view. The small

number of musical and motion features used in this approach does not sufficiently

describe the properties of the music and dance motion.

Lee and Lee suggested an approach to generate background music from an animation

by matching feature points extracted from musical data and the corresponding feature

points extracted from motion data. Their approach is different from the above mentioned

approaches in that it maps an input motion sequence into a corresponding musical piece

[LEE05].

In this approach, an analysis process is carried out to extract feature points from both

music and motion sequences. Examples of feature points from music include: local peak

points of note volume and points where a note is played near a quarter note (or a note

played on the beat). Examples of feature points from motion include: foot falls and the

transition points of arm swings. Scores are assigned to feature points emphasizing the

more important features.

18

Various feature points from each source (music or motion) are merged together to

give an overall representation of the source and then the merged feature points of music

are aligned with the merged feature points of motion by time-scaling the music and time-

warping the motion using a dynamic programming algorithm.

They also suggested a novel data structure – a Music Graph – which is used to

synthesize a music sequence for the given motion data. A Music Graph consists of

musical segments (nodes) and the acceptable transitions among them. Music is generated

by traversing the Music Graph based on the motion features of a given animation.

2.4.2 Feature-Based Matching

Another approach that has been used for music-to-motion synchronization is based on

the features of music and motion. Musical features and motion features are parameters

that describe the perceptual properties of music and motion. Events, on the other hand,

represent the occurrences of some changes of features in time domain. One typical

example of this approach is the matching musical intensity with motion intensity

[DOB03][SHI06a][SHI06b].

Dobrian and Bevilacqua suggested a method of mapping motion to music by dividing

an input motion sequence into small pieces of motion segments. Motion features such as

the time taken to traverse the segment, the total distance traversed over a segment, the

average speed while traveling along a segment, and the curviness of the path, are

extracted from each motion segment. Those features are then transformed into either

MIDI parameters or control parameters for signal processing based on a user specified

19

mapping in order to generate a musical sound track. [DOB03] Although this approach

provides an intuitive and expressive way to generate a musical sound track, it does not

provide a feasible approach for generating dance motion for music due to the manual

mapping process between musical and motion features.

Shiratori et al. suggested a feature based method for synthesizing synchronized dance

motion based on rhythm and intensity of motion and music. The key idea here is that

musical musical rhythm has a strong correlation with motion rhythm while musical

intensity, which represents the musical mood, has a strong correlation with motion

intensity which represents the strength of motion [SHI06a][SHI06b].

Their approach synthesizes a dance performance by searching a motion graph of

dance motion clips to select the best matching motion sequence in terms of rhythm and

intensity. This approach is predicated on the idea that rhythm and intensity provide a

sufficient level of information to generate both choreographically as well as aesthetically

convincing dance motion. However, given the complex and dynamic structure of music

and dance, it is unlikely that rhythm and intensity alone are sufficient for this purpose.

2.4.3 Emotion-based Matching

The psychological responses of audiences while auditioning music and dance

performance are one of the crucial issues in music and motion synchronization research.

Emotional assessment of music and motion gestures has been addressed in the fields of

music psychology, emotional intelligence, and human computer interaction (HCI). There

20

are also research efforts in music and motion synchronization that consider the emotional

responses of audiences to music and human motion [CAR02].

Cardle et al. suggested an approach for imbuing generated human motion with

emotional content based on input music. Their approach extracts musical features from

both MIDI data and the corresponding analog audio rendition to extract perceptually

significant musical features. The musical features are used to guide the motion editing

process to synthesize an animation synchronized with the given music [CAR02].

The animator manually establishes a mapping between musical features and motion

editing filters interactively such that any musical feature can be mapped to any motion

feature. The motion editing techniques used in their system include motion signal

processing, infinite impulse response (IIR) filters, non-linear filters, and stochastic noise

functions that can modify the generated motion to imbue it with a specific (happy, sad,

angry) emotional mood.

Morioka et al. suggested an algorithm of synthesizing music that can appropriately

express emotion in dance. Their algorithm extracts the emotional content from input

dance motion and from a library of musical pieces. Music is then synthesized by selecting

a musical piece which is emotionally best matched for the emotional content extracted

from the dance motion. An eight state emotional space is used that includes: solemn, sad,

tender, serene, light, happy, exhilarated and emphatic emotional states [MOR04].

This approach can achieve convincing, emotionally inspired, mapping between dance

motion and a corresponding musical performance. However, because this approach only

21

considers emotional state while ignoring all other aspects of dance motion and music

(features and events), it is generally not useful for the dance performance generation

problem.

2.4.4 Limitations of Previous Approaches

The research efforts that we investigated in this section have several important

limitations when considering the problem of automatic dance motion generation. They

are as follows:

• Not enough information from music and motion is used in the matching process

The approaches discussed in this section all utilize a limited number of features

extracted from human motion and music. The lack of richness in the feature set used

in those efforts limits the effectiveness of the mapping processes because not enough

information is used and thus the mappings are generally not convincing.

• No unified solution framework has been developed

Each approach focuses on one aspect of the motion to music matching problem. For

example, one approach only considers matching events detected from music and

motion while other approaches consider only matching features or emotional states

of the music and motion data. In order to produce convincing results, all aspects

(events, features, emotion) of motion and music should be considered in a unified

mapping framework.

• Global structure of music and motion are ignored in matching process

22

Music and dance motion sequences consist of several patterns or themes which

change and repeat over the performance. This global structure of music and dance

motion is important and provides critically important information that must be

considered if convincing dance motion is to be generated. Current efforts, however,

focus exclusively on local matching process, ignoring the global structure of the

music and motion.

23

Chapter 3 -- Dance Performance Generation

As we mentioned in Chapter 1, the problem addressed in this dissertation is the

automatic generation of human dance performances based on an arbitrary musical input.

The proposed solution to this problem consists of the following four components:

• Music analysis

• Motion analysis

• Motion graph construction

• Matching between musical contents and motion contents

We will begin with the overview of the proposed solution. Then we will discuss each

component.

3.1 Solution Overview

A musical analysis process as well as a motion analysis process is first carried out to

extract useful information from both the input musical data and a set of dance motion

clips in a motion capture database consisting of a variety of recorded dance motions. The

musical analysis process extracts thirty musical features including beat, pitch and timbre

information. A motion analysis process is then carried out to extract thirty seven motion

features consisting of postural and dynamic features of the motion. Finally, a mapping is

performed between the musical and motion feature sets using a novel mapping algorithm

that will be described below.

o

co

th

an

m

m

p

se

cr

As depict

f human dan

onstructed, m

he correspon

nd a music

musical sign

motion featur

ath whose m

equence of t

riteria descri

MusicV

Music Anal• Musical

Matching A• Matchin• Correla

Figure 3.1

ted in figure

nce motion c

motion featu

nding nodes

analysis pr

al. Finally,

re vectors is

motion featu

the input mu

ibed in sectio

cal Feature Vectors

lysis l Feature Ex

Algorithm ng Progressiting Relative

: Work Flow

3.1, the prop

capture data

ure vectors a

of the graph

rocess obtai

a matching

s performed

ure vector se

usical piece.

on 3.4.1.

Musi(.wav,

M

traction

ons of Pattee Sensations

24

w of Dance M

posed appro

in a pre-pro

are calculate

h. At run tim

ins musical

g process be

by searchin

equence bes

The search

c Data , .mp3)

otion FeatuVectors

rns s

Motion Gene

oach construc

cessing phas

d for each m

me, a musica

feature vec

etween the

ng the motio

st matches t

is performe

ure

M

Mot•••

ration Syste

cts a motion

se. When the

motion segm

al piece is fed

ctors by ana

musical fea

on graph to s

o the music

d based on a

Motion Captur

tion Graph Motion SegmMotion FeaMotion Tran

em

n graph consi

e motion gra

ment and stor

d into the sy

alyzing the

ature vectors

select an op

cal feature v

a set of matc

re Database

Constructiomentation ture Extractnsition

isting

aph is

red in

ystem

input

s and

ptimal

vector

ching

n

tion

25

3.2 Music Analysis

Much research on techniques that extract useful features from sound signals has been

done in the fields of speech signal processing, non-speech sound signal processing, and

musical signal processing. Linear Prediction Coefficients (LPC) and Mel Frequency

Cepstral Coefficients (MFCC) are used in speech synthesis and recognition and they can

also be useful features in representing musical signals [DAV80]. Sound features related

to the spectral shapes such as centroid, rolloff and flux have been used to perform content

based audio classification and retrieval. These features are also useful for representing the

timbre information of musical signals [WOL96]. Research on beat and tempo extraction

for analyzing the rhythmic structure of music also has been done. Beat tracking has been

done by estimating peaks and their strengths using autocorrelation techniques. Tzanetakis

worked on musical feature extraction and genre classification of music. He used thirty

musical features in analysis process to perform genre classification [TZA02]. In this

research, we used a set of musical features defined in Tzanetakis’s work.

The musical analysis process extracts thirty separate features categorized into three

parts: beat, pitch and timbre information. Musical analysis is carried out on each musical

segment. The size of each segment is based on musical beat information where each

segment consists of sixteen beats. To produce a perceptually appropriate mapping

between music and motion, the extracted musical features have to correlate well to the

listener’s perception of the music. We used observational analysis to validate this

correlation.

26

3.2.1 Rhythm Information

Rhythm information such as the estimate of the main beat and its strength, the

regularity of the rhythm, the relation of the main beat to the subbeats, and the relative

strength of subbeats to the main beat are extracted to represent the rhythmic structure of

the music by using a beat detection algorithm. The input musical signal is decomposed

into frequency bands using filters and the signal’s envelope is extracted. Periodicity is

detected based on the extracted envelope using an autocorrelation algorithm. The

dominant peaks of the autocorrelation function correspond to the various periodicities of

the signal’s envelope [TZA02]. These peaks are accumulated over the whole sound

segment into a beat histogram where each bin corresponds to the peak lag. Table 3.1

shows the musical features related with rhythm information.

Table 3.1: Musical Features for Rhythm Information [TZA02]

A0, A1 Relative amplitude (divided by the sum of amplitudes) of the first, and second histogram peak

RA Ratio of the amplitude of the second peak divided by the amplitude of the first peak

P1, P2 Period of the first, second peak in bpm

SUM Overall sum of the histogram (indication of beat strength)

3.2.2 Pitch Information

The procedure for computing pitch information from the input musical data is similar

to that of computing beat information. Both of the procedures are based on an

autocorrelation technique. The difference between the two procedures is that the pitch

27

detection algorithm analyzes shorter time windows that correspond to human pitch

perception. In this effort we use a multiple pitch detection algorithm described in

[TOL00]. In this algorithm, the signal is decomposed into two frequency bands (below

and above 1000 Hz) and amplitude envelopes are extracted for each frequency band. An

enhanced autocorrelation function is then computed so that the effect of integer multiples

of the peak frequencies to multiple pitch detection is reduced [TZA02]. The features

computed to represent pitch content are shown in table 3.2.

Table 3.2: Musical Features for Pitch Information [TZA02]

FA0 Amplitude of maximum peak of the folded histogram. This corresponds to the most dominant pitch class of the song. For tonal music this peak will typically correspond to the tonic or dominant chord. This peak will be higher for songs that do not have many harmonic changes.

UP0 Period of the maximum peak of the unfolded histogram. This corresponds to the octave range of the dominant musical pitch of the song.

FP0 Period of the maximum peak of the folded histogram. This corresponds to the main pitch class of the song.

IPO1 Pitch interval between the two most prominent peaks of the folded histogram. This corresponds to the main tonal interval relation. For pieces with simple harmonic structure this feature will have value 1 or -1 corresponding to fifth or fourth interval (tonic-dominant).

SUM The overall sum of the histogram. This is feature is a measure of the strength of the pitch detection.

3.2.3 Timbre Information

Nineteen musical features which represent perceptual attributes of the musical timbre

of the input musical data are extracted. Those features include spectral shape features

such as spectral centroid, spectral rolloff, spectral flux, and Mel-Frequency Cepstral

Coefficients which are known to be useful in representing musical timbre. RMS and time

28

domain zero crossings are also extracted. Table 3.3 shows descriptions of musical

features for timbre information except for MFCC [TZA02].

Mel-Frequency Cepstral Coefficients (MFCC) have been widely used in speech

recognition systems. It is known that the first five coefficients of MFCC are useful in

representing the characteristics of musical signals. MFCC coefficients are obtained by

grouping and smoothing the FFT bins of the magnitude spectrum according to the

perceptually motivated Mel-frequency scaling. A discrete Cosine Transform is performed

to decorrelate the resulting feature vectors. [TZA02]

Table 3.3: Spectral Shape Features and Other Features [TZA02]

Feature name Description

Spectral centroid A measure of brightness of the musical segment

Spectral rolloff A measure of the amount of signal’s energy which is concentrated in the lower frequencies

Spectral flux A measure of the amount of local spectral change.

RMS A measure of the loudness of a signal

Zero Crossings A measure of the noisiness of the signal

3.2.4 Musical Feature Vector

The analysis described above results in the generation of thirty musical features that

comprise the Musical Feature Vector. Table 3.4 shows all thirty musical features.

29

Table 3.4: Musical Feature Vector [TZA02]

Timbre Beat Pitch

mRMS mCentroid

mFlux mRolloff

vRMS vCentroid

vFlux vRolloff

ZCRs

mMFCC0 mMFCC1 mMFCC2 mMFCC3 mMFCC4 vMFCC0 vMFCC1 vMFCC2 vMFCC3 vMFCC4

A0 A1 RA P1 P2

SUM

FA0 UP0 FP0 IPO1 SUM

3.3 Motion Analysis

The motion analysis process analyzes a motion clip and extracts a set of motion

features which can describe the perceptual properties of the motion resulting in a motion

feature vector. The motion feature vector contains useful information which can be used

in matching the motion with musical features extracted in music analysis component. In

this section, we define each motion feature used in this research and the motion feature

vector.

Rudolf Laban developed the theory, system, and language which are collectively

called Laban Movement Analysis (LMA). LMA is used to understand and describe

human movements. LMA has been widely used by actors, dancers, athletes, and physical

therapists for observing, understanding, and describing all types of human motion. It

describes human motion using four components: body, effort, shape, and space [CHI00].

Although LMA is a good system for describing and recording human movement, it is

difficult to use the LMA system for human motion synthesis because the four categories

30

of LMA system describe the qualitative aspects of a human movement while it is

necessary to specify quantitative parameters to generate human motion. Chi et al. showed

that the shape and effort components of LMA provide valuable parameters that describe

qualitative aspects of movements and they developed a transformation from shape and

effort parameters into quantitative and controllable parameters [CHI00]. In this section

we will define motion features useful in representing the properties of human motion.

The motion features defined in this dissertation consists of postural features and dynamic

features which are conceptually similar to the shape and effort components of LMA.

We chose to use a Hierarchical Translation-Rotation (HTR) format to represent

motion capture data. HTR is a widely used motion capture format consisting of a root

(center of body) position, root orientation and the orientations of all other joint angles in a

hierarchical fashion. While the HTR format is suitable for visualizing motion using

graphics libraries, it is not well suited for performing motion analysis. We therefore

transformed the motion data into a point representation for the purpose of analysis.

A point representation is obtained by simple calculations on the limb lengths and

orientation information obtained from HTR data format. Figure 3.3 shows the human

model and the corresponding point representation. This process transforms the

orientations of each limb into a 3D location in the global Cartesian coordinates of each

joint position.

31

Figure 3.2: Virtual Human Model

3.3.1 Dynamic Features

Dynamic features represent the dynamic properties of the human motion. These

features are conceptually similar to some of the effort components of Laban Movement

Analysis. Dynamic features used in this dissertation include: velocity, acceleration, and

directional change of joint motion. We will explain each dynamic feature below.

Motion velocity

Motion velocity is defined as a linear summation of the velocities of each joint

position. Equation (3.1) depicts the calculation of motion velocity.

∑∑= =

−+=N

j

n

iii jj

1 1||)(x)1(x||ocitymotion_vel (3.1)

32

Figure 3.3: Calculation of Dynamic Features

Here, )( jix represents a position vector of a joint i at frame j and )()1( jj ii xx −+

represents the change of position of joint i at frame j. n is the number of joints and N is

the number of frames. Shiratori et al. used a similar motion feature with this and they call

it motion intensity. They found that motion intensity is perceptually related with music

intensity [SHI06a][SHI06b]. For example, fast motion matches well with fast and light

musical sound and slow motion matches well with calm or heavy mood musical sound.

Motion Acceleration

Like motion velocity, motion acceleration is defined as a linear sum of the

accelerations of each joint position. Acceleration represents the force applied to the joint

angle. For example, dynamic or sudden motion will have result in large acceleration

values while static smooth motion results in low acceleration values. It is expected that

the acceleration dynamics of motion will perceptually correlate well to musical intensity.

)()1( jj ii xx −+

)1( +jix

)( jix

)1( +jix

)( jix

θ

(a) Motion Velocity (b) Motion Velocity

33

High tempo, dynamic music will result in motion with high acceleration values and vice

versa. Motion acceleration is calculated by equation (3.2).

∑∑= =

−+=N

j

n

iii jj

1 1||)(v)1(v||elerationmotion_acc (3.2)

Here, )( jiv represents a velocity of joint i at frame j and obtained by calculating

||)()1(|| jj ii xx −+ .

Directional Change of Motion

Certain motions are simple or linear while others are complicated and consist of many

directional changes. Directional change of motion indicates the degree of directional

changes made by each joint of the human body. Consequently, motion with a high degree

of directional change will have a large high frequency component. Directional change in

motion is a significant feature in dance motion and would normally correlate to properties

of the accompanying musical piece. The directional change of motion is calculated by

equation (3.3).

∑∑= =

−⎟⎟⎠

⎞⎜⎜⎝

⎛+

⋅+=

N

j

n

i ii

ii

jjjj

1 1

1

||)(x||||)1(x||)(x)1(xcosf_motionl_change_odirectiona (3.3)

Here, )( jix and )1( +jix are vectors of two consecutive joint positions in time.

Therefore, directional change of motion is obtained by calculating linear summation of

angles two consecutive vectors make.

34

3.3.2 Postural Features

Postural features represent the properties that describe the shape of the human body

movement. Postural features are conceptually similar to some of shape components of

Laban Movement Analysis. Postural features used in this dissertation are motion span,

motion density, arm shape, footsteps, and balance. In general, the lower body of human

character structure represents locomotion while the upper body represents gestures.

Vertical, horizontal, and sagittal components of arm shape represent the character’s upper

body gesture and footsteps provide information about lower body locomotion. Motion

span, motion density, and balance give useful information about overall shape of motion.

Motion Span

Motion span represents the size of a motion based on the amount of space it spans.

Motion span is calculated by considering how far each joint position travels in a motion

segment. It is approximated by calculating the linear summation of the distance between

each joint position relative to the calculated centroid of all the joint positions for that joint.

Motion span is calculated by equations (3.4) and (3.5):

⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑

=

N

jj

N 1ii )(x1c (3.4)

∑∑= =

−=n

i

N

jj

2 1ii ||c)(x||nMotion_Spa (3.5)

where ci is a vector representing the centroid of all the positions of joint i through the

motion.

Motion Density

35

Motion density represents how dense the motion is. When a joint moves a great deal

in a relatively small region, we define this as dense motion. Conversely, if it moves little

in a relatively large region we define this as sparse motion. Motion density can therefore

be calculated as the ratio of motion intensity over motion span by equation (3.6):

∑∑

∑∑

= =

= =

−

−+== n

i

N

j

n

i

N

j

j

jj

2 1ii

2 1ii

||c)(x||

||)(x)1(x||

n Motion_SpaocityMotion_VelsityMotion_Den (3.6)

Arm Shape

Arm shape features have three components according to the axes of the local

Cartesian coordinate attached to the virtual human body. They are: vertical, horizontal,

and sagittal components. The vertical component describes how much the arms are raised

upward or lowered downward, the horizontal component describes the degree to which

the arms are spread outward or contracted inward in the horizontal plane, and finally the

sagittal component describes the degree to which the arms are extended forward or

backward [CHI00].

The value of each component of the arm shape is obtained by projecting the position

of the end effectors – left hand and right hand – onto each relevant axis of the local

coordinate system. As shown in figure 3.4, the horizontal axis of the local coordinate

system is obtained by drawing a line from the left shoulder position to the right shoulder

position. The horizontal axis is defined as a unit vector h whose direction is aligned with

the line defined with the left and right shoulder positions and the vector h aims outward

from the shoulder position. The sagittal axis is defined with a unit vector s which is

36

obtained by calculating the cross product between the vector h and vector d which

connects a shoulder position to the center of body. In the similar fashion, the cross

product between vector s and vector h defined the unit vector v that represent vertical axis.

Finally, the projections of the end-effectors onto each axis are obtained by calculating the

dot product between the vector from the corresponding shoulder position to the end-

effector position and the unit vector which represent relevant axis.

Figure 3.4: Calculation of Arm Shape Motion Features

Footsteps

Footsteps provide critical information in describing dance motion. The footsteps

feature is defined as the number of steps occurring in a given motion frame. Exuberant

and flamboyant dance motions tend to have frequent steps while austere and restrained

dance motions tend to have less frequent steps. If the vertical position of the foot is less

than a specified threshold and the foot does not change its position for a certain amount

of time, the foot is considered to be in contact with the ground. Therefore, the footsteps

d

hs

v

37

feature is obtained by counting the number of times that the foot comes in contact with

the ground over the motion frame.

Balance

Human body posture can be well balanced and stable or unbalanced and unstable. The

postural feature balance is defined as the degree of stability of the human body posture

and it provides useful information describing the posture of the dancer’s movement. As

shown in figure 3.5, we calculate the projection point of the center of the human body

onto the ground c and the center point of the rectangle defined by the positions of the left

and right heels r. The feature balance is defined as the distance between those two

positions by equation (3.7) [MOR04].

||rc||Balance −= (3.7)

Fig 3.5: Calculation of Balance Motion Feature

Center of the rectangle Projection of the center of body

Distance between the two points

38

3.3.3 Motion Feature Vector

As described in above two subsections, nine dynamic features and twenty eight

postural features are used to describe the properties of dance motion. Statistics such as

mean, variance, and range are used to construct the motion feature vector. Those features

are calculated for each dance motion segment and stored in corresponding graph nodes

when the motion graph is constructed. (Section 3.3.4 explains about motion graph

construction.) Table 3.5 shows the motion feature vector for each motion segment.

3.3.4 Motion graph construction

A motion graph data structure is used for dance motion synthesis in this work. In

most cases, humans dance not by improvising new motion, but by recombining several

dance motions which they already know into a final dance routine [SHI06a] [SHI06b]. It

is therefore possible to use a motion capture database consisting of prerecorded dance

motions to synthesize a dance motion by searching and selecting a perceptually

appropriate dance motion sequence from the database. By doing this, we can generate

realistic, pleasing, and entertaining dance motion that is synchronized to input music.

A motion graph is a directed graph where nodes correspond to motion sequences and

edges represent the transitions between the nodes. The edges are established by creating

the transitions between the nodes which can be successfully connected to each other so

that the motion resulting from following an edge is seamless. Those transitions are

obtained by calculating similarity measures between the poses of the motion sequences at

the boundaries of the motion clips. If we have two motion sequences A and B, the system

assesses the similarity of the poses of the last frame of A and the first frame of B. The

39

continuity measure includes the similarity of the two poses in consideration and also

continuities in velocity and acceleration. We used the method suggested by [KOV02].

Table 3.5: Motion Feature Vector

Postural Features Dynamic Features

mLArmVertical vLArmVertical rLArmVertical

mLArmHorizontal vLArmHorizontal rLArmHorizontal mLArmSagittal vLArmSagittal rLArmSagittal

mRArmVertical vRArmVertical rRArmVertical

mRArmHorizontal vRArmHorizontal rRArmHorizontal mRArmSagittal vRArmSagittal rRArmSagittal

mSpan vSpan rSpan

mDensity vDensity rDensity

mBalance vBalance rBalance

nFootSteps

mVelocity vVelocity rValocity

mAcceleration vAcceleration rAcceleration

mDirectionalChange vDirectionalChange rDirectionalChange

Consequently, any path selected from the motion graph will generate natural,

seamless and realistic human motion. Human motion can therefore be synthesized by

searching for a path within the motion graph that best satisfies a given criteria. The

criteria for searching the path used in this research are based on the degree of matching

between the motion feature vectors and the musical feature vectors.

In this research, the motion graph nodes contain dance motion sequences consisting

of sixteen musical beats. We assume that sixteen musical beats is an appropriate minimal

length that can contain a dance motion. Figure 3.6 shows the motion graph we used in

this research.

th

cr

ex

m

ar

th

p

d

th

3.4 Da

Automati

he optimal m

riteria. In th

xplain the

matching betw

3.4.1

The criter

re based on

hese criteria

erceptually

efined here

he matching

Figure 3.6: M

nce Perfor

c dance mot

motion sequ

his section, w

proposed a

ween dance

Matching

ria for match

n a set of re

a is accomp

well match

will be used

g in an opti

T

Motion Grap

rmance G

tion generati

uence from t

we will desc

approach for

motion and

g Criteria

hing a piece

quirements

plished by

ed pair of m

d to develop

mization pr

Transitions

40

ph for Human

eneration

ion is, in effe

the motion

cribe the ma

r solving th

music featur

of music an

that any res

observing

musical fea

the evaluati

rocess that f

n Dance Mo

fect, a proble

graph which

atching crite

he problem

re vectors.

nd a correspo

sulting matc

the correla

atures and m

ion function

finds the m

Mo

Motion F

otion Genera

em of search

h satisfies a

ria used in t

m of finding

onding piece

ch must sati

ations that e

motion featu

ns which asse

most perceptu

otion Data

Feature Vector

ation

hing and sele

a set of matc

this research

g an approp

e of dance m

isfy. Establis

exist betwe

ures. The cr

ess the quali

ually approp

ecting

ching

h and

priate

motion

shing

een a

riteria

ity of

priate

41

motion sequence for a given piece of music. Our assumptions for each matching criterion

are described below:

Assumption 1: The musical beats and motion beats must be synchronized.

Both music and dance performance contain rhythm and the rhythm of each must

be synchronized to construct natural and seamless dance performance for the

given music.

Assumption 2: There should be a correlation between repeated musical patterns and

corresponding motion patterns.

Music contains repeated patterns and when a specific musical pattern repeats, the

same corresponding pattern of motion which was previously matched to the music

should be repeated. In other words, if a musical pattern ‘A’ at time t1 repeats

again at time t2, then we expect that the motion pattern ‘a’ which was assigned to

music pattern ‘A’ at time t1 repeats at time t2.

Assumption 3: Changes in the auditory sensation of music should produce

corresponding changes in the visual sensation of the generated motion.

Since human perception is especially sensitive to the changes of auditory and

visual sensations especially at the boundaries between two consecutive musical

and motion segments, the difference in auditory sensation at the boundary

between two music segments must be matched to the difference in the visual

sensation between the two motion segments.

42

Assumption 4: There exists a perceptual correlation between musical features and

motion features.

Each music segment must have a perceptual correlation with the motion segment

assigned to it. For example, a high intensity music segment should be assigned a

motion clip which has high motion intensity.

Motion-to-music matching deals with the problem of comparing two instances from

different spaces - musical feature space and motion feature space. If there existed a

standard space that is shared by both of the two instances, the comparison (or match)

would be apparent: we can measure the distance between two points in a single standard

space. We however don’t have such a standard space and we therefore have to compare

the two points in two different spaces. The idea of the suggested approach is that

although the two points under comparison are in different spaces, we can at least match

the relative amount of perceptual changes in auditory and visual sensations. While

musical and motion segments proceed from one segment to another during a performance,

the audience can perceive the changes of auditory and visual stimuli. We assume that the

difference in auditory sensation between two consecutive musical segments must be

matched to the difference in the visual sensation between the corresponding two motion

segments. For example, if there is a drastic change such as calm sound to noisy and

strong sound, it is expected that the corresponding motion segments also change from, for

example, slow and smooth motion to fast and dynamic motion. Otherwise, if there is not

much change in auditory sensation, then not much change in visual sensation is expected.

43

The first criterion can be satisfied by manually segmenting the input musical data and

all the motion clips in the database into segments containing the predefined number of

beats each (sixteen beats are used in this research). In the future, this process can be

automated by using the beat information extracted from the music and motion data. Once

we have beat information, the music and motion beats are aligned by warping the motion

data along the time axis.

The second and third criteria can be covered by matching relative distances among

the musical segments to the relative distances among the corresponding motion segments

in the respective feature vector spaces. This can be carried out by calculating similarity

matrices for both the musical segments and motion segments. Two separate similarity

matrices are built in this process, one is the similarity matrix for musical data which

contains the distance metrics among the musical segments and another is the similarity

matrix for motion data which contains the distance metrics among the motion segments.

The optimal path that we are searching for, therefore, will minimize the difference

between the two similarity matrices. Section 3.4.2 describes this procedure in details.

The last criterion is implemented by using statistical approaches that measure the

correlation between musical features and motion features. Some of the previous research

efforts explored the relationship between a small number of musical and motion features

observing that a correlation existed for those features. The correlation measure used in

this effort is expected to significantly improve previous results due to the abundance of

musical and motion features used in the analysis procedure. A detailed discussion about

44

the suggested method for deriving the relationship between musical features and motion

features is described in section 3.4.3.

3.4.2 Matching the Progression of Patterns in Music and Motion Sequences

We use the similarity matrices to achieve the matching criteria 2 and 3. The similarity

matrix is a matrix which represents the similarities among the segments in music or

motion sequences. As shown in figure 3.5, columns and rows of the similarity matrix

represent the segments and a cell in the matrix represents the similarity between the

corresponding segments. The similarity between the segments is obtained by calculating

the Euclidean distance between the two points of the segments in the feature space. In the

same way, similarity matrices are calculated for both the music segment and motion

sequence to evaluate the degree of matching between the given music data and motion

sequence under consideration. Equation (3.8) shows the calculation of the similarity

between music and motion sequences.

||FF||Similarity1

0,

jk

n

k

ikji −=∑

−

=

(3.8)

The similarity between a segment i and a segment j is obtained by calculating a linear

summation of the Euclidean distances between the feature vector of segment i and the

feature vector of segment j. Here, ikF represents the normalized kth feature value of the

feature vector for the segment i and n is the number of features. The same equation

applies to both the music segment and the motion segment.

45

After obtaining the similarity matrices for both of the music and motion sequences,

we calculate the difference between the two matrices to evaluate the quality of the current

matching. Equation (3.9) shows the calculation of the difference between the similarity

matrices for music and motion sequences. The matching which produces the smallest

difference is determined to be the best match. By doing this, we can select the best

matching motion sequence for a given musical segment satisfying both matching criteria

2 and 3.

∑∑−

=

−

=

−=1

0

1

0,, ||M||Difference

N

i

N

j

motionji

musicji M (3.9)

Matching criteria 2 requires that a correlation exist between repeated musical patterns

and corresponding motion patterns. In figure 3.5, a music segment which belongs to the

cluster ‘A’ at time t1 repeats again at time t = 4 which means the feature values for the

segment at t = 1 must be similar to the feature values for the music segment at t = 4.

Based on criteria 2, the motion feature value at time t = 1 should be similar to the motion

feature values at time t = 4. By selecting the motion sequence which produces the

minimum difference between the two similarity matrices for music and motion sequences,

we can obtain the motion sequence which satisfies criteria 2.

Matching criteria 3 requires that changes in the auditory sensation of the music

should produce corresponding changes in visual sensation of the generated motion. When

the music play proceeds from the first music segment at time t = 1 to next segment at

time t = 2, the change of auditory sensation at the boundary between the first and the

second segments can be represented by the similarity between the feature vectors

between the two segments. In order to satisfy matching criteria 3, a similar amount of

46

change in the visual sensation between the first and the second motion segments should

occur. This can be achieved by selecting the motion sequence which produces the

minimum difference between the two similarity matrices for music and motion sequences.

Minimize difference M , M ,

NN

Figure 3.7: Similarity matrix and matching progressions of patterns

3.4.3 Correlation between Musical and Motion Features

Although the approach described in section 3.4.2 effectively matches the progressions

of patterns in music and the generated dance sequences, it is, however, not sufficient to

produce convincing results because it does not consider the correlation between the

features of the music and dance sequences.

For example, let’s consider only one feature for each feature space to demonstrate the

problem. As shown in figure 3.9, the loudness of a musical segment changes from 0.2 to

0.8 and then to 0.5 in the three consecutive musical segments. If we only apply the

approach described in section 3.4.2, two different motion sequences can be matched to

1 2 3 4 … 1 0.0 0.3 0.7 0.1 2 0.0 0.7 0.4 3 0.0 0.4 4 0.0 …

1 2 3 4 … 1 0.0 0.3 0.7 0.1 2 0.0 0.8 0.5 3 0.0 0.3 4 0.0 …

Music sequence

AA

Similarity matrix (motion) Similarity matrix (music)

a a

Motion sequence from motion graph

47

the given progression of musical sequence because the differences in the visual sensation

caused by the motion sequences are similar (one is depicted by a solid line and another is

by dashed line). However, one of them which is depicted by dashed line is not promising

because it ignores the direct correlation between the two features - music intensity and

motion velocity while the two features are known to be correlated [SHI06a][SHI06b].

Figure 3.8: Matching between Musical Intensity and Motion Intensity

In this paper, therefore, the approach described in section 3.4.2 generates a set of

candidate motion sequences and then the process that establishes the correlation between

musical and motion features described in this section follows to finally select the best

matching motion sequence.

We used correlation coefficients to measure the strength of correlation between each

pair of the musical and motion features. Equation (3.10) shows the calculation of

correlation coefficient between i-th musical feature and j-th motion feature, , . Here,

denotes the i-th musical feature and denotes the j-th motion feature. The calculation is

carried on the example data set that consists of the pairs of musical and motion data that

have shown excellent perceptual matching in a subjective user study.

Motion Intensity (Motion Velocity)

Music segments

Music Intensity (Loudness)

1

0 1 2 3

Music Intensity

Motion segments

1

01 2 3

Motion Intensity

48

,∑ ∑ ∑

∑ ∑ ∑ ∑ (3.10)

Human perception is known to be highly sensitive to changes in visual and auditory

sensations and it is assumed that changes in feature values have more perceptual

relevance than the feature values themselves. We must therefore consider changes in

sensations of both musical and motion contents when performing the matching process.

To do this, we calculate the correlation coefficients on changes in sensations in

corresponding pairs of music and motion segments using the equation (3.11). More

specifically, we compute the correlation coefficients between ∆ and ∆ . Here, ∆

denotes the difference of an i-th musical feature between two consecutive musical

segments and ∆ denotes the difference of a j-th motion feature between corresponding

two motion segments.

,∆ ∑∆ ∆ ∑∆ ∑∆

∑∆ ∑∆ ∑∆ ∑∆ (3.11)

In the matching process, for every musical-motion feature pairs that has strong or

moderate correlations, the difference between the motion feature value and expected

motion feature value that is determined by the regression function is calculated. For

example, if i-th musical feature and j-th motion feature has a strong correlation, then

the expected motion feature , is obtained by equation (3.12). Here, ,

denotes j-th motion feature of motion segment k and the function is a regression

function obtained from regression analysis on the example data set.

, , (3.12)

49

The difference , , between the real motion feature value , and the

expected motion feature , can be obtained by the equation (3.13).

, , , , , , (3.13)

The difference values , , are summed for every (i,j) pair that have moderate or

strong correlation and for every musical segment along with the corresponding motion

segment under consideration to evaluate the goodness of matching between musical and

motion sequences in terms of correlated features. Finally the motion sequence that

produces the minimum summed difference value is selected as the best match. The same

process is carried out on all the pairs of ∆ and ∆ that have moderate or strong

correlations.

3.4.4 Dance motion Generation System

The input to the dance motion generation system is a musical dataset and the output is

video data depicting human dance motion that is perceptually matched to the input music.

The motion-to-music matching is performed mainly by a combination of the similarity-

based approach and the correlation-based approach described in section 3.4.2 and 3.4.3

respectively. Figure 3.9 shows the architecture of the dance motion generation system

suggested in this dissertation. The details of each component are described below.

In a pre-processing phase, a motion graph is constructed. Motion sequences

( ∆ , 0, … , 1 and 0, … , ) are collected and saved in a

motion database and each motion sequence is then segmented into a number of motion

segments ( 0, … , , where denotes the number of

50

Figure 3.9: System Architecture

Music Analysis

Music Segmentation

Musical Feature Extraction

Similarity Analysis for Musical Sequence

Similarity Matrix for Musical Sequence

Similarity‐based matching

Motion‐to‐Music Matching

C.C. Matching

Movie Generation

Correlation Coefficients

Similarity Analysis for Motion Sequence

Similarity Matrix for Motion Sequence

Motion Database

Candidate Motion Sequences

Threshold %

Motion Beat Information

Musical sequence: ∆ , 0, … , 1

Motion Beat Information

,

,

,

,

,

C

, 0, … ,

,

,

Output: Video Data

,

,

Motion sequences: ∆ , 0, … , 1,

0, … ,

Motion Graph Construction Motion Segmentation Motion Feature

Extraction

,Nodes and edges

Motion GraphMotion Feature Vectors

,

Motion Database

Motion Graph Construction

51

segments of i-th motion clip). A motion feature extraction component then extracts the

motion features defined in section 3.3 and motion feature vectors are obtained. A motion

graph is finally constructed based on the motion segments and the corresponding motion

feature vectors.

At a run time, a musical sequence ( 0 ∆ , 0, … , 1) is fed into

the system as input and is segmented by a music segmentation component. This

component segments the music based on beat information that is extracted by the musical

feature extraction component. Each musical segment is analyzed and the thirty musical

features defined in section 3.2 are then extracted. This results in a sequence of musical

feature vectors representing the properties of the input musical data that are used in the

motion-to-music matching process.

A similarity matrix , ( 1 , 1 ) for the

input musical sequence is then constructed based on the sequence of musical vectors

created in the above step. The similarity matrix represents the perceptual similarities

among the musical segments in a musical piece as described in section 3.4.2. It therefore

contains information on the progression of musical patterns that can be used in the

motion-to-music matching process. This is done by comparing the musical data similarity

matrix with a similarity matrix that is generated for the motion sequence. Each cell in the

musical data similarity matrix represents the Euclidean distance between two points in

musical feature space. This calculation is carried out for every pair of musical segments.

The system then searches for a set of available motion paths in the motion graph that

was constructed in pre-processing phase. The search is carried out with the argument

52

that represents the number of segments of the input musical piece. The system

searches the graph to obtain all the motion paths that have length .

Progressions of patterns between musical and motion contents are then matched as

described in section 3.4.2. For each available motion path the similarity matrix for the

motion sequence is first constructed in the same fashion as the calculation of the

similarity matrix for musical segments described above. Here, , denotes the

similarity matrix for a motion sequence and denotes the k-th motion feature of

motion segment i. The difference between the similarity matrix for musical data and the

similarity matrix for motion data is calculated by summing up the differences on a cell by

cell basis. The set of candidate motion paths are determined by selecting the motion paths

that produce the least difference. The motion paths are ranked based on least difference

and top 10% of the paths are selected as candidate motion sequences.

For each candidate motion sequence selected in the above step, the system then

calculates a score that represent the degree of perceptual matching between musical and

motion contents based on the correlations that exist in the musical feature and motion

feature pairs as described in section 3.4.3. The score is calculated and accumulated for

every pair of musical and motion features that have at minimum moderate correlation

( 0.5). This process is repeated for all pairs of musical segments and

motion segments. The motion sequence that produces the lowest score is then selected as

the best matching motion sequence for the input musical piece. This process results in

the selection of the motion path with the strongest correlation to the input musical

features. This motion path is then used by the system to create a movie clip containing

53

both the input musical piece and a character animation based on the selected motion

sequence.

54

Chapter 4 -- User FeedBack

We conducted an user opinion study to validate the effectiveness of the suggested

approaches. This chapter presents methods, results, statistical analysis and discussion on

the user opinion study.

4.1 Introduction

We performed an user opinion study to validate the effectiveness of the approaches

suggested in this dissertation because it is difficult to evaluate dance performances using

any mathematical or algorithmic models. The user opinion study presented in this chapter

tests whether the suggested approaches can generate dance performances that show better

perceptual match to input musical piece than the ones generated by random walk.

Random walk here implies a process of selecting a sequence of motion segments that

selects the sequence randomly and aligns the beats of the dance sequence and the input

music. Often random walk can generate perceptually interesting dance movements when

they are matched to the input musical piece in terms of beat information. We therefore

show that the results generated by the suggested approaches can produce more

convincing results than the results generated by random walks

This study will test the following hypotheses:

Hypothesis 1: The suggested approach can generate more creative dance

performance than random walk.

55

Hypothesis 2: The suggested approach can generate more realistic dance

performance than random walk.

Hypothesis 3: The suggested approach can generate dance performances that express

characteristics of input music better than random walk.

Hypothesis 4: Overall evaluation of the results generated by the suggested approach

is better than the evaluation of the results generated by random walk.

4.2 Methods

The participants of the study consist of twenty people, fourteen of them were males

and six of them were females. The ages range from fifteen to forty two and the average of

the ages was 26.01. All the participants volunteered to participate in this study and they

were not paid for the study.

Eight dance performances are shown to the users during the study. The user opinion

study consists of two separate sessions: one is a training session and another is a

evaluation session. Two of the eight dance performances were used in the training session

and other six of them were used in evaluation session. There are two categories of dance

performance data. One category represents the dance performances generated by the

suggested approach and another category represents the dance performances generated by

random walk. Each category consists of three dance performances that were generated

with input music that has different tempo – slow, intermediate, and fast. The stimuli

therefore consist of six conditions in total. The six dance performances used in evaluation

56

session were randomly ordered and each user was provided with dance performances

with different orders.

The dance motion sequences were generated automatically by applying the suggested

approaches and random search method respectively for the given input music and the

motion graph constructed in preprocessing. The input data used in dance motion

generation was uncompressed wave format musical data that contain 44,100Hz, 16 bit,

stereo music. Motion data were HTR (Hierarchical Translation-Rotation) format motion

data. The human model has twenty segments and the data frame rate was thirty frames

per second. The resulting dance motion sequences were rendered into images by using

BMRT renderer and then the images are combined and the corresponding musical pieces

are added using ADOBE Premiere. The movie clips are finally generated in the format of

AVI.

During the study, after the training session, participants are asked to view six video

clips that contain dance performances with accompanying music. Each video clip takes

about 25 seconds. After viewing each video clip, they are asked to answer four questions

about the quality of the dance performance by scoring the quality of the dance

performance they saw using seven level Likert scale. The questions are shown in fig 4.1.

The question 1 asks a participant whether or not she feels that the virtual character

showed creative dance movements to the music. If she feels it did, she will give a high

score. Otherwise, if she feels that it just showed routine movements, you will give a low

score. The question 2 asks a participant if she feels that the dance performance of the

virtual character looked similar to a real human dancing to the music. If so she will give a

57

high score. The question 3 asks a participant how well she feels the dance performance

matched the music. If she feels that the performance expressed the music well, she will

give a high score. Finally, the last question asks a participant’s overall evaluation of the

dance performance. If she really liked the dance performance or she feels it is good, she

will give a high score.

Figure 4.1 The Questions used in the User Opinion Study

4.3 Results

This section presents the results of statistical analysis on the user opinion study data.

We used single factor ANOVA (Analysis of Variance) to analyze the result and the

analyses results for each hypothesis are as follows.

58

Prediction 1: Dance performance generated by suggested approach looks more creative.

The difference was significant for fast (F = 17.273, p = 0.0 < 0.001) and slow (F =

17.300, p = 0.0 < 0.001) music (Table 4.1). The null hypothesis, that the results were the

same, was rejected.

Table 4.1 Statistical Analysis of the Results for Prediction 1

Musical Tempo

Suggested Approaches

Random Walk F-value P-value

Fast 5.5 3.5 17.273 < 0.001 Intermediate 4.65 4.6 0.015 0.902

Slow 5.55 3.85 17.300 0.000

Prediction 2: Dance performance generated by suggested approach looks more realistic.

The difference was significant for fast (F = 4.560, p = 0.039 < 0.05), intermediate (F

= 6.818, p = 0.013 < 0.05), and slow (F = 6.818, p = 0.013 < 0.05) music (Table 4.2). The

null hypothesis, that the results were the same, was rejected.


Musical Tempo



Fast 6.2 5.6 4.560 0.039 Intermediate 6.35 5.6 6.818 0.013

Slow 6.4 5.85 6.818 0.013

Prediction 3: Dance performance generated by suggested approach looks more

expressive.

59

The difference was significant for fast (F = 76.382, p = 0.000 < 0.001), intermediate

(F = 5.240, p = 0.028 < 0.05), and slow (F = 115.024, p = 0.000 < 0.001) music (Table

4.3). The null hypothesis, that the results were the same, was rejected.


Musical Tempo



Fast 6.45 4.45 76.382 < 0.001 Intermediate 5.75 4.7 5.240 0.028

Slow 6.05 3.1 115.024 < 0.001

Prediction 4: Dance performance generated by suggested approach looks better overall.

The difference was significant for fast (F = 13.977, p = 0.001 < 0.05) and slow (F =

42.309, p = 0.000 < 0.001) music. The null hypothesis, that the results were the same,

was rejected.


Musical Tempo



Fast 6.1 4.75 13.977 0.001 Intermediate 5.5 4.95 3.386 0.074

Slow 6.15 4.35 42.309 < 0.001

4.4 Discussion

The null hypotheses were rejected at the .05 or .01 level in all but two of the twelve

cases. In all cases the score was higher for our approach, implying that there is a

difference in response to the dance clips and that the approach presented here produces

better results than a random one.

60

For us the most important prediction is that the suggested approach can generate a

dance performance that is perceptually well matched (expresses the music) to the input

music. We also observed that people tend to think that the dance performances are more

realistic and creative when they are perceptually well matched to the music. This result

was somewhat unexpected because our approach does not have any functionality to

generate “creative” dance movements and all the dance motions are realistic because they

had been captured from real dancer’s movements. It may be that there is an underlying

positive response (possibly aesthetic) that causes this correlation in the results.

We also noted that the two cases where the null hypothesis could not be rejected both

correspond to the intermediate tempo and that the ratings for the intermediate tempo are

closest in all four cases. This could be an interaction effect with tempo or it could have

something to do with the participants overall experience with dancing.

Overall we feel encouraged by the study and think it suggests that our approach is

promising.

61

Chapter 5 -- Conclusions and Future Work

This dissertation proposes a novel approach that addresses the problem of automatic

dance motion generation. The proposed approach automatically generates a dance

performance for a piece of music by matching the progressions of patterns in musical and

motion contents as well as correlating musical and motion features. The proposed

approach has a number of important advantages over current approaches.

First, the abundance of features used in this approach generates better results than

current approaches where a limited number of features are used. In this work, a set of

motion features was developed for describing the postural and dynamic properties of

human motion. These features were shown to effectively represent the qualitative

properties of human dance motion in quantitative measures. The usefulness of these

novel motion features are not limited to the motion-to-music matching problem, but are

also useful for many other applications such as motion analysis or motion retrieval

problems.

Second, a set of matching criteria were suggested to evaluate the quality of matching

between each motion sequence in a motion database and the given musical piece.

Matching between musical and motion data is performed by searching the motion graph

and selecting a path representing a sequence of motions that best matches a given musical

segment. Results of human subject studies performed in this effort showed that these

criteria are indeed feasible and can generate natural, realistic, pleasing, and perceptually

appropriate dance motion for musical input.

62

Third, a similarity-based matching approach was introduced in this dissertation that

effectively achieve matches the progressions of patterns in musical and motion contents.

This is a critical component necessary for the generation of perceptually appropriate

dance motion that has been ignored in current approaches. Current approaches have

focused exclusively on correlating local musical and motion features. In this dissertation

we demonstrated that the matching of progressions of patterns is more critical in dance

motion generation than the correlation of local musical and motion features.

Finally, the perceptual relationship between musical and motion features was

explored in order to establish a perceptual correlation between musical and motion

contents. Correlation coefficients were used to establish the relationship between a pair of

musical and motion features and the result of correlation analysis was used to guide the

search for the best motion sequence for the given musical input in the matching process.

This analysis demonstrated that changes in the feature values of consecutive musical and

motion segments had a stronger perceptual correlation than the feature values themselves.

This novel finding resulted in improved perceptual matching in the generated dance

motion.

The approach proposed in this dissertation has a number of application areas where a

perceptually optimal mapping between two different media types is needed. Examples of

this include: abstract animation, movie clip generation, textures generation from musical

data or automatic music or sound effects generation from motion data.

When motion graph approaches are used, the complexity of a motion graph is an

important factor that impacts the performance of the matching algorithm. A large number

63

of motion segments and transitions amongst them will generally produce better

performance in the matching algorithm. However, as the size of the graph increases, the

time complexity of most graph search algorithms increases exponentially. It is therefore

necessary to develop an efficient graph search algorithm that guarantees interactive-time

performance.

64

References

[ALA05] Alankus, G., Bayazit, A., and Bayazit, O. “Automated Motion Synthesis for

Dancing Characters.” Journal of Computer Animation and Virtual Worlds, Oct.

2005.

[ARI02] Arikan, O., and Forsythe, D. “Interactive Motion Generation from Examples.”

In Proceedings of ACM SIGGRAPH 2002, Annual Conference Series, ACM

SIGGRAPH.

[BOD97] Bobby Bodenheimer, Charles Rose, Seth Rosenthal, and John Pella. “The

Process of Motion Capture: Dealing with the Data.” In Computer Animation and

Simulation, Eurographics Workshop, Springer-Verlag, 1997, pp.3-18.

[BRA99] Brand, M. “Voice Puppetry.” In Proceedings of SIGGRAPH 1999, ACM Press

/ACM SIGGRAPH, Los Angeles, A. Rockwood, Ed., Computer Graphics

Proceedings, Annual Conference Series, ACM, 21.28.

[BRE97] Bregler, C., Covell, M., and Slaney, M. “Video Rewrite: Driving Visual Speech

with Audio” SIGGRAPH 97, Los Angeles, CA (in press).

[BRO94] Brooke, N., and Scott, S. “Computer Graphics Animations of Talking Faces

based on Stochastic Models.” In Intl. Symposium on Speech, Image Processing,

and Neural Networks, 1994.

[BUR92] Allen W. Burton, Nancy L. Greer, and Diane M. Wiese. “Changes in Overhand

Throwing Patterns as a Function of Ball Size.” Pediatric Exercise Science, 4(1),

Feb. 1992, pp.50-67.

65

[CAR02] Cardle, M., Barthe, L., Brooks, S., and Robinson, P. “Music-Driven Motion

Editing: Local Motion Transformations Guided by Music Analysis.” In

Proceedings of Eurographics 2002.

[CHI00] Chi, D., Costa, M., Zhao, L., and Badler, N. "The EMOTE Model for Effort and

Shape." Proceedings of SIGGRAPH 2000, ACM Computer Graphics Annual

Conference, New Orleans, Louisiana, 23-28 July, 2000, pp. 173-182.

[CHU99] Shih-kai Chung and James K. Hahn. “Animation of Human Walking in Virtual

Environments.” In Proceedings of Computer Animation '99, May 1999, pp.4-15.

[COH90] Cohen, M. and Massaro, D. “Synthesis of Visible Speech.” Behavioral

Research Methods and Instrumentation, 22, 260-263, 1990.

[COH93] Cohen, M. and Massaro, D. “Modeling Coarticulation in Synthetic Visual

Speech”, in Magnenat-Thalmann N., Thalmann D. (Editors), Models and

Techniques in Computer Animation, Springer Verlag, Tokyo, 1993, pp. 139-156.

[COO01] Cooper, M. and Foote, J. “Visualizing Musical Structure and Rhythm via Self-

Similarity”, Proceedings of the International Conference on Computer Music,

2001.

[DAV80] Davis, S., and Mermelstein, P. “Experiments in Syllable-Based Recognition of

Continuous Speech.” IEEE Transactions on Acoustics, Speech and Signal

Processing, 28:357–366, 1980.

[DOB03] Dobrian, C. and Bevilacqua, F. “Gestural Control of Music: using the Vicon 8

Motion Capture System.” In Proceedings of the Conference on New Interfaces

for Musical Expression. 2003.

66

[GLE97] Michael Gleicher. “Motion Editing with Spacetime Constraints.” In Interactive

3D Graphics 1997, pp.139-148.

[GLE98] Michael Gleicher. “Retargetting Motion to New Characters.” In Proceedings of

SIGGRAPH 1998, Aug. 1998, pp.33-42.

[GUI94] Guiard-Marigny, T., Adjoudani, A., and C. Benoit. “A 3-D Model of the Lips

for Visual Speech Synthesis.” Proc.ESCA/IEEE Workshop on Speech Synthesis,

New Paltz, NY, pp. 49–52, 1994.

[HAH95] Hahn, J., Geigel, J., Lee, J.W., Gritz, L., Takala, T., and Mishra, S. "An

Integrated Approach to Sound and Motion." Journal of Visualization and

Computer Animation, Volume 6, Issue No. 2, pp. 109-123, 1995.

[HOD95] Jessica Hodgins, Wayne Wooten, David Brogan, and James O’Brien.

“Animating Human Athletics.” In Proceedings of SIGGRAPH 1995, Aug. 1995,

pp.71-78.

[HOD97] Jessica Hodgins and Nancy Pollard. “Adapting Simulated Behaviors for New

Characters.” In Proceedings of SIGGRAPH 1997, Aug. 1997, pp.153-162.

[KIM03] Kim, T., Park, S., and Shin, S. “Rhythmic-Motion Synthesis Based on Motion-

Beat Analysis.” In Proceedings of ACM SIGGRAPH 2003, Annual Conference

Series, ACM SIGGRAPH.

[KOV02] Kovar, L., Gleicher, M., and Pighin, F. “Motion graphs.” In Proceedings of

SIGGRAPH 2002, Annual Conference Series, ACM SIGGRAPH.

[LEE95] Lee, Y., Terzopoulos, D., and Waters, K. “Realistic Modeling for Facial

Animation.” In Proceedings of SIGGRAPH 1995, ACM Press / ACM

67

SIGGRAPH, Los Angeles, California, Computer Graphics Proceedings, Annual

Conference Series, ACM, 55.62.

[LEE02] Lee, J., Chai, J., Reitsma, P. S. A., Hodgins, J. K., and Pollard, N. S.

“Interactive Control of Avatars Animated with Human Motion Data.” In

Proceedings of ACM SIGGRAPH 2002, Annual Conference Series, ACM

SIGGRAPH.

[LEE05] Lee, H., and Lee, I. “Automatic Synchronization of Background Music and

Motion in Computer Animation.” In Proceedings of Eurographics 2005.

[LEV04] Levin, G. and Lieberman, G. “In-Situ Speech Visualization in Real-time

Interactive Installation and Performance.” In Proceedings of The 3rd

International Symposium on Non-Photorealistic Animation and Rendering, pages

7-14. ACM Press, 2004.

[LEW91] Lewis, J. “Automated Lip-Sync: Background and Techniques." J.Visualization

and Computer Animation, 2(4):118–122, 1991. ISSN 1049-8907.

[LI02] Li, Y., Wang, T., and Shum, H.-Y. “Motion Texture: A Two-Level Statistical

Model for Character Motion Synthesis.” In Proceedings of ACM SIGGRAPH

2002, Annual Conference Series, ACM SIGGRAPH.

[LYT] Wayne Lytle, AniMusic, http://www.animusic.com/.

[MAK75] Makhoul, J. “Linear Prediction: A Tutorial Overview.” Proceedings of the

IEEE, 63:561–580, 1975.

[MIS95]Mishra, S. and Hahn, J. "Mapping Motion to Sound and Music in Computer

Animation and VE." Invited Paper, Pacific graphics, August 21-August 24, 1995.

68

[MIS05] Misra, A., Wang, G. and Cook, P. R. “sndtools: Real-Time Audio DSP and 3D

Visualization”, Proceedings of the International Computer Music Conference,

2005.

[MOL96] Tom Molet, Ronan Boulic, and Daniel Thalmann. “A Real Time Anatomical

Converter for Human Motion Capture.” In Computer Animation and Simulation,

Eurographics Workshop, Springer-Verlag, 1996, pp.79-94.

[MOR91] Morishima, S., and Harashima, H. “A Media conversion from Speech to Facial

Image for Intelligent Man-Machine Interface.” IEEE J Selected Areas

Communications, 9 (4):594–600, 1991. ISSN 0733-8716.

[MOR04] Morioka, H., Nakatani, M., and Nishida, S. “Proposal of an Algorithm to

Synthesize Music Suitable for Dance.” In Proceedings of the ACM SIGCHI

International Conference on Advances in Computer Entertainment Technology

2004.

[OBR00] James F. O'Brien, Robert E. Bodenheimer, Gabriel J. Brostow, and Jessica K.

Hodgins. “Automatic Joint Parameter Estimation from Magnetic Motion Capture

Data.” Proceedings of Graphics Interface 2000, Montreal, Quebec, Canada, May

15-17, pp. 53-60.

[OLI97] Oliver, W., Yu, J., and Metois, E. “The Singing Tree: Design of an Interactive

Musical Interface.” In DIS '97: Proceedings of the conference on Designing

interactive systems: processes, practices, methods, and techniques, pages 261-

264. ACM Press, 1997.

69

[Ox02] Ox, J. “2 performances in the 21st Century Virtual Color Organ.” In Proceedings

of the fourth conference on Creativity & Cognition, pages 20-24. ACM Press,

2002.

[PAR72] Parke, F. “Computer Generated Animation of Faces.” Proc.ACM National

Conf., pp. 451–457, 1972.

[PAR74] Parke, F. “A Parametric Model for Human Faces.” Tech. Report UTEC-CSc-

75-047 Salt Lake City: University of Utah, 1974.

[PEA86] Pearce, A., Wyvill, B., Wyvill, G., and Hill, D. “Speech and Expression: A

Computer Solution to Face Animation.” Graphics Interface, 1986.

[PER06] Perera, M., Shiratori, T., and Kudoh, S. 2006. Task Recognition and Style

Analysis in Dance Sequences. In Proceedings of IEEE International Conference

on Multisensor Fusion and Integration for Intelligent Systems. 2006.

[SAU07] Sauer, D., and Yang, Y. “Music-Driven Character Animation.” ACM

Transactions on Multimedia Computing, Communications and Applications.

2007.

[SHI04] Shiratori, T., Nakazawa, A., and Ikeuchi, K. “Detecting Dance Motion Structure

using Motion Capture and Musical Information.” In Proceedings of International

Conference on Virtual Systems and Multimedia. 2004.

[SHI06a] Shiratori, T., Nakazawa, A., and Ikeuchi, K. “Dancing-to-Music Character

Animation.” In Proceedings of Eurographics 2006.

[SHI06b] Shiratori, T., Nakazawa, A., and Ikeuchi, K. “Synthesizing Dance Performance

using Musical and Motion Features.” In Proceedings of IEEE International

Conference on Robotics and Automation 2006.

70

[TAK92] Takala, T. and Hahn, J. "Sound Rendering." Computer graphics, Proceedings

SIGGRAPH '92, Association for Computing Machinery (ACM), Vol. 26, No. 2

(July 1992), pp. 211- 220.

[TAK93] Takala, T., Hahn, J., Gritz, L., Geigel, J., and Lee, J.W. "Using Physically-

Based Models and Genetic Algorithms for Functional Composition of Sound

Signals, Synchronized to Animated Motion." International Computer Music

Conference (ICMC), Tokyo, Japan, Sept. 10-15, 1993.

[TOL00] Tolonen, T. and Karjalainen, M. “A Computationally Efficient Multipitch

Analysis Model.” IEEE Trans. on Speech and Audio Processing, 8(6):708–716,

Nov. 2000.

[TZA02]Tzanetakis, G. “Manipulation, Analysis and Retrieval Systems for Audio

Signals.” Ph.D dissertation Princeton University, 2002.

[WAT87] Waters, K. “A Muscle Model for Animating Three-Dimensional Facial

Expression.” IEEE Computer Graphics, 21(4), 1987.

[WIT95] Andrew Witkin and Zoran Popovic. “Motion Warping.” In Proceedings of

SIGGRAPH 1995, Aug. 1995, pp.105-108.

[WOL96] Wold, E., Blum, T., Keislar, D., and Wheaton, J. “Content-based classification,

search and retrieval of audio.” IEEE Multimedia, 3(2), 1996.

Perceptually motivated automatic dance motion generation for music

Documents