Perceptually Motivated Automatic Dance Motion Generation for Music By Jae Woo Kim B.S. in Physics, February 1991, Hankuk University of Foreign Studies M.S. in Computer Science, February 1993, Hankuk University of Foreign Studies A Dissertation Submitted to the Faculty of The School of Engineering and Applied Science of the George Washington University in partial satisfaction of the requirements for the degree of Doctor of Science May 17, 2009 Dissertation directed by James K. Hahn Professor of Computer Science
78
Embed
Perceptually motivated automatic dance motion generation for music
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Perceptually Motivated Automatic Dance Motion Generation for
Music
By Jae Woo Kim
B.S. in Physics, February 1991, Hankuk University of Foreign Studies
M.S. in Computer Science, February 1993, Hankuk University of Foreign Studies
A Dissertation Submitted to
the Faculty of The School of Engineering and Applied Science
of the George Washington University in partial satisfaction of the requirements
for the degree of Doctor of Science
May 17, 2009
Dissertation directed by
James K. Hahn
Professor of Computer Science
UMI Number: 3349630
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
The goal of the research is to develop a method to generate a dance performance that
is perceptually matched to a given musical piece. The proposed method extracts musical
features by analyzing input audio streams while also extracting motion features from
motion data. A mapping is then performed between the two feature spaces by matching
the progressions of musical patterns and dance motion patterns and correlating the feature
values between the two. Finally, the input music is automatically transformed into a
dance performance through a process of dance motion recombination based on the
derived mapping. The process results in the generation of “natural” or realistic dance
motion that approximates a human performance for a given musical piece.
1.1 Motivation
Synthesizing realistic human motion is one of the most important research topics in
computer animation. Various methods such as keyframing, inverse kinematics, and
dynamic simulation have been used to synthesize human behavior. More recently, with
the advent of motion capture technology, motion capture and editing techniques have
become widely used in realistic human motion synthesis.
On the other hand, synchronization of sound with motion is also a very essential
problem in animation because sound plays an important role in computer animation. The
process of creating computer generated animation usually begins with the creation of a
visual animated sequence, followed by the sound editing process where perceptually
appropriate sound effects are manually added to the animation sequence. The sound
2
design process is tedious, time consuming, and requires a high level expertise to produce
convincing results. Automatic sound effects synchronization has therefore been a crucial
issue within the computer animation research community as well as industry.
When dealing with music and human dance motion, the synchronization between
music and motion becomes even more important because dance motion has a much
stronger linkage to music than any other type of motion. However, synchronizing music
to dance motion is a very difficult problem due to the intricate relationship that exists
between music and motion in a dance performance. To date, little research has been done
on the problem of synchronization between dance motion and music [CAR02].
The synchronization problem is further compounded due to the complexity inherent
in both musical and human motion data. Musical sound contains a wealth of information
such as pitch, timbre, rhythm and harmony. Human motion data, on the other hand, is
multidimensional and of all human motion (i.e. walking, jumping, running); human dance
motion is the most complex. This complexity makes it difficult to analyze the data and to
explore the relationship between the musical and dance motion features.
The nature how dance is created and performed also creates additional challenges.
Dance performance is carefully choreographed by expert choreographers based on a
given musical piece. This process requires a high degree of intelligence and requires
much expertise, education and experience. This problem is therefore not amenable to the
use of analytic or algorithmic models because of the aesthetic, perceptual, and
psychological aspects that are involved in this process.
sy
th
d
co
d
re
pr
d
nu
In this
ynchronized
he resulting
issertation c
ommercials,
1.2 Pro
The goal
ance motion
esulting in a
rocess wher
ance perform
The prob
umber of res
dissertation,
d dance moti
g dance pe
can be app
, virtual reali
oblem Dom
of this diss
n generation
an animated
re a musical
mance (figur
blem addres
search areas
, we devel
ion that is pe
erformance
plied to a
ity applicatio
main
sertation is
n where an
d dance perf
l piece is au
re 1.1).
Figure
ssed by this
(table 1.1).
3
lop a nov
erceptually m
is convinci
number of
ons, comput
to develop
arbitrary pi
formance. T
uditioned an
1.1: Problem
s work is i
el method
matched to a
ing. The a
application
ter games an
a solution t
ece of musi
This approac
nd analyzed
m definition
inherently m
to automa
a given mus
approach su
n areas incl
nd entertainm
to the probl
ic is an inp
h mimics a
to inspire t
multidisciplin
atically gen
ical piece so
uggested in
luding film,
ment systems
lem of autom
put into a sy
choreograp
the creation
nary spanni
nerate
o that
this
, TV
s.
matic
ystem
pher’s
n of a
ing a
4
Table 1.1: Relevant Research Areas
Research Area Relevance
Motion synthesis Generating human figure animation from given musical cues
Music visualization Rendering a given musical performance using visual constructs
Motion graph search problem Searching an sequence of motion that perceptually matched to a given musical piece from a motion graph constructed by solely human dance motion segments
Motion retrieval problem Retrieving a motion clip that matches to the input musical cues
Music analysis and motion analysis problem
Extracting music and motion features which best describe the properties of music and motion which are useful in matching music to motion
1.3 Proposed Solution
The proposed solution to the problem of dance motion generation consists of four
components – 1) music analysis, 2) motion analysis, 3) motion graph construction, and 4)
matching between musical and motion features. Musical features as well as motion
features are extracted from input music and a database of dance motion clips. Feature
vectors are then constructed from the extracted feature values. Musical features represent
the properties of the musical segments contained in the input music. Motion features
represent the postural and dynamic properties of the motion segments in motion
sequences. A matching process between musical features and motion features is
performed to create a perceptually matched dance performance for the given music
through a process of dance motion recombination; figure 1.2 depicts an overview of the
5
approach. The problem can be formulated as a search problem where a motion database
is searched for a perceptually optimal sequence of motion segments that can be
recombined into a dance performance based on an input musical piece.
Figure 1.2: Solution Overview
Both music and dance performances can be considered to be sequences composed of
a finite number of patterns. Musical performances generally consist of a set of patterns or
themes that repeat throughout the musical piece. Similarly a progression of patterns or
themes exists in a dance performance. Current approaches to this problem have only
considered the local properties of musical and motion segments in finding an optimal
match. The global or thematic structures of the musical performance and reconstructed
dance motion are ignored. Consequently, dance motion produced using current
approaches are not convincing.
In this dissertation, we propose a novel approach using a motion-to-music matching
method that extracts thirty musical features from musical data as well as thirty seven
motion features from motion data. A matching process is then performed between the two
feature spaces considering the correspondence of the relative changes in both feature
spaces and the correlations between musical and motion features. Similarity matrices are
Motion Database
Music Data (.wav or .mp3)
Motion Analysis Music Analysis
Motion FeaturesMatching Musical Features
6
introduced to match the amount of relative changes in both feature spaces and correlation
coefficients are used to establish the correlations between musical features and motion
features by measuring the strength of correlation between each pair of the musical and
motion features. By doing this, the progressions of musical and dance motion patterns
and perceptual changes between two consecutive musical and motion segments are
matched.
To demonstrate the effectiveness of the proposed approach, some measure of
perceived quality will have to be developed. Evaluating the perceived quality of an
animated dance performance using an analytic or algorithmic solution, however, is not
feasible due to the complexity of articulated human figure animation as well as the
perceptual, psychological, and aesthetic aspects that contribute to the perceived quality of
the generated dance performance. We therefore designed and carried out a user opinion
study to assess the perceived quality of the proposed approach.
1.4 Original Contributions
This dissertation will make several original contributions in the area of human figure
animation:
• A novel approach has been developed to address the problem of automatic dance
motion generation using a thematic analysis of music and dance motion.
• The perceptual relationship between musical and motion features has been
explored and established to match musical contents and motion contents.
7
• A set of motion features which are useful in describing the postural and dynamic
properties of human motion has been developed. The motion features developed in
this research are not only useful for motion to music matching but also useful for
many other applications such as motion analysis and motion retrieval.
• The proposed approach can be used in a number of application areas requiring
perceptually optimal mapping between two different media types. Examples of this
include abstract animation, movie clip generation, textures generation from
musical data or automatic music or sound effects generation from motion data.
1.5 Document Organization
The remainder of this document is organized as follows: Chapter 2 reviews previous
work in the various related domains including music-to-motion matching, human motion
synthesis, music visualization, and speech animation. Chapter 3 describes, in detail, the
approach proposed in this dissertation. Chapter 4 discusses about user opinion study of
the proposed approach in motion-to-music synchronization. Chapter 5 concludes the
dissertation and discusses future work.
8
Chapter 2 -- Related Work
In this chapter, we review previous work related to this dissertation. The related
research areas are divided into the following four categories: 1) human motion synthesis,
2) music visualization, 3) speech animation, and 4) dance motion generation from music.
Major issues regarding previous approaches in each area are described along with the
advantages and limitations of each approach.
2.1 Human Motion Synthesis
Human motion synthesis or articulated figure animation has been an important topic
in computer animation research. It has many application areas and much research has
been performed on creating animation of human behaviors. Some research efforts have
focused on the appearance of the generated human motion while others focused on the
physical correctness of the movements of human body depending on the application area
or requirements of the problem. Research in human motion synthesis can be divided into
several categories – keyframing, inverse kinematics, and dynamic simulation – in terms
of the solution methods used in generating animation.
Keyframing approaches have been widely used in classical animation. Animators
specify the keyframes of the animation and the in-between frames are generated
automatically using interpolation methods. The merit of this approach lies in that it gives
the animators full control over the human characters’ movements. However, this
approach requires much time and effort as well as a high degree of expertise. The inverse
kinematics approach allows animators to specify only the positions of end-effectors, such
9
as hands and feet, and all the joint angles are automatically obtained by applying the
inverse kinematics method to define the pose of each frame in the animation sequence.
Additional constraints can be defined to address the problem of ambiguity which
frequently occurs when using this approach. The problem of ambiguity can be solved by
optimization methods using the given constraints. Inverse kinematics approaches require
less time and effort compared to the keyframing approache. [CHU99] It is nonetheless a
tedious and time-consuming procedure. Finally, dynamic simulation approaches calculate
dynamic constraint formulations to automatically produce physically correct motion.
While this approach is the least labor intensive, it is very difficult to specify the necessary
constraints to produce a desired motion and the computational cost of this approach is
very high[HOD95][HOD97].
Since the advent of the motion capture technology, approaches using motion capture
data in human motion synthesis have been widely used in generating realistic human
motion. Human body movements generated using this approach is highly realistic
because the motion capture data is captured from real human actors’ movements
[BOD97][MOL96][OBR00]. Motion capture data can be reused by modifying the
captured motion data using various techniques such as signal processing, space-time
constraints, and displacement mapping. Motion editing techniques modify the motion
capture data to meet user-specified requirements or environmental constraints while
keeping the original quality of the motion [BRU95][WIT95][GLE97] [GLE98]. Motion
graphs have been used to generate new sequences of motion by stitching many small
clips of motion data according to the specifications and constraints given by users and the
environment. User specifications can be given as a set of key character poses, a path
10
traveled by the character, or as a reference motion specified by the user and captured by a
video camera. [ARI02][KOV02][Lee02][LI02].
While a variety of approaches have been developed for the specification of desired
motion in human motion synthesis systems, little work has been done on the generation
human motion from the musical cues extracted from a musical performance
[SHI06a][SHI06b].
2.2 Music Visualization
Music visualization is the process of rendering music graphically by analyzing and
visually mapping the properties of the musical sound. Music visualization can be grouped
into two categories: one is visualization for analysis of musical contents and the other is
visualization for artistic expression.
There are two approaches in visualizing musical content for analysis purposes, one is
visualizing direct musical data and the other is visualizing interpreted musical data. Here,
direct musical data refers to information extracted directly from the musical data such as
pitch or onset time. Interpreted musical data, on the other hand, denotes higher level
information extracted from musical sound such as tempo or key. Misra et al. developed
methods of direct music visualization in which they displayed 2D waveforms or
spectrograms where the x-axis represented time and y-axis represented the primary
values of interest [MIS05]. Interpreted music visualization renders static or animated
imagery to represent structural characteristics, tonal contexts, tempo and loudness
variations of the given musical data. Researchers have used a number of approaches in
11
order to represent the interpreted music including: a 2D grid, a sequence of translucent
arches, toroid representation and moving dots [COO01].
Approaches for music visualization for artistic expression establish a mapping
between musical and visual content and, in some cases, enables interactivity between a
musician and the generated visualization. In these approaches, music is visualized
through responsive video imagery, virtual character behavior, and responsive virtual
environments. Ox developed a system that allows users to navigate virtual landscapes
generated by assigning geometric and color data to characteristics found within input
musical streams [OX02]. Oliver et al. developed a virtual environment containing both
real and abstract elements where user vocalizations result in both auditory and visual
feedback [OLI97]. Leven and Lieberman developed a system which generates glyphs to
augment live performer’s vocalizations [LEV04]. Wayne Lytle developed a number of
three dimensional animations based on MIDI events. In his work, MIDI data controls and
animates preposterous-looking instruments to produce pleasing movements to every
sound generated from the MIDI data [LYT].
Various commercial media players such as Winamp, Microsoft media player, Real
Player have the ability to generate animated imagery based on a piece of recorded music.
They visualize music based on basic musical features such as loudness and frequency
content.
The problem addressed by this dissertation can be viewed as an approach to music
visualization using human character dance performance. In this regard, the proposed
approach is closely related the problem of music visualization for artistic expression.
12
However, as we discussed in this section, much of the research efforts to date have been
focused on the problem of music visualization for analysis of musical content while the
problem of visualization for artistic expression has not been very well studied.
Unfortunately, due to the complexity of the problem, current approaches in music
visualization are not applicable to the problem of automatic dance performance
generation from music. The information extracted from musical performance for the
purpose of musical analysis or artistic visualization is not sufficient for the problem of
dance motion generation. Current approaches generally extract a small number of musical
features for the purposes of visualization. Global or structural information regarding the
musical composition is not extracted. Therefore, current approaches do not provide
sufficient information for the generation of realistic dance performance.
2.3 Speech Animation
Synchronizing the lip movement of an animated character with speech is an important
topic in human facial animation. This problem is comparable with the motion-to-music
matching problem because both problems deal with the process of generating animation
for the given audio signals. Research efforts on the synchronization of lip movement with
speech are concerned with several issues: (1) how to represent the facial model, (2) how
to control the facial model, (3) how to label vesemes or the visual configurations, and (4)
how to generate the motion sequence [COH93][BRE97].
Generic 3D mesh models, 3D scans, real images, or hand-drawn images have been
used as human facial models in this area [PAR72][LEW91][GUI94][MOR91]. The
13
control parameters are defined depending on the facial models used. Examples of control
parameters are three dimensional deformations or the labels attributed to specific facial
locations. Target utterances are matched with corresponding phoneme labels with which
associated visual configurations have been assigned. Those associated visual
configurations are used to generate the animated images in the synthesis phase. The
phoneme labels can be assigned manually or automatically.
Keyframing methods and physics-based methods have been used in synchronizing lip
movement with speech and more recently machine learning methods have been used as
well. With keyframing methods, the animator specifies particular key-frames and the
systems generate intermediate frames to generate animated images
[PAR74][PEA86][COH90]. With physics-based methods, physically based models are
used to determine the mouth movement for given initial condition and the set of forces
acting on the facial model. To do this the facial model must include a representation of
the underlying facial muscles and skin [WAT87][LEE95]. In machine learning methods,
systems are trained from recorded and labeled data and then used to synthesize new
motions [BRA99][BRO94].
The approaches used in speech animation are not directly applicable to the problem of
dance motion generation because the objectives of the two problems are different. The
objective of speech animation is to generate correct mouth movement that can realize the
appropriate phonemic targets, while objective of dance motion generation is to generate
perceptually matched dance movements that are aesthetically satisfying. The phonemic
labeling approach that uses predefined visual configurations for each phoneme used in
14
speech animation systems is not applicable to dance motion generation because for a
given set of musical data can have multiple visual configurations that are perceptually
well matched.
2.4 Synchronization between Music and Motion
There have been some recent efforts addressing the problem of music to motion
synchronization. Those efforts are focused on establishing a correlation between musical
and motion features that represent the perceptual properties of music and motion
respectively. Those efforts have addressed the problem using two approaches – one is the
mapping of music to motion, and the other is mapping motion to music. Each of these
approaches is applicable to different application areas. Whatever the approach, the
essence of the problem, however, is the same.
Musical features that have been used in this research include both MIDI (Musical
Instrument Digital Interface) data as well as audio signal data. MIDI is a standardized
protocol for communication between electronic music devices as well as between those
devices and host computers. MIDI data contains information on event messages such as
the pitch and intensity of musical notes, control signals for sound generation parameters
such as volume, vibrato and panning, as well as clock signals to set the tempo. Because
MIDI data already contains much information about the music while audio signal data
does not, it is generally easier to extract musical features from MIDI data. Much of the
music to motion synchronization research has therefore utilized MIDI data as input due to
its simplicity. Most music, however, is not stored as MIDI data, but as either an analog or
15
digital signal. This limits the usefulness of approaches based on MIDI data. Approaches
using audio signal data are much more widely applicable.
We can divide research efforts in music to motion synchronization into the following
three categories: 1) Event-based matching, 2) Feature-based matching, and 3) Emotion-
based matching. In this section, we investigate these approaches and discuss their
strengths and limitations.
2.4.1 Event-Based Matching
Motion and music can be synchronized by matching events extracted from musical
data with events extracted from motion data. An event in the musical domain is defined
as a point in the time domain where a certain significant perceptual change occurs.
Events in the musical domain include: dominant drum beats, peak points in amplitude
while events in the motion domain include: motion beats, footsteps, arm swings, sudden
pauses, and jumps [ALA05][KIM03][LEE05][SAU07].
Sauer and Yang suggested a system for creating an animation that is synchronized to
input music by matching musical events such as beat positions and dynamics (e.g., peaks
and valleys of amplitude) with predefined actions. They developed a script language as
well that is used to define the mapping from musical events to motion events. The users
can easily define the mapping by editing a script file [SAU07].
Although their script allows users the flexibility to define a mapping scheme, their
system has several limitations. First, the variety of movements their system can generate
is limited by a small number of predefined movements. Second, the dance performance
16
does not look realistic because it is generated by a combination of several simple actions.
Lastly, matching is not desirable because it is done manually ignoring the correlation
between music and dance.
Kim et al. suggested an approach for synthesizing synchronized motion from a set of
motion clips using an event matching approach for synchronization. In their approach,
new motion is synthesized by traversing a movement transition graph that is constructed
based on basic movements and acceptable transitions between them. Here, motion
capture data is segmented into small units of motion or basic movements based on motion
beat information extracted from the motion data [KIM03].
In order to synchronize motion to music, musical beats as well as motion beats are
first extracted followed by an incremental time-warping process that aligns the motion
beats and the musical beats so that the synthesized motion is synchronized to the input
sound. While this approach has demonstrated promising results for some classes of
periodic motion, its does not perform well in generating novel choreography. This is due
to the lack of sufficient musical and motion features that are used in synchronizing
motion with music.
Alankus et al. suggested an approach for synthesizing dance motion utilizing a
database of dance motion clips. In this approach new dance motion is synthesized
through a beat matching process where motion clips of dance moves are recombined in
order to match musical beats that are extracted from input music. Distinct dance moves
are delineated by a motion frame where a significant change in the direction of the
movement of some body part occurs.
17
In this approach, dance motion is synthesized by traversing a transition graph
consisting of dance motion segments – called dance figures – and acceptable transitions
between them. A matching algorithm traverses the transition graph using two different
algorithms – a fast greedy method and a genetic algorithm. The motion segments are time
warped so that they are aligned with musical beats extracted through a music analysis
process [ALA05].
Although this approach generates perceptually matched dance motions, it does not
create desirable results from a choreographic or aesthetic point of view. The small
number of musical and motion features used in this approach does not sufficiently
describe the properties of the music and dance motion.
Lee and Lee suggested an approach to generate background music from an animation
by matching feature points extracted from musical data and the corresponding feature
points extracted from motion data. Their approach is different from the above mentioned
approaches in that it maps an input motion sequence into a corresponding musical piece
[LEE05].
In this approach, an analysis process is carried out to extract feature points from both
music and motion sequences. Examples of feature points from music include: local peak
points of note volume and points where a note is played near a quarter note (or a note
played on the beat). Examples of feature points from motion include: foot falls and the
transition points of arm swings. Scores are assigned to feature points emphasizing the
more important features.
18
Various feature points from each source (music or motion) are merged together to
give an overall representation of the source and then the merged feature points of music
are aligned with the merged feature points of motion by time-scaling the music and time-
warping the motion using a dynamic programming algorithm.
They also suggested a novel data structure – a Music Graph – which is used to
synthesize a music sequence for the given motion data. A Music Graph consists of
musical segments (nodes) and the acceptable transitions among them. Music is generated
by traversing the Music Graph based on the motion features of a given animation.
2.4.2 Feature-Based Matching
Another approach that has been used for music-to-motion synchronization is based on
the features of music and motion. Musical features and motion features are parameters
that describe the perceptual properties of music and motion. Events, on the other hand,
represent the occurrences of some changes of features in time domain. One typical
example of this approach is the matching musical intensity with motion intensity
[DOB03][SHI06a][SHI06b].
Dobrian and Bevilacqua suggested a method of mapping motion to music by dividing
an input motion sequence into small pieces of motion segments. Motion features such as
the time taken to traverse the segment, the total distance traversed over a segment, the
average speed while traveling along a segment, and the curviness of the path, are
extracted from each motion segment. Those features are then transformed into either
MIDI parameters or control parameters for signal processing based on a user specified
19
mapping in order to generate a musical sound track. [DOB03] Although this approach
provides an intuitive and expressive way to generate a musical sound track, it does not
provide a feasible approach for generating dance motion for music due to the manual
mapping process between musical and motion features.
Shiratori et al. suggested a feature based method for synthesizing synchronized dance
motion based on rhythm and intensity of motion and music. The key idea here is that
musical musical rhythm has a strong correlation with motion rhythm while musical
intensity, which represents the musical mood, has a strong correlation with motion
intensity which represents the strength of motion [SHI06a][SHI06b].
Their approach synthesizes a dance performance by searching a motion graph of
dance motion clips to select the best matching motion sequence in terms of rhythm and
intensity. This approach is predicated on the idea that rhythm and intensity provide a
sufficient level of information to generate both choreographically as well as aesthetically
convincing dance motion. However, given the complex and dynamic structure of music
and dance, it is unlikely that rhythm and intensity alone are sufficient for this purpose.
2.4.3 Emotion-based Matching
The psychological responses of audiences while auditioning music and dance
performance are one of the crucial issues in music and motion synchronization research.
Emotional assessment of music and motion gestures has been addressed in the fields of
music psychology, emotional intelligence, and human computer interaction (HCI). There
20
are also research efforts in music and motion synchronization that consider the emotional
responses of audiences to music and human motion [CAR02].
Cardle et al. suggested an approach for imbuing generated human motion with
emotional content based on input music. Their approach extracts musical features from
both MIDI data and the corresponding analog audio rendition to extract perceptually
significant musical features. The musical features are used to guide the motion editing
process to synthesize an animation synchronized with the given music [CAR02].
The animator manually establishes a mapping between musical features and motion
editing filters interactively such that any musical feature can be mapped to any motion
feature. The motion editing techniques used in their system include motion signal
functions that can modify the generated motion to imbue it with a specific (happy, sad,
angry) emotional mood.
Morioka et al. suggested an algorithm of synthesizing music that can appropriately
express emotion in dance. Their algorithm extracts the emotional content from input
dance motion and from a library of musical pieces. Music is then synthesized by selecting
a musical piece which is emotionally best matched for the emotional content extracted
from the dance motion. An eight state emotional space is used that includes: solemn, sad,
tender, serene, light, happy, exhilarated and emphatic emotional states [MOR04].
This approach can achieve convincing, emotionally inspired, mapping between dance
motion and a corresponding musical performance. However, because this approach only
21
considers emotional state while ignoring all other aspects of dance motion and music
(features and events), it is generally not useful for the dance performance generation
problem.
2.4.4 Limitations of Previous Approaches
The research efforts that we investigated in this section have several important
limitations when considering the problem of automatic dance motion generation. They
are as follows:
• Not enough information from music and motion is used in the matching process
The approaches discussed in this section all utilize a limited number of features
extracted from human motion and music. The lack of richness in the feature set used
in those efforts limits the effectiveness of the mapping processes because not enough
information is used and thus the mappings are generally not convincing.
• No unified solution framework has been developed
Each approach focuses on one aspect of the motion to music matching problem. For
example, one approach only considers matching events detected from music and
motion while other approaches consider only matching features or emotional states
of the music and motion data. In order to produce convincing results, all aspects
(events, features, emotion) of motion and music should be considered in a unified
mapping framework.
• Global structure of music and motion are ignored in matching process
22
Music and dance motion sequences consist of several patterns or themes which
change and repeat over the performance. This global structure of music and dance
motion is important and provides critically important information that must be
considered if convincing dance motion is to be generated. Current efforts, however,
focus exclusively on local matching process, ignoring the global structure of the
music and motion.
23
Chapter 3 -- Dance Performance Generation
As we mentioned in Chapter 1, the problem addressed in this dissertation is the
automatic generation of human dance performances based on an arbitrary musical input.
The proposed solution to this problem consists of the following four components:
• Music analysis
• Motion analysis
• Motion graph construction
• Matching between musical contents and motion contents
We will begin with the overview of the proposed solution. Then we will discuss each
component.
3.1 Solution Overview
A musical analysis process as well as a motion analysis process is first carried out to
extract useful information from both the input musical data and a set of dance motion
clips in a motion capture database consisting of a variety of recorded dance motions. The
musical analysis process extracts thirty musical features including beat, pitch and timbre
information. A motion analysis process is then carried out to extract thirty seven motion
features consisting of postural and dynamic features of the motion. Finally, a mapping is
performed between the musical and motion feature sets using a novel mapping algorithm
that will be described below.
o
co
th
an
m
m
p
se
cr
As depict
f human dan
onstructed, m
he correspon
nd a music
musical sign
motion featur
ath whose m
equence of t
riteria descri
MusicV
Music Anal• Musical
Matching A• Matchin• Correla
Figure 3.1
ted in figure
nce motion c
motion featu
nding nodes
analysis pr
al. Finally,
re vectors is
motion featu
the input mu
ibed in sectio
cal Feature Vectors
lysis l Feature Ex
Algorithm ng Progressiting Relative
: Work Flow
3.1, the prop
capture data
ure vectors a
of the graph
rocess obtai
a matching
s performed
ure vector se
usical piece.
on 3.4.1.
Musi(.wav,
M
traction
ons of Pattee Sensations
24
w of Dance M
posed appro
in a pre-pro
are calculate
h. At run tim
ins musical
g process be
by searchin
equence bes
The search
c Data , .mp3)
otion FeatuVectors
rns s
Motion Gene
oach construc
cessing phas
d for each m
me, a musica
feature vec
etween the
ng the motio
st matches t
is performe
ure
M
Mot•••
ration Syste
cts a motion
se. When the
motion segm
al piece is fed
ctors by ana
musical fea
on graph to s
o the music
d based on a
Motion Captur
tion Graph Motion SegmMotion FeaMotion Tran
em
n graph consi
e motion gra
ment and stor
d into the sy
alyzing the
ature vectors
select an op
cal feature v
a set of matc
re Database
Constructiomentation ture Extractnsition
isting
aph is
red in
ystem
input
s and
ptimal
vector
ching
n
tion
25
3.2 Music Analysis
Much research on techniques that extract useful features from sound signals has been
done in the fields of speech signal processing, non-speech sound signal processing, and
musical signal processing. Linear Prediction Coefficients (LPC) and Mel Frequency
Cepstral Coefficients (MFCC) are used in speech synthesis and recognition and they can
also be useful features in representing musical signals [DAV80]. Sound features related
to the spectral shapes such as centroid, rolloff and flux have been used to perform content
based audio classification and retrieval. These features are also useful for representing the
timbre information of musical signals [WOL96]. Research on beat and tempo extraction
for analyzing the rhythmic structure of music also has been done. Beat tracking has been
done by estimating peaks and their strengths using autocorrelation techniques. Tzanetakis
worked on musical feature extraction and genre classification of music. He used thirty
musical features in analysis process to perform genre classification [TZA02]. In this
research, we used a set of musical features defined in Tzanetakis’s work.
The musical analysis process extracts thirty separate features categorized into three
parts: beat, pitch and timbre information. Musical analysis is carried out on each musical
segment. The size of each segment is based on musical beat information where each
segment consists of sixteen beats. To produce a perceptually appropriate mapping
between music and motion, the extracted musical features have to correlate well to the
listener’s perception of the music. We used observational analysis to validate this
correlation.
26
3.2.1 Rhythm Information
Rhythm information such as the estimate of the main beat and its strength, the
regularity of the rhythm, the relation of the main beat to the subbeats, and the relative
strength of subbeats to the main beat are extracted to represent the rhythmic structure of
the music by using a beat detection algorithm. The input musical signal is decomposed
into frequency bands using filters and the signal’s envelope is extracted. Periodicity is
detected based on the extracted envelope using an autocorrelation algorithm. The
dominant peaks of the autocorrelation function correspond to the various periodicities of
the signal’s envelope [TZA02]. These peaks are accumulated over the whole sound
segment into a beat histogram where each bin corresponds to the peak lag. Table 3.1
shows the musical features related with rhythm information.
Table 3.1: Musical Features for Rhythm Information [TZA02]
A0, A1 Relative amplitude (divided by the sum of amplitudes) of the first, and second histogram peak
RA Ratio of the amplitude of the second peak divided by the amplitude of the first peak
P1, P2 Period of the first, second peak in bpm
SUM Overall sum of the histogram (indication of beat strength)
3.2.2 Pitch Information
The procedure for computing pitch information from the input musical data is similar
to that of computing beat information. Both of the procedures are based on an
autocorrelation technique. The difference between the two procedures is that the pitch
27
detection algorithm analyzes shorter time windows that correspond to human pitch
perception. In this effort we use a multiple pitch detection algorithm described in
[TOL00]. In this algorithm, the signal is decomposed into two frequency bands (below
and above 1000 Hz) and amplitude envelopes are extracted for each frequency band. An
enhanced autocorrelation function is then computed so that the effect of integer multiples
of the peak frequencies to multiple pitch detection is reduced [TZA02]. The features
computed to represent pitch content are shown in table 3.2.
Table 3.2: Musical Features for Pitch Information [TZA02]
FA0 Amplitude of maximum peak of the folded histogram. This corresponds to the most dominant pitch class of the song. For tonal music this peak will typically correspond to the tonic or dominant chord. This peak will be higher for songs that do not have many harmonic changes.
UP0 Period of the maximum peak of the unfolded histogram. This corresponds to the octave range of the dominant musical pitch of the song.
FP0 Period of the maximum peak of the folded histogram. This corresponds to the main pitch class of the song.
IPO1 Pitch interval between the two most prominent peaks of the folded histogram. This corresponds to the main tonal interval relation. For pieces with simple harmonic structure this feature will have value 1 or -1 corresponding to fifth or fourth interval (tonic-dominant).
SUM The overall sum of the histogram. This is feature is a measure of the strength of the pitch detection.
3.2.3 Timbre Information
Nineteen musical features which represent perceptual attributes of the musical timbre
of the input musical data are extracted. Those features include spectral shape features
such as spectral centroid, spectral rolloff, spectral flux, and Mel-Frequency Cepstral
Coefficients which are known to be useful in representing musical timbre. RMS and time
28
domain zero crossings are also extracted. Table 3.3 shows descriptions of musical
features for timbre information except for MFCC [TZA02].
Mel-Frequency Cepstral Coefficients (MFCC) have been widely used in speech
recognition systems. It is known that the first five coefficients of MFCC are useful in
representing the characteristics of musical signals. MFCC coefficients are obtained by
grouping and smoothing the FFT bins of the magnitude spectrum according to the
perceptually motivated Mel-frequency scaling. A discrete Cosine Transform is performed
to decorrelate the resulting feature vectors. [TZA02]
Table 3.3: Spectral Shape Features and Other Features [TZA02]
Feature name Description
Spectral centroid A measure of brightness of the musical segment
Spectral rolloff A measure of the amount of signal’s energy which is concentrated in the lower frequencies
Spectral flux A measure of the amount of local spectral change.
RMS A measure of the loudness of a signal
Zero Crossings A measure of the noisiness of the signal
3.2.4 Musical Feature Vector
The analysis described above results in the generation of thirty musical features that
comprise the Musical Feature Vector. Table 3.4 shows all thirty musical features.