Top Banner
M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture-Speech Analysis and Synthesis
19

M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Dec 17, 2015

Download

Documents

Melissa Conley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel

Yemez, A. Murat Tekalp

Combined Gesture-Speech Analysis and Synthesis

Page 2: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Outline

• Project Objective• Technical Details

– Preparation of Gesture-Speech Database

– Determination of Gestural – Auditory Events

– Detection of Gestural – Auditory Events

– Gesture-Speech Correlation Analysis

– Synthesis of Gestures Accompanying Speech

• Resources• Concluding Remarks and Future Work• Demonstration

Page 3: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Project Objective

• The production of speech and gesture is interactive throughout the entire communication process.

• Computer-Human Interaction systems should be interactive such that, for an edutainment application, animated person’s speech should be aided and complemented by it’s gestures.

• Two main goals of this project:– Analysis and modeling of correlation between speech and

gestures.

– Synthesis of correlated natural gestures accompanying speech.

Page 4: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Technical Details

• Preparation of Gesture-Speech Database• Determination of Gestural – Auditory Events • Detection of Gestural – Auditory Events • Gesture-Speech Correlation Analysis• Synthesis of Gestures Accompanying Speech

Page 5: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Preparation of Database

• Gestures and Speech of a specific subject (Can-Ann) was investigated.

• 25 minutes video of a native English speaker giving directions, 25 fps, 38249 frames.

Page 6: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Determination of Gestural – Auditory Events

• Database is manually examined to find specific, repetitive gestural and auditory events.

• Note that, the events found for one specific subject is personal and can vary from culture to culture.– During the refusal phrases

• Turkish Style → Upward Movement of Head• European Style → Left-Right Movement of Head

– The Can-Ann does not use these gestural events at all.

• Auditory Events:– Semantic Information (Keywords): “Left”, “Right” and “Straight”.– Prosodic Information: “Accent”.

• Gestural Events:– Head Movements: “Down”, “Tilt”.– Hand Movements: “Left”, “Right”, “Straight”.

Page 7: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Correlation Results

Directional word-gesture alignment13%

9%

13% 65%

Match

Close

Confused

Wrong

Pitch Accents

65%16%

19%Phrase InitialNo Gesture

No Gesture

Gesture-marked

Page 8: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Detection of Gesture Elements

• In this project, we consider arm and head gestures.• Gesture features are selected as:

– Head Gesture Features: Global Motion Parameters calculated within head region.

– Hand Gesture Features: Hand center of mass position and calculated velocity.

• Main tasks included in detection of gesture elements:– Tracking of head region.

• Optical Flow Based

– Tracking of hand region.• Kalman Filter Based• Particle Filter Based

– Extraction of gesture features.– Recognition and labeling of gestures.

Page 9: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Detection of Auditory Elements

• In this project, we consider semantic and prosodic events.• Main tasks included in detection of gesture elements:

– Extraction of Speech Features:• MFCC• Pitch• Intensity

– Keyword Spotting• HMM Based• Dynamic Time Warping Based

– Accent Detection• HMM Based• Sliding Window Based

Page 10: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Training

Testing Grammar:

left

right

straight

silence

garbage

Training speech Labels for keywords

Unknown speech Labels for keywords

- Hidden Markov Toolkit (HTK) was used as base technology for development of keyword spotter

- 20 minutes of speech were labelled manually and used for training

- Speaker-dependent speech recognition system

- each keyword was pronounced in training speech at least 30 times

Keyword Spotting (HMM Based): Training

Page 11: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

- 5.5 minutes of speech were used for testing- Speech fragment contains aproximately 600 words of which 35 are keywords

First experiments: keyword spotter was able to find almost all keywords in test speech, but it gives many false alarms.

Keyword Spotting (HMM Based): Testing

2 4 6 8 10 12 14 16 18 2060

70

80

90

100

110

120

Mixtures of Gaussians in HMMs

%Keyword Spotting Rate

False Alarm Rate

Page 12: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Keyword Spotting (Dynamic Time Warping)

Speech

Keyword Template

Discrete deviation

Sliding

• MFCC Parameters are used for parameterization• Dynamic time warping method is used to find an optimal match

between two given sequences (e.g. time series).• Results: Recognized keywords Missed words False alarms

33 2 22

Page 13: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Accent Detection (Sliding Window Based)

• Parameters are calculated given a sliding window:– Pitch contour

– Number of local minimum and maximum in pitch contour

– Intensity

• Windows that has high intensity values are selected.• Median Filtering is used to remove short windows.• The candidate accent windows are labeled using connected

component analysis.• The candidate accent regions that contain few or many local

minimums and maximums are eliminated.• Remaining candidate regions are selected as accents.• Proposed method detects %68 of accents and gives 25% F.A.

Page 14: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Synthesis of Gestures Accompanying Speech

• Based on the methodology used in correlation analysis given a speech signal:– Features will be extracted.

– Most probable speech label will be designated to speech patterns.

– Gesture pattern that is most correlated with speech pattern will be used to animate a stick model of a person.

Page 15: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Hand Gesture Models

Original Hand TrajectoriesGenerated trajectories

based on HMM

Page 16: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Resources

• Database Preparation and Labeling– VirtualDub

– Anvil

– Paraat

• Image Processing and Feature Extraction:– Matlab Image Processing Toolbox

– OpenCV Image Processing Library

• Gesture-Speech Correlation Analysis– HTK HMM Toolbox

– Torch Machine Learning Library

Page 17: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Concluding Remarks and Future Work

• Database will be extended with new subjects.• Algorithms and methods will be tested using new databases.• HMM based accent detector will be implemented.• Keyword and event sets will be extended.• Database scenarios will be extended.

Page 18: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Demonstration I

Page 19: M. Emre Sargın, Ferda Ofli, Yelena Yasinnik, Oya Aran, Alexey Karpov, Stephen Wilson,Engin Erzin, Yücel Yemez, A. Murat Tekalp Combined Gesture- Speech.

Demonstration II