Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.

Arlindo VeigaDirce CeloricoJorge ProençaSara CandeiasFernando Perdigão

Prosodic and Phonetic Features for Speaking Styles Classification and Detection

IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop

November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

2

Summary

IberSPEECH 2012

| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Objective

Characterization of the corpus

Features

Methods Automatic segmentation Classification

Results Automatic detection

Segmentation Speech versus Non-speech Read versus Spontaneous

Classification Speech versus Non-speech Read versus Spontaneous

Conclusions and future works

3

Objective

IberSPEECH 2012


Automatic detection of speaking styles for segmentation purposes of multimedia data

Style of a speech segment?

Segment broadcast news documents into two most evident classes: read versus spontaneous speech (prepared and unprepared speech)

Using combination of phonetic and prosodic features

Explore also speech/non-speech segmentation

slow fastclear informal causal planned prepared

spontaneous unprepared …

4


IberSPEECH 2012


Broadcast News audio

corpus

TV Broadcast News MP4 podcasts

Daily download

Extract audio stream and downsample from

44.1kHz to 16 kHz

30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels:

Level 1– dominant signal: speech, noise, music, silence, clapping, …

For speech:

Level 2– acoustical environment: clean, music, road, crowd,…

Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity)

Level 4– speaker info: BN anchor, gender, public figures,…

5


IberSPEECH 2012


From Level 1 – speech versus non-speech

From Level 3 – read speech (prepared) versus spontaneous speech

Type of segment Number of segments Average duration

(± std deviation) (s) Speech 7971 11.0 (± 9.4)

Non-Speech 2529 4.1 (± 5.3) Read Speech 4989 10.6 (± 8.5)

Spontaneous Speech 1738 12.0 (± 10.4)

For each segment, a vector of 322 features (214 phonetic features and 108 prosodic features) are computed

6

Features

IberSPEECH 2012


Phonetic (size of parameter vector for each segment: 214)

• Based on the results of a free phone loop speech recognition

• Phone duration and recognized loglikelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation)

• Silence and speech rate

Prosodic (size of parameter vector for each segment: 108)

• Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope

• First and second order statistics

• Polynomial fit of first and second order

• Reset rate (rate of voiced portions)

• Voiced and unvoiced duration rates

7

Methods

IberSPEECH 2012


Automatic detection

Implies automatic segmentation and automatic classification

Automatic segmentation based on modified BIC (Bayesian Information Criterion) - DISTBIC

Binary classification: SVM classifiers

8

Methods

IberSPEECH 2012


Automatic segmentation

DISTBIC - uses distance (Kullback-Leibler) on the first step and delta BIC (DBIC) to validate marks

si-1 si si+1 si+2

…. ….DBIC<0 DBIC>0

Parameters:

Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy (windows 25 ms, step 10 ms)

A threshold of 0.6 in the distance standard deviation was used to select significant local maximum; window size: 2000 ms, step 100 ms

Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC process

9

Methods

IberSPEECH 2012


Classification

SVM classifiers (WEKA tool – SMO, linear kernel, C=14):

• speech / non-speech

• read / spontaneous

2 step classification approach

Speech / non-speech

classification

Read / spontaneous

classification

non-speech

speechspontaneous

read

10

Results

IberSPEECH 2012


Performance measure

Segmentation only:

Collar (detection tolerance) range 0.5 s to 2.0 sA detected mark is assigned as correct if there is one reference mark less than “collar”

Automatic detection

Classification only: “AT” – agreement time = % frame correctly classified

11

Results

IberSPEECH 2012


Segmentation performance

:

0.5 s 1.0 s 1.5 s 2.0 s

0.3

0.4

0.5

0.6

0.7

0.8

Collar (seconds)

F1-

scor

eF1-score: collar range 0.5 s to 2.0 s

0.8

0.7

0.6

0.5

0.4

0.3

0.5 1.0 1.5 2.0

12

Results

IberSPEECH 2012


0.5 s 1.0 s 1.5 s 2.0 s

0.5

0.6

0.7

0.8

0.9

1

Collar (seconds)

Acc

urac

yRecall: collar range 0.5 s to 2.0 s

1.0

0.9

0.8

0.7

0.6

0.5

0.5 1.0 1.5 2.0

Segmentation performance

:

13

Results

IberSPEECH 2012


Automatic detection

Speech / non-speech detection

Type of features AT. Speech Non-speech Phonetic 91.5% 94.9% 62.2% Prosodic 93.2% 97.0% 61.0%

Combination 93.3% 96.6% 64.9%

Read / spontaneous detection

Type of features AT. Read Spontaneous Phonetic 76.7% 91.9% 38.6% Prosodic 81.1% 93.0% 51.2%

Combination 83.3% 92.7% 59.6%

“AT” – agreement time = % frame correctly classified

14

Results

IberSPEECH 2012


Classification only (using given manual segmentation)

Speech / non-speech classifier

Type of features Acc. Speech Non-speech Phonetic 93.8% 96.7% 82.0% Prosodic 93.8% 97.5% 81.9%

Combination 94.4% 97.6% 84.0%

Type of features Acc. Read Spontaneous Phonetic 83.2% 92.8% 55.4% Prosodic 86.4% 95.0% 61.6%

Combination 87.4% 93.7% 69.5%

“Acc.” – Accuracy

Read / spontaneous classifier

15

Conclusions and future work

IberSPEECH 2012


Read speech can be differentiated from spontaneous speech with reasonable accuracy.

Good results were obtained with only a few and simple measures of the speech signal.

A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information).

We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint.

We intend to automatically segment all audio genres and speaking styles.

16

THANK YOU

IberSPEECH 2012


17

Appendix – BIC

IberSPEECH 2012


BIC (Bayesian Information Criterion)Dissimilarity measure between 2 consecutive segments

Two hypothesizes:H0 – No change of signal characteristics. Model: 1 Gaussian:H1 – Change of characteristics. 2 Gaussians:

μ – mean vector; S – covariance matrixMaximum likelihood ratMaximum likelihood ratio between H0 and H1:

X

X1 X2

1 2

1 22 2 2( ) log log logX X XN N NX X XR i

~ ; ,X XX N x μ Σ

1 1 1 2 2 2~ ; , ; ~ ; , ;X X X XX N x X N xμ Σ μ Σ

18

Appendix – BIC

IberSPEECH 2012


P –complexity penalization

λ – penalization factor (ideal 1.0)

Change if:

Parameters used in this work:

p=16; λ=1.3; frame rate = 100; N=200; M=10;

( ) ( )BIC i R i P

*( ) 0BIC i

Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.

Documents

speech style

spontaneous speech

read speech

universidad autnoma

lombard speech

levels of unprepared

phonetic features

nonspeechfrom level