Top Banner
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo Veiga 1,2 Fernando Perdigão 1,2 1 Instituto de Telecomunicações, Polo de Coimbra, Portugal 2 Universidade de Coimbra, DEEC, Portugal Automatically distinguishing Styles of Speech
18

9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

Dec 27, 2015

Download

Documents

Posy Parker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

9th Conference on Telecommunications – Conftele 2013Castelo Branco, Portugal, May 8-10, 2013

Sara Candeias 1

Dirce Celorico 1

Jorge Proença 1

Arlindo Veiga 1,2

Fernando Perdigão 1,2

1Instituto de Telecomunicações, Polo de Coimbra, Portugal2Universidade de Coimbra, DEEC, Portugal

Automatically distinguishing Styles of Speech

Page 2: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

2

Summary

Objective

Characterization of the corpus

Automatic segmentation Method Performance

Automatic classification Features Classification method Results

Speech versus Non-speech Read versus Spontaneous

Conclusions and future works

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 3: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

3

Objective

Automatic detection of styles of speech for segmentation of multimedia data

Speech - Who? What? How?

Style of a speech segment?

Segment broadcast news samples into the two most evident classes: read versus spontaneous speech (prepared and unprepared speech)

Using a combination of phonetic and prosodic featuresFirst explore a speech/non-speech segmentation

slow fastclear informal causal planned prepared

spontaneous unprepared …

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 4: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

4

Characterization of the corpus

Broadcast News audio

corpus

TV Broadcast News MP4 podcasts

Daily download

Extract audio stream and downsample from

44.1kHz to 16 kHz

30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels:

Level 1– dominant signal: speech, noise, music, silence, clapping, …

For speech:

Level 2– acoustical environment: clean, music, road, crowd,…

Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity)

Level 4– speaker info: BN anchor, gender, public figures,…

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 5: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

5

Characterization of the corpus

From Level 1 – speech versus non-speech

From Level 3 – read speech (prepared) versus spontaneous speech

Type of segment Number of segments Average duration

(± std deviation) (s) Speech 7971 11.0 (± 9.4)

Non-Speech 2529 4.1 (± 5.3) Read Speech 4989 10.6 (± 8.5)

Spontaneous Speech 1738 12.0 (± 10.4)

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 6: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

6

Methods

Automatic Detection

1. Automatic Segmentation

(find/mark different segments on the audio signal)

2. Automatic Classification (classify the segments)

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 7: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

7

Methods

1. Automatic segmentation

Based on modified BIC (Bayesian Information Criterion):DISTBIC – uses distance (Kullback-Leibler) on the first step and delta BIC (DBIC) to validate marks

si-1 si si+1 si+2

…. ….DBIC<0 DBIC>0

Parameters:

Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy (windows 25 ms, step 10 ms)

A threshold of 0.6 in the distance standard deviation was used to select significant local maximum; window size: 2000 ms, step 100 ms

Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC process

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 8: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

8

Results

Performance measure

Automatic Segmentation:

Collar (detection tolerance) range 0.5 s to 2.0 sA detected mark is assigned as correct if there is one reference mark

inside the collar allowed interval

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

RecallPrecision

RecallPrecision2scoreF1

marks reference#

marks detectedcorrectly #Recall

marks unexpected# marks detectedcorrectly #

marks detectedcorrectly #Precision

Page 9: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

9

Results

Segmentation performance

0.5 s 1.0 s 1.5 s 2.0 s

0.3

0.4

0.5

0.6

0.7

0.8

Collar (seconds)

F1-

scor

eF1-score: collar range 0.5 s to 2.0 s

0.8

0.7

0.6

0.5

0.4

0.3

0.5 1.0 1.5 2.0

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 10: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

10

Results

0.5 s 1.0 s 1.5 s 2.0 s

0.5

0.6

0.7

0.8

0.9

1

Collar (seconds)

Acc

urac

yRecall: collar range 0.5 s to 2.0 s

1.0

0.9

0.8

0.7

0.6

0.5

0.5 1.0 1.5 2.0

Segmentation performance

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 11: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

11

Methods

Phonetic (size of parameter vector for each segment: 214)

• Based on the results of a free phone loop speech recognition

• Phone duration and recognized log likelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation)

• Silence and speech rate

Prosodic (size of parameter vector for each segment: 108)

• Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope

• First and second order statistics

• Polynomial fit of first and second order

• Reset rate (rate of voiced portions)

• Voiced and unvoiced duration rates

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

2. Automatic Classification – Features

a vector of 322 features for each segment is computed

Page 12: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

12

Methods

Classification

SVM (Support Vector Machine) classifiers (WEKA tool, linear kernel, C=14):

• speech / non-speech

• read / spontaneous

2 step classification approach

Speech / non-speech

classification

Read / spontaneous

classification

non-speech

speechspontaneous

read

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 13: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

13

Results

Automatic detection (automatic segmentation + classification)

Agreement time = % frame correctly classified

Speech / non-speech detection

Type of features All Speech Non-speech Phonetic 91.5% 94.9% 62.2% Prosodic 93.2% 97.0% 61.0%

Combination 93.3% 96.6% 64.9%

Read / spontaneous detection

Type of features All Read Spontaneous Phonetic 76.7% 91.9% 38.6% Prosodic 81.1% 93.0% 51.2%

Combination 83.3% 92.7% 59.6%

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 14: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

14

Results

Classification only (using given manual segmentation)

% - Accuracy

Speech / non-speech classifier

Type of features All Speech Non-speech Phonetic 93.8% 96.7% 82.0% Prosodic 93.8% 97.5% 81.9%

Combination 94.4% 97.6% 84.0%

Type of features All Read Spontaneous Phonetic 83.2% 92.8% 55.4% Prosodic 86.4% 95.0% 61.6%

Combination 87.4% 93.7% 69.5%

Read / spontaneous classifier

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 15: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

15

Conclusions and future work

Read speech can be distinguished from spontaneous speech with reasonable accuracy.

Results were obtained with only a few and simple measures of the speech signal.

A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information).

We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint.

We intend to automatically segment all audio genres and speaking styles.

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 16: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

16

THANK YOU

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 17: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

17

Appendix – BICBIC (Bayesian Information Criterion)Dissimilarity measure between 2 consecutive segments

Two hypothesizes:H0 – No change of signal characteristics. Model: 1 Gaussian:H1 – Change of characteristics. 2 Gaussians:

μ – mean vector; S – covariance matrix

Maximum likelihood ratio between H0 and H1:

X

X1 X2

1 2

1 22 2 2( ) log log logX X XN N NX X XR i

~ ; ,X XX N x μ Σ

1 1 1 2 2 2~ ; , ; ~ ; , ;X X X XX N x X N xμ Σ μ Σ

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

Page 18: 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

18

Appendix – BIC

P –complexity penalization

λ – penalization factor (ideal 1.0)

Change if:

Parameters used in this work:

p=16; λ=1.3; frame rate = 100; N=200; M=10;

( ) ( )BIC i R i P

*( ) 0BIC i

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013