Top Banner
International Research Institute MICA Multimedia, Information, Communication & Applications UMI 2954 Hanoi University of Science and Technology 1 Dai Co Viet - Hanoi - Vietnam Multimedia Database (AC5210) Mac Dang Khoa SpeechCom department Le Thi Lan ComVis department
52

Multimedia Database (AC5210)

Jan 28, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multimedia Database (AC5210)

International Research Institute MICAMultimedia, Information, Communication & Applications

UMI 2954

Hanoi University of Science and Technology

1 Dai Co Viet - Hanoi - Vietnam

Multimedia Database

(AC5210)

Mac Dang KhoaSpeechCom department

Le Thi LanComVis department

Page 2: Multimedia Database (AC5210)

International Research Institute MICAMultimedia, Information, Communication & Applications

UMI 2954

Hanoi University of Science and Technology

1 Dai Co Viet - Hanoi - Vietnam

Audio and Speech database

Page 3: Multimedia Database (AC5210)

3

AC5210

2017 Classification

Audio

Sound

Environment

Animal

Noise

Synthetic sound

Music

Sound effect

Speech

Language achieveme

nt

Speech processing technology

ASR TTS Others:

Identification

Verification

Authentication

Example of Sound banks

Page 4: Multimedia Database (AC5210)

International Research Institute MICAMultimedia, Information, Communication & Applications

UMI 2954

Hanoi University of Science and Technology

1 Dai Co Viet - Hanoi - Vietnam

Speech database

Speech and language achievement and

documentation

Page 5: Multimedia Database (AC5210)

5

AC5210

2017 Languages in the world

Language speaking

distribution :

50 % population speak 10

languages

95% population speak 5%

(6500) languages

Page 6: Multimedia Database (AC5210)

6

AC5210

2017 Languages in the world

Endangered languages

Languages disappearing 1

From 1950, 421 languages were disappeared

“1/3 of the world’s languages are in danger of disappearing

in the next few decades” .

One language dies every 14 days

Language changing, mixing

1 http://www.sil.org

https://www.ethnologue.com

Languages saving => Documentation

Page 7: Multimedia Database (AC5210)

7

AC5210

2017 Languages documentation

What

“the methods, tools, and theoretical underpinnings for compiling a

representative and lasting multipurpose record of a natural language or one

of its varieties” (Himmelmann 1998)

To Preserve language (endangered languages)

Material for language study

Material for Natural language and Speech processing

Tasks

Collecting : recording, taking pictures, gathering written documents, ...

Processing : analysing, systematizing, transcribing, translating, ...

Archiving: storing, publising

Among 7000 languages

< 600 well documentation languages

3,349 unwritten languages

Page 8: Multimedia Database (AC5210)

8

AC5210

2017 Community

Page 9: Multimedia Database (AC5210)

9

AC5210

2017 Vietnamese minority languages

54 ethnic groups

Vietnamese (Kinh): 87%

5 ethnics < 1000 pers

Page 10: Multimedia Database (AC5210)

10

AC5210

2017 MICA’s AuCo collection

ÂuCơ: Audio Copora

From 2007

Language documentation:

Vietnam an neighbors

Minorities

Tasks

Collection

Digitalization

Documentation :annotation,

transcription

Analysis

Archiving: Online access

Page 11: Multimedia Database (AC5210)

11

AC5210

2017 DoReMiFa project

Données des Recherches

linguistiques de Michel

Ferlus en Asie du sud-est

Digitizing the collections of

Michel Ferlus (1963-2003)

2014 – 2015

Data

>40 ethnic languages

> 200 cassette tapes,

recording from 1963 – 2003

Project groups

6 Linguists + IT expert

>20 linguistic student

Page 12: Multimedia Database (AC5210)

12

AC5210

2017 Data collection

Available recorded data

Fieldwork recording

Page 13: Multimedia Database (AC5210)

13

AC5210

2017 Data collection

Wordlist

Common/Standard wordlist for fieldwork

>2000 worđs

Available : HAL

Page 14: Multimedia Database (AC5210)

14

AC5210

2017 Digitalization

Signal digital (WAV, 24-bit,

48,000 Hz)

Audio

tapes

Page 15: Multimedia Database (AC5210)

15

AC5210

2017 Processing

Transcription - Manually

Linguists /Phonetician

Time-consuming

> 10 h working for 1 hour of transcription (Word level)

> 100h working for 1 hour of transcription (

Phone level)

Page 16: Multimedia Database (AC5210)

16

AC5210

2017 Processing (2)

Transcription: Semi-automatically

Multilingual speech recognition

Acoustic Phonetic

Recognizer

Multilingual

AM

IPA

TextGrid

X-SAMA

TextGrid

TextGrid Conversion

Input speech

Phone sequence (hypothesis)

Page 17: Multimedia Database (AC5210)

17

AC5210

2017 Processing (3)

Transcription: Semi-automatically

Experiments of phone level transcription on

Green Mong (Mo Piu) language [1]

• <500 speakers, unwritten languages, lack of linguistic study

• Methods: 5 languages supply acoustic models : Vietnamese (VN),

Mandarin (CH), Khmer (KH), French (FR), English (EN), each one

trained on big corpora

Na language

• Acoustic model from 5 languages: 40 English phones, 43 French

phones, 34 Mandarin phones, 41 Vietnamese phones, and 36

Khmer phones

1. Caelen-Haumont G, Sam S, Castelli E (2011) Automatic Labeling and Phonetic Assessment for an Unknown Asian

Language: The Case of the“ Mo Piu” North Vietnamese Minority (early results). In: Asian Language Processing (IALP),

2011 International Conference on. IEEE, pp 260–263

2. Thi-Ngoc-Diep DO, Alexis M, Eric C (2015) Towards the Automatic Processing of Yongning Na (Sino-Tibetan): Developing

a “Light”Acoustic Model of the Target Language and Testing “Heavyweight”Models from Five National Languages. In: The

4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU’14. St Petersburg,

Russia

Page 18: Multimedia Database (AC5210)

18

AC5210

2017 Processing (4)

Transcription: Semi-automatically

Phone transcription result for Mo Piu

Automatic labeling

Manual labeling

Page 19: Multimedia Database (AC5210)

19

AC5210

2017 Processing (5)

Transcription: Semi-automatically

Phone transcription result for Mo Piu

0

10

20

30

40

50

%

good + close acoustic wrong

CHFR VNCHVNCHKH VNCHKHENFRVNCHKHFR VNFR

Page 20: Multimedia Database (AC5210)

20

AC5210

2017 Archiving (1)

Open Language Archives Community (OLAC)

Page 21: Multimedia Database (AC5210)

21

AC5210

2017 Archiving (2)

Participating Archives

Page 22: Multimedia Database (AC5210)

22

AC5210

2017 Archiving (3)

Pangloss collection:

Page 23: Multimedia Database (AC5210)

23

AC5210

2017 Archiving (5)

Data format standard:

Page 24: Multimedia Database (AC5210)

24

AC5210

2017 Data packaging

Annotation

TextGrid

Wordlist

XML generation

Page 25: Multimedia Database (AC5210)

25

AC5210

2017 Publishing

- 42 languages/dialects

- 120 hours of recordings

- 100 annotated documents

Current

Page 26: Multimedia Database (AC5210)

26

AC5210

2017 Publishing (2)

Examples: http://lacito.vjf.cnrs.fr/archivage/index_en.htm

Page 27: Multimedia Database (AC5210)

International Research Institute MICAMultimedia, Information, Communication & Applications

UMI 2954

Hanoi University of Science and Technology

1 Dai Co Viet - Hanoi - Vietnam

Speech database

For speech processing

Page 28: Multimedia Database (AC5210)

28

AC5210

2017

The Task Specific Voice Control and Dialog system

Speech

Recognizer

Language

Analyzer

Expert

system

Text-to-

speech

synthesizer

vocabulary

&grammar

model

Semantic

rule

Pronunciation

rule

Systems under voice

control executes

commands reports status

Converts spoken

input into

grammatically correct

text

Extracts

meaning

from text

Selects desired

action, issues

commands to system,

constructs reply in

text form

Converts text reply

into machine

generated speech

(TEXT)(Meaning) (reply

text)(Speech)

Voice

output

(Speech)

Output

action

Transcribed

speech

Corpus

Text Corpus

Position of text corpus and speech corpus

Page 29: Multimedia Database (AC5210)

29

AC5210

2017 Corpus building

Recording

High quality

Well control

Non – naturel

Expensive

=> Specific purpose, Text to speech

Crowdsourcing

Source: available speech sources

Size: huge

Different types/quality

Nature quality

Not expensive for a big corpus

=> ASR

Page 30: Multimedia Database (AC5210)

30

AC5210

2017 Evaluation problems

How evaluate an ASR (or ASPR) system ?

Tests common databases (benchmark)

Evaluation campaigns (DARPA for ASR and NIST for

ASPR)

For French: AUPELF

Common databases

Page 31: Multimedia Database (AC5210)

31

AC5210

2017 Some speech databases

TIMIT : 630 American speakers, recording in good

conditions 1 recording session

Bref80 : 80 francophone speakers, read of

« journal Le Monde » texts (5330 sentences)

M2VTS : multimodal databases (voice + visage

images)

Switchboard : speech in English, telephone

quality, several recording sessions

There is very few databases for mobile phone

speech

CTIMIT (TIMIT corpus re-recorded through a cellular

mobile phone inside a moving truck)

Cellular Switchboard

Page 32: Multimedia Database (AC5210)

International Research Institute MICAMultimedia, Information, Communication & Applications

UMI 2954

Hanoi University of Science and Technology

1 Dai Co Viet - Hanoi - Vietnam

VNSpeechCorpus

Page 33: Multimedia Database (AC5210)

33

AC5210

2017 Text corpus - VietnameseData collection and normalization (1/2)

Remark:

We can get grand text corpus from Web. This corpus

contains contains a large number of words in different context

of different domains.

It is very useful to be used in analyzing the acoustic units

it can represent the statistic distribution of universal

Vietnamese language.

Page 34: Multimedia Database (AC5210)

34

AC5210

2017 Text corpus - VietnameseData collection and normalization (2/2)

Web pages collection and data preparation :

Documents were gathered from Internet by some web robots

Constructing the text corpus from HTML pages.

Normalizing or rewriting non-standard words.

Main contents

menus links, references advertisements

RedundancyRedundancymust be must be removed !removed !

Main contents

menus links, references advertisements

RedundancyRedundancyRedundancyRedundancymust be must be removed !removed !

Txt 868MB

Normalized

www

Html 2.5GB

Data collection

Normalization

Page 35: Multimedia Database (AC5210)

35

AC5210

2017 Text corpusData collection and normalization (2/3)

Web pages collection and data preparation : Documents were gathered from Internet by a web robot

Constructing the text corpus from HTML pages.

Normalizing or rewriting non-standard words.

All characters were converted to Unicode (UTF-8) by our tools.

Variable modules

data collecting

html2text

1. token normalizing 2. character converting

3. sentence splitting 4. word splitting

5. case changing6. lexicon constructing

7. number2text

Data preparation

8. sentence filtering

Fixed moduleswww

html

txt

sent

Page 36: Multimedia Database (AC5210)

36

AC5210

2017 Text corpusData collection and normalization (3/3)

All redundancy removed

Menus

Links

References

Advertisements

Vietnamese: 2.5 GB of HTML pages → 868 MB of

text corpus = 10,020,267 sentences

Main contents

menus links, references advertisements

RedundancyRedundancymust be must be removed !removed !

Main contents

menus links, references advertisements

RedundancyRedundancyRedundancyRedundancymust be must be removed !removed !

Page 37: Multimedia Database (AC5210)

37

AC5210

2017 Text corpus - VietnameseText corpus evaluation

Perplexity of the language models

Perplexity is used to evaluate quality of one language

model which is built from one Text corpus.

Page 38: Multimedia Database (AC5210)

38

AC5210

2017 Speech Corpus- VietnameseData collection

Text corpus for recording:

Goal: cover words and sentences that are most frequently

used, cover sufficient variations to support flexible and natural

spoken language generation .

Content:

Phoneme, digits and string of digits, application words.

Sentences, short paragraphs.

Domains:

Law, culture and society, sports, science and technology,

policy, medicine, business, weather.

Everyday conversations.

Source: web, books, newspapers.

Page 39: Multimedia Database (AC5210)

39

AC5210

2017 Quiet studio for recording

Page 40: Multimedia Database (AC5210)

40

AC5210

2017Speech Corpus - VietnameseRequirements of Speech corpus

phoneme: study the acoustic and spectral characteristics of

Vietnamese phonemes. Ex: a /a/ , a/a/ - a ha /a - aha/

Tones: study the acoustic and spectral characteristics of six

tons. Ex: ba, bá, bà, bã, bả, bạ

Digits and string of digits: to build the isolated/connected

digital recognition/ synthesis systems.

Digits: 0-9,

String of digits: telephone number , credit number

Application words: used in controller systems by speech such

as telephone services, human-machine interface...

Ex: đóng - close, mở - open, đo - measure...

Sentences and paragraphs: used for training and testing

continuous speech recognition systems:

Dialogs , short paragraphs

Page 41: Multimedia Database (AC5210)

41

AC5210

2017 Speech Corpus - VietnameseSpeaker selection

The most important among the speaker characteristics

sex

regional dialectal background

level of education

age and physical health.

Our speakers :

The age of the speakers : from 15 to 45 years old,

Among the 50 speakers, 25 females/ 25 are males.

From 4 big cities and provinces, Hanoi, NgheAn, HaTinh,

HCM city, represent 3 major dialect regions: the South, the

North, and the Middle

Page 42: Multimedia Database (AC5210)

42

AC5210

2017 Speech Corpus - VietnameseVNSpeechCorpus

Common part: 45 minutes of signal of phonemes, tones, digits

and strings of digits, application words and common sentences

and paragraphs:

contains 4955 isolated words, with 1257 different

words.

There are 840 mono-words of text corpus in the 1327

most used mono-words of the web corpus

Private part: 15 minutes of signal of about 40 short paragraphs

Total 2000 short paragraphs (70 - 80 words)/ 20

subjects.

40 short paragraphs /1 speaker.

Page 43: Multimedia Database (AC5210)

43

AC5210

2017 Speech Corpus - VietnameseSpeech database evaluation (1/6)

Evaluating corpus by analyzing the

distributions of acoustic units including:

mono-words, base syllables,

Initial-Final parts,

Phonemes, di-phones, tri-phone and tones

Compare the distributions with the

distributions obtained from Text Corpus (Web

corpus)

Page 44: Multimedia Database (AC5210)

44

AC5210

2017 Speech Corpus - VietnameseSpeech database evaluation (2/6)

Distribution of mono-phones in common part, private part and

Web corpora

Page 45: Multimedia Database (AC5210)

45

AC5210

2017Speech Corpus - VietnameseSpeech database evaluation (3/6)

Distribution of six tones in common part, private part and Web

corpora

The distributions of acoustic units correspond with the

distributions of acoustic units of a huge Text Corpus (Web)

Page 46: Multimedia Database (AC5210)

46

AC5210

2017Speech Corpus - VietnameseSpeech database evaluation (4/6)

We calculated the correlation coefficients between the

distributions of the common part and the private part with

the web reference corpora

vector x: occurrence frequency of acoustic units of

VNSpeechCorpus

vector y: occurrence frequency of acoustic units of Text

Corpus.

corr(x,y): correlation coefficient between x and y.

yx

yxyxcorr

),cov(),(

))((1

),cov(1

yyxxn

yx i

n

i i

n

i i xxn

x1

2)(1

Page 47: Multimedia Database (AC5210)

47

AC5210

2017 Speech Corpus - VietnameseSpeech database evaluation (5/6)

Correlation coefficients of acoustic units between common

part, private part and Web data :

Our corpus is acceptable and correctly balanced in

terms of acoustic units and tones

Page 48: Multimedia Database (AC5210)

48

AC5210

2017 Speech Corpus – VietnameseSpeech database evaluation (6/6)

No of

Speaker

Recording

timePurpose

SPEECHDAT 5000 10’/Speaker Training speech recognition systems via

telephone network.

BREF 90 40’-70’/

Speaker

Training and testing speech recognition

systems of French Language

SESP 45 10’/Speaker Speaker recognition

Korean Speech

Database

50

150

10h office

10h studio

Training and testing speech recognition

systems of Korean Language

VNSpeechCorpus 50 50h office

50h studio

Training and testing speech recognition/

synthesis systems of Vnmese Language

Comparison between VNSpeechCorpus with other Speech Database

Page 49: Multimedia Database (AC5210)

49

AC5210

2017 VNSpeechCorpus +

Goals

100h transcribed studio speech

>100 speakers

> 100h nature recording by smartphone

> 100h crowdsourcing speech

Page 50: Multimedia Database (AC5210)

International Research Institute MICAMultimedia, Information, Communication & Applications

UMI 2954

Hanoi University of Science and Technology

1 Dai Co Viet - Hanoi - Vietnam

Homework

Page 51: Multimedia Database (AC5210)

51

AC5210

2017 Next week: presentation

Sound bank examples

Speech Database for ASR

Page 52: Multimedia Database (AC5210)

52

AC5210

2017 Course projects

Speech/Audio crowdsourcing

Read and summary: Eskenazi, Maxine, Gina-Anne Levow,

Helen Meng, Gabriel Parent, and David

Suendermann. Crowdsourcing for speech processing:

Applications to data collection, transcription and

assessment. John Wiley & Sons, 2013

Tools for speech/audio crowding

Demo: Development a tool for online audio news crowding

(VOV, VTV online .etc)

Speech/Language online collection

OLAC collection: Overview, standard

Speech segmentation techniques

Demo: Speech segmentation (sentence level)