Top Banner
Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002
168

Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Dec 29, 2015

Download

Documents

Kelly Morris
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Novel Speech Recognition Models for Arabic

The Arabic Speech Recognition TeamJHU Workshop Final Presentations

August 21, 2002

Page 2: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Arabic ASR Workshop Team

Senior Participants Undergraduate Students: Katrin Kirchhoff, UW Melissa Egan, Pomona

College

Jeff Bilmes, UW Feng He, Swarthmore College

John Henderson, MITRE

Mohamed Noamany, BBN Affiliates:

Pat Schone, DoD Dimitra Vergyri, SRI

Rich Schwartz, BBN Daben Liu, BBN

Nicolae Duta, BBN

Graduate Students Ivan Bulyko, UW

Sourin Das, JHU Mari Ostendorf, UW

Gang Ji, UW

Page 3: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

“Arabic”

GulfArabic

EgyptianArabic

LevantineArabic

North-AfricanArabic

ModernStandardArabic(MSA)

Dialects used for informal conversation

Cross-regional standard, used for formal communication

Page 4: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Arabic ASR: Previous Work

• dictation: IBM ViaVoice for Arabic• Broadcast News: BBN TIDESOnTap• conversational speech: 1996/1997 NIST

CallHome Evaluations

• little work compared to other languages• few standardized ASR resources

Page 5: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Arabic ASR: State of the Art (before WS02)

• BBN TIDESOnTap: 15.3% WER • BBN CallHome system: 55.8% WER • WER on conversational speech noticeably

higher than for other languages (eg. 30% WER for English CallHome)

focus on recognition of conversational Arabic

Page 6: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Problems for Arabic ASR

• language-external problems: – data sparsity, only 1 (!) standardized corpus

of conversational Arabic available

• language-internal problems: – complex morphology, large number of

possible word forms (similar to Russian, German, Turkish,…) – differences between written and spoken

representation: lack of short vowels and other pronunciation information

(similar to Hebrew, Farsi, Urdu, Pashto,…)

Page 7: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Corpus: LDC ECA CallHome

• phone conversations between family members/friends

• Egyptian Colloquial Arabic (Cairene dialect)• high degree of disfluencies (9%), out-of-vocabulary

words (9.6%), foreign words (1.6%) • noisy channels • training: 80 calls (14 hrs), dev: 20 calls (3.5 hrs),

eval: 20 calls (1.5 hrs)• very small amount of data for language modeling

(150K) !

Page 8: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

MSA - ECA differences • Phonology:

– /th/ /s/ or /t/ thalatha - talata (‘three’)– /dh/ /z/ or /d/ dhahab - dahab (‘gold’)– /zh/ /g/ zhadeed - gideed (‘new’) – /ay/ /e:/ Sayf - Seef (‘summer’)– /aw/ /o:/ lawn - loon (‘color’)

• Morphology: – inflections yatakallamu - yitkallim (‘he

speaks’)

• Vocabulary:– different terms TAwila - tarabeeza (`table’)

• Syntax: – word order differences SVO - VSO

Page 9: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Workshop Goals

improvements to Arabic ASR through

developing novel models to better exploit available data

developing techniques for using out-of-corpusdata

Factored language modeling Automaticromanization

Integration ofMSA text data

Page 10: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Factored Language Models

• complex morphological structure leads to large number of possible word forms

• break up word into separate components

• build statistical n-gram models over individual morphological components rather than complete word forms

Page 11: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Automatic Romanization

• Arabic script lacks short vowels and other pronunciation markers

• comparable English example

• lack of vowels results in lexical ambiguity; affects acoustic and language model training

• try to predict vowelization automatically from data and use result for recognizer training

th fsh stcks f th nrth tlntc hv bn dpletd

the fish stocks of the north atlantic have been depleted

Page 12: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Out-of-corpus text data

• no corpora of transcribed conversational speech available

• large amounts of written (Modern Standard Arabic) data available (e.g. Newspaper text)

• Can MSA text data be used to improve language modeling for conversational speech?

• Try to integrate data from newspapers, transcribed TV broadcasts, etc.

Page 13: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Recognition Infrastructure

• baseline system: BBN recognition system • N-best list rescoring • Language model training: SRI LM toolkit

with significant additions implemented during this workshop

• Note: no work on acoustic modeling, speaker adaptation, noise robustness, etc.

• two different recognition approaches: grapheme-based vs. phoneme-based

Page 14: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Summary of Results (WER)

40

45

50

55

60

65

52

53

54

55

56

57

58

59

AdditionalCallhomedata 55.1%

Languagemodeling 53.8%

Baseline59.0

Automaticromanization57.9%

Grapheme-based reconizer

Phone-based recognizer

Random62.7%

Oracle46%

Base-line55.8%

Trueromanization54.9%

Page 15: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Novel research

• new strategies for language modeling based on morphological features

• new graph-based backoff schemes allowing wider range of smoothing techniques in language modeling

• new techniques for automatic vowel insertion• first investigation of use of automatically

vowelized data for ASR• first attempt at using MSA data for language

modeling for conversational Arabic• morphology induction for Arabic

Page 16: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Key Insights

• Automatic romanization improves grapheme-based Arabic recognition systems

• trend: morphological information helps in language modeling

• needs to be confirmed on larger data set

• Using MSA text data does not help• We need more data!

Page 17: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Resources

• significant add-on to SRILM toolkit for general factored language modeling

• techniques/software for automatic romanization of Arabic script

• part-of-speech tagger for MSA & tagged text

Page 18: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Outline of Presentations• 1:30 - 1:45: Introduction (Katrin Kirchhoff)• 1:45 - 1:55: Baseline system (Rich Schwartz) • 1:55 - 2:20: Automatic romanization (John Henderson, Melissa Egan)• 2:20 - 2:35: Language modeling - overview (Katrin

Kirchhoff)• 2:35 - 2:50: Factored language modeling (Jeff Bilmes)• 2:50 - 3:05: Coffee Break • 3:05 - 3:10: Automatic morphology learning (Pat Schone)• 3:15 - 3:30: Text selection (Feng He)• 3:30 - 4:00: Graduate student proposals (Gang Ji, Sourin

Das)• 4:00 - 4:30: Discussion and Questions

Page 19: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Thank you!• Fred Jelinek, Sanjeev Khudanpur, Laura

Graham• Jacob Laderman + assistants• Workshop sponsors• Mark Liberman, Chris Cieri, Tim Buckwalter• Kareem Darwish, Kathleen Egan• Bill Belfield & colleagues from BBN • Apptek

Page 20: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.
Page 21: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

BBN Baseline System for Arabic

Richard Schwartz, Mohamed Noamany,Daben Liu, Bill Belfield, Nicolae Duta

JHU WorkshopAugust 21, 2002

Page 22: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

BBN BYBLOS System

• Rough’n’Ready / OnTAP / OASIS system

• Version of BYBLOS optimized for Broadcast News

• OASIS system fielded in Bangkok and Aman

• Real-Time operation with 1-minute delay

• 10%-20% WER, depending on data

Page 23: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

BYBLOS Configuration

• 3-passes of recognition– Forward Fast-match uses PTM models and

approximate bigram search– Backward pass uses SCTM models and

approximate trigram search, creates N-best.– Rescoring pass uses cross-word SCTM

models and trigram LM

• All runs in real time– Minimal difference from running slowly

Page 24: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Use for Arabic Broadcast News

• Transcriptions are in normal Arabic script, omitting short vowels and other diacritics.

• We used each Arabic letter as if it were a phoneme.

• This allowed addition of large text corpora for language modeling.

Page 25: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Initial BN Baseline

• 37.5 hours of acoustic training• Acoustic training data (230K words)

used for LM training• 64K-word vocabulary (4% OOV)

• Initial word error rate (WER) = 31.2%

Page 26: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Speech Recognition Performance

System (all real-time results) WER (%)

Baseline 31.2

+ 145M word LM (Al Hayat) 26.6

+ System Improvements (MLLR and tuning) 21.0

+ 128k Lexicon (OOV reduced to 2%) 20.4

+ Additional 20 hours acoustic data 19.1

+ 290M word LM + improved lexicon 17.3

+ New scoring (remove hamza from alif) 15.3

Page 27: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Call Home Experiments

• Modified OnTAP system to make it more appropriate for Call Home data.

• Added features from LVCSR research to OnTAP system for Call Home data.

• Experiments:– Acoustic training: 80 conversations (15 hours)

• Transcribed with diacritics

– Acoustic training data (150K words) used for LM

– Real-time

Page 28: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Using OnTAP system for Call Home

System WER (%)

Baseline for OASIS 64.1

+ Bypass BN segmenter 63.4

+ Cepstral Mean Subtraction on conversations 62.4

+ Incremental MLLR on whole conversation 61.8

+ 1-level CMS (instead of 2) 60.8

Page 29: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Additions from LVCSR

System WER (%)

Baseline for OASIS 60.8

+ VTL on training and decoding (unoptimized) 59.0

+ LPC Smoothing with 40 poles 58.7

+ ‘split-init training’ 58.1

+ HLDA (not used for workshop) 56.6

+ Modified backoff (not used for workshop) 56.0

Page 30: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Output Provided for Workshop

• OASIS was run on various sets of training as needed• Systems were run either for Arabic script phonemes

or ‘Romanized’ phonemes – with diacritics.• In addition to workshop participants, others at BBN

provided assistance and worked on workshop problems.

• Output provided for workshop was N-best sentences– with separate scores for HMM, LM, #words, #phones,

#silences– Due to high error rate (56%), the oracle error rate for 100

N-best was about 46%.

• Unigram lattices were also provided, with oracle error rate of 15%

Page 31: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Phoneme HMM Topology Experiment

• The phoneme HMM topology was increased for the Arabic script system from 5 states to 10 states in order to accommodate a consonant and possible vowel.

• The gain was small (0.3% WER)

Page 32: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

OOV Problem

• OOV Rate is 10%– 50% is morphological variants of words in

the training set– 10% is Proper names– 40% is other unobserved words

• Tried adding words from BN and from morphological transducer– Added too many words with too small gain

Page 33: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Use BN to Reduce OOV

• Can we add words from BN to reduce OOV?

• BN text contains 1.8M distinct words.• Adding entire 1.8M words reduces OOV

from 10% to 3.9%.• Adding top 15K words reduces OOV to

8.9%• Adding top 25K words reduces OOV to

8.4%.

Page 34: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Use Morphological Transducer

• Use LDC Arabic transducer to expand verbs to all forms– Produces > 1M words

• Reduces OOV to 7%

Page 35: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Language Modeling Experiments

Described in other talks• Searched for available dialect

transcriptions• Combine BN (300M words) with CH

(230K)• Use BN to define word classes• Constrained back-off for BN+CH

Page 36: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.
Page 37: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Autoromanization of Arabic Script

Melissa Egan and John Henderson

Page 38: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Autoromanization (AR) goal• Expand Arabic script representation to include

short vowels and other pronunciation information.

• Phenomena not typically marked in non-diacritized script include:– Short vowels {a, i, u}– Repeated consonants (shadda)– Extra phonemes for Egyptian Arabic {f/v,j/g} – Grammatical marker that adds an ‘n’ to the

pronunciation (tanween)

• ExampleNon-diacritized form: ktb – write Expansions: kitab – book

aktib – I writekataba – he wrotekattaba – he caused to write

Page 39: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

AR motivation

• Romanized text can be used to produce better output from an ASR system.– Acoustic models will be able to better disambiguate

based on extra information in text.– Conditioning events in LM will contain more

information.

• Romanized ASR output can be converted to script for alternative WER measurement.

• Eval96 results (BBN recognizer, 80 conv. train)– script recognizer: 61.1 WERG (grapheme)– romanized recognizer: 55.8 WERR

(roman)

Page 40: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

RomanizerTesting

RomanizerTraining

AR data

CallHome Arabic from LDCConversational speech transcripts (ECA) in both script and a roman specification that includes short vowels, repeats, etc.

set conversationswords

asrtrain 80 135Kdev 20

35Keval96(asrtest) 20 15Keval97 20 18Kh5_new 20 18K

Page 41: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Data format• Script without and with diacritics

• CallHome in script and roman forms

Script: AlHmd_llh kwIsB w AntI AzIk

Roman: ilHamdulillA kuwayyisaB~ wi inti izzayyik

our task

Page 42: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Autoromanization (AR) WER baseline

• Train on 32K words in eval97+h5_new• Test on 137K words in ASR_train+h5_new

Status portion error % totalin train in test in test errorunambig. 68.0% 1.8% 6.2%ambig. 15.5 13.9 10.8unknown 16.5 99.8 83.0total 100 19.9 100.0

Biggest potential error reduction would come from predicting romanized forms for unknown words.

Page 43: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

AR “knitting” example

unknown: t bqwA

kn.roman: yibqu ops: ciccrd

new roman: tibqu

known: y bqwA

kn.roman: yibqu ops: ciccrd

unknown: tbqwA

known: ybqwA1. Find close known word

2. Record ops required to make roman from known

3. Construct new roman using same ops

Page 44: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Experiment 1 (best match)

Observed patterns in the known short/long pairs:Some characters in the short forms are

consistently found with particular, non-identical characters in the long forms.

Example rule: A a

Page 45: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Experiment 2 (rules)

• Some output forms depend on output context.

• Rule:– ‘u’ occurs only between two non-vowels.– ‘w’ occurs elsewhere.

• Accurate for 99.7% of the instances of ‘u’ and ‘w’ in the training dictionary long forms. Similar rule may be formulated for ‘i’ and ‘y.’

Environments in which ‘w’ occurs in training dictionary long forms:Env FreqC _ V 149V _ # 8# _ V 81C _ # 5V _ V 121V _ C 118

Environments in which ‘u’ occurs in training dictionary long forms:Env FreqC _ C 1179C _ # 301# _ C 29

Page 46: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Experiment 3 (local model)

• Move to more data-driven model– Found some rules manually.– Look for all of them, systematically.

• Use best-scoring candidate for replacement– Environment likelihood score– Character alignment score

Known long: H a n s A h aKnown short: H A n s A h A

input: H A m D y h A

result: H a m D I h a

Page 47: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Experiment 4 (n-best)• Instead of generating romanized form using the

single best short form in the dictionary, generate romanized forms using top n best short forms.

Example (n = 5)

Page 48: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Character error rate (CER)

• Measurement of insertions, deletions, and substitutions in character strings should more closely track phoneme error rate.

• More sensitive than WER– Stronger statistics from same data

• Test set results– Baseline 49.89 character error rate (CER)– Best model 24.58 CER– Oracle 2-best list 17.60 CER suggests more room for gain.

Page 49: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Summary of performance (dev set)

Accuracy CER

Baseline 8.4% 41.4%

Knitting 16.9% 29.5%

Knitting + best match + rules 18.4% 28.6%

Knitting + local model 19.4% 27.0%

Knitting + local model + n-best 30.0% 23.1% (n = 25)

Page 50: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Varying the number of dictionary matches

18

22

26

30

0 50 100 150 200dictionary matches

per

form

ance

accuracy CER

Page 51: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

ASR scenarios

1) Have a script recognizer, but want to produce romanized form.

postprocessing ASR output

2) Have a small amount of romanized data and a large amount of script data available for recognizer training.

preprocessing ASR training set

Page 52: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

ASR experiments

ScriptTrain

ScriptASR

AR

WERG

WERRRomanResult

ScriptResult

ARRoman

ASR

RomanResult WERR

ScriptResult WERGR2S

Postprocessing

Preprocessing

Page 53: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Experiment: adding script data

Future training set

ASR train100 conv

AR train40

•Script LM training data Script LM training data could be acquired from could be acquired from found text.found text.

•Script transcription is Script transcription is cheaper than roman cheaper than roman transcriptiontranscription

•Simulate a Simulate a preponderance of preponderance of script by training AR script by training AR on a separate set.on a separate set.

•ASR is then trained ASR is then trained on output of AR.on output of AR.

Page 54: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Eval 96 experiments, 80 conv

Config WERR WERG

script baseline N/A 59.8post processing 61.5 59.8preprocessing 59.9 59.2 (-

0.6)Roman baseline 55.8 55.6 (-

4.2)Bounding experiment• No overlap between ASR train and AR

train.• Poor pronunciations for “made-up” words.

Page 55: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Eval 96 experiments, 100 conv

Config WERR WERG

script baseline N/A 59.0postprocessing 60.7 59.0preprocessing 58.5 57.5 (-

1.5)Roman baseline 55.1 54.9 (-

4.1)More realistic experiment• 20 conversation overlap between ASR

train and AR train.• Better pronunciations for “made-up”

words.

Page 56: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Remaining challenges

• Correct “dangling tails” in short matches

• Merge unaligned characters

Page 57: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Bigram translation model

input s: t b q w A

output r: □ t i b q u □

kn. roman dl: y i b q u

r* arg maxr

p(s,d s )p(r | s,dl )d(ds , dl )

arg maxr

p(s,ds )p(r, s,dl )d(ds ,d l )

p(r, s, dl ) p(ri | ri 1 )p(sj | ri )p(dlk | ri)i

p(sj | ri)

p(ri | ri 1)

p(dlk | ri )

Page 58: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Future work

• Context provides information for disambiguating both known and unknown words– Bigrams for unknown words will also be

unknown, use part of speech tags or morphology.

• Acoustics– Use acoustics to help disambiguate vowels?– Provide n-best output as alternative

pronunciations for ASR training.

Page 59: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.
Page 60: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Factored Language Modeling

Katrin Kirchhoff, Jeff Bilmes, Dimitra Vergyri,Pat Schone, Gang Ji, Sourin Das

Page 61: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Arabic morphology

• structure of Arabic derived words

s k n

root

a a

pattern

LIVE + past + 1st-sg-past + part: “so I lived”

-tufa- affixesparticles

Page 62: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Arabic morphology• ~5000 roots• several hundred patterns• dozens of affixes large number of possible word

forms problems training robust language

model large number of OOV words

Page 63: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Vocabulary Growth - full word forms

CallHome

0

2000

4000

6000

8000

10000

12000

14000

16000

# word tokens

vocab size EnglishArabic

Page 64: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Vocabulary Growth - stemmed words

CallHome

0

2000

4000

6000

8000

10000

12000

14000

16000

# word tokens

vocab size EN wordsAR wordsEN stemsAR stems

Page 65: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Particle model

• Break words into sequences of stems + affixes:

• Approximate probability of word sequence by probability of particle sequence

MW ,...,, 21

T

ntnttttN PWWWP )|(),...,,( 1,...,2,121

Page 66: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Factored Language Model

• Problem: how can we estimate P(Wt|Wt-1,Wt-

2,...) ?

• Solution: decompose W into its morphological components: affixes, stems, roots, patterns

• words can be viewed as bundles of features Pt

patterns

roots

affixes

stems

words

Rt

At

St

Pt-1

Rt-1

At-1

St-1

Pt-2

Rt-2

At-2

St-2

Wt-2 Wt-1 Wt

Page 67: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Statistical models for factored representations

• Class-based LM:

• Single-stream LM:

),|()|(),|( 2121 tttttttt FFFPFWPWWWP

),|(),...,,|( 21121 tttttt FFFPFFFFP

Page 68: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Full Factored Language Model

assume where w = word, r = root, = pattern, a = affixes

• Goal: find appropriate conditional independence statements to simplify this model.

iiii raw ,,

),,,,,|(

),,,,,,|(

),,,,,,,|(

),,,,,|,,(),|(

222111

222111

222111

22211121

iiiiiii

iiiiiiii

iiiiiiiii

iiiiiiiiiiii

raraP

rararP

rararaP

rararaPwwwP

Page 69: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Experimental Infrastructure

• All language models tested using nbest rescoring

• two baseline word-based LMs: – B1: BBN LM, WER 55.1%– B2: WS02 baseline LM, WER 54.8%

• combination of baselines: 54.5%• new language models were used in

combination with one or both baseline LMs

• log-linear score combination scheme

Page 70: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Log-linear combination

For m information sources, each producing a maximum-likelihood estimate for W:

I: total information available Ii : the i’th information source

ki: weight for the i’th information source

i

i

m

ii

kIWP

IZIWP )|(

)(

1)|(

Page 71: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Discriminative combination

• We optimize the combination weights jointly with the language model and insertion penalty to directly minimize WER of the maximum likelihood hypothesis.

• The normalization factor can be ignored since it is the same for all alternative hypotheses.

• Used the simplex optimization method on the 100-bests provided by BBN (optimization algorithm available in the SRILM toolkit).

Page 72: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Word decomposition

• Linguistic decomposition (expert knowledge)

• automatic morphological decomposition: acquire morphological units from data without using human knowledge

• assign words to classes based not on characteristics of word form but based on distributional properties

Page 73: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

(Mostly) Linguistic Decomposition

• Stems/morph class: information from LDC CH lexicon:

• roots: determined by K. Darwish’s morphological analyzer for MSA

• pattern: determined by subtracting root from stem

$atamna <…> $atam:verb+past-1st-plural

stem morph. tag

$atam $tm

$atam CaCaC

Page 74: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Automatic Morphology

• Classes defined by morphological components derived from data

• no expert knowledge• based on statistics of word forms• more details in Pat’s presentation

Page 75: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Data-driven Classes• Word clustering based on distributional statistics• Exchange algorithm (Martin et. al 98)

– initially assign words to individual clusters– move each temporarily word to all other clusters,

compute change in perplexity (class-based trigram)– keep assignment that minimizes perplexity– stop when class assignment no longer changes

• bottom-up clustering (SRI toolkit)– initially assign words to individual clusters– successively merge pairs of clusters with highest

average mutual information– stop at specified number of classes

Page 76: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Results

• Best word error rates obtained with:– particle model: 54.0% (B1 + particle LM)

– class-based models: 53.9% (B1+Morph+Stem)

– automatic morphology: 54.3% (B1+B2+Rule)

– data-driven classes: 54.1% (B1+SRILM, 200 classes)

• combination of best models: 53.8%

Page 77: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Conclusions• Overall improvement in WER gained from

language modeling (1.3%) is significant• individual differences between LMs are not

significant• but: adding morphological class models always

helps language model combination• morphological models get the highest weights

in combination (in addition to word-based LMs)• trend needs to be verified on larger data set application to script-based system?

Page 78: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.
Page 79: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Factored Language Models and Generalized

Graph Backoff

Jeff Bilmes, Katrin KirchhoffUniversity of Washington, Seattle &

JHU-WS02 ASR Team

Page 80: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Outline• Language Models, Backoff, and Graphical

Models

• Factored Language Models (FLMs) as Graphical Models

• Generalized Graph Backoff algorithm

• New features to SRI Language Model Toolkit (SRILM)

Page 81: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Standard Language Modeling

321( | ) ( | , , )tt t ttt wP w h P w w w

• Example: standard tri-gram

tW1tW 2tW 3tW 4tW

Page 82: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Typical Backoff in LM

1 2 3| , ,t t t tW W W W

1 2| ,t t tW W W

1|t tW W

tW

• In typical LM, there is one natural (temporal) path to back off along.

• Well motivated since information often decreases with word distance.

Page 83: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Factored LM: Proposed Approach

• Decompose words into smaller morphological or class-based units (e.g., morphological classes, stems, roots, patterns, or other automatically derived units).

• Produce probabilistic models over these units to attempt to improve WER.

Page 84: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Example with Words, Stems, and Morphological classes

tS1tS 2tS 3tS

tW1tW 2tW 3tW

tM1tM 2tM 3tM

( | , )t t tP w s m1 2( | , , )t t t tP s m w w 1 2( | , )t t tP m w w

Page 85: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Example with Words, Stems, and Morphological classes

tS1tS 2tS 3tS

tW1tW 2tW 3tW

tM1tM 2tM 3tM

1 2 1 2 1 2( | , , , , , )t t t t t t tP w w w s s m m

Page 86: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

In general

2tF

21tF

22tF

23tF

1tF

11tF

12tF

13tF

3tF

31tF

32tF

33tF

Page 87: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

• A word is equivalent to collection of factors.

• E.g., if K=3

• Goal: find appropriate conditional independence statements to simplify this sort of model while keeping perplexity and WER low. This is the structure learning problem in graphical models.

General Factored LM

1:{ } { }Kt tw f

1 2 3 1 2 3 1 2 31 2 1 1 1 2 2 2

1 2 3 1 2 3 1 2 31 1 1 2 2 2

2 3 1 2 3 1 2 31 1 1 2 2 2

3 1 2 3 1 21 1 1 2 2

( | , ) ( , , | , , , , , )

( | , , , , , , , )

( | , , , , , , )

( | , , , , ,

t t t t t t t t t t t t

t t t t t t t t t

t t t t t t t t

t t t t t t

P w w w P f f f f f f f f f

P f f f f f f f f f

P f f f f f f f f

P f f f f f f f

32 )t

the kth factorkf

Page 88: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

The General Case

2tF

22tF

1tF

11tF

21tF

23tF

12tF

13tF

3tF

31tF

32tF

33tF

Page 89: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

The General Case

iF

1AF

2AF

3AF

Page 90: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

The General Case

iF

1AF

2AF

3AF

iF

1AF

3AF

iF

1AF

iF

iF

1AF

2AF

iF

2AF

3AF

iF

2AF

iF

3AF

Page 91: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

1 2 3| , ,i A A AF F F F

1 2| ,i A AF F F 2 3| ,i A AF F F1 3| ,i A AF F F

1|i AF F 2|i AF F 3|i AF F

iF

A Backoff Graph (BG)

Page 92: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Example: 4-gram Word Generalized Backoff

1 2 3| , ,t t t tW W W W

1 2| ,t t tW W W

1|t tW W

2 3| ,t t tW W W 1 3| ,t t tW W W

2|t tW W 3|t tW W

tW

Page 93: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

How to choose backoff path?

Four basic strategies1.Fixed path (based on what

seems reasonable (e.g., temporal constraints))

2.Generalized all-child backoff3.Constrained multi-child backoff4.Child combination rules

Page 94: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Choosing a fixed back-off path

1 2 3| , ,i A A AF F F F

1 2| ,i A AF F F 2 3| ,i A AF F F

2|i AF F 3|i AF F

iF

1 3| ,i A AF F F

1|i AF F

Page 95: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

How to choose backoff path?

Four basic strategies1.Fixed path (based on what

seems reasonable (e.g., temporal constraints))

2.Generalized all-child backoff3.Constrained multi-child backoff4.Child combination rules

Page 96: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Generalized Backoff

1 2

1 2( , , ) 1 2

1 21 2

1 2 1 2

( , , )if ( , , ) 0

( , )( | , )

( , ) ( , , ) otherwise

P P

P PN f f f P P

P PBO P P

P P P P

N f f fd N f f f

N f fP f f f

f f g f f f

• In typical backoff, we drop 2nd parent and use conditional probability.

1 2 1( , , ) ( | )P P BO Pg f f f P f f

• More generally, g() can be any positive function, but need new algorithm for computing backoff weight (BOW).

Page 97: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Computing BOWs

1 2

1 2

1 2

1 2( , , )

: ( , , ) 0 1 21 2

1 2: ( , , ) 0

( , , )1

( , )( , )

( , , )

P P

P P

P P

P PN f f f

f N f f f P PP P

P Pf N f f f

N f f fd

N f ff f

g f f f

• Many possible choices for g() functions (next few slides)

• Caveat: certain g() functions can make the LM much more computationally costly than standard LMs.

Page 98: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

g() functions

• Standard backoff

1 2 1( , , ) ( | )P P BO Pg f f f P f f

• Max counts

1 2 *( , , ) ( | )P P BO Pjg f f f P f f

* argmax ( , )Pjj

j N f f

• Max normalized counts

* ( , )argmax

( )Pj

j Pj

N f fj

N f

Page 99: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

More g() functions• Max backoff graph node.

1 2 *( , , ) ( | )P P BO Pjg f f f P f f* argmax ( | )BO Pj

jj P f f

1 2 3| , ,i A A AF F F F

1 1 2| ,A AF F F 1 2 3| ,A AF F F 1 3| ,i A AF F F

1|i AF F 2|i AF F 3|i AF F

iF

Page 100: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

More g() functions• Max back off graph node.

1 2 *( , , ) ( | )P P BO Pjg f f f P f f* argmax ( | )BO Pj

jj P f f

1 2 3| , ,i A A AF F F F

1 1 2| ,A AF F F 1 2 3| ,A AF F F 1 3| ,i A AF F F

1|i AF F 2|i AF F 3|i AF F

iF

Page 101: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

How to choose backoff path?

Four basic strategies1.Fixed path (based on what seems

reasonable (time))2.Generalized all-child backoff3.Constrained multi-child backoff

• Same as before, but choose a subset of possible paths a-priori

4.Child combination rules• Combine child node via

combination function (mean, weighted avg., etc.)

Page 102: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Significant Additions to Stolcke’s SRILM, the SRI

Language Modeling Toolkit

• New features added to SRILM including– Can specify an arbitrary number of

graphical-model based factorized models to train, compute perplexity, and rescore N-best lists.

– Can specify any (possibly constrained) set of backoff paths from top to bottom level in BG.

– Different smoothing (e.g., Good-Turing, Kneser-Ney, etc.) or interpolation methods may be used at each backoff graph node

– Supports the generalized backoff algorithms with 18 different possible g() functions at each BG node.

Page 103: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Example with Words, Stems, and Morphological classes

tS1tS 2tS 3tS

tW1tW 2tW 3tW

tM1tM 2tM 3tM

( | , )t t tP w s m1 2( | , , )t t t tP s m w w 1 2( | , )t t tP m w w

Page 104: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

How to specify a model## word given stem morphW : 2 S(0) M(0) S0,M0 M0 wbdiscount gtmin 1 interpolate S0 S0 wbdiscount gtmin 1 0 0 wbdiscount gtmin 1

## stem given morph word word S : 3 M(0) W(-1) W(-2) M0,W1,W2 W2 kndiscount gtmin 1 interpolate M0,W1 W1 kndiscount gtmin 1 interpolate M0 M0 kndiscount gtmin 1 0 0 kndiscount gtmin 1

## morph given word wordM : 2 W(-1) W(-2) W1,W2 W2 kndiscount gtmin 1 interpolate W1 W1 kndiscount gtmin 1 interpolate 0 0 kndiscount gtmin 1

| ,t t tW S M

|t tW S

tW

1 2| ,t t tM W W

1|t tM W

tM

1| ,t t tS M W

|t tS M

1 2| , ,t t t tS M W W

tS

Page 105: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Summary• Language Models, Backoff, and Graphical

Models

• Factored Language Models (FLMs) as Graphical Models

• Generalized Graph Backoff algorithm

• New features to SRI Language Model Toolkit (SRILM)

Page 106: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Coffee Break

Back in 10 minutes

Page 107: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Knowledge-Free Induction of Arabic Morphology

Patrick Schone21 August 2002

Page 108: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Why induce Arabic morphology?

(1) Has not been done before(2) If it can be done, and if it has value in LM, it can generalize across languages without needing an expert

Page 109: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Original Algorithm(Schone & Jurafsky, ‘00/`01)

Looking for word inflections on words w/ Fr>9

Use a character tree to find word pairs with similar beginnings/ endings Ex: car/cars , car/cares, car/caring

Use Latent Semantic Analysis to induce semantic vectors for each word, then compare word-pair semantics

Use frequencies of word stems/rules to improve the initial semantic estimates

Page 110: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Trie-based approach could be a problem for Arabic: Templates => $aGlaB: { $aGlaB il$AGil $aGlu $AGil } Result: 3576 words in CallHome lexicon w/ 50+ relationships!

Algorithmic ExpansionsIR-Based Minimum Edit Distance

∙ $ A G i l

∙ 0 1 2 3 4 5

$ 1 0 1 2 3 4

a 2 1 2 3 4 5

G 3 2 3 2 3 4

l 4 3 4 3 4 3

a 5 4 5 4 5 4

B 6 5 6 5 6 5

Use Minimum Edit Distance to find the relationships (can be weighted)

Use information-retrieval based approach to faciliate search for MED candidates

Page 111: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Algorithmic ExpansionsAgglomerative Clustering Using Rules &

Stems

#Word Pairs w/ Rule* => il+* 1178* => *u 635* => *i 455*i => *u 377* => fa+* 375* => bi+* 366…

Gayyar 507xallaS 503makallim$ 468qaddim 434itgawwiz 332tkallim 285…

#Word Pairs w/ Stem

Do bottom-up clustering, where weight betweentwo words is Ct(Rule)*Ct(PairedStem)1/2

Page 112: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Algorithmic ExpansionsUpdated Transitivity

If X~Y and Y~Z and |X^Y|>2 and X^Y<Z, then X~Z

Page 113: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Scoring Induced Morphology

Score in terms of conflation set agreement Conflation set (W)=all words morphologically

related to W Example: $aGlaB: { $aGlaB il$AGil $aGlu $AGil }

||/|| ww

ww YYXC

||/|)((| ww

www YYXXI

||/|)(| ww

www YYXYD

ErrorRate = 100*(I+D)/(C+D)

If XW=induced set for W, and YW=truth set for W, compute

total correct, inserted, and deleted as:

Page 114: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Scoring Induced Morphology

Exp#

Algorithm Words w/ Frq≥10

All words

Suf Pref Gen’l

Suf Pref Gen’l

1 Semantics alone 20.9 11.7 39.8 29.7 20.6 60.7

2 Exp1+Freq Info 19.2 11.5 39.0 27.6

16.8 57.6

3 Exp1+NewData 20.3 12.5 39.6 27.6 16.7 56.9

4 Exp2+NewData 23.5 14.5 38.7 25.4 15.4 55.1

5 NewData+MED:Sem

19.5 13.0 39.8 27.2 17.5 57.2

6 NewData+Clusters

17.2 11.8 36.6 24.8 15.9 55.5

7 Union: Exp5, Exp6 16.2 10.8 35.8 23.7 14.3 54.5

8 Union: Exp3, Exp6 17.5 10.6 35.9 24.2 13.9 54.2

9 Exp7 + NewTrans 14.9 8.4 33.9 22.4 12.3 53.1

10 Exp8 + NewTrans 16.4 8.4 33.6 23.3

12.3 52.7

Induction error rates on words from original 80 Set

Page 115: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Using Morphology for LM Rescoring

System Word Error Rate

Baseline: L1+L2 only 54.5%

Baseline + Root 54.3%

Baseline + Stem 54.6%

Baseline + Class 54.4%

Baseline + Root+Class 54.4%

For each word W, use induced morphology to generate

• Stem =smallest word, z, from XW where z< w

• Root =character intersection across XW

• Rule =map of word-to-stem• Pattern=map of stem-to-root• Class = map of word-to-root

Page 116: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Other Potential Benefits of Morphology:

Morphology-driven Word Generation• Generate probability-weighted “words” using

morphologically-derived rules (like Null => il+NULL)• Generate only if initial and final n-characters of

stem have been seen before.

Numberpropose

d

Coverage Observed

as words

Rule only 993398 41.3% 0.1%

Rule+1-char stem agree

98864 25.0% 1.1%

Rule+2-char stem agree

35092 14.9% 1.8%

Page 117: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.
Page 118: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Text Selection for Conversational Arabic

Feng HeASR (Arabic Speech Recognition) Team

JHU Workshop

Page 119: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Motivation• Group goal: Conversational Arabic

Speech Recognition.• One of the Problems: not enough

training data to build a Language Model – most available text is in MSA (Modern Standard Arabic) or a mixture of MSA and conversational Arabic.

• One Solution: Select from mixed text segments that are conversational, and use them in training.

Page 120: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

– Use POS-based language models because it has been shown to better indicate differences in styles, such as formal vs conversational.

– Method:1.Training POS (part of speech) tagger on

available data2.Train POS-based language models on

formal vs conversational data3.Tag new data4.Select segments from new data that are

closest to conversational model by using scores from POS-based language models.

Task: Text Selection

Page 121: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

• For building the Tagger and Language Models– Arabic Treebank: 130K words of hand-

tagged Newspaper text in MSA.– Arabic CallHome: 150K words of

transcribed phone conversations. Tags are only in the Lexicon.

• For Text Selection– Al Jazeera: 9M words of transcribed TV

broadcasts. We want to select segments that are closer to conversational Arabic, such as talk-shows and interviews.

Data

Page 122: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Implementation

• Model (bigram):

)(

)()|(maxarg)|(maxarg*

WP

TPTWPWTPT

TT

1it it

iw

iiiii

iiiii

ttPtwP

ttPtwPTPTWP

)|()|(

)|()|()()|(

1

1:0

Page 123: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

• These are words that are not seen in training data, but appear in test data.

• Assume unknown words behave like singletons (words that appear only once in training data).

• This is done by duplicating training data with singletons replaced by special token. Then train tagger on both the original and duplicate.

About unknown words:

Page 124: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Tools:GMTK (Graphical Model Toolkit)

Algorithms:Training: EM training – set parameters so that joint probability of hidden states and observations is maximized.

Decoding (tagging): Viterbi – find hidden state sequence that maximizes joint probability of hidden state and observations.

Page 125: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Experiments

Exp 1: Data: first 100K of English Penn Treebank. Trigram model. Sanity check.

Exp 2: Data: Arabic Treebank. Trigram model.Exp 3: Data: Arabic Treebank and CallHome. Trigram

model.The above three experiments all used 10 fold cross

validation, and are unsupervised.

Exp 4: Data: Arabic Treebank. Supervised trigram model.

Exp 5: Data: Arabic Treebank and Callhome. Partially supervised training using Treebank’s tagged data. Test on portion of treebank not used in training. Trigram model.

Page 126: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Results

Experiment Accuracy Accuracy of OOV

Baseline

1 – tri, en 92.7 37.9 79.3 – 95.5

2 – tri, ar, tb 79.5 19.3 75.9

3 – tri, ar, tb+ch

74.6 17.6 75.9

4 – tri, ar, tb, sup

90.9 56.5 90.0

5 – repeat 3 with partial supervision

83.4 43.6 90.0

Page 127: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Building Language Models and Text Selection

• Use existing scripts to build formal and conversational language models from tagged Arabic Treebank and CallHome data.

• Text selection: use log likelihood ratio

)()|(

)()|(log)( /1

/1

FPFSP

CPCSPSScore

i

i

Ni

Ni

i

Si: the ith sentence in data setC: coversational language modelF: formal language modelNi : length of Si

Page 128: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

log

cou

nt

log likelihood ratio

perc

enta

ge

log likelihood ratio

Score Distribution

Page 129: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Assessment

• A subset of Al Jazeera equal in size to Arabic CallHome (150K words) is selected, and added to training data for speech recognition language model.

• No reduction in perplexity. • Possible reasons: Al Jazeera has no

conversational Arabic, or has only conversational Arabic of a very different style.

Page 130: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Text Selection Work Done at BBN

Rich SchwartzMohamed Noamany

Daben LiuNicolae Duta

Page 131: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Search for Dialect Text

• We have an insufficient amount of CH text for estimating a LM.

• Can we find additional data?• Many words are unique to dialect text.• Searched Internet for 20 common

dialect words.• Most of the data found were jokes or

chat rooms – very little data.

Page 132: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Search BN Text for Dialect Data

• Search BN text for the same 20 dialect words.

• Found less than CH data• Each occurrence was typically an

isolated lapse by the speaker into dialect, followed quickly by a recovery to MSA for the rest of the sentence.

Page 133: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Combine MSA text with CallHome

• Estimate separate models for MSA text (300M words) and CH text (150K words).

• Use SRI toolkit to determine single optimal weight for the combination, using deleted interpolation (EM)– Optimal weight for MSA text was 0.03

• Insignificant reduction in perplexity and WER

Page 134: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Classes from BN

Hypothesis:• Even if MSA ngrams are different, perhaps the

classes are the same.Experiment:• Determine classes (using SRI toolkit) from

BN+CH data.• Use CH data to estimate ngrams of classes

and / or p(w | class)• Combine resulting model with CH word trigramResult:• No gain

Page 135: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Hypothesis Test Constrained Back-Off

Hypothesis:• In combining BN and CH, if a probability is

different, could be for 2 reasons:– CH has insufficient training– BN and CH truly have different probabilities (likely)

Algorithm:• Interpolate BN and CH, but limit the probability

change to be as much as would be likely due to insufficient training.

• Ngram count cannot change by more than its sqrt

Result:• No gain

Page 136: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.
Page 137: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Learning & Using Factored Language

ModelsGang Ji

Speech, Signal, and Language InterpretationUniversity of Washington

August 21, 2002

Page 138: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Outline

• Factored Language Models (FLMs) overview

• Part I: automatically finding FLM structure

• Part II: first-pass decoding in ASR with FLMs using graphical models

Page 139: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Factored Language Models

• Along with words, consider factors as components of the language model

• Factors can be words, stems, morphs, patterns, roots, which might contain complementary information about language

• FLMs also provide a new possibilities for designing LMs (e.g., multiple back-off paths)

• Problem: We don’t know the best model, and space is huge!!!

Page 140: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Factored Language Models

• How to learn FLMs– Solution 1: do it by hand using

expert linguistic knowledge– Solution 2: data driven; let the

data help to decide the model– Solution 3: combine both linguistic

and data driven techniques

Page 141: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Factored Language Models

• A Proposed Solution:– Learn FLMs using evolution-inspired

search algorithm

• Idea: Survival of the fittest– A collection (generation) of models– In each generation, only good ones

survive– The survivors produce the next

generation

Page 142: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Evolution-Inspired Search

• Selection: choose the good LMs• Combination: retain useful characteristics• Mutation: some small change in next generation

Page 143: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Evolution-Inspired Search

• Advantages– Can quickly find a good model– Retain goodness of the previous generation

while covering significant portion of the search space

– Can run in parallel

• How to judge the quality of each model?– Perplexity on a development set– Rescore WER on development set– Complexity-penalized perplexity

Page 144: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Evolution-Inspired Search

• Three steps form new models.– Selection (based on perplexity, etc)

• E.g. Stochastic universal sampling: models are selected in proportion to their “fitness”

– Combination– Mutation

Page 145: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Moving from One Generation to Next

• Combination Strategies– Inherit structures horizontally– Inherit structures vertically– Random selection

• Mutation– Add/remove edges randomly– Change back-off/smoothing strategies

Page 146: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Combination according to Frames

t1t 2t t1t 2t t1t 2t

t1t 2t

1F

2F

3F

1F

2F

3F

Page 147: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Combination according to Factors

t1t 2t t1t 2t t1t 2t

1F

2F

3F

t1t 2t

1F

2F

3F

Page 148: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Outline

• Factored Language Models (FLMs) overview

• Part I: automatically finding FLM structure

• Part II: first-pass decoding with FLMs

Page 149: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Problem• May be difficult to improve WER just by

rescoring n-best lists• More gains can be expected from using

better models in first-pass decoding• Solution:

1. do first-pass decoding using FLMs2. Since FLMs can be viewed as graphical models,

use GMTK (most existing tools don’t support general graph-based models)

3. To speed up inference, use generalized graphical-model-based lattices.

Page 150: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

F1

F2

F3

Word

Graph for Acoustic Model

FLMs as Graphical Models

Page 151: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

FLMs as Graphical Models

• Problem: decoding can be expensive!• Solution: multi-pass graphical lattice

refinement– In first-pass, generate graphical lattices

using a simple model (i.e., more independencies)

– Rescore the lattices using a more complicated model (fewer independencies) but on much smaller search space

Page 152: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Example: Lattices in a Markov Chain

227

034

152

This is the same as a word-based lattice

Page 153: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Lattices in General Graphs

2 03

26

1 23

0

5

25

1

34

1

14

0

6

12

Page 154: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Research Plan• Data

– Arabic CallHome data• Tools

– Tools for evolution-inspired search• most part already developed during workshop

– Training/Rescoring FLMs• Modified SRI LM toolkit: developed during this

workshop

– Multi-pass decoding• Graphical models toolkit (GMTK): developed in

last workshop

Page 155: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Summary

• Factored Language Models (FLMs) overview

• Part I: automatically finding FLM structure

• Part II: first-pass decoding of FLMs using GMTK and graphical lattices

Page 156: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.
Page 157: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Minimum Divergence Adaptation of a MSA-Based

Language Model to Egyptian Arabic

A proposal bySourin Das

JHU Workshop Final PresentationAugust 21, 2002

Page 158: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Motivation for LM Adaptation

• Transcripts of spoken Arabic are expensive to obtain; MSA text is relatively inexpensive (AFP newswire, ELRA arabic data, Al jazeera …)– MSA text ought to help; after all it is Arabic

• However there are considerable dialectal differences– Inferences drawn from Callhome knowledge or data ought

to overrule those from MSA whenever the inferences drawn from them disagree: e.g. estimates of N-gram probabilities

– Cannot interpolate models or merge data naïvely– Need to instead fall back to MSA knowledge only when the

Callhome model or data is “agnostic” about an inference

Page 159: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Motivation for LM Adaptation

• The minimum K-L divergence framework provides a mechanism to achieve this effect– First estimate a language model Q* from MSA text

only– Then find a model P* which matches all major

Callhome statistics and is close to Q*.

• Anecdotal evidence: MDI methods successfully used to adapt models based on NABN text to SWBD: a 2% WER reduction in LM95 from a 50% baseline WER.

Page 160: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

An Information Geometric View

Models satisfyingMSA-text marginals

Models satisfyingCallhome marginals

The Uniform Distribution

MaxEnt Callhome LM

Min Divergence Callhome LMMaxEnt MSA-text LM

The Space of all Language Models

Page 161: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

A Parametric View of MaxEnt Models

• The MSA-text based MaxEnt LM is the ML estimate among exponential models of the form

Q(x) = Z-1(,) exp[ i fi(x) + j gj(x)]

• The Callhome based MaxEnt LM is the ML estimate among exponential models of the form

P(x) = Z-1(,) exp[ j gj(x) + k hk(x)]

• Think of the Callhome LM as being from the familyP(x) = Z-1(,) exp[ i fi(x) + j gj(x) + k hk(x)]

where we set =0 based on the MaxEnt principle.

• One could also be agnostic about the values of i’s, since no examples with fi(x)>0 are seen in Callhome– Features (e.g. N-grams) from MSA-text which are not seen in Callhome

always have fi(x)=0 in Callhome training data

Page 162: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

A Pictorial “Interpretation” of the Minimum Divergence Model

All exponential models of the formP(x)=Z-1(,,) exp[ i fi(x) + j gj(x) + k hk(x)]

Subset of all exponential models with =0Q(x)=Z-1(,) exp[ i fi(x) + j gj(x)]

The ML model for MSA textQ*(x)=Z-1(,) exp[ i*fi(x) + j*gj(x)]

The ML model for Callhome, with =* instead of =0.P*(x)=Z-1(,,) exp[ i*fi(x) + j**gj(x) + k*hk(x)]

Subset of all exponential models with =*P(x)=Z-1(,,) exp[ i*fi(x) + j gj(x) + k hk(x)]

Page 163: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Details of Proposed Research (1):

A Factored LM for MSA text• Notation W=romanized word, =script, S=stem, R=root, M=tag

Q(i|i-1,i-2) = Q(i|i-1,i-2,Si-1,Si-2,Mi-1,Mi-2,Ri-1,Ri-2)

• Examine all 8C2 = 28 all trigram “templates” of two variables

from the history with i.

– Set observations w/counts above a threshold as features

• Examine all 8C1 = 8 all bigram “templates” of one variable from

the history with i.

– Set observations w/counts above a threshold as features

• Build a MaxEnt model (Use Jun Wu’s toolkit)Q(i|i-1,i-2)=Z-1(,) exp[ 1f1(i,i-1,Si-2)+2f2(i,Mi-1,Mi-2) …

+ifi(i,i-1)+…+jgj(i,Ri-1)+…+JgJ(i)]

• Build the Romanized language modelQ(Wi|Wi-1,Wi-2) = U(Wi|i) Q(i|i-1,i-2)

Page 164: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

A Pictorial “Interpretation” of the Minimum Divergence Model

All exponential models of the formP(x)=Z-1(,,) exp[ i fi(x) + j gj(x) + k hk(x)]

The ML model for MSA textQ*(x)=Z-1(,) exp[ i*fi(x) + j*gj(x)]

The ML model for Callhome, with =* instead of =0.P*(x)=Z-1(,,) exp[ i*fi(x) + j**gj(x) + k*hk(x)]

Page 165: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Details of Proposed Research (2):

Additional Factors in Callhome LMP(Wi|Wi-1,Wi-2) = P(Wi,i| Wi-1,Wi-2,i-1,i-2,Si-1,Si-2,Mi-1,Mi-2,Ri-1,Ri-2)

• Examine all 10C2 = 45 all trigram “templates” of two variables from the history with W or .– Set observations w/counts above a threshold as features

• Examine all 10C1 = 10 all bigram “templates” of one variable from the history with W or .– Set observations w/counts above a threshold as features

• Compute a Min Divergence model of the form

P(Wi|Wi-1,Wi-2)=Z-1(,, ) exp[ 1f1(i,i-1,Si-2)+2f2(i,Mi-1,Mi-2)+…

+ifi(i,i-1 )+…+jgj(i,Ri-1)+…

+JgJ(i)]

exp[1h1(Wi,Wi-1,Si-2)+ 2h2(i,Wi-1,Si-2) +…

+ khi(i,i-1)+…+ KhK(Wi)]

Page 166: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

Research Plan and Conclusion

• Use baseline Callhome results from WS02– Investigate treating romanized forms of a script

form as alternate pronunciations

• Build the MSA-text MaxEnt model– Feature selection is not critical; use high cutoffs

• Choose features for the Callhome model• Build and test the minimum divergence

model– Plug in induced structure – Experiment with subsets of MSA text

Page 167: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.

A Pictorial “Interpretation” of the Minimum Divergence Model

All exponential models of the formP(x)=Z-1(,,) exp[ i fi(x) + j gj(x) + k hk(x)]

The ML model for MSA textQ*(x)=Z-1(,) exp[ i*fi(x) + j*gj(x)]

The ML model for Callhome, with =* instead of =0.P*(x)=Z-1(,,) exp[ i*fi(x) + j**gj(x) + k*hk(x)]

Page 168: Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002.