Top Banner
Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, pages 53–61, Athens, Greece, 31 March, 2009. c 2009 Association for Computational Linguistics Spoken Arabic Dialect Identification Using Phonotactic Modeling Fadi Biadsy and Julia Hirschberg Department of Computer Science Columbia University, New York, USA {fadi,julia}@cs.columbia.edu Nizar Habash Center for Computational Learning Systems Columbia University, New York, USA habash@ccls.columbia.edu Abstract The Arabic language is a collection of multiple variants, among which Modern Standard Arabic (MSA) has a special sta- tus as the formal written standard language of the media, culture and education across the Arab world. The other variants are in- formal spoken dialects that are the media of communication for daily life. Arabic di- alects differ substantially from MSA and each other in terms of phonology, mor- phology, lexical choice and syntax. In this paper, we describe a system that automat- ically identifies the Arabic dialect (Gulf, Iraqi, Levantine, Egyptian and MSA) of a speaker given a sample of his/her speech. The phonotactic approach we use proves to be effective in identifying these di- alects with considerable overall accuracy — 81.60% using 30s test utterances. 1 Introduction For the past three decades, there has been a great deal of work on the automatic identification (ID) of languages from the speech signal alone. Re- cently, accent and dialect identification have be- gun to receive attention from the speech science and technology communities. The task of dialect identification is the recognition of a speaker’s re- gional dialect, within a predetermined language, given a sample of his/her speech. The dialect- identification problem has been viewed as more challenging than that of language ID due to the greater similarity between dialects of the same lan- guage. Our goal in this paper is to analyze the ef- fectiveness of a phonotactic approach, i.e. making use primarily of the rules that govern phonemes and their sequences in a language — a techniques which has often been employed by the language ID community — for the identification of Arabic dialects. The Arabic language has multiple variants, in- cluding Modern Standard Arabic (MSA), the for- mal written standard language of the media, cul- ture and education, and the informal spoken di- alects that are the preferred method of communi- cation in daily life. While there are commercially available Automatic Speech Recognition (ASR) systems for recognizing MSA with low error rates (typically trained on Broadcast News), these rec- ognizers fail when a native Arabic speaker speaks in his/her regional dialect. Even in news broad- casts, speakers often code switch between MSA and dialect, especially in conversational speech, such as that found in interviews and talk shows. Being able to identify dialect vs. MSA as well as to identify which dialect is spoken during the recog- nition process will enable ASR engines to adapt their acoustic, pronunciation, morphological, and language models appropriately and thus improve recognition accuracy. Identifying the regional dialect of a speaker will also provide important benefits for speech tech- nology beyond improving speech recognition. It will allow us to infer the speaker’s regional origin and ethnicity and to adapt features used in speaker identification to regional original. It should also prove useful in adapting the output of text-to- speech synthesis to produce regional speech as well as MSA – important for spoken dialogue sys- tems’ development. In Section 2, we describe related work. In Sec- tion 3, we discuss some linguistic aspects of Ara- bic dialects which are important to dialect iden- tification. In Section 4, we describe the Arabic dialect corpora employed in our experiments. In Section 5, we explain our approach to the identifi- cation of Arabic dialects. We present our experi- mental results in Section 6. Finally, we conclude in Section 7 and identify directions for future re- search. 2 Related Work A variety of cues by which humans and machines distinguish one language from another have been explored in previous research on language identi- 53
9

Spoken Arabic Dialect Identification Using Phonotactic ... · Iraqi Arabic (IRQ) is the dialect of Iraq. In some dialect classications, Iraqi Arabic is considered a sub-dialect of

Mar 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, pages 53–61,Athens, Greece, 31 March, 2009. c©2009 Association for Computational Linguistics

    Spoken Arabic Dialect Identification Using Phonotactic Modeling

    Fadi Biadsy and Julia HirschbergDepartment of Computer Science

    Columbia University, New York, USA{fadi,julia}@cs.columbia.edu

    Nizar HabashCenter for Computational Learning Systems

    Columbia University, New York, USAhabash@ccls.columbia.edu

    AbstractThe Arabic language is a collection ofmultiple variants, among which ModernStandard Arabic (MSA) has a special sta-tus as the formal written standard languageof the media, culture and education acrossthe Arab world. The other variants are in-formal spoken dialects that are the mediaof communication for daily life. Arabic di-alects differ substantially from MSA andeach other in terms of phonology, mor-phology, lexical choice and syntax. In thispaper, we describe a system that automat-ically identifies the Arabic dialect (Gulf,Iraqi, Levantine, Egyptian and MSA) of aspeaker given a sample of his/her speech.The phonotactic approach we use provesto be effective in identifying these di-alects with considerable overall accuracy— 81.60% using 30s test utterances.

    1 Introduction

    For the past three decades, there has been a greatdeal of work on the automatic identification (ID)of languages from the speech signal alone. Re-cently, accent and dialect identification have be-gun to receive attention from the speech scienceand technology communities. The task of dialectidentification is the recognition of a speaker’s re-gional dialect, within a predetermined language,given a sample of his/her speech. The dialect-identification problem has been viewed as morechallenging than that of language ID due to thegreater similarity between dialects of the same lan-guage. Our goal in this paper is to analyze the ef-fectiveness of a phonotactic approach, i.e. makinguse primarily of the rules that govern phonemesand their sequences in a language — a techniqueswhich has often been employed by the languageID community — for the identification of Arabicdialects.

    The Arabic language has multiple variants, in-cluding Modern Standard Arabic (MSA), the for-

    mal written standard language of the media, cul-ture and education, and the informal spoken di-alects that are the preferred method of communi-cation in daily life. While there are commerciallyavailable Automatic Speech Recognition (ASR)systems for recognizing MSA with low error rates(typically trained on Broadcast News), these rec-ognizers fail when a native Arabic speaker speaksin his/her regional dialect. Even in news broad-casts, speakers often code switch between MSAand dialect, especially in conversational speech,such as that found in interviews and talk shows.Being able to identify dialect vs. MSA as well as toidentify which dialect is spoken during the recog-nition process will enable ASR engines to adapttheir acoustic, pronunciation, morphological, andlanguage models appropriately and thus improverecognition accuracy.

    Identifying the regional dialect of a speaker willalso provide important benefits for speech tech-nology beyond improving speech recognition. Itwill allow us to infer the speaker’s regional originand ethnicity and to adapt features used in speakeridentification to regional original. It should alsoprove useful in adapting the output of text-to-speech synthesis to produce regional speech aswell as MSA – important for spoken dialogue sys-tems’ development.

    In Section 2, we describe related work. In Sec-tion 3, we discuss some linguistic aspects of Ara-bic dialects which are important to dialect iden-tification. In Section 4, we describe the Arabicdialect corpora employed in our experiments. InSection 5, we explain our approach to the identifi-cation of Arabic dialects. We present our experi-mental results in Section 6. Finally, we concludein Section 7 and identify directions for future re-search.

    2 Related Work

    A variety of cues by which humans and machinesdistinguish one language from another have beenexplored in previous research on language identi-

    53

  • fication. Examples of such cues include phone in-ventory and phonotactics, prosody, lexicon, mor-phology, and syntax. Some of the most suc-cessful approaches to language ID have madeuse of phonotactic variation. For example, thePhone Recognition followed by Language Model-ing (PRLM) approach uses phonotactic informa-tion to identify languages from the acoustic sig-nal alone (Zissman, 1996). In this approach, aphone recognizer (not necessarily trained on a re-lated language) is used to tokenize training data foreach language to be classified. Phonotactic lan-guage models generated from this tokenized train-ing speech are used during testing to compute lan-guage ID likelihoods for unknown utterances.

    Similar cues have successfully been used forthe identification of regional dialects. Zisssmanet al. (1996) show that the PRLM approach yieldsgood results classifying Cuban and Peruvian di-alects of Spanish, using an English phone recog-nizer trained on TIMIT (Garofolo et al., 1993).The recognition accuracy of this system on thesetwo dialects is 84%, using up to 3 minutes of testutterances. Torres-Carrasquillo et al. (2004) devel-oped an alternate system that identifies these twoSpanish dialects using Gaussian Mixture Models(GMM) with shifted-delta-cepstral features. Thissystem performs less accurately (accuracy of 70%)than that of (Zissman et al., 1996). Alorfi (2008)uses an ergodic HMM to model phonetic dif-ferences between two Arabic dialects (Gulf andEgyptian Arabic) employing standard MFCC (MelFrequency Cepstral Coefficients) and delta fea-tures. With the best parameter settings, this systemachieves high accuracy of 96.67% on these twodialects. Ma et al. (2006) use multi-dimensionalpitch flux features and MFCC features to distin-guish three Chinese dialects. In this system thepitch flux features reduce the error rate by morethan 30% when added to a GMM based MFCCsystem. Given 15s of test-utterances, the systemachieves an accuracy of 90% on the three dialects.

    Intonational cues have been shown to be goodindicators to human subjects identifying regionaldialects. Peters et al. (2002) show that human sub-jects rely on intonational cues to identify two Ger-man dialects (Hamburg urban dialects vs. North-ern Standard German). Similarly, Barakat etal. (1999) show that subjects distinguish betweenWestern vs. Eastern Arabic dialects significantlyabove chance based on intonation alone.

    Hamdi et al. (2004) show that rhythmic dif-

    ferences exist between Western and Eastern Ara-bic. The analysis of these differences is done bycomparing percentages of vocalic intervals (%V)and the standard deviation of intervocalic inter-vals (∆C) across the two groups. These featureshave been shown to capture the complexity of thesyllabic structure of a language/dialect in additionto the existence of vowel reduction. The com-plexity of syllabic structure of a language/dialectand the existence of vowel reduction in a languageare good correlates with the rhythmic structure ofthe language/dialect, hence the importance of sucha cue for language/dialect identification (Ramus,2002).

    As far as we could determine, there is noprevious work that analyzes the effectiveness ofa phonotactic approach, particularly the parallelPRLM, for identifying Arabic dialects. In this pa-per, we build a system based on this approach andevaluate its performance on five Arabic dialects(four regional dialects and MSA). In addition, weexperiment with six phone recognizers trained onsix languages as well as three MSA phone recog-nizers and analyze their contribution to this classi-fication task. Moreover, we make use of a discrim-inative classifier that takes all the perplexities ofthe language models on the phone sequences andoutputs the hypothesized dialect. This classifierturns out to be an important component, althoughit has not been a standard component in previouswork.

    3 Linguistic Aspects of Arabic Dialects

    3.1 Arabic and its Dialects

    MSA is the official language of the Arab world.It is the primary language of the media and cul-ture. MSA is syntactically, morphologically andphonologically based on Classical Arabic, the lan-guage of the Qur’an (Islam’s Holy Book). Lexi-cally, however, it is much more modern. It is nota native language of any Arabs but is the languageof education across the Arab world. MSA is pri-marily written not spoken.

    The Arabic dialects, in contrast, are the true na-tive language forms. They are generally restrictedin use to informal daily communication. Theyare not taught in schools or even standardized, al-though there is a rich popular dialect culture offolktales, songs, movies, and TV shows. Dialectsare primarily spoken, not written. However, thisis changing as more Arabs gain access to elec-

    54

  • tronic media such as emails and newsgroups. Ara-bic dialects are loosely related to Classical Ara-bic. They are the result of the interaction betweendifferent ancient dialects of Classical Arabic andother languages that existed in, neighbored and/orcolonized what is today the Arab world. For ex-ample, Algerian Arabic has many influences fromBerber as well as French.

    Arabic dialects vary on many dimensions –primarily, geography and social class. Geo-linguistically, the Arab world can be divided inmany different ways. The following is only oneof many that covers the main Arabic dialects:

    • Gulf Arabic (GLF) includes the dialects ofKuwait, Saudi Arabia, Bahrain, Qatar, UnitedArab Emirates, and Oman.

    • Iraqi Arabic (IRQ) is the dialect of Iraq. Insome dialect classifications, Iraqi Arabic isconsidered a sub-dialect of Gulf Arabic.

    • Levantine Arabic (LEV) includes the di-alects of Lebanon, Syria, Jordan, Palestineand Israel.

    • Egyptian Arabic (EGY) covers the dialectsof the Nile valley: Egypt and Sudan.

    • Maghrebi Arabic covers the dialects ofMorocco, Algeria, Tunisia and Mauritania.Libya is sometimes included.

    Yemenite Arabic is often considered its ownclass. Maltese Arabic is not always consid-ered an Arabic dialect. It is the only Arabicvariant that is considered a separate languageand is written with Latin script.

    Socially, it is common to distinguish three sub-dialects within each dialect region: city dwellers,peasants/farmers and Bedouins. The three degreesare often associated with a class hierarchy fromrich, settled city-dwellers down to Bedouins. Dif-ferent social associations exist as is common inmany other languages around the world.

    The relationship between MSA and the dialectin a specific region is complex. Arabs do not thinkof these two as separate languages. This particularperception leads to a special kind of coexistencebetween the two forms of language that serve dif-ferent purposes. This kind of situation is what lin-guists term diglossia. Although the two variantshave clear domains of prevalence: formal written(MSA) versus informal spoken (dialect), there is

    a large gray area in between and it is often filledwith a mixing of the two forms.

    In this paper, we focus on classifying the di-alect of audio recordings into one of five varieties:MSA, GLF, IRQ, LEV, and EGY. We do not ad-dress other dialects or diglossia.

    3.2 Phonological Variations among ArabicDialects

    Although Arabic dialects and MSA vary on manydifferent levels — phonology, orthography, mor-phology, lexical choice and syntax — we willfocus on phonological difference in this paper.1

    MSA’s phonological profile includes 28 conso-nants, three short vowels, three long vowels andtwo diphthongs (/ay/ and /aw/). Arabic dialectsvary phonologically from standard Arabic andeach other. Some of the common variations in-clude the following (Holes, 2004; Habash, 2006):

    The MSA consonant (/q/) is realized as a glot-tal stop /’/ in EGY and LEV and as /g/ in GLF andIRQ. For example, the MSA word /t

    ¯ari:q/ ‘road’

    appears as /t¯ari:’/ (EGY and LEV) and /t

    ¯ari:g/ (GLF

    and IRQ). Other variants also are found in sub di-alects such as /k/ in rural Palestinian (LEV) and/dj/ in some GLF dialects. These changes do notapply to modern and religious borrowings fromMSA. For instance, the word for ‘Qur’an’ is neverpronounced as anything but /qur’a:n/.

    The MSA alveolar affricate (/dj/) is realized as/g/ in EGY, as /j/ in LEV and as /y/ in GLF. IRQpreserves the MSA pronunciation. For example,the word for ‘handsome’ is /djami:l/ (MSA, IRQ),/gami:l/ (EGY), /jami:l/ (LEV) and /yami:l/ (GLF).

    The MSA consonant (/k/) is generally realizedas /k/ in Arabic dialects with the exception of GLF,IRQ and the Palestinian rural sub-dialect of LEV,which allow a /č/ pronunciation in certain con-texts. For example, the word for ‘fish’ is /samak/in MSA, EGY and most of LEV but /simač/ in IRQand GLF.

    The MSA consonant /θ/ is pronounced as /t/ inLEV and EGY (or /s/ in more recent borrowingsfrom MSA), e.g., the MSA word /θala:θa/ ‘three’is pronounced /tala:ta/ in EGY and /tla:te/ in LEV.IRQ and GLF generally preserve the MSA pronun-ciation.

    1It is important to point out that since Arabic dialects arenot standardized, their orthography may not always be con-sistent. However, this is not a relevant point to this papersince we are interested in dialect identification using audiorecordings and without using the dialectal transcripts at all.

    55

  • The MSA consonant /δ/ is pronounced as /d/in LEV and EGY (or /z/ in more recent borrow-ings from MSA), e.g., the word for ‘this’ is pro-nounced /ha:δa/ in MSA versus /ha:da/ (LEV) and/da/ EGY. IRQ and GLF generally preserve theMSA pronunciation.

    The MSA consonants /d¯/ (emphatic/velarized

    d) and /δ¯/ (emphatic /δ/) are both normalized to

    /d¯/ in EGY and LEV and to /δ

    ¯/ in GLF and IRQ.

    For example, the MSA sentence /δ¯alla yad

    ¯rubu/

    ‘he continued to hit’ is pronounced /d¯all yud

    ¯rub/

    (LEV) and /δ¯all yuδ

    ¯rub/ (GLF). In modern bor-

    rowings from MSA, /δ¯/ is pronounced as /z

    ¯/ (em-

    phatic z) in EGY and LEV. For instance, the wordfor ‘police officer’ is /δ

    ¯a:bit

    ¯/ in MSA but /z

    ¯a:bit

    ¯/

    in EGY and LEV.In some dialects, a loss of the emphatic feature

    of some MSA consonants occurs, e.g., the MSAword /lat

    ¯i:f/ ‘pleasant’ is pronounced as /lati:f/ in

    the Lebanese city sub-dialect of LEV. Empha-sis typically spreads to neighboring vowels: if avowel is preceded or succeeded directly by an em-phatic consonant (/d

    ¯/, /s

    ¯/, /t

    ¯/, /δ

    ¯/) then the vowel

    becomes an emphatic vowel. As a result, the lossof the emphatic feature does not affect the conso-nants only, but also their neighboring vowels.

    Other vocalic differences among MSA and thedialects include the following: First, short vow-els change or are completely dropped, e.g., theMSA word /yaktubu/ ‘he writes’ is pronounced/yiktib/ (EGY and IRQ) or /yoktob/ (LEV). Sec-ond, final and unstressed long vowels are short-ened, e.g., the word /mat

    ¯a:ra:t/ ‘airports’ in MSA

    becomes /mat¯ara:t/ in many dialects. Third, the

    MSA diphthongs /aw/ and /ay/ have mostly be-come /o:/ and /e:/, respectively. These vocalicchanges, particularly vowel drop lead to differentsyllabic structures. MSA syllables are primarilylight (CV, CV:, CVC) but can also be (CV:C andCVCC) in utterance-final positions. EGY sylla-bles are the same as MSA’s although without theutterance-final restriction. LEV, IRQ and GLF al-low heavier syllables including word initial clus-ters such as CCV:C and CCVCC.

    4 Corpora

    When training a system intended to classify lan-guages or dialects, it is of course important to usetraining and testing corpora recorded under simi-lar acoustic conditions. We are able to obtain cor-pora from the Linguistic Data Consortium (LDC)with similar recording conditions for four Arabic

    dialects: Gulf Arabic, Iraqi Arabic, Egyptian Ara-bic, and Levantine Arabic. These are corpora ofspontaneous telephone conversations produced bynative speakers of the dialects, speaking with fam-ily members, friends, and unrelated individuals,sometimes about predetermined topics. Although,the data have been annotated phonetically and/ororthographically by LDC, in this paper, we do notmake use of any of annotations.

    We use the speech files of 965 speakers (about41.02 hours of speech) from the Gulf Arabicconversational telephone Speech database for ourGulf Arabic data (Appen Pty Ltd, 2006a).2 Fromthese speakers we hold out 150 speakers for test-ing (about 6.06 hours of speech).3 We use the IraqiArabic Conversational Telephone Speech database(Appen Pty Ltd, 2006b) for the Iraqi dialect, se-lecting 475 Iraqi Arabic speakers with a total du-ration of about 25.73 hours of speech. Fromthese speakers we hold out 150 speakers4 for test-ing (about 7.33 hours of speech). Our Levan-tine data consists of 1258 speakers from the Ara-bic CTS Levantine Fisher Training Data Set 1-3(Maamouri, 2006). This set contains about 78.79hours of speech in total. We hold out 150 speakersfor testing (about 10 hours of speech) from Set 1.5

    For our Egyptian data, we use CallHome Egyp-tian and its Supplement (Canavan et al., 1997)and CallFriend Egyptian (Canavan and Zipperlen,1996). We use 398 speakers from these corpora(75.7 hours of speech), holding out 150 speakersfor testing.6 (about 28.7 hours of speech.)

    Unfortunately, as far as we can determine, thereis no data with similar recording conditions forMSA. Therefore, we obtain our MSA training datafrom TDT4 Arabic broadcast news. We use about47.6 hours of speech. The acoustic signal was pro-cessed using forced-alignment with the transcriptto remove non-speech data, such as music. Fortesting we again use 150 speakers, this time iden-tified automatically from the GALE Year 2 Dis-tillation evaluation corpus (about 12.06 hours ofspeech). Non-speech data (e.g., music) in the test

    2We excluded very short speech files from the corpora.3The 24 speakers in devtest folder and the last 63 files,

    after sorting by file name, in train2c folder (126 speakers).The sorting is done to make our experiments reproducible byother researchers.

    4Similar to the Gulf corpus, the 24 speakers in devtestfolder and the last 63 files (after sorting by filename) intrain2c folder (126 speakers)

    5We use the last 75 files in Set 1, after sorting by name.6The test speakers were from evaltest and devtest folders

    in CallHome and CallFriend.

    56

  • corpus was removed manually. It should be notedthat the data includes read speech by anchors andreporters as well as spontaneous speech spoken ininterviews in studios and though the phone.

    5 Our Dialect ID Approach

    Since, as described in Section 3, Arabic dialectsdiffer in many respects, such as phonology, lex-icon, and morphology, it is highly likely thatthey differ in terms of phone-sequence distribu-tion and phonotactic constraints. Thus, we adoptthe phonotactic approach to distinguishing amongArabic dialects.

    5.1 PRLM for dialect ID

    As mentioned in Section 2, the PRLM approach tolanguage identification (Zissman, 1996) has hadconsiderable success. Recall that, in the PRLMapproach, the phones of the training utterances ofa dialect are first identified using a single phonerecognizer.7 Then an n-gram language model istrained on the resulting phone sequences for thisdialect. This process results in an n-gram lan-guage model for each dialect to model the dialectdistribution of phone sequence occurrences. Dur-ing recognition, given a test speech segment, werun the phone recognizer to obtain the phone se-quence for this segment and then compute the per-plexity of each dialect n-gram model on the se-quence. The dialect with the n-gram model thatminimizes the perplexity is hypothesized to be thedialect from which the segment comes.

    Parallel PRLM is an extension to the PRLM ap-proach, in which multiple (k) parallel phone rec-ognizers, each trained on a different language, areused instead of a single phone recognizer (Ziss-man, 1996). For training, we run all phone recog-nizers in parallel on the set of training utterancesof each dialect. An n-gram model on the outputs ofeach phone recognizer is trained for each dialect.Thus if we have m dialects, k x m n-gram modelsare trained. During testing, given a test utterance,we run all phone recognizers on this utterance andcompute the perplexity of each n-gram model onthe corresponding output phone sequence. Finally,the perplexities are fed to a combiner to determinethe hypothesized dialect. In our implementation,

    7The phone recognizer is typically trained on one of thelanguages being identified. Nonetheless, a phone recognizetrained on any language might be a good approximation,since languages typically share many phones in their phoneticinventory.

    we employ a logistic regression classifier as ourback-end combiner. We have experimented withdifferent classifiers such as SVM, and neural net-works, but logistic regression classifier was supe-rior. The system is illustrated in Figure 1.

    We hypothesize that using multiple phone rec-ognizers as opposed to only one allows the systemto capture subtle phonetic differences that mightbe crucial to distinguish dialects. Particularly,since the phone recognizers are trained on differ-ent languages, they may be able to model differentvocalic and consonantal systems, hence a differentphonetic inventory. For example, an MSA phonerecognizer typically does not model the phoneme/g/; however, an English phone recognizer does.As described in Section 3, this phoneme is animportant cue to distinguishing Egyptian Arabicfrom other Arabic dialects. Moreover, phone rec-ognizers are prone to many errors; relying uponmultiple phone streams rather than one may leadto a more robust model overall.

    5.2 Phone RecognizersIn our experiments, we have used phone recogniz-ers for English, German, Japanese, Hindi, Man-darin, and Spanish, from a toolkit developed byBrno University of Technology.8 These phone rec-ognizers were trained on the OGI multilanguagedatabase (Muthusamy et al., 1992) using a hybridapproach based on Neural Networks and Viterbidecoding without language models (open-loop)(Matejka et al., 2005).

    Since Arabic dialect identification is our goal,we hypothesize that an Arabic phone recognizerwould also be useful, particularly since otherphone recognizers do not cover all Arabic con-sonants, such as pharyngeals and emphatic alveo-lars. Therefore, we have built our own MSA phonerecognizer using the HMM toolkit (HTK) (Younget al., 2006). The monophone acoustic modelsare built using 3-state continuous HMMs withoutstate-skipping, with a mixture of 12 Gaussians perstate. We extract standard Mel Frequency CepstralCoefficients (MFCC) features from 25 ms frames,with a frame shift of 10 ms. Each feature vec-tor is 39D: 13 features (12 cepstral features plusenergy), 13 deltas, and 13 double-deltas. The fea-tures are normalized using cepstral mean normal-ization. We use the Broadcast News TDT4 corpus(Arabic Set 1; 47.61 hours of speech; downsam-pled to 8Khz) to train our acoustic models. The

    8www.fit.vutbr.cz/research/groups/speech/sw/phnrec

    57

  • !

    "#$%&'()*+(,!

    -*./(0&!

    1(2$/(3*&*()!

    456/*)'!$'%5()!

    72.8*0!$'%5()!

    !"#$%&'"(

    )*+,*#"+%%'-.!

    !*/0'"()1#-+(

    2+"#.-'3+*((

    4-.5'%1()1#-+(

    2+"#.-'3+*((

    6/,/-+%+()1#-+(

    2+"#.-'3+*((

    !"#$%&'(&

    )*+,&'(&

    -./01%#2&'(&

    '34#21%23&'(&

    (56&'(&

    !"#$%&'(&

    )*+,&'(&

    -./01%#2&'(&

    '34#21%23&'(&

    (56&'(&

    9.$.5()(!$'%5()!!"#$%&'(&

    )*+,&'(&

    -./01%#2&'(&

    '34#21%23&'(&

    (56&'(&

    7/"894-:(

    ;5/%%'

  • 4 dialects

    seconds accuracy Gulf Iraqi Levantine Egyptian

    5 60.833 49.2 52.7 58.1 83

    15 72.83 60.8 61.2 77.6 91.9

    30 78.5 68.7 67.3 84 94

    45 81.5 72.6 72.4 86.9 93.7

    60 83.33 75.1 75.7 87.9 94.6

    120 84 75.1 75.4 89.5 96

    !"#

    "$#

    ""#

    %$#

    %"#

    &$#

    &"#

    '$#

    '"#

    ($#

    ("#

    )$$#

    "# )"# *$# !"# %$# )+$#

    ,--./0-1#

    2.34#567809./8#

    :/0;0?@?8#567809./8#

    AB1C@0?#567809./8#

    D#

    E89F6.G8/0?-8#H./0

  • 4 dialects

    seconds accuracy Gulf Iraqi Levantine Egyptian

    5 68.6667 54.5 50.7 60 77.9

    15 76.6667 57.3 62.6 73.8 90.7

    30 81.6 68.3 71.7 79.4 90.2

    45 84.8 69.9 73.6 86.2 94.9

    60 86.933 76.8 76.5 85.4 96.3

    120 87.86 79.1 77.4 90.1 93.6

    !"#

    "$#

    ""#

    %$#

    %"#

    &$#

    &"#

    '$#

    '"#

    ($#

    ("#

    )$$#

    "# )"# *$# !"# %$# )+$#

    ,--./0-1#

    2.34#567809./8#

    :/0;0?@?8#567809./8#

    AB1C@0?#567809./8#

    7D,#567809./8#

    E#

    F89G6.H8/0?-8#I./0

  • ReferencesF. S. Alorfi. 2008. PhD Dissertation: Automatic Identifica-

    tion Of Arabic Dialects Using Hidden Markov Models. InUniversity of Pittsburgh.

    Appen Pty Ltd. 2006a. Gulf Arabic Conversational Tele-phone Speech Linguistic Data Consortium, Philadelphia.

    Appen Pty Ltd. 2006b. Iraqi Arabic Conversational Tele-phone Speech Linguistic Data Consortium, Philadelphia.

    M. Barkat, J. Ohala, and F. Pellegrino. 1999. Prosody as aDistinctive Feature for the Discrimination of Arabic Di-alects. In Proceedings of Eurospeech’99.

    F. Biadsy, N. Habash, and J. Hirschberg. 2009. Improv-ing the Arabic Pronunciation Dictionary for Phone andWord Recognition with Linguistically-Based Pronuncia-tion Rules. In Proceedings of NAACL/HLT 2009, Col-orado, USA.

    A. Canavan and G. Zipperlen. 1996. CALLFRIEND Egyp-tian Arabic Speech Linguistic Data Consortium, Philadel-phia.

    A. Canavan, G. Zipperlen, and D. Graff. 1997. CALL-HOME Egyptian Arabic Speech Linguistic Data Consor-tium, Philadelphia.

    J. S. Garofolo et al. 1993. TIMIT Acoustic-PhoneticContinuous Speech Corpus Linguistic Data Consortium,Philadelphia.

    N. Habash. 2006. On Arabic and its Dialects. MultilingualMagazine, 17(81).

    R. Hamdi, M. Barkat-Defradas, E. Ferragne, and F. Pelle-grino. 2004. Speech Timing and Rhythmic Structure inArabic Dialects: A Comparison of Two Approaches. InProceedings of Interspeech’04.

    C. Holes. 2004. Modern Arabic: Structures, Functions, andVarieties. Georgetown University Press. Revised Edition.

    B. Ma, D. Zhu, and R. Tong. 2006. Chinese Dialect Iden-tification Using Tone Features Based On Pitch Flux. InProceedings of ICASP’06.

    M. Maamouri. 2006. Levantine Arabic QT Training DataSet 5, Speech Linguistic Data Consortium, Philadelphia.

    P. Matejka, P. Schwarz, J. Cernocky, and P. Chytil. 2005.Phonotactic Language Identification using High QualityPhoneme Recognition. In Proceedings of Eurospeech’05.

    Y. K. Muthusamy, R.A. Cole, and B.T. Oshika. 1992. TheOGI Multi-Language Telephone Speech Corpus. In Pro-ceedings of ICSLP’92.

    J. Peters, P. Gilles, P. Auer, and M. Selting. 2002. Iden-tification of Regional Varieties by Intonational Cues. AnExperimental Study on Hamburg and Berlin German.45(2):115–139.

    F. Ramus. 2002. Acoustic Correlates of Linguistic Rhythm:Perspectives. In Speech Prosody.

    A. Stolcke. 2002. SRILM - an Extensible Language Model-ing Toolkit. In ICASP’02, pages 901–904.

    P. Torres-Carrasquillo, T. P. Gleason, and D. A. Reynolds.2004. Dialect identification using Gaussian Mixture Mod-els. In Proceedings of the Speaker and Language Recog-nition Workshop, Spain.

    S. Young, G. Evermann, M. Gales, D. Kershaw, G. Moore,J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Wood-land. 2006. The HTK Book, version 3.4.

    M. A. Zissman, T. Gleason, D. Rekart, and B. Losiewicz.1996. Automatic Dialect Identification of Extempora-neous Conversational, Latin American Spanish Speech.In Proceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing, Atlanta, USA.

    M. A. Zissman. 1996. Comparison of Four Approaches toAutomatic Language Identification of Telephone Speech.IEEE Transactions of Speech and Audio Processing, 4(1).

    61