THE ELECTRICAL ENGINEERING AND APPLIED SIGNAL PROCESSING SERIES Edited by Alexander Poularikas The Advanced Signal Processing Handbook: Theory and Implementation for Radar, Sonar, and Medical Imaging Real-Time Systems Stergios Stergiopoulos The Transform and Data Compression Handbook K.R. Rao and P.C. Yip Handbook of Multisensor Data Fusion David Hall and James Llinas Handbook of Neural Network Signal Processing Yu Hen Hu and Jenq-Neng Hwang Handbook of Antennas in Wireless Communications Lal Chand Godara Noise Reduction in Speech Applications Gillian M. Davis Signal Processing Noise Vyacheslav P. Tuzlukov Digital Signal Processing with Examples in MATLAB ® Samuel Stearns Applications in Time-Frequency Signal Processing Antonia Papandreou-Suppappola The Digital Color Imaging Handbook Gaurav Sharma Pattern Recognition in Speech and Language Processing Wu Chou and Biing Huang Juang Forthcoming Titles Propagation Data Handbook for Wireless Communication System Design Robert Crane Smart Antennas Lal Chand Godara Nonlinear Signal and Image Processing: Theory, Methods, and Applications Kenneth Barner and Gonzalo R. Arce
385
Embed
THE ELECTRICAL ENGINEERING AND APPLIED SIGNAL …index-of.co.uk/Artificial-Intelligence/Pattern... · Lal Chand Godara Nonlinear Signal and Image Processing: Theory, Methods, and
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
THE ELECTRICAL ENGINEERINGAND APPLIED SIGNAL PROCESSING SERIES
Edited by Alexander Poularikas
The Advanced Signal Processing Handbook:Theory and Implementation for Radar, Sonar,
and Medical Imaging Real-Time SystemsStergios Stergiopoulos
The Transform and Data Compression HandbookK.R. Rao and P.C. Yip
Handbook of Multisensor Data FusionDavid Hall and James Llinas
Handbook of Neural Network Signal ProcessingYu Hen Hu and Jenq-Neng Hwang
Handbook of Antennas in Wireless CommunicationsLal Chand Godara
Noise Reduction in Speech ApplicationsGillian M. Davis
Signal Processing NoiseVyacheslav P. Tuzlukov
Digital Signal Processing with Examples in MATLAB®
Samuel Stearns
Applications in Time-Frequency Signal ProcessingAntonia Papandreou-Suppappola
The Digital Color Imaging HandbookGaurav Sharma
Pattern Recognition in Speech and Language ProcessingWu Chou and Biing Huang Juang
Forthcoming Titles
Propagation Data Handbook for Wireless Communication System DesignRobert Crane
Smart AntennasLal Chand Godara
Nonlinear Signal and Image Processing: Theory, Methods, and ApplicationsKenneth Barner and Gonzalo R. Arce
Forthcoming Titles (continued)
Soft Computing with MATLAB®
Ali Zilouchian
Signal and Image Processing Navigational SystemsVyacheslav P. Tuzlukov
Wireless Internet: Technologies and ApplicationsApostolis K. Salkintzis and Alexander Poularikas
CRC PR ESSBoca Raton London New York Washington, D.C.
Edited byWU CHOUAvaya Labs Research
BIING HWANG JUANGGeorgia Institute of Technology
PATTERNRECOGNITION inSPEECH andLANGUAGEPROCESSING
Preface
Basking Ridge, New JerseySeptember, 2002
Contributors
A. Abella
James Allan
T. Alonso
Jerome R. Bellegarda
William Byrne
Wu Chou
Sadaoki Furui
Jean-Luc Gauvain
Vaibhava Goel
Allen L. Gorin
Qiang Huo
Biing-Hwang Juang
Shigeru Katagiri
Lori Lamel
Qi (Peter) Li
John Makhoul
Hermann Ney
F. J. Och
G. Riccardi
Richard M. Schwartz
J. H. Wright
Contents
1 Minimum Classification Error (MCE) Approach in Pattern RecognitionWu Chou
2 Minimum Bayes-Risk Methods in Automatic Speech RecognitionVaibhava Goel� and William Byrne� � �
3 A Decision Theoretic Formulation for Robust Automatic Speech Recog-nitionQiang Huo
4 Speech Pattern Recognition using Neural NetworksShigeru Katagiri
5 Large Vocabulary Speech Recognition Based on Statistical MethodsJean-Luc Gauvain and Lori Lamel
6 Toward Spontaneous Speech Recognition and UnderstandingSadaoki Furui
7 Speaker AuthenticationQi Li� and Biing-Hwang Juang� � �
8 HMMs for Language Processing ProblemsRichard M. Schwartz and John Makhoul
9 Statistical Language Models With Embedded Latent Semantic Knowl-edgeJerome R. Bellegarda
10 Semantic Information Processing of Spoken Language – How May I
Help You?sm
A. L. Gorin, A. Abella, T. Alonso, G. Riccardi, and J. H. Wright,
11 Machine Translation Using Statistical ModelingHerman Ney, and F. J. Och
12 Modeling Topics for Detection and TrackingJames Allan
1
Minimum Classification Error (MCE)Approach in Pattern Recognition
partial path log-probability lattice backward log-probabilitylattice total probability ��
��
�� ���� � �� �� ���� ������� �
��
������ � ����
�������������� ���� ���������
��
��
��
� ���� � � ��� ���� � ������� �
�� ���� ������
� ���� � �
����� �� ���� ������ � ��
�� �������������
� ���� ����������
�� �
� �
�����
�� �������������
� ���� ����������� ���� ������
�� �
FIGURE 2.1An example lattice. The time marks correspond to the node times and theword ending times. The numbers on the edges are logarithms of conditionaljoint probabilities as described in the text. The partial path log-probability ofa partial hypothesis is the log of the probability of its path; the partial path�� � (‘HELLO’,‘0.6’) in this lattice has value ����. The lattice backwardlog-probability of a partial hypothesis �� is the log of the sum of probabili-ties of all lattice paths from end node of �� to the lattice end node; for thepartial path �� � (‘HELLO’,‘0.6’) in this lattice these paths are indicated bydotted lines and the lattice backward log-probability of this �� is �����. Thelattice total probability of a partial path is the exponentiated sum of its partialpath log-probability and lattice backward log-probability; its value is ����� for�� � (‘HELLO’, ‘0.6’) in the lattice above.
� �
�����
�� ��������������
� �� ���� ��
���
��
�������������
� �� ���� ��
� ��� ��
�
��
2.2.3.2 �� Search Under General Loss Functions
����
�� � �
�� �� �������
��� � ����� ������
�
������
����� ��� ������
� � �� ��
��� �� ��
������
����� ��� �����
� �
��
��
���� � ��������������
�
������
������ ����� ������
� �
�� �� ��
������
����� ��� ������
��
��
2.2.3.3 Single Stack Search Under Levenshtein Loss Function
���� ��
� �
���
����� ��
������
���������
�����������
� � �� �� ��
�� �� � ���� ���
���������� � �� ��
�
� �
� �� ��� � �� ��
� �� � �� ��
�
��� �� ��
������
�� �� ���� �� �� �� �� � � �� ��
���
�� � ���� � ���������
� � � �� �� ��� �� ��� ��
���� � �������
����� ��
��� �� ��� �� � �
��� ��
���� � ���
2.2.3.4 Prefix Tree Search Under Levenshtein Loss Function
�
�� ���
�� � ����� � ���� ��
��������� �� ����� � �� �
��
�
����� ��
������
�����������
� � �� �� ��
�� �� � ���� ��� � � �� �
��
��������
�
�� ��� �� ���
��������������
��������
��� � ��� � � � � �� �
��
��������
��������������
��������
��� � ��� � ��
�� ��� �� ���
� � �� �
��
��������
��������������
����������
��� � ���� � �� � ���
� �����
�� ��
prefix tree
������ ����
��������������
����������
��� � ���� � ���
partial hypothesis comparison cost
�
2.2.3.5 Pruning and Multistack Organization of the Prefix Tree Search
��
2.2.3.6 Loss Functions Other than Levenshtein Distance
��
2.3 Segmental MBR Procedures
high con-fidence regions
low confidence regions
��
�� �
�� �� segment sets� ��� � �� ���� �
� ���� � ���
� ���� � ������� �� � � �� ��
��� �
�
�� �� ��
�� �
��
��� �� ��� � � ��
conjunction rule ��� �
�� � � ��
�� � �
� � �
�
� ��
� ��
� � � ������ � � �� �� �
����� �� �
��
���
�����
��� �� ��
�������
�� ���
Proposition.�
�� � ��������
���
��
��� � ������
�����
�
�
� �����
���� ������ ��� ����
��� ��� ���
��� �� ��
���������� ��� �
�� ���
� ��� ����
induced
������ �� �
��
���
�������� �� ��
�������
��� �� �
2.3.1 Segmental Voting
���� � ������� �����
�
� ���������
� ��������
� ���� �� �
��
���
���������� �� ��
�������
�������
�� ��
2.3.2 ROVER
���� � �� ����� �
� �� ��
���� ���� ��
� �� ��� ���
���
������ ���� � � �����
���
�� � ��
�� � �� ����
��
simultaneous alignment
� �
corre-spondence set
���
� �
� � �� ���
FIGURE 2.2An example word transition network.
���� ������ �� �
��
���
���������� �� ��
������
� � �
���� ������ ��� � �
2.3.3 e-ROVER
joining expanded
� � � �
FIGURE 2.3Joining two correspondence sets.
������ ������ �� �
��
��������� ����
��������� �� ��
������ � �����
���
� ��
����� �� � ������ ������ �� � ���������� ���
segmentation
2.4 Experimental Results
2.4.1 Parameter Tuning within the MBR Classification Rule
� �����
� �����
��������� � ���� �� ���� �� �� ��
�� � � �
�� �word insertion penalty �
languagemodel scale factor
likelihood scale factor �
����������� � ����� �� ���� �� �� ������
TABLE 2.1
� � ���� � � ����� � � ����
���������� ������ �� �������� ��
����� � �����
������ ���
������ ���
�������� ��� � � ����
2.4.1.1 Optimization of Likelihood Parameters
���������������� � � �
�����
�
�������
��� ����������
� � �������supervised optimization
unsupervised optimization
� � ���
� �
� � ����� � � ���� �
� � � �
�
2.4.2 Utterance Level MBR Word and Keyword Recognition
��
����� ��
���� �� � �������
��
���� �� �� �
��� ����
��
��
��
abilities, bartenders, calculation, databasesa, and, the, besides, collaboration, distribution
2.4.2.1 Likelihood Scale Factor Tuning
�� ��
2.4.2.2 N-best List Rescoring and �� Search
��
��
��
TABLE 2.2
��
�
2.4.3 ROVER and e-ROVER for Multilingual ASR
�
�
FIGURE 2.4Top panel shows the ratio of total number of e-ROVER correspondence sets tothat of ROVER correspondence sets, as a function of the pinching threshold.Bottom panel shows the WER performance of e-ROVER for these thresholds.
FIGURE 4.1Architecture of shift-tolerant LVQ classifier [20].
4.3.2.3 LVQ/HMM Hybrid Classifier
FIGURE 4.2Block diagram of LVQ/HMM hybrid classifier.
�
4.3.2.4 HMM/LVQ Hybrid Classifier
FIGURE 4.3Block diagram of HMM/LVQ hybrid classifier.
�
� �
�� ��� � ��
�
�
�
� �
4.3.3 Squared Error Minimization
4.3.3.1 Training Using the Squared Error Loss
�
���� � �� ��
�
��
���
����� � ��� �����
���� �
� ��
�� �
�� � �
�
���� � �� � ����� � �� �
������ � ����
�
�
�
�
����� ���
����� � ����
����� � �� �
�
����� ���
����� � �����
���� � �� � ����� � �� �
�
����� ���
����� � �����
FIGURE 4.4Architecture of time-delay neural network [27].
4.3.3.2 Time-delay Neural Network
c c c
c c c
FIGURE 4.5Schematic description of distance classifier as a single intermediate layer net-work (2-dimensional input, 3 references/class, 3 classes).
4.3.3.3 Multi-state Time-delay Neural Network
4.3.4 Cross Entropy Minimization
4.3.4.1 Training Using the Cross Entropy Loss
��
�� � �
�
� � �
��
���
��
���
�� ��������� ��
� ���
���
����������� ���
��� ������ �
� �
�����
������������� �
�� ��
������
������ �
���
�� ����� ��������� � ������ ��������� �
��������� �����
��� ������ ��
��������� �� � � ������������ ��
�� ��
������
������ ��
��
�
� �
�����
��������� ���
4.3.4.2 Unidirectional Network Classifier
� � �� �
�� ������� � ��
4.3.4.3 Bidirectional Network Classifier
W
V
utyt
st s(t+1)
Time delayut : Input vector
st : State vector
yt : Output vector
FIGURE 4.6Architecture of unidirectional network [23].
4.4 Fusion of Multiple Classification Decisions
4.4.1 Principles
FIGURE 4.7Architecture of bi-directional network [25].
FIGURE 4.8Typical classifier design schemes of averaging-based decision fusion.
4.4.2 Examples of Embodiment
4.4.2.1 Multi-codebook Classifier Designed with GPD
FIGURE 4.9Relation between recognition accuracy and the number of prototypes per classand codebook [3].
4.4.2.2 Multi-class Classification Based on Support Vector Machine
4.4.2.3 Decision Fusion Using Different Classifiers
FIGURE 4.10Typical block diagrams of the MSTDNN-based audio-visual speech recognition[7].
4.4.2.4 Decision Fusion Using Multi-modal Classifiers
FIGURE 4.11Block diagram of the twofold-HMM-based audio-visual speech recognition [21].
4.5 Concluding Remarks
References
4.6 Appendix: Maximizing Mutual Information
���� � �� � ������ ������
� ���� ����� �����
� ��
����� � �� � ��
�� ���� �
��
��� ��� ���� ����� ����
���� ����
�
� � �� ���� ���� � ��
���
����� ���
� ������� ��������
� �
�� ���� ����
���� � �� � ����� � �� � ��
���
����� ���
� ������������
��� �
���� � �� � ���� � ��
5
Large Vocabulary Speech Recognition Basedon Statistical Methods
Jean-Luc Gauvain and Lori LamelLIMSI, France
CONTENTS
5.1 Introduction
5.2 Overview
���� � � � �������
���� ��� �� �� �
� �
���� ���� �� ��
� � ����� ��� ���� ��� �� �
������ �� ���� ���
�
� ���� �n
� ���� �� �
���� ���
�
�
FIGURE 5.1LVCSR speech generation model: The word sequence � produced by the lan-guage model is successively transformed by the pronunciation model (� �� �� �)and the acoustic model (��� ���� �), resulting in the speech signal � .
5.3 Language Modeling
n
n� � ���� ��� ���� ���
FIGURE 5.2System diagram of a generic speech recognizer based on statistical models, in-cluding training and decoding processes and the main knowledge sources.
� �� � �
��
���
������������� ���� ����� �����
� � �� �� �
nn
�
�� ��� � � �� �����
� � �
��
���
� ��������� �������
�
�
� � ���� ���� ��� � � � � �� ���
5.3.1 Text Preparation
n
� one hundred fifty dollars �
nineteen ninety one one thousand nine hundred and ninety one
hundred � � hundred andmillion dollars
million
� �� � ������������
FIGURE 5.3Some example transformation rules applied during text normalization with as-sociated probabilities.
FIGURE 5.5Some example lexical entries and their pronunciations along with estimateprobabilities. For the compound words, the original concatenated pronunci-ation is given in the 1st line and the reduced forms are given in the 2nd line.
interest conferencecompany
don’t knowdid you going to
gonna, dunno
5.5 Acoustic Modeling
5.5.1 Acoustic Front-end
�
� � ������� ������ ��
��
FIGURE 5.6A simple 3-state left-to-right HMM topology commonly used for allophone mod-eling in LVCSR. The model generates at least 3 speech frames per allophone, re-sulting in a minimal phone segment duration of 30ms for frame rate of 100Hz.
FIGURE 5.7Examples of allophonic transcriptions in terms of intra-word triphones andquinphones. Each contextual unit is defined by the central phone followed by itsphone context shown in parentheses (left-context, right-context). * is a wildcardsignifying any context.
������� ���
���
��� ������������
��� ��� ��� �
a priori
Position:General classes:
Vowel classes:
Consonant classes:Individual phones:
FIGURE 5.8Example questions used for decision tree clustering.
senones genones PELs tied-states
5.5.3 HMM Parameter Estimation
� �
Question Log likelihood gain Question Log likelihood gain
FIGURE 5.9The most frequently used decision tree questions for an American Englishbroadcast news transcription system [40]. The [+1] and [-1] indicate that thequestion has been applied to the right or left context respectively, and [0] to thephone itself.
�� � �������
�� ����
� �
A Posteriori
�
�
�� � �������
�� �� � � �� � �� �� �� � �
5.5.4 HMM Adaptation
� �
� � � ���
�� � �������
�� ���� ���
������� ��
����� � ��� � �
� ��
�� � �������
�� ��� ��� �� �
A b
5.6 Decoding
� �
� � � �������
� �� ��� � �������
�
���
� �� �� �� �� �������� �
�
� � � �������
������
� �� �� �� �� �������� ��
5.6.1 Speech/Non-speech Detection
5.6.2 Decoding Strategies
�
�
�
�
�
FIGURE 5.10Example word lattice generated by a speech recognizer using a bigram languagemodel for a 2.1s utterance. Each graph edge corresponds to a word hypothesisand a time interval (as specified by the time information on the nodes). In thisexample the word transcription with the highest likelihood is “sil IT WAS AGOOD PROGRAM sil” which happens to be what was said. (The acoustic andlanguage model likelihoods are not given on the figure.)
5.6.3 Efficiency
n
�
�
�
5.6.4 Confidence Measures
���� ��� �
���� ������
�������� �
���� ������
���� ����
5.7 Indicative Performance Levels
substitutionsinsertions
deletions
5.7.1 Dictation
�
5.7.2 Speech Recognition for Dialog Systems
�
n
exact
5.7.3 Transcription for Audio Indexation
�
�
5.8 Portability and Language Dependencies
�
References
The THISL Broadcast NewsRetrieval System,
Experiments in Vocal Tract Normaliza-tion,
A CompactModel for Speaker Adaptation Training,
One Pass Cross Word Decoding for Large Vocabularies Based on aLexical Tree Search Organization, 4
The Forward-Backward Search Strat-egy for Real-Time Speech Recognition,
Preliminary results on the performance of a system for the au-tomatic recognition of continuous speech,
AcousticMarkov Models used in the Tangora Speech Recognition System,
1
A Maximum Likelihood Approach toContinuous Speech Recognition,
PAMI-5
A Fast Match for Continuous Speech Recognition Using Allophonic Models,1
Large Vocabulary Recogni-tion of Wall Street Journal Sentences at Dragon Systems,
A maximization technique oc-curring in the statistical analysis of probabilistic functions of Markov chains
41
Vector quantization for efficient computation of continuous den-sity likelihoods, 2
A Baseline for the Tran-scription of Italian Broadcast News,
Word and acoustic confidence annotation for large vocabularyspeech recognition
Improvements in Language, Lexical and PhoneticModeling in Sphinx-II,
An empirical study of smoothing techniques forlanguage modeling, 13
Speaker, Environment and ChannelChange Detection and Clustering via the Bayesian Information Criterion
The Role of Word-Dependent Coartic-ulatory Effects in a Phoneme-Based Speech Recognition System
3
Statistical Language Modelling using CMU-Cambridge Toolkit,
Comparison of Parametric Representations ofMonosyllabic Word Recognition in Continuously Spoken Sentences,
28
Maximum Likelihood from In-complete Data via the EM Algorithm
39
Human SpeechRecognition Performance on the 1995 CSR Hub-3 Corpus
Genones: Optimization the Degree of Tying ina Large Vocabulary HMM-based Speech Recognizer,1
Speaker adaptation using con-strained estimation of Gaussian mixtures3
Sonograph and Sound Mechanics,22
Automatic Recognition of Phonetic Patterns inSpeech, 30
Human Speech Recognition Performance on the 1994CSR Spoke 10 Corpus
Comparison of speaker recognition methods using statistical featuresand dynamic features,ASSP-29
An improved approach to hidden Markov modeldecomposition of speech and noise,
Robust Continuous Speech Recognition usingParallel Model Combination, 9
Cluster Adaptive Training for Speech Recognition,
Semi-Tied Covariance Matrices for Hidden Markov Models,7
Transcribing Broad-cast News: The LIMSI Nov96 Hub4 System,
Spoken Lan-guage component of the MASK Kiosk
Speech Recognition for an Informa-tion Kiosk,
Partitioning and Transcription of Broad-cast News Data, 5
Developments in ContinuousSpeech Dictation using the ARPA WSJ Task,
Maximum a Posteriori Estimation for Multivari-ate Gaussian Mixture Observations of Markov Chains,
2
The LIMSI Broadcast News TranscriptionSystem 37
A Rapid Match Algorithm for Continuous SpeechRecognition,
A Probabilistic Approach to Confidence Mea-sure Estimation and Evaluation
Real-time Telephone-basedSpeech Recognition in the Jupiter Domain, 1
SWITCHBOARD: Telephone SpeechCorpus for Research and Development,
The Population Frequencies of Species and the Estimation of Popu-lation Parameters 40
A tree search strategyfor large-vocabulary continuous speech recognition,1
Linear Discriminant Analysis for ImprovedLarge Vocabulary Continuous Speech Recognition, 1
SegmentGeneration and Clustering in the HTK Broadcast News Transcription System,
News-on-Demand-’An Ap-plication of Informedia Technology’,
The ATIS Spoken LanguageSystems Pilot Corpus,
Perceptual linear predictive (PLP) analysis of speech,87
Large vocabu-lary continuous speech recognition using a hybrid connectionist-HMM system,
Signal Representation
Subphonetic Modeling with Markov States - Senone,1
Predicting Unseen Triphones withSenones, II
Continuous Speech Recognition by Statistical Methods,64
Statistical Methods for Speech Recognition,
A Dynamic LanguageModel for Speech Recognition,
: Speech BasedVideo Retrieval,
Maximum-Likelihood Estimation for Mixture MultivariateStochastic Observations of Markov Chains 64
Estimation of Probabilities from Sparse Data for the LanguageModel Component of a Speech Recognizer,
ASSP-35
Unsupervised Training of a Speech Recognizer: Re-cent Experiments, 6
The 1995 Abbot hybridconnectionist-HMM large-vocabulary recognition system,
Improved Clustering Techniques for Class-Based Statis-tical Language Modelling,
Improved backing-off for n-gram language modeling,1
Design of the 1994 CSR Benchmark Tests,
Toward Automatic Recognition of Broadcast News,
Heteroscedastic discriminant analysis and re-duced rank HMMs for improved speech recognition,26
Eigenvoices for Speaker Adaptation,
On Designing Pronunciation Lexicons for Large Vo-cabulary, Continuous Speech Recognition, 1
Speech Recognition of European Languages,
Continuous Speech Recognition at LIMSI,
A Phone-based Approach to Non-LinguisticSpeech Feature Identification, 9
Lightly Supervised and UnsupervisedAcoustic Model Training 16
Development of Spoken Language Corpora for Travel Infor-mation 3
Large-vocabulary speaker-independent continuous speech recogni-tion: The SPHINX system,
Speaker Normalization Using Efficient Frequency Warp-ing Procedures 1
Maximum Likelihood Linear Regression forSpeaker Adaptation of Continuous Density Hidden Markov Models,
9
Maximum Likelihood Estimation for Multivariate Observa-tions of Markov Sources IT-28
Speech recognition by machines and humans,22
Fast Speaker Change Detection for Broadcast NewsTranscription and Indexing 3
Multi-site Data Collection for a Spoken Language Corpus,
Finding Consensus in Speech Recognition:Word Error Minimization and Other Applications of Confusion Networks,
Subspace distribution clustering for continuousobservation density hidden Markov models,
Spoken Language Processing and Human-Machine Communica-tion in the European Union Programs,
An overview of EU programs related to conver-sational/interactive systems,
Algorithms for Bigram and Trigram Clus-tering,
News on Demand,43
Named Entity Extrac-tion from Broadcast News,
Full Expansion ofContext-Dependent Networks in Large Vocabulary Speech Recognition,
Large-VocabularyDictation using SRI’s Decipher Speech Recognition System: Progressive
Search Techniques, II
The Use of a One-Stage Dynamic Programming Algorithm for Con-nected Word Recognition,
ASSP-32
Improvements in BeamSearch for 10000-Word Continuous Speech Recognition,
I
Single-Tree Method for Grammar-DirectedSearch, 2
The Use of Decision Trees with Context Sensitive Phoneme Mod-elling,
A One Pass DecoderDesign for Large Vocabulary Recognition,
Recent Advancesin Japanese Broadcast News Transcription,2
Modeling Inverse Covariance Matrices by Ba-sis Expansion,
Language-model look-ahead for largevocabulary speech recognition,
A Word Graph Algorithm for Large Vo-cabulary Continuous Speech Recognition,11
The Role ofPhonological Rules in Speech Understanding Research,
ASSP-23
Continuous WordRecognition Based on the Stochastic Segment Model,
1993 Benchmark Tests for the ARPA Spoken Language Program,
1994 Benchmark Tests for the ARPA Spoken Language
Program,
1995 Hub-3 Multiple Microphone Corpus Benchmark Tests,
1998Broadcast News Benchmark Test Results: English and Non-English Word Er-ror Rate Performance Measures,
An efficient A� stack decoder algorithm for continuous speechrecognition with a stochastic language model,
Improved Discriminative Training Techniques ForLarge Vocabulary Continuous Speech Recognition
Evaluation of Spoken Language Systems: The ATIS Domain,
An Introduction to Hidden Markov ModelsASSP-3
Efficient Algorithms for Speech Recognition,
Stochastic pronuncia-tion modelling from hand-labelled phonetic corpora,29
Improvements in Stochastic Language Modeling,
Adaptive Statistical Language Modeling,
Two Decades of Statistical Language Modeling: Where Do WeGo From Here?,
88
Language-independent and langauge-adaptiveacoustic modeling for speech recognition 35
Memory-efficient LVCSR search using a one-pass stack decoder,14
New uses for N-Best Sen-tence Hypothesis, within the BYBLOS Speech Recognition System,
I
Improved Hid-den Markov Modeling of Phonemes for Continuous Speech Recognition,
3
NYU Language Modeling Experiments for the1995 CSR Evaluation,
A Markov Random Field Approach to Bayesian SpeakerAdaptation,
Modeling Those F-Conditions – Or Not,
Scalable backoff language models1
Automatic Segmentation, Classifica-tion and Clustering of Broadcast News Audio,
Evaluation of word confidence for speech recognitionsystems 13
Entropy-based Pruning of Backoff Language Models
Four-level Tied Structure for Efficient Repre-sentation of Acoustic Modeling,
An Investigation into Vocal Tract LengthNormalization,
Human Bench-marks for Speaker Independent Large Vocabulary Recognition Performance,
Speech discrimination by dynamic programming,4
Elements-wise recognition of continuous speech composed ofwords from a specified dictionary, 7
Verbmobil: Translation of Face-to-Face Dialogs,Plenary
Multilinguality in Speech and Spoken Language Systems88
Probabilistic Models for Topic De-tection and Tracking, 1
FIGURE 6.12A phrase structure tree based on a dependency structure.
� � �� (right-headed)
� � �� (left-headed)
� � �
� � �
� ��� � � � � �� ��
��
�
�� �� �� � � � �� � �� �
���� � � � �� � �� �
�� ��
�
�
�� ��
�� �� �� ���� � � � ��
����� ��� � � ��
������ ��� � ���
��
���
����
���
��
���
��
���
����� ��� �� �� ���
6.6.2 Summarization of Multiple Utterances
�� � � � � �� ����� ���� � � � � ���� ��� ��
� �� � � ��� ��� � � � � ��
6.6.3 Evaluation
6.6.3.1 Word Network of Manual Summarization Results for Evaluation
6.6.3.2 Evaluation Data
6.6.3.3 Training Data for Summarization Models
I_L
_T
SUB
I_L
_C_T
I_L
_C
I_L
I_L
RD
M I_L
_T
RD
M
I
I
REC TRSI_
L_T SU
B
I_L
_C_T
I_L
_C
I_L I_
L
RD
M I_L
_T
RD
M I
I
REC TRS
FIGURE 6.13Each utterance summarizations at 70% summarization ratio.
6.6.3.4 Evaluation Results
� �
� � �
� � � �
� � � �
� � � � �
I_L
_T
SUB
I_L
_C_T
I_L
_C
I_L
I_L
RD
M
I_L
_T
RD
M I I
REC TRS
I_L
_T
SUB
I_L
_C_T
I_L
_C
I_L
I_L
RD
M
I_L
_T
RD
M
I I
REC TRS
FIGURE 6.14Article summarizations at 30% summarization ratio.
6.6.4 Discussion
6.7 Spontaneous Speech Recognition and Understanding ResearchIssues
6.7.1 Language Models and Corpora
6.7.2 Message-driven Speech Recognition and Understanding
� �� ��� � � ��� � � � � �� � � ��� � � � � ��
� �
�
� �� ���
�
� � �� �� �
�����������
�
����������
� � �� ���� ��!��� �
������
( �! ����
0��!���������� �
-����������� �
0��!��!
���������
4������
� ������
���� .�
)����
�� � �
� " �� ������
5��
2��������������� �����
(��������
FIGURE 6.15A communication-theoretic view of speech generation and recognition.
�
�
����
� �� ��� � ����
�
�
� �� �� �� �� ����
����
� �� ��� � ����
�
�
� �� �� �� �� ���� ���
� ����
����
� �� ��� � ������
� �� �� �� �� ���� ���
� ����
� �� �� �
6.7.3 Statistical Approaches and Speech Science
-
6.7.4 Research on the Human Brain
6.7.5 Dynamic Spectral Features
FIGURE 6.16Speech-generation and speech-perception processes.
6.8 Conclusion
References
7
Speaker Authentication
Qi Li� and Biing-Hwang Juang�
�Bell Labs; �Avaya Labs Research
CONTENTS
7.1 Introduction
FIGURE 7.1Speaker authentication approaches.
Speaker authentication
7.1.1 Speaker Recognition and Verification
Speaker recognitionSpeaker verification
hypothesis test-ing Speaker identification
classification
FIGURE 7.2A speaker verification system.
direct methoddirectly
fixed pass-phrase system
text-prompted system
text-independentSV system
closed test open test
7.1.2 Verbal Information Verification
FIGURE 7.3An example of verbal information verification by asking sequential questions.(Similar sequential tests can also be applied in speaker verification and otherbiometric or multi-modality verification.)
in-direct method
7.2 Pattern Recognition in Speaker Authentication
7.2.1 Bayesian Decision Theory
�
� � ��
���� ��� ���� ��� � � �
�� � � �
a posteriori
� ������ ��������� ����
����
������� � ����
���� �
�����
�������� ����
�������� � �
��
��
������� �
�����
��������� ��� ����
Bayes decision rule� � �� ���� � � �������
�������� �
�� � � � � � �� ����� �� ��
������� �
�����
��������� ��� ���
��� ���
� ��� ��� � �� � �������
��
� ������
��� � � ��� ������
� �������
�������
��� � � ��� ������
�������� �����
�
� � ��������
��
� ������
� ������ �
�����
� ��������
�� � ��� ��������
�����
��������� �����
�� � ��� ��������
�����
� ��������� �����
7.2.2 Stochastic Models for Stationary Process
pdfpdf
pdf
�������� � �������� ���
���
��� ���������
�� �� ������� �� � � ���
� ������ ��� �
������������
��
���� � ��� ���� ��� � ���
�
�� �� �
�
��� ��
�
��
���
������� ��
��� �
��
��� ������� ������
��� ������� ��
��� �
��
��� ������� ����� � ������� � ������
��� ������� ��
������� �� ������������
��� ����������
��� ��� ���� ���
�� � � �� �������
��
���
� ���������
� � �
speaker-dependent
7.2.3 Stochastic Models for Non-Stationary Process
Natural Language Spoken InterfaceControl Using Data-Driven Semantic Inference
Large–Scale Sparse Singular Value Computations
Using Linear Algebra for In-telligent Information retrieval
An Overview of Parallel Algorithms for the SingularValue and Dense Symmetric Eigenvalue Problems
Natural Language Call Routing: A Robust,Self–Organized Approach
Structure and Perfor-mance of a Dependency Language Model
Recognition Performance of a Structured LanguageModel
Building Probabilistic Models for Natural Language
Dialog Management in Vector–Based CallRouting
Language Model Adaptation Using Mix-tures and an Exponentially Decaying Cache
Towards Better Integration of Semantic Predictorsin Statistical Language Modeling
Lanczos Algorithms for Large SymmetricEigenvalue Computations – Vol. 1 Theory
Recognizing and Using Knowledge Structures in Dialog Sys-tems
Indexing by Latent Semantic Analysis
Adaptive Lan-guage Model Estimation Using Minimum Discrimination Estimation
Improving the Retrieval of Information from External Sources
Latent Semantic Indexing (LSI) and TREC–2
Language Modeling
Personalized Information Delivery: An Analysisof Information Filtering Methods
On Topic Identification and Dialogue Move Recognition
Topic–Based Language Modeling Using EM
Matrix Computations
Document Space Models Using Latent Semantic Anal-ysis
Probabilistic Latent Semantic Analysis
Probabilistic Topic Maps: Navigating Through Large Text Col-lections
Modeling Long Distance Dependencies in Lan-guage: Topic Mixtures Versus Dynamic Cache Models
Self–Organized Language Modeling for Speech Recognition
Putting Language into Language Modeling
Using a Stochastic Context–Free Grammar as a Language Model forSpeech Recognition
Putting Language Back into Language Modeling
Statistical Language Modeling Using a Variable Context
The Hub and Spoke Paradigm for CSR Evaluation
A Cache-based Natural Language Method for SpeechRecognition
Cluster Expansion and Iterative Scaling for Maxi-mum Entropy Language Models
Solution to Plato’s Problem: The LatentSemantic Analysis Theory of Acquisition, Induction, and Representation ofKnowledge
How Well Can Pas-sage Meaning Be Derived Without Using Word Order: A Comparison of LatentSemantic Analysis and Humans
Trigger–Based Language Models: AMaximum Entropy Approach
On Structuring Probabilistic Dependencesin Stochastic Language Modeling
A Variable–Length Category–Based N–Gram Lan-guage Model
Latent Seman-tic Indexing: A Probabilistic Analysis
Beyond Word �-Grams
An Overview of Automatic SpeechRecognition
The CMU Statistical Language Modeling Toolkit and its Use inthe 1994 ARPA CSR Evaluation
A Maximum Entropy Approach to Adaptive Statistical LanguageModeling
Two Decades of Statistical Language Modeling: Where Do WeGo From Here
Interactive Feature Induc-tion and Logistic Regression for Whole Sentence Exponential Language Mod-els
Language Representation
A MaximumLikelihood Model for Topic Classification of Broadcast News
An Explanation of the Effectiveness of Latent Semantic Indexing byMeans of a Bayesian Regression Model
Combining Nonlocal, Syntactic and N-Gram De-pendencies in Language Modeling
Recognition and Parsing of Context–Free Languages in Time�
�
Using Detailed Linguistic Structure inLanguage Modeling
Linguistic Features for Whole Sen-tence Maximum Entropy Language Models
Integration of Speech Recognition and Natural Language Processing in theMIT Voyager System
10
Semantic Information Processing of SpokenLanguage – How May I Help You?sm
A. L. Gorin, A. Abella, T. Alonso, G. Riccardi, and J. H. Wright,AT&T Laboratories
CONTENTS
10.1 Introduction
AT&T’s‘How May I Help You?’ ��
“The fundamental problem of communication is that of reproducing atone point either exactly or approximately a message selected at another
point. Frequently the messages have meaning, � � � These semantic as-pects of communication are irrelevant to the engineering problem.”
confirmclarify
“Do you want to make a collect call?”“Charge this call please”
“How do you want to charge this call, to a credit card or to a third num-ber?”
“What is your home phonenumber?”
Construct Algebra
dialog motivators
inheritance hierarchy
‘is a’‘has a’
10.2 Call-Classification
‘press one if you want x, press two if you want y’
‘please say collect, calling card’ ‘press orsay one if you want x’
‘How may I help you?’
“I want to reverse the charges on this call.”“Can you tell me what time it is in Tokyo?”“I was trying to call my sister and dialed a wrong number.”“I’ve been trying to dial this number all day and can’t get through.”
“How much money do I owe you?”“I don’t recognize this phone call to Tallahassee on October 4.”“What’s this charge for one dollar and fifty cents?”“I have a question about my bill.”
FIGURE 10.1Call classification and routing in HMIHY.
‘How may I help you?’
‘How may I help you?’
FIGURE 10.2Inheritance hierarchy of task knowledge in operator services.
perplexity
� � � ���
� �
Evaluating Call Classification.
FIGURE 10.3Histogram of utterance lengths.
false rejection
correct classificationtrue rejection rate
Remark:
“I want to know howto pay my bill”
10.3 Language Modeling for Recognition and Understanding
� � ���� � � � ��
�� � �� ������ � � � �����
‘I want to make a’‘collect call’ ‘card call’
� � � �
‘wrong’‘wrong number’
‘dialed a wrong number’
‘dialed a wrong number’ ‘dialed the wrong number’
FIGURE 10.4A salient grammar fragment.
salient grammar fragments
� User
� ��� yeah I’m not AT&T WIRELESS PHONE and when I got and she toldme that I would be switched to 7 CENTS A MINUTES FOR ALL my AT&Tlong distance on that I was on 10 10 cents ONE RATE PLAN
FIGURE 12.1A sample detection error tradeoff (DET) curve for the TDT tracking task withone training story (�� � �).
minimum
12.2 Basic Topic Models
12.2.1 Vector Space
�� � ��
������� � �������
12.2.2 Language Models
� ��� ��
�
� � �
�
� � ���
� � ��� ��
��
� �����
���� � ��� � ���� � ���
12.3 Implementing the Models
12.3.1 Named Entities
President Bush George Bush
12.3.2 Document Expansion
� ����� ��
���
� ���� �����
12.3.3 Clustering
12.3.4 Time Decay
12.4 Comparing Models
12.4.1 Nearest Neighbors
� �
�
�
� � �
�
�
12.4.2 Decision Trees
�
12.4.3 Model-to-Model
�
��� � � ������ � ��
� �
��� � �� ��
���
���� �������
����
�
� � �
��� � ���� ����� ����
���� � ��� ����� � ���
�
12.5 Miscellaneous Issues
�
�
12.5.1 Deferral
12.5.2 Multi-modal Issues
third
12.5.3 Multi-lingual Issues
FIGURE 12.2Screen snapshot of the Lighthouse system that was created to portray TDT topicclusters and their relationships.
12.6 Using TDT Interactively
12.6.1 Demonstrations
12.6.2 Timelines
��
Oklahoma
��
OklahomaMcVeigh Simpson
FIGURE 12.3Overview of January-June 1998. The topic labeled monica lewinsky allegation isthe highest ranked topic by the �� measure. The pop-up on oregon school shoot-ing shows significant named entities for that event. The other pop-up displays asub-menu for obtaining more information on the name kip kinkel.
��
12.7 Modeling Events
�
12.8 Conclusion
� research
References
Proceedings of Conference onInformation Retrieval Research (SIGIR)
Proceedings of the DARPA BroadcastNews Transcription and Understanding Workshop
Proceedings of Conference on Information Retrieval Research (SIGIR)
Information Retrieval
Topic Detection and Track-ing: Event-based Information Organization
In Proceedings of the 36th Annual Meetingof the Association for Computational Linguistics and the 17th InternationalConference on Computational Linguistics (COLING-ACL’98)
Proceedings for Empirical Methods in NLP
Proceedings of the Text Retrieval Conference(TREC-3)
Proceedings of the DARPA Broadcast News Workshop
Topic Detection and Tracking: Event-based InformationOrganization
Topic Detection and Tracking: Event-based Information Organization
Proceed-ings of the DELOS-NSF Workshop on Personalization and Recommender Sys-tems in Digital Libraries
Topic Detectionand Tracking: Event-based Information Organization
Topic Detection and Tracking: Event-based Information Organization
Proceedings of the Text Retrieval Conference (TREC-2)
Topic Detection and Tracking:Event-based Information Organization
Proceedings of the Human Language Technology Conference (HLT)
Proceedings of the Text RetrievalConference (TREC-8)
Proceedings of ACM SIGIR Conference on Research in Information Retrieval
Topic Detection andTracking: Event-based Information Organization
Proceedings of the IEEE Symposium on Information Visualization2000 (InfoVis 2000)
Foundations of Statistical Natural LanguageProcessing
EuroSpeech
Proceedings of the DARPABroadcast News Workshop
Proceedings of the 2000 Speech Transcription Workshop
Proceedings of the DARPA BroadcastNews Workshop
On-line New Event Detection, Clustering, and Tracking
Advances inInformation Retrieval: Recent Research from the CIIR
Proceedings of the DARPA Broadcast NewsWorkshop
Proceedings of SIGIR
A Language Modeling Approach to Information Retrieval
Proceedings ofthe European Conference on Research and Advanced Technology for DigitalLibraries (ECDL)
Proceedings of the Text Retrieval Conference (TREC-9)
Introduction to Modern InformationRetrieval
Topic Detectionand Tracking: Event-based Information Organization
Proceedings of the DARPA Broadcast NewsWorkshop
Proceedings of the Eighth International Conference on Informa-tion and Knowledge Management (CIKM99)
Proceedings of SIGIR
Proceedings of KDD 2000 Conference
Information Retrieval
Proceedings of the Text Retrieval Conference (TREC-8)
Proceedings of the DARPA Broadcast News Transcriptionand Understanding Workshop
ACM Transactions on Information Systems(TOIS)
Topic Detection and Tracking: Event-based Information Organization