... NOT JUST ANOTHER PUBLIC DOMAIN SOFTWARE PROJECT... UAB – CIS Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.

... NOT JUST ANOTHERPUBLIC DOMAIN SOFTWARE PROJECT ...

UAB – CIS

• Joseph Picone

Inst. for Signal and Info. Processing

Dept. Electrical and Computer Eng.

Mississippi State University

• Contact Information:

Box 9571

Mississippi State University

Mississippi State, Mississippi 39762

Tel: 662-325-3149

Fax: 662-325-2298

Email: [email protected]

• URL: www.isip.msstate.edu/publications/seminars/external/2003/uab

• Acknowledgement:Supported by several NSF grants (e.g. EIA-9809300).

mailto:[email protected]?subject=MIT%20Lincoln%20Labs%20-%20risk%20minimization

http://www.isip.msstate.edu/publications/seminars/external/2003/uab

TECHNOLOGY

• Focus: speech recognition– State of the art

– Statistical (e.g., HMM)

– Continuous speech

– Large vocabulary

– Speaker independent

• Goal: Accelerate research– Flexibility, Extensibility, Modular

– Efficient (C++, Parallel Proc.)

– Easy to Use (documentation)

– Toolkits, GUIs

• Benefit: Technology– Standard benchmarks

– Conversational speech

PUBLIC DOMAIN SOFTWARE

MISSION“GNU FOR DSP”

• Origins date to work at Texas Instruments in 1985.

• The Institute for Signal and Information Processing (ISIP) was created in 1994 at Mississippi State University with a simple vision to develop public domain software.

• Key differentiating characteristics of this project are: Public Domain: unrestricted software (including

commercial use); no copyrights, licenses, or research-only restrictions.

Increase Participation: competitive technology plus application-specific toolkits reduce start-up costs.

Lasting Infrastructure: Support, training, education, dissemination of information are priorities.

APPROACHFLEXIBLE YET EFFICIENT

Research:

Rapid Prototyping

“Fair” Evaluations

Ease of Use

Lightweight Programming

Efficiency:

Memory

Hyper-real time training

Parallel processing

Data intensive

Research:

• Matlab• Octave• Python

ASR:

• HTK• SPHINX• CSLUISIP:

• IFCs• Java Apps• Toolkits

APPROACHPLATFORMS AND COMPILERS

Supported platforms:• Linux (Redhat 6.1 or greater) • Sun x86 Solaris 7 or greater • Windows (cygwin tools)• (Recently phased out Sun Sparc)

Languages and Compilers:• Remember Lisp? Java? Tk/Tcl?• Avoid a reliance on Perl! • C++ was the obvious choice as

a tradeoff between stability, standardization, and efficiency.

DOCUMENTATION AND WORKSHOPS

• Extensive online software documentation, tutorials, and training materials

• Self-documenting software

• Over 100 students and professionals representing 25 countries and 75 institutions have attended our workshops

• Over a dozen companies have trained in our lab

APPROACH

• Metadata extraction from conversational speech

• Automatic gisting and intelligence gathering

• Speech to text is the core technology challenge

• Machines vs. humans

• Real-time audio indexing• Time-varying channel• Dynamic language model• Multilingual and cross-lingual

APPLICATIONSREAL-TIME INFORMATION EXTRACTION

• In-vehicle dialog systems improve information access.

• Advanced user interfaces enhance workforce training and increase manufacturing efficiency.

• Noise robustness in both environments to improve recognition performance

• Advanced statistical models and machine learning technology

DIALOG SYSTEMS FOR THE CARAPPLICATIONS

APPLICATIONSSPEAKER RECOGNITION

• Voice verification for calling card security

• First wide-spread deployment of recognition technology in the telephone network

• Extension of same statistical modeling technology used in speech recognition

APPLICATIONSSPEAKER STRESS AND FATIGUE

• Recognition of emotion, stress, fatigue, and other voice qualities are possible from enhanced descriptions of the speech signal

• Fundamentally the same statistical modeling problem as other speech applications

• Fatigue analysis from voice under development under an SBIR

APPLICATIONSUNIQUE FEATURES OF OUR RESEARCH

• Acoustic Modeling– Risk minimization– Relevance vectors– Syllable modeling– Network training

• Language Modeling– Hierarchical decoder– Dynamic models– NLP Integration

• Basic principles: fundamental algorithm research captured in a consistent software framework

INTRODUCTIONSPEECH RECOGNITION RESEARCH?

• Why do we work on speech recognition?

“Language is the preeminent trait of the human species.”

“I never met someone who wasn’t interested in language.”

“I decided to work on language because it seemed to bethe hardest problem to solve.”

• Why should we work on speech recognition?

• Antiterrorism, homeland security, military applications

• Telecommunications, mobile communications

• Education, learning tools, educational toys, enrichment

• Computing, intelligent systems, machine learning

• Commodity or liability?

• Fragile technology that is error prone

INTRODUCTIONFUNDAMENTAL CHALLENGES

INTRODUCTIONPROBABILISTIC FRAMEWORK

SPEECH RECOGNITIONBLOCK DIAGRAM OVERVIEW

Core components:

• transduction

• feature extraction

• acoustic modeling (hidden Markov models)

• language modeling (statistical N-grams)

• search (Viterbi beam)

• knowledge sources

SPEECH RECOGNITIONFEATURE EXTRACTION

SPEECH RECOGNITIONACOUSTIC MODELING

SPEECH RECOGNITIONLANGUAGE MODELING

SPEECH RECOGNITIONVITERBI BEAM SEARCH

• breadth-first

• time synchronous

• beam pruning

• supervision

• word prediction

• natural language

Traditional Output:

• best word sequence

• time alignment of information

Other Outputs:

• word graphs

• N-best sentences

• confidence measures

• metadata such as speaker identity, accent, and prosody

SPEECH RECOGNITIONAPPLICATION OF INFORMATION RETRIEVAL

• Maximum likelihood convergence does not translate to optimal classification if a priori assumptions about the data are not correct.

• Finding the optimal decision boundary requires only one parameter.

RISK MINIMIZATIONML CONVERGENCE NOT OPTIMAL

RISK MINIMIZATIONGENERALIZATION AND RISK

• Optimal decision surface is a line

• Optimal decision surface changes abruptly

• Optimal decision surface still a line

• How much can we trust isolated data points?

• Can we integrate prior knowledge about data, confidence, or willingness to take risk?

• Structural optimization often guided by an Occam’s Razor approach

• Trading goodness of fit and model complexity– Examples: MDL, BIC, AIC, Structural Risk

Minimization, Automatic Relevance Determination

RISK MINIMIZATION

Model Complexity

Error

Training SetError

Open-LoopError

Optimum

STRUCTURAL OPTIMIZATION

RISK MINIMIZATIONSTRUCTURAL RISK MINIMIZATION

• The VC dimension is a measure of the complexity of the learning machine

• Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik)

• Expected Risk:

• Not possible to estimate P(x,y)

• Empirical Risk:

• Related by the VC dimension, h:

• Approach: choose the machine that gives the least upper bound on the actual risk

),(),(2

1)( yxdPxfyR

l

iiiemp xfy

lR

1

|),(|2

1

)()()( hfRR emp

VC confidence

empirical risk

bound on the expected risk

VC dimension

Expected risk

optimum

RISK MINIMIZATION

Optimization: Separable Data

• Hyperplane:

• Constraints:

• Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors.

• Final classifier:

SVs

iii bxxyxf )()(

bwx

01)( bwxy ii

origin

class 1

class 2

w

H1

H2

C1CO C2

optimalclassifier

• Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally

• The data points that define the boundary are called support vectors

SUPPORT VECTOR MACHINES

• Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 SI test

Approach % Error # Parameters

SVM: Polynomial Kernels 49%

K-Nearest Neighbor 44%

Gaussian Node Network 44%

SVM: RBF Kernels 35% 83 SVs

Separable Mixture Models 30%

RVM: RBF Kernels 30% 13 RVs

EXPERIMENTAL RESULTSDETERDING VOWEL DATA

• HMM system is cross-word state-tied triphones with 16 mixtures of Gaussian models

• SVM system has monophone models with segmental features

• System combination experiment yields another 1% reduction in error

EXPERIMENTAL RESULTSSVM ALPHADIGIT RECOGNITION

Transcription Segmentation SVM HMM

N-best Hypothesis 11.0% 11.9%

N-best+Ref Reference 3.3% 6.3%

• A kernel-based learning machine

• Incorporates an automatic relevance determination (ARD) prior over each weight (MacKay)

• A flat (non-informative) prior over completes the Bayesian specification

)1

),0(|()|(0

N

i iiiwNwP

N

iii xxKwwwxy

10 ),();(

);(1

1);|1(

wixyi

ewxtP

RELEVANCE VECTOR MACHINESAUTOMATIC RELEVANCE DETERMINATION

• RVMs yield a large reduction in the parameter count while attaining superior performance

• Computational costs mainly in training for RVMs but is still prohibitive for larger sets

Approach Error

Rate

Avg. # Parameters

Training Time

Testing Time

SVM 16.4% 257 0.5 hours 30 mins

RVM 16.2% 12 30 days 1 min

EXPERIMENTAL RESULTSSVM/RVM ALPHADIGIT COMPARISON

EXPERIMENTAL RESULTSPRACTICAL RISK MINIMIZATION?

• Reduction of complexity at the same level of performance is interesting:

• Results hold across tasks

• RVMs have been trained on 100,000 vectors

• Results suggest integrated training is critical

• Risk minimization provides a family of solutions:

• Is there a better solution than minimum risk?

• What is the impact on complexity and robustness?

• Applications to other problems?

• Speech/Non-speech classification?

• Speaker adaptation?

• Language modeling?

EXPERIMENTAL RESULTSPRELIMINARY RESULTS

ApproachError

RateAvg. #

ParametersTraining

TimeTesting

Time

SVM 15.5% 994 3 hours 1.5 hoursRVM

Constructive 14.8% 72 5 days 5 mins

RVMReduction 14.8% 74 6 days 5 mins

• Data increased to 10000 training vectors

• Reduction method has been trained up to 100k vectors (on toy task). Not possible for Constructive method

SUMMARYRELEVANT SOFTWARE RESOURCES

• Pattern Recognition Applet: compare popular algorithms on standard or custom data sets

• Speech Processing Toolkits: speech recognition, speaker recognition and verification, statistical modeling, machine learning, state of the art toolkits

• Fun Stuff: have you seen our commercial on the Home Shopping Channel?

• Foundation Classes: generic C++ implementations of many popular statistical modeling approaches

SUMMARYBRIEF BIBLIOGRAPHY

Applications to Speech Recognition:

1. J. Hamaker and J. Picone, “Advances in Speech Recognition Using Sparse Bayesian Methods,” submitted to the IEEE Transactions on Speech and Audio Processing, January 2003.

2. A. Ganapathiraju, J. Hamaker and J. Picone, “Applications of Risk Minimization to Speech Recognition,” submitted to the IEEE Transactions on Signal Processing, July 2003.

3. J. Hamaker, J. Picone, and A. Ganapathiraju, “A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines,” Proceedings of the International Conference of Spoken Language Processing, vol. 2, pp. 1001-1004, Denver, Colorado, USA, September 2002.

4. J. Hamaker, Sparse Bayesian Methods for Continuous Speech Recognition, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, December 2003.

5. A. Ganapathiraju, Support Vector Machines for Speech Recognition, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, January 2002.

Influential work:

6. M. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning, vol. 1, pp. 211-244, June 2001.

7. D. J. C. MacKay, “Probable networks and plausible predictions --- a review of practical Bayesian methods for supervised neural networks,” Network: Computation in Neural Systems, 6, pp. 469-505, 1995.

8. D. J. C. MacKay, Bayesian Methods for Adaptive Models, Ph. D. thesis, California Institute of Technology, Pasadena, California, USA, 1991.

9. E. T. Jaynes, “Bayesian Methods: General Background,” Maximum Entropy and Bayesian Methods in Applied Statistics, J. H. Justice, ed., pp. 1-25, Cambridge Univ. Press, Cambridge, UK, 1986.

10. V.N. Vapnik, Statistical Learning Theory, John Wiley, New York, NY, USA, 1998.

11. V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, NY, USA, 1995.

12. C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” AT&T Bell Laboratories, November 1999.

... NOT JUST ANOTHER PUBLIC DOMAIN SOFTWARE PROJECT... UAB – CIS Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.

Documents

information access

dissemination of information

unrestricted software

speech signalfundamentally

speech recognitionstate

information processing

workforce training

competitive technology