Feature Extraction Using PCA

Speech Recognition

Definition

• Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words.

• The recognised words can be an end in themselves, as for applications such as commands & control, data entry, and document preparation.

• They can also serve as the input to further linguistic processing in order to achieve speech understanding

Speech Processing

• Signal processing:– Convert the audio wave into a sequence of feature vectors

• Speech recognition:– Decode the sequence of feature vectors into a sequence

of words• Semantic interpretation:

– Determine the meaning of the recognized words• Dialog Management:

– Correct errors and help get the task done• Response Generation

– What words to use to maximize user understanding• Speech synthesis (Text to Speech):

– Generate synthetic speech from a ‘marked-up’ word string

Many kinds of Speech Recognition Systems

• Speech recognition systems can be characterised by many parameters.

• An isolated-word (Discrete) speech recognition system requires that the speaker pauses briefly between words, whereas a continuous speech recognition system does not.

Spontaneous V Scripted

• Spontaneous, speech contains disfluencies, periods of pause and restart, and is much more difficult to recognise than speech read from script.

Enrolment

• Some systems require speaker enrolment, a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrolment is necessary.

Signal Variability

• Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal.

• The acoustic realisations of phonemes, the recognition systems smallest sound units of which words are composed, are highly dependent on the context in which they appear.

• These phonetic variables are exemplified by the acoustic differences of the phoneme 't/'in two, true, and butter in English.

• At word boundaries, contextual variations can be quite dramatic, and devo andare sound like devandare in Italian.

More

• Acoustic variability can result from changes in the environment as well as in the position and characteristics of the transducer.

• Within-speaker variability can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality.

• Differences in socio-linguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variability.

What is a speech recognition system?

• Speech recognition is generally used as a human computer interface for other software. When it functions in this role, three primary tasks need be performed.

• Pre-processing, the conversion of spoken input into a form the recogniser can process.

• Recognition, the identification of what has been said.

• Communication, to send the recognised input to the application that requested it.

How is pre-processing performed

• To understand how the first of these functions is performed, we must examine,

• Articulation, the production of the sound.

• Acoustics, the stream of the speech itself.

• What characterises the ability to understand spoke input, Auditory perception.

Acoustics

• Articulation provides valuable information about how speech sounds are produced, but a speech recognition system cannot analyse movements of the mouth.

• Instead, the data source for speech recognition is the stream of speech itself.

• This is an analogue signal, a sound stream, and a continuous flow of sound waves and silence.

Important Features (Acoustics)

• Four important features of the acoustic analysis of speech are, (Carter, 1984)

• Frequency, the number of vibrations per second a sound produces

• Amplitude, the loudness of the sound.• Harmonic structure added to the fundamental

frequency of a sound are other frequencies that contribute to its quality or timbre.

• Resonance.

Auditory perception, hearing speech.

• "Phonemes tend to be abstractions that are implicitly defined by the pronunciation of the words in the language. In particular, the acoustic realisation of a phoneme may heavily depend on the acoustic context in which it occurs. This effect is usually called co-articulation", (Ney, 1994).

• The way a phoneme is pronounced can be affected by its position in a word, neighbouring phonemes and even the word's position in a sentence. This affect is called the co-articulation effect.

• The variability in the speech signal caused by co-articulation and other sources make speech analysis very difficult.

Human Hearing

• The human ear can detect frequencies from 20Hz to 20,000Hz but it is most sensitive in the critical frequency range, 1000Hz to 6000Hz, (Ghitza, 1994).

• Recent Research has uncovered the fact that humans do not process individual frequencies.

• Instead, we hear groups of frequencies, such as format patterns, as cohesive units and we are capable of distinguishing them from surrounding sound patterns, (Carrell and Opie, 1992) .

• This capability, called auditory object formation, or auditory image formation, helps explain how humans can discern the speech of individual people at cocktail parties and separate a voice from noise over a poor telephone channel, (Markowitz, 1995).

Pre-processing Speech

• Like all sounds, speech is an analogue waveform. In order for a Recognition System to perform action on speech, it must be represented in a digital manner.

• All noise patterns silences and co-articulation effects must be captured.

• This is accomplished by digital signal processing. The way the analogue speech is processed is one of the most complex elements of a Speech Recognition system.

Recognition Accuracy

• To achieve high recognition accuracy the speech representation process should, (Markowitz, 1995),

• Include all critical data.

• Remove Redundancies.

• Remove Noise and Distortion.

• Avoid introducing new distortions.

Signal Representation

• In statistically based automatic speech recognition, the speech waveform is sampled at a rate between 6.6 kHz and 20 kHz and processed to produce a new representation as a sequence of vectors containing values of what are generally called parameters.

• The vectors typically comprise between 10 and 20 parameters, and are usually computed every 10 or 20 milliseconds.

Signal Recognition Technologies

• Signal Recognition methodologies fall into to four categories, most system will apply one or more in the conversion process.

Template Matching,

• Template match is the oldest and least effective method. It is a form of pattern recognition.

• It was the dominant technology in the 1950's and 1960's. • Each word or phrase in an application is stored as a

template. • The user input is also arranged into templates at the

word level and the best match with a system template is found.

• Although Template matching is currently in decline as the basic approach to recognition, it has been adapted for use in word spotting applications. It also remains the primary technology applied to speaker verification, (Moore, 1982).

Acoustic-Phonetic Recognition

• Acoustic-phonetic recognition functions at the phoneme level. It is an attractive approach to speech as it limits the number of representations that must be stored. In English there are about forty discernible phonemes no matter how large the vocabulary, (Markowitz, 1995).

• Acoustic phonetic recognition involves three steps,

Feature Extraction.Segmentation and Labelling.Word-Level recognition.

• Acoustic phonetic recognition supplanted template matching in the early 1970's.

• The successful ARPA SUR systems highlighted potential benefits of this approach. Unfortunately acoustic phonetic was at the time a poorly researched area and many of the expected advances failed to materialise.

• The high degree of acoustic similarity among phonemes combined with phoneme variability resulting from the co-articulation effect and other sources create uncertainty with regard to potential phoneme labels, (Cole 1986).

• If these problems can be overcome, there is certainly an opportunity for this technology to play a part in future Speech Recognition system.

Stochastic Processing,

• The term stochastic refers to the process of making a sequence of non-deterministic selections from among a set of alternatives.

• They are non-deterministic because the choices during the recognition process are governed by the characteristics of the input and not specified in advance, (Markowitz, 1995).

• Like template matching, stochastic processing requires the creation and storage of models of each of the items that will be recognised.

• It is based on a series of complex statistical or probabilistic analyses. These statistics are stored in a network-like structure called a Hidden Markov Model (HMM), (Paul, 1990).

HMM

• A Hidden Markov Model is made up of states and transitions, which are shown, in the diagram. Each state represents of a HMM holds statistics for a segment of a word, which describe the value and variations that are found in the model of that word segment. The transitions allow for speech variations such as

• The prolonging of a word segment, this would cause several recursive transitions in the recogniser.

• The omission of a word segment, This would cause a transition that skips a state.

• Stochastic processing using Hidden Markov Models is accurate, flexible, and capable of being fully automated, (Rabiner and Juang, 1986).

Neural networks

• "if speech recognition systems could learn speech knowledge automatically and represent this knowledge in a parallel distributed fashion for rapid evaluation … such a system would mimic the function of the human brain, which consists of several billion simple, inaccurate and slow processors that perform reliable speech processing", (Waibel and Hampshire, 1989).

• An artificial neural network is a computer program, which attempt to emulate the biological functions of the Human brain. They are an excellent classification systems, and have been effective with noisy, patterned, variable data streams containing multiple, overlapping, interacting and incomplete cues, (Markowitz, 1995).

• Neural networks do not require the complete specification of a problem, learning instead through exposure to large amount of example data. Neural networks comprise of an input layer, one or more hidden layers, and one output layer. The way in which the nodes and layers of a network are organised is called the networks architecture.

• The allure of neural networks for speech recognition lies in their superior classification abilities.

• Considerable effort has been directed towards development of networks to do word, syllable and phoneme classification.

Auditory Models,

• The aim of auditory models to allow a Speech Recognition system to screen all noise from the signal and concentrate on the central speech pattern in a similar way to the Human Brain.

• Auditory modelling offers the promise of being able to develop robust Speech Recognition systems that are capable of working in difficult environments.

• Currently, it is purely an experimental technology.

Performance of Speech Recognitions systems

• Performance of speech recognition systems is typically described in terms of word error rate, defined as:

• Deletion, The loss of a word within the original speech. The system outputs "A E I U" while the input was "A E I O U".

• Substitution, The replacement of an element of the input, such as a word, with another. The system outputs "song" while the input was "long".

• Insertion, The system adds an element to the input, such as a word, when no word was input. The system outputs "A E I O U" while the input was "A E I U".

Speech Recognition as Assistive Technology

• Main use is as alternative Hands Free Data entry mechanism

• Very effective

• Much faster than switch access

• Mainstream technology

• Used in many applications where hands are needed for other things e.g. mobile phone while driving, in surgical theatres

• People with speech impairment (Dysarthic Speech) have shown improved articulation after using SR systems especially Discrete systems

Reasons why SR may fail some people

• Crowded room - Cannot have everyone talking at once

• Too many errors because all noises, coughs, throat clearances etc are picked up

• Speech not good enough to use it• Not enough training• Cognitive overhead too much for some

people

Some links

• The following are links to major speech recognition links

Carnegie Mellon Speech Demos

• CMU Communicator– Call: 1-877-CMU-PLAN (268-7526), also 268-5144,

or x8-1084– the information is accurate; you can use it for your

own travel planning…

CMU Universal Speech Interface (USI)

• CMU Movie LineSeems to be about apartments now…– Call: (412) 268-1185

Telephone Demos• Nuance http://www.nuance.com

– Banking: 1-650-847-7438– Travel Planning: 1-650-847-7427– Stock Quotes: 1-650-847-7423

• SpeechWorks http://www.speechworks.com/demos/demos.htm – Banking: 1-888-729-3366– Stock Trading: 1-800-786-2571

• COV_Principle_Components_2_variables.xls

• MIT Spoken Language Systems Laboratory http://www.sls.lcs.mit.edu/sls/whatwedo/applications.html – Travel Plans (Pegasus): 1-877-648-8255– Weather (Jupiter): 1-888-573-8255

• IBM http://www-3.ibm.com/software/speech/

– Mutual Funds, Name Dialing: 1-877-VIA-VOICE

Feature Extraction Using PCA

Documents

hidden markov model

speech recognition systems

speech recognition system

stochastic processing

speech recognition

neural networks

auditory perception

auditory models