Top Banner
Senior Design Final Report Winter/Spring 2003 University of California, Riverside Department of Electrical Engineering Voice Command Recognition: ROBOKART Prepared by Adrian Abordo, Jon Liao Technical Faculty Advisor: Yingbo Hua Submitted: June 6 th , 2003
107
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speech Recognition Final

Senior Design Final Report

Winter/Spring 2003

University of California, Riverside Department of Electrical Engineering

Voice Command Recognition: ROBOKART

Prepared by Adrian Abordo, Jon Liao

Technical Faculty Advisor: Yingbo Hua

Submitted: June 6th, 2003

Page 2: Speech Recognition Final

June 2003

2

Table of Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Part 1: Preparation 1.1 Summary of Speech Recognition Systems . . . . . . . . . . . . . . . . . . . . 4 1.2 Project Overview & Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Project Goals & Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Part 2: Implementation 2.1 System Overview & Block Diagrams . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Project Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Part 3: Theory and Algorithms 3.1 Guiding Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Overview of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Pattern Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Run-Time Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Part 4: User’s Guide 4.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Template Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Program Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Part 5: Post-Project Analysis 5.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Part 6: Administrative Expenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Equipment List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Appendix: Photographs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Page 3: Speech Recognition Final

June 2003

3

INTRODUCTION

This report is intended to provide a detailed walkthrough of our “Robokart”

Speech Recognition, Senior Design Project. Part 1 covers all necessary preparatory

information—overview of past and current approaches in speech recognition technology,

our project summary and roadmap, project goals and initial design specifications. Part 2

covers implementation details: a system overview and block diagrams, the challenges

faced in creating each part of the project, and how they were solved. Part 3 provides a

description of particularly important recognition sub-systems and lays down the

theoretical and mathematical basis for the operation of our algorithms. Part 4 is a User’s

Guide which is meant to teach any interested individuals how to use the software

included in this report, how to set the various parameters, and how to prepare a speech

recognition session using our current system. Part 5 provides performance evaluation

information, as well as future improvements that need to be made. Part 6 provides

logistical information: project expenses, a parts list, glossary, and references. Our source

code is included in the Appendix at the end of the report.

Project Results at a Glance:

Primary Objective (Recognition system works): Operational, in testing

Secondary Objective (Robokart car runs): Incomplete, on hold

Tertiary Objective (System flexible & adaptable): Achieved, in testing

Page 4: Speech Recognition Final

June 2003

4

PART ONE

1.1 Summary of Speech Recognition Systems

Major interest in Automatic Speech Recognition (ASR) had its origins during the

early part of the Cold War, when the U.S. Government needed an automated system that

could analyze and translate intercepted Russian radio transmissions. When initial

government efforts failed to produce a reliable system, the department known today as

the Defense Advanced Research Projects Agency funded programs at top academic

institutions in the country to provide research information needed to make a system

viable, jumpstarting research and commercial efforts that continue to this day.1

In the 1950’s, Bell Labs created a recognition system capable of identifying the

spoken digits 0 through 9, while later systems were able to recognize vowels. During the

1960’s, users . . . were . . . required . . . to . . . pronounce . . . each . . . word . . . separately.

Eventually, continuous speech systems that accepted naturally spoken sentences became

available in the 70’s. Ironically, modern systems that utilize Hidden Markov Models

(HMM’s) work better with continuous sentences than with discrete words.

Today speech recognition systems are commonplace and find applications in

automatic dictation, simple instrument and computer control, personal identification, and

toys. IBM’s ViaVoice software, voice-activated dialing in cellphones, OTG’s SecurNT

voice authentication system, and the Sony Aibo are products that utilize speech

recognition. However, anyone who has ever used an ASR system can attest that modern

systems work reasonably well, but can never recognize speech with the same accuracy

and robustness that human beings can. Even though speech recognition has made great

Page 5: Speech Recognition Final

June 2003

5

advances over several decades, the pinnacle of speech recognition ability—mimicry of

human performance—has yet to be reached.

Certain aspects of human speech and hearing make successful machine-

recognition extremely difficult. For starters, psycho-acoustical experiments showed that

human hearing has a nonlinear frequency and intensity response, and moreover, those

responses are coupled. Before delving into details, let’s first review how human hearing

works.

The human hearing system detects tiny variations in air pressure and converts

those changes into nerve signals which the brain converts into sensations of loudness and

pitch. Air pressure for voiced sounds are typically measured in units of micropascals

(µPa) or microbars (µBar) on a decibel scale. The energy-per-unit-area imparted by the

changes in air pressure is called intensity, and is very roughly (but not directly) related to

loudness. It is important to point out that loudness is a sensation and is not an actual

physical quantity, whereas intensity is. Therefore, the answer to the classic question, “If a

tree falls in a forest, and there is nobody there to hear it, does it make a sound?” would

be: No it does not make a sound, but it would probably make some high-intensity air

pressure changes. Generally speaking, higher intensities of fluctuation result in our ears

perceiving a sound that is louder, as long as certain conditions are met.

The second quality related to hearing is the frequency of vibrations. What we

perceive as pure tones are air-pressure changes which fluctuate at a single frequency.

Human voice, far from producing pure tones, imparts pressure changes at a multitude of

different frequencies, which the inner part of our ears are designed to detect. The classic

theory used to explain our ability to recognize frequencies (pitch) says that tens of

Page 6: Speech Recognition Final

June 2003

6

thousands of tiny hair-cells arranged carefully in the cochlea act like small bandpass

filters that are individually tuned to a particular frequency. When a particular frequency

of vibration hits the ear, the hair cell(s) that are tuned to that frequency will fire nerve

impulses. The brain then uses these patterns of firing to estimate the frequency content of

the signal. The frequency content of a sound is primarily what the brain relies on to

recognize speech, and it is for the most part very good at detecting different

frequencies—up to a point.

Psychological experiments made in the 1930’s by Fletcher and Munson showed

that perceived loudness is not simply a function of intensity, but also of the sound’s

frequency:

Figure 1:

Page 7: Speech Recognition Final

June 2003

7

Figure 1 shows that tones that have the same intensity but oscillate at different

frequencies can be perceived with different loudness. The bottom curve shows that a 10

dB sound oscillating at 1000 Hz sounds just as loud as a more powerful 76 dB sound at

100 Hz.

Not only does human hearing have a frequency-intensity tradeoff, but we also

perceive some frequencies better than others. The bandwidth of human hearing is from 20

Hz through 20 kHz, with peak sensitivity around 3-4 kHz. Sounds that fall outside the

peak range are gradually attenuated and are not perceived too well. (Interesting side note:

3-4 kHz is roughly the frequency range that babies cry at). Figure 2 is called an A-

weighted approximation of the frequency response of human hearing. Peak perception

between 2-4 kHz and gradual attenuation outside that range can easily be seen.

Figure 2:

The last property worth mentioning is the frequency resolution of human ears.

Two tones of the same intensity, played at the same time but at two different frequencies

may be perceived as being only one tone if their frequencies are too close to each other, a

condition known as masking. Therefore, if two sounds are too close to each other in

Page 8: Speech Recognition Final

June 2003

8

frequency, one sound may mask the other, and the second sound may not be perceived at

all. In order for sounds to be successfully identified, they must be separated in frequency

by a minimum distance called the “critical bandwidth” (CBW). To make matters worse,

the critical bandwidth of hearing is not a constant. It depends on the center frequencies

that each part of the cochlea is tuned to detect. It has been found, for example, that the

group of hair cells tuned to a center frequency of 1000 Hz has a smaller bandwidth

(higher resolution) than hair cells tuned to 8000 Hz (which have poorer resolution).

The problems that automatic speech recognition systems face can finally be

explained. For starters, one of the main points of this section is that what we perceive as

sound is really not tied to any single physical property. A great deal of biological and

psychological conditioning takes place between the time the pressure waves first hit our

eardrums and before they are “perceived,” so much so that what we finally sense is quite

different from the physical event that generated it. The intensity and frequency

dependence of human hearing is merely the tip of the iceberg—our brains also utilize a

great deal of information based on timing, duration, learning, and contextual cues.

Modern ASR’s vs Humans:

Microphones, the entry point of practically all current automatic speech

recognition systems, are very poor substitutes for the human ear, as they are limited to

simply measuring the voltages generated by changing air pressures. The electrical signals

of a microphone are not to be confused with the nerve signals that our brains utilize to

recognize sound, mainly because sounds go through many stages of transduction in the

human ear before it is finally processed by the brain. Our inner ear detects frequencies

instantaneously and in parallel, whereas computers utilize the Short-Time Fourier

Page 9: Speech Recognition Final

June 2003

9

Transform to estimate the spectral content of a signal, something which can neither be

done instantaneously (a window of samples first needs to be taken) nor in parallel (one

window after another needs to be extracted before frequency changes can be estimated).

Table 1 summarizes the relationship between artificial and human systems, highlighting

the essential differences between the two.

Table 1

Human Beings: Artificial Systems:

1. Instantaneous detection of frequencies. 1. Reliance on Fourier Transform.

2. Frequency measurements are precise. 2. Window length dictates precision.

3. Neural processing done in parallel. 3. Computer processing is sequential.

4. Brain is very good at pattern detection. 4. Computers are not.

5. Humans can recognize words in a wide 5. Artificial recognizers are only as

variety of conditions, environments, and robust as the models they are built

interference. on.

6. Humans have other sources of information 6. ASR’s can only rely on signals

such as sight or context to understand generated by a microphone.

what was said.

To counteract the failings inherent to computers, a variety of signal processing

techniques and algorithms have been employed to bring machines up to par with their

human counterparts. These algorithms are generally aimed at emulating some known

properties of human recognition. DSP “front-ends” are algorithms meant to simulate the

Page 10: Speech Recognition Final

June 2003

10

hearing process and try to account for such observed properties as critical bandwidth and

nonlinear frequency response, and more importantly are designed to measure features in

the speech signal (power, rate of change, spectrum, cepstral coefficients, etc.) that are

thought to play an important role in humans’ recognition of speech. A graphical overview

of a class of front-ends that utilize spectral analysis is shown in Figure 3 below.

Figure 3:

“Back-ends” typically work in conjunction with front-ends and employ higher

level algorithms such as pattern recognition, statistical prediction, and neural networks to

make sense of the features that have just been extracted by the front-end. Figure 4 shows

the processing sequence involved in the creation of a speech recognition template starting

with front-end systems and ending with a back-end model.

Page 11: Speech Recognition Final

June 2003

11

Figure 4:

With the exception of the Fourier Transform Filter Bank Model (which will be

covered in greater detail in Part 3), explanation of the other systems shown in Figures 3

and 4 is far beyond the scope of this report. For implementation details, interested readers

are strongly encouraged to read Joseph Picone’s paper on “Signal Modeling Techniques

in Speech Recognition.” 2

In the broadest sense, a speech recognition system can either be speaker-

dependent or speaker-independent. The early systems were speaker-dependent. They

were only tailored to recognize the voice of single, or at most a handful, of people,

mainly because that person’s particular voice template was the only training the system

has from which to perform recognition tasks. Security and authentication systems that use

a person’s voice to perform personal identification are speaker-dependent systems. The

current trend in speech recognition is to have systems that are speaker-independent.

Systems that are speaker independent are designed to recognize words from a wide

variety of speakers.

Recognition systems can also operate at a variety of levels. Small-vocabulary

systems can operate with a vocabulary of roughly 100 words and can be used for

individual letter and digit dictation. Medium-vocabulary systems are capable of

recognizing around 1000 words, and large-vocabulary systems are meant to recognize

Page 12: Speech Recognition Final

June 2003

12

more than 5000 words. Additionally, systems can also be tailored to work at the sentence,

word, syllable, or allophone level. The sentence level, as the name implies, is a collection

of words that are grouped together into a single unit of meaning, such as, “Open the

garage door.” Word-level systems attempt to recognize spoken input through each

individual word, such as “Open” “the” “garage” “door.” Syllable-level systems operate at

a lower level and are designed to recognize syllables: “O” “pen” “the” “ga” “rage”

“door.” Finally, allophone-level systems work at even smaller individual units of speech:

O-p-e-n th-e g-a-r-a-g-e d-oo-r (in practice, though, triphones, rather than individual

allophones, are used. See below).

If a system operates at a lower level of recognition, it can theoretically possess a

larger vocabulary using a smaller set of templates (which is desirable for flexibility and

memory reasons). For example, there are an infinite number of sentences in the English

language, but those sentences utilize roughly 500,000 words, which in turn are built upon

approximately 1,000 syllables, which are comprised of about 47 phonemes. If a

recognition system were designed to recognize 47 phonemes (and can recognize them

accurately!), it can theoretically recognize all the words in the English language if it were

also programmed with information on how to chain those individual units together. In

practice, however, ASR accuracy decreases at lower levels of recognition because it has

fewer and shorter samples from which to make a decision. Longer units of meaning (like

sentences) are easy to differentiate from each other because there are more phonetic

features to use for comparison but are necessarily limited in the number of units they can

recognize. Whereas smaller units can be combined to form some very large vocabularies,

they are also harder to differentiate. A computer, for example, might have a difficult time

Page 13: Speech Recognition Final

June 2003

13

distinguishing between the vowel sound in “said” versus the vowel sound in “head.”

Therefore, practical voice dictation software utilize triphones (sets of three allophones

chained together) as a compromise between large vocabulary and accuracy.

The most popular large-vocabulary speech recognition systems today use Hidden

Markov Models that utilize the statistical probabilities of a sound showing up in a

particular word, as well as the probabilities of that sound transitioning into different

allophones. They have been quite successful but require a large database of words spoken

by a wide variety of people from which statistical and probabilistic information is

extracted.

1.2 Project Overview & Roadmap

Our senior design project focused on building a small-vocabulary (24-word) voice

command recognition system to be used in directing the movements of a small “robokart”

in real time. In choosing this project, we were faced with the task of creating a robust and

accurate system that is intelligent enough to recognize spoken word-level commands and

operate with reasonable speed. The commands we used were as follows:

1) Robokart 2) Rotate 3) Clockwise 4) Counterclockwise 5) Proceed 6) Forward 7) Backward 8) Stop 9) Turn 10) Left 11) Right 12) Speed 13) Up 14) Slow 15) Down 16) Dance

Page 14: Speech Recognition Final

June 2003

14

17) Charge 18) Retreat 19) Good 20) Bad 21) Go 22) To 23) Sleep 24) Wake

To demonstrate our results, we designed a system to physically display the

obtained solutions via a remote-controlled car. Thus, the project consisted of two main

parts: the software component built to carry out the actual speech recognition and the

hardware component (Robokart) which had been planned to carry out the spoken

commands.

Using a microphone attached to a personal computer, we obtained voice samples

for template creation and training. We processed the training input using algorithms

written in MATLAB 6.5 and automated the acquisition and analysis of data in real-time

using Matlab’s Data Acquisition Toolbox v2.2. We used mean zero-crossings, mean

power, and a 24-Bark filter bank to build the feature-space from which to perform

recognition.

The analog speech input, after being processed by our speech recognition

program, was planned to produce a binary 8-bit output to be transmitted through the

Serial Port of a personal computer into the Motorola MC68HC11 microntroller, which is

embedded on the CME119-EVBU evaluation board. The binary commands from Matlab

would be outputed through the PC’s serial port interface and sent to the evaluation board

via the 68HC11’s asynchronous serial communication interface (SCI). The 68HC11 in

Page 15: Speech Recognition Final

June 2003

15

turn was connected to the transmitter of a remote-controlled car, from which it would

have directed the car’s movements.

Our project was divided into the following stages:

1.) Research speech recognition technology and methods.

2.) Acquire parts needed to build initial voice training set (PC microphone, voice

recording program).

3.) Compile the training set via Matlab and incorporate the sounds into a library

which will be used as the starting point for template creation.

4.) Acquire the toolboxes needed to perform signal processing and real-time

acquisition (Signal Processing Toolbox and Data Acquisition Toolbox).

5.) Extract features from the sound library and use them to build the reference

template on which recognition will be based.

6.) Create the actual software engine which will work in real-time, process new

inputs, compare them to the template, and output commands through the serial

port.

7.) Acquire and modify the circuitry of a toy remote-controlled car.

8.) Interface the 68HC11 to the car’s remote control transmitter and write programs

needed for the serial communication interface.

9.) Test the speech recognition engine.

10.) Test the microcontroller-to-transmitter interface.

11.) Test the microcontroller-to-serial port interface.

12.) Integrate the speech recognition engine with 68HC11 microcontroller.

13.) Test the final product.

Page 16: Speech Recognition Final

June 2003

16

1.3 Project Goals & Specifications

We wanted to develop a speech recognition system that was flexible, but not

necessarily speaker-independent (for speaker-independence is a very hard thing to

achieve). Since the speech recognition system is geared towards the control of an

instrument, we placed particular importance on accuracy and robustness, envisioning that

this system could one day be incorporated into the hands-free control of certain devices

and instruments (microwave oven, car dash board, non-critical airplane flight controls).

On a personal level, we wanted a system that relied on our own ingenuity and originality.

We therefore used current speech recognition approaches as a starting point for building a

knowledge base, but we tried not to copy approaches that are already known to work. In

short, we wanted to come up with a speech recognition system that was, for the most part,

original and new (at least to us).

Minimum performance specifications are listed below. Highest priorities are listed

first:

1.) Recognition catch rate (RCR) of at least 50 percent (i.e. the spoken word should

appear in the “Recognized Word Buffer” for at least half the duration that it was

spoken).

2.) Ability of the recognition system to operate without physical prompting (control

should be achieved through voice alone, without requiring the user to press keys,

signal an intent to speak, or to issue a record command by pressing some button.

System, once started, must be completely hands-free).

3.) Ability to perform word-recognition within 1 second of when the person has

finished speaking (near real-time performance).

Page 17: Speech Recognition Final

June 2003

17

4.) Ability for the engine to output serial-port commands as soon as a recognized

word has been detected (again, within 1 second).

5.) Ability to physically carry out those commands quickly (this relies on the

microcontroller’s ability to receive and interpret the serial commands within 1

second).

6.) Total system lag from voice-input to Robokart response of no more than 3

seconds (it would be disastrous to say “Stop,” but robokart responds too slowly

and plows into a wall).

7.) Be able to run the recognition engine for an indefinite length of time without it

running out of memory.

PART TWO

2.1 System Overview & Block Diagrams

A word about notation: some blocks will have a number written inside them. This

indicates that the block is composed of several sub-components which are not visible in

that particular diagram but which will be covered in more detail later on. For example,

“Block 1.4” indicates that a particular block is the fourth sub-component of “Block 1.”

The number notation allows you to track the hierarchal relationships between systems.

Page 18: Speech Recognition Final

June 2003

18

Figure 5:

0: The sound card was set to a sampling rate of 22.05 kHz and to single-channel

(mono) acquisition of sound.

1: All the speech processing takes place in Matlab, using functions written from

scratch. The bulk of the senior design project has been spent creating, coding, and

testing algorithms used in this component.

2: The 68HC11 is the middleman between the software and the actual Robokart car.

Figure 6:

1.1: The Data Acquisition Engine (DAQ) is a Matlab toolbox that captures data in real

Page 19: Speech Recognition Final

June 2003

19

time. In this case, the engine is instructed to pull data from the computer’s sound

card.

1.2: The feature extractor contains all the functions needed to extract relevant features

from the user’s voice. The output is a feature vector which is fed to the

comparator.

1.3: The comparator is the heart of the speech recognition engine. It finds the most

likely match between the user’s inputted features and those features that are stored

in memory. Using relationships that have been trained into the reference template,

the Comparator determines the most likely word that has been inputted.

1.4: The reference template is the brain of the speech recognition engine. It contains

three sub-templates which are crucial for determining what parts of a word have

been said: the word-association matrix, the reference feature vectors, and the

reference time positions.

1.5: Serial port output is an 8-bit binary signal that is sent to the 68HC11 and contains

control instructions.

Page 20: Speech Recognition Final

June 2003

20

Figure 7:

1.1.1: Using information stored in “trigger condition,” the trigger detector constantly

scans the digital input stream, looking for an event that would initiate data-

logging.

1.1.2: In order to differentiate background noise from a user attempting to speak, the

trigger condition is set to a band of amplitudes within which the intrinsic noise of

the system is expected to stay. If the microphone detects a voltage level that

leaves this pre-defined band, the trigger detector is instructed to begin logging

samples.

1.1.3: The data-logging system is pre-set to record 0.5 seconds worth of samples every

time the trigger is activated. Too short a value means data might be accidentally

missed. Too long a value slows down the recognition system’s response time.

Page 21: Speech Recognition Final

June 2003

21

Figure 8:

1.2.1: As an additional safeguard against sudden bursts of noise, the power discriminator

ensures that the signal coming in from the Data Acquisition Engine is of sufficient

power to preclude it from being a random hardware event. Signals whose power

exceed one and a half times the baseline magnitude are assumed to originate from

the microphone and not from intrinsic noise.

1.2.2: The Data Acquisition Engine always records 0.5 seconds worth of sound every

time it is triggered. Because the user may not be speaking during that entire

period of time, the pure sound extractor passes on portions of the data where it

believes somebody is speaking and deletes those portions where there is silence.

1.2.3: Sound samples that have been extracted and purified are temporarily kept in a

buffer for subsequent processing. However, since the system is very fast, the

Page 22: Speech Recognition Final

June 2003

22

partitioned sound buffer is mainly there to allow the user to hear the quality of the

sound that has just been recorded for troubleshooting purposes.

1.2.4: Feature intervals are chunks of samples over which features are measured and

compared against. The feature-length parser divides the sound vector into portions

of desired length (often with a sliding window) and passes the portions to the

feature extraction functions.

1.2.5: The mean zero-crossings counter counts the average number of times that the

microphone’s amplitude signal crosses the zero axis over a certain interval. The

number of zero-crossings is a time-domain feature that we believe might be useful

in characterizing certain sounds.

1.2.6: The average signal power over an interval is another time-domain feature that we

employ to help determine what word has been said.

1.2.7: The filter bank is based on a Holmes filter bank architecture and consists of 24

triangular bandpass filters arranged nonlinearly on the frequency axis. Each filter

measures the spectral power within its range and outputs the log10 of that power

amplitude. The filter bank outputs a 24-element power spectrum vector which

plays a very big role in characterizing speech for our system.

1.2.8: The feature vector is a 26-element vector that combines the time-domain and

frequency-domain feature values extracted from each segment of sound. It is sent

to the comparator for analysis.

Page 23: Speech Recognition Final

June 2003

23

Figure 9:

1.3.1: The comparator treats the output from the feature extractor as a 26-dimensional

position vector. This component calculates the distance between the current input

position and all positions stored in the reference template.

1.3.2: The resulting distance vectors are sorted from low to high, allowing the most

promising template candidates (those whose positions are the closest to the

current input) to be placed near the top of the stack.

1.3.3: The desired number of possible word candidates is extracted from the top of the

distance stack.

1.3.4: Using a lookup table, the word-groups who have features that match the current

best candidates are stored in memory.

1.3.5: The relative position within each word is estimated by a position lookup-table

which matches taught feature vectors with the relative positions within the word

that they are known to occur at.

Page 24: Speech Recognition Final

June 2003

24

1.3.6: The identity and position of each recognized word fragment is stored in a large

matrix where trends can be detected over time.

1.3.7: The chain detector looks for those word fragments that occur repeatedly and

consistently and checks to make sure that each candidate word position follows

the forward flow of time. It scores each word fragment according to which ones

have the best consistency and causality.

1.3.8: Based on the chain detector’s recommendation, the candidate word is chosen as

soon as meaningful input from the user is deemed complete.

Figure 10:

2.1: A short program burned into EEPROM that sets the necessary Serial

Communication Interface control registers, checks the SCI status registers through

polling, and stores 8-bit data that has been received.

2.2: Based on the data received by the SCI, it outputs the programmed movement

signals to the radio-control transmitter connected to Port D.

2.2 Project Challenges

The feature-extractor went through a great many changes over a long period of

time. This was due mainly to the fact that we were not sure what features were critical in

recognizing human speech. Initially, we relied entirely on 8 time-domain features: mean

Page 25: Speech Recognition Final

June 2003

25

zero-crossings, mean sound power, mean diffcode (binary backward-looking difference),

mean variability (average positive and negative peaks over an interval), zero-crossings

delta (multiplicative change from one interval to the next), power delta, diffcode delta,

and variability delta. Subsequent testing showed that most of these time-domain features

were unsuitable for our application. All signals had widely differing power changes,

making their power deltas an unreliable and unpredictable measure. Mean variability and

mean diffcode, due to their reliance on binary backward difference, were much too easily

corrupted by noise, and by extension, so too were their deltas. In the end, we decided to

keep average zero-crossings and average power since our tests showed an acceptable, but

small, correlation between their values and words that were spoken.

Selection of proper frequency-domain features also proved troublesome. We

initially relied on a “top N approach,” where the power-spectral-density (PSD) of each

input interval was measured, the top N strongest peaks were extracted (we used N = 30 at

this point), and those top N peaks’ values and locations were compared with peaks stored

in the template. We discovered that finding the top N peaks slowed down our programs

considerably, and the computations needed to evaluate their closeness to the template in

magnitude and in frequency almost brought the recognition engine to a crawl at run-time.

We therefore endeavored to find a feature set that would give us a set of single numbers,

rather than a set of magnitude-frequency pairs, which became very unwieldy over time.

It was during more research that we stumbled across early attempts at emulating

the frequency response of the human cochlea. Psychoacoustic researchers discovered that

the ear had a minimum resolution below which it could not differentiate sounds, and that

some sounds were perceived as being louder than others based on their frequency.

Page 26: Speech Recognition Final

June 2003

26

Keeping these characteristics in mind, early researchers started with analog bandpass

circuits tuned to certain frequencies and used these “filter banks” to decompose the

spectral content of an input sound in a way, it was assumed, that the human ear roughly

worked. We also discovered that the outputs of these filter banks (a vector of power

amplitudes) was also used by neural network researchers as the inputs to their systems.

Since we were interested in solving the speech recognition riddle through an artificial

intelligence approach, we adopted the most common filter bank model—the Holmes filter

bank—and made it part of the spectral feature extraction process.

Subsequent challenges involved optimizing our code to accept massive amounts

of data, compare them to an even more monstrous set of data, perform complex iterative

calculations on huge matrices, sort those matrices, extract the best candidates from each

matrix, and implement some meaningful scoring and weighting scheme and somehow do

it quickly enough to be implemented in near real-time. Our early data structures were

very inefficient and the results were disastrous. At one point, it took our 2.53 GHz

machine almost 5 minutes to analyze a 0.04-second sound fragment. We trimmed the

number of features to use (which was another reason why 8 initial time-domain features

were cut down to a mere 2) and attempted to make every portion of our code as

vectorized as possible, since Matlab works best with matrices, instead of loops. It took

almost two weeks, but we got our code running at a speed that resembled real-time.

An even bigger challenge, one that almost derailed the entire project, involved the

actual word-recognition process itself. Our original template data was unreliable, so we

were extracting statistical information that was very misleading. Our early attempts at

recognition produced almost random results because the statistical occurrences of each

Page 27: Speech Recognition Final

June 2003

27

word fragment were being misrepresented. We corrected the problem by writing a pair of

comprehensive “intra-entity” and “cross-entity” association functions which found the

similarities within and across words. The “association matrices” are the outputs of those

functions and form the core of our system.

When configuring ROBOKART for asynchronous SCI, we came across some

barriers. First, deciding which electronic RC control car to use was easy, the Tyco Fast

Traxx was more versatile, and easier to modify. The only problem was when linking the

two software systems together (MATLAB and IAR Embedded Software) was getting the

COM port to accept signals from MATLAB. The COM1 port was perpetually busy when

we loaded our MCU C code. The hardware setup was configured to take output

responses from the MCU in Port D, but we never had a chance to obtain the input from

MATLAB. Therefore, we needed some more time to get the communication between the

devices to work properly.

PART 3

3.1 Guiding Philosophy

Automatic speech recognition is essentially just a very complicated template-

matching problem. On one end, you have some set of predefined values that are

associated with a particular word, and on the other end, you have an input that can vary in

a multitude of ways. The task of matching the two sets of values has been, and still is, a

daunting task, especially when there are many other competing templates involved in the

matching process. We have discovered in the past two quarters that speech recognition—

Page 28: Speech Recognition Final

June 2003

28

attempting to match an ever-changing signal to an essentially stationary one—is an

exercise in controlling chaos.

Despite the enormous challenges, there were a set of guiding principles which

were helpful to us and formed the basis for our design approach:

1.) When faced with an ambiguity—when a clear choice for a word has not

been recognized—rather than assume that nothing has been said, the

human brain will almost always fall back to the next closest match.

2.) The key to successful speech recognition does not depend on finding the

magical set of features that absolutely differentiates a word from other

sounds, but rather lies in how those words are perceived by the listener.

This is a bit of a philosophical abstraction, but I strongly believe that what

makes the word “lollipop” stand apart from the word “Volkswagen” is not

some critical feature quantity (like zero-crossings or cepstral coefficients) that

can objectively be quantified. Instead, we treat them as different words

because we have been taught to do so by experience. Basically, words are

different not because of some quantitative property of the signal, but because

we “perceive” them differently.

3.) Therefore, between front-end digital signal processing and back-end

pattern detection, as long as the front-end is stable, the bulk of our work

should be focused on creating an intelligent system that can detect the

patterns of similarities and differences between stored words and use that

pattern to classify future inputs. In other words, we decided that the key to

building a working recognition system did not rest in superior signal

Page 29: Speech Recognition Final

June 2003

29

processing, but rather depended on creating an artificially intelligent system

that can teach itself to find similarities and differences between words.

4.) In the end, what we wanted is a system that can truly recognize a word, a

syllable, or a sound without resorting to arbitrarily programmed models,

statistics, or clever signal-processing tricks. We wanted a system that can

learn.

5.) We wanted to do this not by imitating tried and true methods but by

coming up with a (hopefully) new approach.

3.2 Overview of the Problem:

Any sound measured by a microphone is simply a sequence of numbers (in our

case, a sequence of voltages). The reference template is also a sequence of numbers.

Speech recognition is “simply” the process by which one sequence of numbers is

compared to another sequence of numbers in an attempt to find the best fit. The main

difficulty lies in the fact that the template is usually a stationary sequence, whereas the

input sequence of spoken words can change in a variety of unpredictable ways. In

particular, an utterance can differ from a stored template in 3 ways:

1.) Delta-M (∆M) Error such as interference, noise, and other magnitude distortions

corrupt the input signal and can make it sound different from the reference signal.

2.) Delta-T (∆T) Error such as unexpected pauses, unusually fast or slow speaking

styles, and other changes in speed can randomly shift the position of the input

relative to the template.

3.) Combination ∆M and ∆T Error randomly distorts a signal’s values and also

shifts its position randomly in time. Real speech falls under this category because

Page 30: Speech Recognition Final

June 2003

30

people never say the same word exactly the same way twice, in addition to

whatever background noise might be present in the environment. People can also

pause unexpectedly, or say a word faster or shorter than expected, or stutter, or

jump-around, or even be uncooperative. Over a sufficiently long interval, an input

signal can vary from the “ideal” in a multitude of ways.

Two recordings of the word “counterclockwise” spoken by the same person is shown

in Figure 11:

Figure 11

Since both instances of “counterclockwise” are spoken by the same person, these

plots are generally similar. However, there are also differences. Note that

“Counterclockwise” 1 generally has less energy. Note, too, that “Counterclockwise” 2 is

Page 31: Speech Recognition Final

June 2003

31

slightly faster (left-shifted in time) compared to “Counterclockwise” 1. These recordings

were made by the same person on the same day.

Figure 12 below shows the same word “counterclockwise” spoken by the same

person but on different days. “Counterclockwise” 3 was recorded approximately 4

months before “Counterclockwise” 1 was recorded.

Figure 12

Notice that the differences between the two signals are magnified. While the

difference in energies is slight, “Counterclockwise” 3 is stretched compared to

“Counterclockwise” 1. They start at roughly the same time, but there is a pause in the

middle that causes “Counterclockwise” 3 to be skewed in time.

Page 32: Speech Recognition Final

June 2003

32

Such differences are called intra-speaker differences. The same person can utter

the same word in slightly different ways each time. The person can pause, speak faster,

speak slower, or emphasize certain syllables. A recognition system needs to be robust

enough to understand that these different pronunciations of the same word are not

entirely different words, but simply different examples.

Matching words across different speakers is an even more challenging task.

Whereas differences between words spoken by the same person are relatively small,

inter-speaker differences are huge. Figure 13 shows “counterclockwise” spoken by two

different people:

Figure 13

As can be seen in the figure above, the difference in magnitude (∆M) and the

Page 33: Speech Recognition Final

June 2003

33

difference in time (∆T) is large across different speakers, even if they are saying the same

word. Not only does “Counterclockwise” 4 have different energies (on a per-syllable

basis, not across the entire word) than “Counterclockwise” 1, it is substantially stretched

in time.

At this point you might wonder if the differences between words are truly as

substantial as we claim. After a quick visual review, “Counterclockwise” 1-4 shows that

they each have roughly 4 lumps (syllables) which are separated by periods of low energy.

Moreover, except for small variations, these lumps have roughly the same size, and with

a little bit of stretching here and a little bit of imagination there, you may visualize them

as occurring at roughly the same times. So what’s the big deal?

The problem is that computers do not have the faculties that humans have. They

operate at the numerical level. They are also not capable of asking “what if”—what if I

stretch this portion here? or stretch this portion there? or superimpose these two lumps?

All computers have to work with is the data that is given to them, as it is given, without

recourse to imagination. They crunch numbers, and depending on the numbers that come

out, they classify their inputs accordingly. To teach a computer “what if” and try to teach

it all the fuzzy and indeterminate situations where it can adjust the data would be quite an

accomplishment in artificial intelligence. “Counterclockwise” might seem easy to

recognize because it is the only word in our command set that has four syllables, but what

if we decide to include another four-syllable word in the future? Numerically,

“Counterclockwise” 1-4 look like:

Page 34: Speech Recognition Final

June 2003

34

Table 2

Counterclockwise Duration (s) Signal Energy Signal Power

1 1.4048 9.0664 2.9270e-004

2 1.5905 24.4054 6.9591e-004

3 1.4623 12.0885 3.7492e-004

4 1.7524 9.6690 2.5023e-004

By looking at the table and these particular features, you might notice that there

are some big variations over the same word. On the other hand, you might be able to find

a pattern in the similarities among the four examples of “Counterclockwise.” If so, you

are using a human trait—the innate ability to find similarities between examples. With

some clever programming, computers might be able to do the same thing, but the task

becomes much more difficult if the number of possible words is not limited to 1 but 24.

The computer then has to ask itself, How long is long enough? When should I begin

measuring a sound? When should I stop measuring a sound? How do I know that this is

really the beginning of the word? How do I know that this is really the end of the word,

and the speaker simply didn’t pause? Statistically, if this half of the word is similar to this

other word, but the later half is closer to this one, how do I decide which example is

which? How do I even know whether I am measuring these words correctly and at the

right times? How do I know if the input is gibberish?

Page 35: Speech Recognition Final

June 2003

35

The problem of signal distortion is less severe in the frequency domain but is not

the cure-all needed for recognition. Figure 14 shows the power spectral magnitude of

“Counterclockwise” 1 and 4 as well as “Robokart.”

Figure 14

“Counterclockwise” 1 and 4 look fairly similar; they both dip below -60 dB at

roughly the 5700 Hz mark, whereas “Robokart” dips as early as 3900 Hz. From

experience, when we examined the spectrum of all 24 words, there were many instances

when the distinction was blurred. We saw that there were gray areas where the spectrum

of different examples of the same word will be close, but there will sometimes be another

word whose example is even closer, and the problem turns into trying to determine which

example should belong to which word. Setting thresholds by saying, “If this input’s

spectrum dips below so-and-so at this point in frequency at this point in time means that

Page 36: Speech Recognition Final

June 2003

36

it belongs to this group, otherwise it belongs to that group” turned out to be futile. Time

and time again, we have found exceptions to the artificial boundaries that we have

attempted to set, almost as if the words themselves were taunting us. And once we made

an exception for one case, it became hard to determine under what conditions exceptions

were justified. Eventually, once all examples of all words were considered, the distinction

between them became muddled.

To illustrate this “feature ambiguity” which applies to both the frequency and

time domains, Figure 15 shows our attempt at using a time-domain feature called

“variability” to distinguish between different syllables:

Figure 15

Overlapping segments are undesirable because they are syllables for which the

value of a particular feature is similar. In theory, different syllables ought to have

Page 37: Speech Recognition Final

June 2003

37

different values, yet this graph shows that for the feature known as “variability,” many

syllables are indistinguishable because they share a similar range of values.

The astute reader would point out that we should simply add more features. After

all, in an environment encompassing 100 different features (dimensions), syllables or

words overlapping in one feature cannot (or should not!) overlap when all other

dimensions are considered. If a chicken walks like a duck, and “walking” is the only

feature that you are measuring, you might come to the conclusion that a chicken is the

exact same thing as a duck. But if you consider more features and test to see if a chicken

walks like a duck and looks like a duck and sounds like a duck, then you would realize

that ducks and chickens are different because they neither look nor sound the same

(though they might walk the same).

Realizing this, our early attempts utilized as many as 38 features simultaneously:

8 time-domain features, and 30 frequency domain features (highest spectral peaks). Using

these features, we attempted to build statistical models of occurrences, hoping to find the

one pattern that distinguished, say, the word “stop” from the word “up.” The process was

very time-consuming and we realized that there were invariably unusual exceptions

(extreme feature values) that skewed the means and standard distributions of our different

“patterns.” It was difficult to decide whether to keep these deviant values or whether to

throw them out, and trying to decide the conditions they were useful for was extremely

difficult. For 24 words, trying to calculate patterns and set thresholds across different

features of a signal bogged us down in work. During this endeavor we realized that our

models would only work for the particular set of words that we were working with. What

if the user wanted to work with a whole different set of words? What if the user wanted to

Page 38: Speech Recognition Final

June 2003

38

program his or her own unique voice into the system? Would the user have to wade

through the sea of statistical data and manually establish the relationships between new

words? How could our current system even begin to handle words that are completely

new, words for which we have no statistical models?

In the end, we decided that for the sake of flexibility (and our own sanity), we

would somehow force the computer to make those decisions for itself. And so, after

almost four months of trying different signal representations in the time and frequency

domains, after trying to find statistical patterns buried in the signal, and after extracting

all sorts of different features that hopefully would make one word stand out from another,

we concluded that words are not uniquely identifiable through some all-encompassing,

objective property of the signal. Rather, we decided that humans recognize words

because the experience of hearing a word is somehow associated with other experiences

already stored in memory, and an association is formed among similar “experiences”

which allows us to relate new inputs to old.

Like human beings, a speech recognition system would have to be taught that

certain examples belong to the same word. It would also have to determine exactly what

makes such examples similar and what makes them different from other words. Instead of

manually organizing the pattern ourselves and trying to set up a myriad of laws and rules,

the computer would have to form “associations” for itself.

Henceforth, we decided not to torture ourselves by trying to find the perfect

feature extraction algorithm and instead focused on building an intelligent, automated

pattern-recognizer.

Page 39: Speech Recognition Final

June 2003

39

3.3 Feature Extraction

Finalized Extraction Procedure:

All features are extracted across a constant interval. We selected a feature length

of 32 ms (706 samples at 22.05 kHz sampling rate) over which to extract each feature.

Succeeding feature chunks are taken by using a sliding, half-overlap window (353 old

samples + 353 new samples). For time-domain features, the samples were unity-

weighted. For frequency-domain features, the samples were weighted with a Hamming

window.

Commonly used measuring intervals are from 20-40 ms. In the frequency domain,

shorter intervals give you good time resolution but poorer frequency resolution, and

longer intervals give you poorer time resolution but better frequency resolution. In the

time domain, 20-40 ms is also the range over which individual components of the speech

signal (the “r” sound in “robokart,” for example) remain essentially the same, allowing us

to measure relevant features across specific “sounds.” 32 ms was selected as a

compromise between having a feature-length short enough to resolve individual sound

details, but long enough to process the signal quickly.

Our final system consists of two time-domain features (zero-crossings, mean

power) coupled with 24 spectral power outputs of a digital filter bank, for a total feature

vector of 26 elements. Zero-crossings is the average number of times a signal crosses the

zero-axis over an interval. Mean power for a signal g(t) over an interval N is simply

given by:

=)(tpN1 )(

2

0

tgNt

t∑=

= (Equation 1)

Page 40: Speech Recognition Final

June 2003

40

The time-domain features were very easy to implement, but choosing the right frequency-

domain features took a lot of research. We eventually chose one of the earliest systems

that tried to emulate the way human hearing works.

The Digital Filter Bank

Based in part on psychoacoustic measurements that seemed to demonstrate that

the human inner ear has a finite frequency resolution*, early speech researchers designed

an overlapping bank of bandpass filters to mimic the frequency response of the human

cochlea. The bandpass filters were tuned to different frequencies, and the passbands were

made similar to the observed bandwidths of the human ear. Of several types of filter

banks available, the Holmes Filter Bank was used for this project.

To simulate the limited frequency resolution of ears, the filter consists of

bandpass filters whose center frequencies are arranged nonlinearly on the frequency axis.

The bandpass filters also have different bandwidths, which are thought to account for our

ears’ limited spectral resolution at high frequencies. To determine the placement of each

bandpass filter, the standard frequency axis is first warped onto a nonlinear scale, where

the whole-numbered integers in that scale determine where the filter centers are placed.

The two scales commonly used to achieve nonlinear frequency warping are the

Mel Scale and the Bark Scale. Both scales attempt to model experimental data where the

ear’s critical bandwidth at different input frequencies were measured. These scales

therefore approximate human ears’ limited ability to differentiate tones that are too close

in frequency.

* Subjects were tested on how well they could perceive pure tones that were played simultaneously. The limited frequency resolution of hearing is based only on simultaneous tones. In reality, the ear has complex mechanisms in place that allow people to separate tones that have even small changes in timing. But accounting for these timing mechanisms would have made the model too complex. Therefore, a simplified version is used here.

Page 41: Speech Recognition Final

June 2003

41

An equation for the Mel scale is given by:

m = 1125 log (0.0016f + 1) (Equation 2)

An equation for the Bark scale (Traunmuller’s version, 1990) is given by:

53019601

8126 .

f

.B −+

= (Equation 3)

The Mel scale is in common use by engineers but is typically only used for sampling

frequencies at or below 10 kHz. A side-by-side comparison of Bark and Mel scale center

frequencies and bandwidths is shown in Table 3:

Page 42: Speech Recognition Final

June 2003

42

Table 3

In the end we selected the Bark scale because it is used for a higher range of

frequencies (it can cover up to 27 kHz, easily accommodating our sampling rate of 22.05

kHz), but more importantly because it seemed to approximate empirical data much better.

A plot of the frequency-to-bark-transformation is shown in Figure 16:

Page 43: Speech Recognition Final

June 2003

43

Figure 16

Before going into the details of the filter bank itself, it is worth mentioning that

people’s increased sensitivity to particular frequencies (see Figure 2) can be modeled by

a pre-emphasis filter which amplifies raw frequencies around the 4 kHz range and

attenuates them outside that range. One such pre-emphasis filter is given by:

Equation 4

A pre-emphasis filter greatly attenuates frequencies below 100 Hz, amplifies

frequencies around 3-4 kHz, and gradually attenuates frequencies above 6 kHz. It is

meant to approximate the different sensitivity of human ears to different frequencies.

Page 44: Speech Recognition Final

June 2003

44

Early models of our feature extractor utilized a pre-emphasis filter in order to

more closely approximate human hearing, but we had difficulty determining the

appropriate gain for the transfer function. Should we boost the 4 kHz region by a factor

of 10? A factor of 100? A factor of 300? By what constant factor should we attenuate

frequencies outside that range? Because the raw spectral magnitudes of our inputs tended

to change unpredictably, a wrong pre-emphasis gain could squelch the entire spectrum or

excessively amplify an already large 4 kHz response. Rather than spend a great deal of

time tuning the filter, in the end we decided not to use a pre-emphasis filter and

instead fed the raw, unweighted input spectrum directly to our filter bank, hoping that our

pattern-recognition back-end could sort it all out later.

Because the Bark scale goes up to a maximum value of 24, a filter bank based on

the Bark scale utilizes 24 bandpass filters which are centered around the published Bark

center frequencies and whose bandwidths are equivalent to the accepted Bark critical

bandwidths. Because these bandpass filters generally overlap, a triangular weighting

scheme is applied to each filter in order to give the center frequency the greatest weight.

Figure 17 shows a plot of the actual filter bank that we utilized in this project:

Page 45: Speech Recognition Final

June 2003

45

Figure 17

To ensure that no spectral data is lost, the filters overlap by a large amount.

Resolution is intentionally decreased at higher frequencies (the filter bandwidths are

made progressively larger), and the filter centers are nonlinearly spaced. This represents

early attempts at mimicking some observed properties of human hearing.

Figure 18 shows an example of an input to the digital filter bank:

Page 46: Speech Recognition Final

June 2003

46

Figure 18

Figure 19 shows the output of the digital filter bank for this particular input

sequence:

Figure 19

Page 47: Speech Recognition Final

June 2003

47

The entire frequency-domain feature extraction process is shown in Figure 20:

Figure 20

1: A portion of the sound vector is captured and weighted by a sliding Hamming window.

2: If the captured portion is shorter than the pre-defined feature length (N), zero-pad it to

N samples.

3: Perform the N-point FFT. Discard the upper-half of the data, which is redundant.

4: The spectrum is fed to the filter bank. Each set of frequencies is triangularly weighted,

and the base-10 log power of the spectrum is calculated over each filter interval.

5: Individual power values are concatenated together to form a single 24-element feature

vector.

Page 48: Speech Recognition Final

June 2003

48

3.4 Pattern Detection:

At the end of the feature extraction stage, 2 time-domain features and 24-

frequency domain power values are concatenated to form a single 26-element feature

vector. At every point in time (or rather, at every “feature interval”), a sound is

decomposed into a vector which can be thought of as a position coordinate in a 26-

dimensional feature space. Since 26-dimensions cannot be visualized, a 3-dimensional

feature space is shown in Figure 21:

Figure 21

Page 49: Speech Recognition Final

June 2003

49

As can be seen in Table 4, the output of the feature extractor is simply a set of

values specifying a position within this feature space (only feature elements 1-10 are

included to allow the table to fit on the page):

Table 4

Rows correspond to sampling intervals (time), and columns correspond to

individual features (vector components). Thus, each horizontal slice represents a

particular 26-D position at a particular point in time.

Template Creation:

Before the recognition engine can recognize new voice inputs, it must first be able

to recognize the original data provided for its training. This process is called “template

creation” and only needs to be performed once for a particular training set. It is, however,

the most time-consuming, typically taking up to two hours to complete for a 24-word

training set comprised of 5 examples each. Figure 22 shows the parts involved in

template creation:

Page 50: Speech Recognition Final

June 2003

50

Figure 22

The two most important functions are the Intra-Entity Associator and the Cross-

Entity Associator. In general, the Intra-Entity Associator tells the recognition system

where and how examples of the same word are similar, while the Cross-Entity Associator

finds similarities across different word groups. Both are based on the same general

principles, however.

The Intra-Entity Associator:

If the training set is reliable (i.e. all word examples were recorded in a reasonably

noise-free environment by a trainer who spoke clearly and consistently), the feature-

vectors of words and examples that sound similar should cluster around similar regions in

the feature space. If the front-end used is a reasonable approximation to human hearing

(but then again, no one yet knows how hearing truly works), the feature vectors ought to

correspond to how particular units of speech actually sound. Thus, perceptual similarity

ought to be reflected in the spatial similarity of groups within the feature space. The set

of points belonging to “dog” is expected to be near the set of points describing “hog.”

Page 51: Speech Recognition Final

June 2003

51

Likewise, the points belonging to “height” should be closer to “kite” than it is to

“brown.” Of course, this is a bit of a simplification, since longer words that are composed

of multiple distinct sounds such as “counterclockwise” or “dilapidated” will probably be

spread out in some complex pattern in the feature space. This is alright, because what we

really want is not so much the distribution pattern of whole words, but rather the

distribution of the individual units of sound within that word.

Because our feature-length is so short (a mere 706 samples long), our feature

vectors are capable of showing word contents at individual units of sound. Therefore,

given a sufficiently small feature-length, the individual “r” “o” “b” “o” “k” “a” “r” “t”

sounds in “robokart” are plotted in the feature space. The job of the intra-entity associator

is to find the units of sound that are similar across different examples of the same word,

as well as find their positions in time. How does it do it?

Given two position vectors V1 and V2 which represent point-positions within

some N-dimensional space, the distance between the points described by V1 and V2 is

given by the familiar Euclidean distance formula:

221 )( VVsumD −= (Equation 5)

The square operator is applied individually to each coordinate pair. In a 3-

dimensional space where two points are described by X1, Y1, Z1 and X2, Y2, Z2, the

distance between the two points is given by:

212

212

212 )()()( ZZYYXXD −+−+−= (Equation 6)

Page 52: Speech Recognition Final

June 2003

52

In a 26-dimensional space described by coordinates C1, C2, C3, C4, . . . C26, the

distance between points A and B is given by:

(Equation 7)

22626

244

233

222

211 )()()()()( ABABABABAB CCCCCCCCCCD −++−+−+−+−= K

Keep in mind that we are working with a matrix where each row corresponds to a

point in time and each column corresponds to a particular feature coordinate. The

distance-finding formula is very easy to implement over matrices.

A note about our matrices:

The data for all our feature coordinates are stored in a 4-dimensional matrix. Each

row (y-direction) corresponds to a point in time, each column (x-direction) corresponds

to a particular feature, each Z slice corresponds to a particular example, and each block in

the 4th-dimension corresponds to a particular word. Because word samples are generally

non-uniform in length, empty spaces in the matrices are filled with NaN’s (not-a-

number). Figure 23 is a graphical summary of our storage scheme.

Page 53: Speech Recognition Final

June 2003

53

Figure 23

The intra-entity associator iteratively peeks at each example of each word. Within

each example, it latches on to a point in time and establishes the feature coordinates in

that period as the “reference point.” Labeling the reference point as Point A, the

associator calculates the distances of all other points belonging to different examples of

the same word. It produces a 3-dimensional matrix of values that correspond to the

Page 54: Speech Recognition Final

June 2003

54

distances of different example points from Point A. Within each Z-slice (example), it

sorts the distances in ascending order. The distances at the top of each sample stack are

those points that are closest to the reference Point A and are assumed to sound the most

similar to the unit of sound that Point A represents. The closest example feature distances

are extracted and stored in a 1-dimensional array. The positions in time of each distance

is also stored in a 1-dimensional array. Algorithm flow is as follows:

For word = 1:max(word),

For example = 1:max(examples),

For time = 1:max(time),

Reference point = feature coordinates at (time,example,word)

Calculates distances of all other points from Reference point

Find the minimum distances within each example

Record the locations in time that they occur at

Calculate the standard deviation of minimum distances

Calculate the standard deviation of time-locations

Calculate the best mean distance from Reference point

Store the previous three values in such a way that they can be

retrieved easily for that Reference point.

end

end

end

Page 55: Speech Recognition Final

June 2003

55

The Inter-Entity Associator:

Whereas the intra-entity associator finds the distances and times of those points

that are closest to a unit of sound at each point in time for each example within a

particular word, the inter-entity associator finds the distances and times of points

belonging to other words. Using essentially the same algorithm as the intra-entity

associator, the inter-entity associator calculates distances of points outside the current

word. It once again finds the distances and times of each external point and records the

particular word that they belong to. It sorts the candidates and finds the closest points for

each example within each word. How does it determine if a point is close enough to the

reference point to sound the same? If the mean distance of the word cluster <= mean

distance of examples within each word, AND standard deviation of word distances <=

standard deviation of best example distances of the reference point, AND standard

deviation of word times <= best standard deviation of example times associated with the

reference point, then the two groups are assumed to be equal and an association is formed

between them.

The inter-entity associator has 3 outputs: an organized list of feature coordinates,

an organized list of word associations, and an organized list of word-times. All three

matrices are arranged in lookup-table form, where the indices of one matrix correspond

exactly to the same data in another matrix. The feature matrix (truncated) looks like:

Page 56: Speech Recognition Final

June 2003

56

Table 5

In the future, if the recognition system encounters a feature vector resembling

(.0042493, 2.3811e-006, -1.6506, -2.2281, -3.205, -3.264, -3.4798, -3.8434, -4.2645,

-4.175 …) it will be referred to Row Index 4. The recognition system can then go to the

association matrix:

Table 6

Here, in Row Index 4, the system will discover that the new input’s feature vector

is most similar to feature vectors located in words 4, 13, 14, 16, 20, and 24. Presumably,

therefore, the unit of sound that has been inputted sounds a great deal like like units in

those particular words. To get more specific, the recognition engine then looks up the

time matrix:

Page 57: Speech Recognition Final

June 2003

57

Table 7

In Row Index 4, the recognition engine would find the relative times that the matching

sounds occur at within those words. It uses this time information to track potential word

candidates across time.

3.5 Run-Time Operation

The Feature Matrix, Association Matrix, and Time Matrix are the components

that the system needs to perform recognition. Suppose the user speaks into the

microphone. The data acquisition engine captures the sound, front-end algorithms remove

periods of silence, and the feature extractor creates a set of feature vectors. The distances

between the current input sequence and the coordinates stored in the Feature Matrix is

calculated. The row index that corresponds to the closest match is retrieved. Using the

row index, potential words are retrieved from the Association Matrix and stored in a Run-

Time Word Matrix. Using the row index, matching time locations are retrieved from the

Time Matrix and stored in a Run-Time Time Matrix. As the user inputs more samples,

the system finds word-numbers that occur repeatedly in the Run-Time Word Matrix. It

also examines the time-locations in the Run-Time Time Matrix. Using Matlab’s

Page 58: Speech Recognition Final

June 2003

58

intersect( ) and diff( ) functions, if a word number occurs repeatedly in each following

set and the corresponding time locations are causal, then that word is selected.

PART FOUR—USER’S GUIDE

4.1 General Overview

Creating the various functions in our system initially took a great deal of time. To

speed up the development and troubleshooting process, we made each function as

flexible and modular as possible, setting up the system so that we could change important

variables quickly and easily. This had the added benefit of allowing new people, with a

little bit of training, to use our recognition programs to create and run their own templates

and adjust the performance parameters to suit their requirements. Because the system is

implemented in a single environment (Matlab), the learning curve should be fairly quick,

but it is assumed that the potential user has at least a passing familiarity with the Matlab

environment.

Before you begin, make sure you have the following software:

Matlab 6.5 Release 13 (or later)

Data Acquisition Toolbox v2.2 (or later)

Signal Processing Toolbox

To get started, make sure that you are in a reasonably quiet environment and that

all our functions are located in the same directory. Set the Matlab work path to the

directory that the functions are stored at. Once that is done, you should first calibrate the

system so that it can measure the ambient noise level and the noise level intrinsic to your

particular hardware:

Page 59: Speech Recognition Final

June 2003

59

Step 0: Calibration

Type the following command in the Matlab workspace:

[dc,bp,ln,un] = sound_calibrate;

The command line will be blocked for approximately 50 seconds while the system

calibrates itself. Try not to make any noise during this period as these settings will be

needed for all future readings. Once the program has finished running, you should save

the calibration settings via the save command:

save calibration1;

4.2 Template Creation

After calibration, the first phase involves supplying the system with a library of

sound samples from which to build a template. This can be accomplished via an external

sound recording program and then using Matlab’s built-in wavread( ) function or by

using our voice_record( ) function. Because of the nature of the Data Acquisition

Toolbox, you must first declare the following as a global variable:

global Recorded_Samples

This is the variable where all recorded sound samples will be stored.

The voice_record( ) function requires you to enter 5 input parameters: dc, bp, ln, un,

quantity. The values for dc, bp, ln, and un should come from the sound_calibrate( )

function. The fifth input parameter, quantity, is a value that tells the program how many

examples of each word you want to give. For example, if you want to form a training set

based on 7 examples of each word, you should set the value of quantity to 7. We

recommend that you choose a minimum value of five but no more than 8, as higher

Page 60: Speech Recognition Final

June 2003

60

numbers of examples will require you to wait a longer amount of time when you reach

the Association Stage.

By default, the voice_record( ) function will give you a window of 3 seconds within

which you can speak a single example. To give yourself a longer or shorter acquisition

duration, open up the voice_record( ) function, scroll down to the CUSTOMIZABLE

SETTINGS portion of the program, which should be near the top, and then change the

value of record_length to the value in seconds that you wish the system to work.

From beginning to end, here is a sample recording session. Suppose you wanted to record

a single word, say, “Proceed” 6 times and store all 6 recordings in a variable called

word_set1:

global Recorded_Samples

voice_record(dc, bp, ln, un, 6);

The system is smart enough to detect the difference between silence and a sound directed

at the microphone. Therefore, at each prompt, the system will wait for you to begin

before it commences recording that example. You must, however, finish the word before

the allotted time expires. At the end of every example, you will see a counter that tracks

your progress by telling you how many recordings you have made so far. Once your set

number of examples has been reached, you should store the samples:

word_set1 = Recorded_Samples

Page 61: Speech Recognition Final

June 2003

61

If you want to continue and record a different word, for example, “stop,” simply press the

up arrow key or type once again:

voice_record(dc, bp, ln, un, 6);

The program will again wait for you to begin. When you are finished, you should store

your new set:

word_set2 = Recorded_Samples;

IMPORTANT: When you make a template of, say, 5 words, you MUST record the same

number of examples for each word. Once you’ve decided on how many samples you

want the recognition system to work with, you must supply it with exactly that many

samples. In the case above, once you’ve decided on six samples, the “Proceed” command

must be recorded six times, the “Stop” command must be recorded six times, and any

additional words you want to train it with must also be recorded the same number of

times. Failure to have a consistent number of examples will result in the system crashing.

Finally, let’s say you want to record a third word, such as “Charge.” Either press the up

arrow key or type in:

voice_record(dc, bp, ln, un, 6);

Record your six samples and store them into a variable:

word_set3 = Recorded_Samples;

When you have finished recording all the words you want the system to recognize, you

must group each set of words into the same array:

Page 62: Speech Recognition Final

June 2003

62

complete_set = {word_set1, word_set2, word_set3}

The array known as complete_set should contain all the examples of each word that you

want the system to process. It is strongly recommended that you

save complete_set

before continuing.

The next step involves extracting the relevant features from your voice samples. Do this

by executing the following line:

[td,fd] = make_template(complete_set);

There might be a short pause while the system extracts the required features. When the

system is finished, you must convert the cell arrays td and fd into matrices:

td = ccm_to_matrix(td);

fd = ccm_to_matrix(fd);

When that is finished, you must combine the time-domain and frequency-domain

matrices into a single matrix:

combined_template = cat(2,td,fd);

Now you are ready to perform “Intra-Entity Association”:

intra_associations = example_matcher(combined_template);

This process can take anywhere from 10 minutes to 1 ½ hours depending on how many

different words you recorded and how many examples of each that you have. A counter

on the screen will update you of its progress.

Page 63: Speech Recognition Final

June 2003

63

When finished, you should now perform a “Cross-Entity Association”:

cross_associations = associator(combined, intra_associations);

Again, this step can take anywhere from 10 minutes to 1 ½ hours depending on how

many word groups you recorded and how many examples of each that you have

provided. A counter will update you of its progress.

When this step is finished, one thing needs to be done before template creation can be

finished:

[features,final_associations,times] = final_template(combined_template,

cross_associations);

When this step is finished, you should save all workspace data:

save Finalized_Template;

4.3 Program Execution

You are now ready to run the recognition program!

But first, you need to make sure the engine is set to the appropriate settings. Open the

recognizer( ) function via Matlab. Scroll down until you see the LOAD TEMPLATE

FILES section:

change load prime_template_706; to match the workspace name that you just saved all

your variables to. In this case, the line should be changed to:

load Finalized_Template;

You should also change the line that says Prime_Features = FEATURES_706 into:

Prime_Features = features;

The two remaining lines should also be changed to:

Associations = final_associations;

Page 64: Speech Recognition Final

June 2003

64

Times = times;

Finally, scroll down to the very bottom of the recognizer( ) function and you will see a

section where words are being instructed to be outputed to the screen. Change them into

the written form of the words that you have just trained, making sure that they are in the

correct order.

You are now ready to run the program! Type:

recognizer(dc,bp,ln,un)

and give it a shot. You can tweak the variables in the USER SETTINGS section of the

function to obtain better performance. The program does not know when to end, so when

you get tired of running it, hit ctrl-break and the program should stop executing.

PART FIVE

5.1 Performance Evaluation:

For the most part, our voice recognition system has a few bugs that need tweaking

for optimized results. It takes about 30 minutes to create a template of 24 commands to

be used: 24 commands with 5 samples each = 5 * 24 = 120 total recordings. Each

recording takes about 15 seconds, so 120 * 15 = 1800 seconds = 30 minutes. To feed the

template into the association matrix and compute the word associations, the process is a

grueling 2 hours. Because there are so 26 feature vectors to associate the template and

along 120 recordings, calculations on a slower processor will be patience-testing. So we

made good use of our processor power for the template loading.

The 2.53 Ghz computer that we used also gave us a boost when actually running

tests. Our final product had only about a 2.5 second delay between vocal input and

Page 65: Speech Recognition Final

June 2003

65

screen output. Although far from instantaneous, there are some required delays,

attributed to finding the end of a word, and searching for a matching template. Our goal

was to create a smooth communication between man and machine, and our product came

pretty close, in terms of speed. With the introduction of outside noise into the vocal input

stage, the speed did not deteriorate, only the accuracy.

Accuracy was a problem that can be tweaked by changing settings and thresholds

in the “recognizer” program. We sometimes we had to repeat the same voice command

in order for the system to output a result. In an absolute controlled environment, where

there are no outside noise and frequency disturbances, we can achieve close to perfect

results for most of our commands. Some commands, such as “Clockwise”, was almost

never obtained as an output. We attribute this result due to the fact that it sounds

extremely close to “Counterclockwise”. There are other words that turned out to be

similar. For example the command “Up”, and “Stop” were very close because the length

of each command are similar. The “S” sound in “Stop” has a smaller power spectrum

than other consonants so it may be harder to detect. When we introduced noise into the

system during our testing, the accuracy diminishment became much more apparent.

For future improvements we have several to fix. Due to time constraints, there

was only so much polishing that we could accomplish down to the final days. One

feature of our system that we would like to improve upon is user-friendliness. (See User’s

Manual) The process of creating a user-defined template and loading into our system has

many steps, and our vision of a good system would only have the basics: Prompt user for

number of words, how many samples each, and then automatically load then in the

association matrix. After the process is complete (We would also like to shorten the

Page 66: Speech Recognition Final

June 2003

66

loading time to much less than 2 hours), the user could define his own threshold values

that effect the system. These values effect the speed and accuracy. The process of

obtaining the vocal input for comparison with the recently loaded template could be

implemented into a GUI.

Our conclusion to the performance evaluation is that our final product is almost

complete, we just need some minor modifications make the system easier to use, and

maybe better speed optimization. Making the system robust is challenging, removing

unwanted noise might be done with the addition of filtering techniques. More testing will

surely be required, and we will be getting closer to perfection every time.

Page 67: Speech Recognition Final

June 2003

67

PART 6

Parts list and costs Items purchased during Senior Design

Project Purchases Costs

Sony Vaio 2.53 Ghz computer $1299.99

Noise Canceling Microphone $24.99

Omnidirectional desktop microphone $5.99

MATLAB + 2 module toolboxes $159.97

Amerikit Radio Control Car Kit $24.95 9.6 V NiMh battery pack, 9V batteries, Solder wire

$34.19

Radioshack Digital Multimeter $39.99

Total (without tax and shipping) = $1590.07

Page 68: Speech Recognition Final

June 2003

68

List of Equipment Used (Final Products only)

1) Omni directional desktop microphone 2) 2.53 Ghz Desktop computer (Sony Vaio) 3) MATLAB student edition (Release 13) 4) MATLAB Signal Processing Toolbox 5) MATLAB Data Acquisition Toolbox 6) Tyco Fast Traxx 7) 9.6V NMh battery and charger 8) Axiom CME119-EVBU (w/ M68HC11 chip) 9) Black and Decker Power screwdriver 10) Solder wire and Iron 11) AxIDE software 12) IAR Embedded Workbench (v2.0)

Page 69: Speech Recognition Final

June 2003

69

Glossary of Terms, Acronyms, and Abbreviations Peripheral – A hardware device connected to a central hardware device that has a specific corresponding function within a complete hardware system. Microphone – A peripheral device that accepts a signal provided by analog sound and converts it into a digital form for application use. ROBOKART – The name of our robot system that will eventually accept voice commands provided by a user. Feature Vector – A set of values that are related to a defined model in our speech recognition system. Feature Space – The location where feature vectors are stored and accessed for processing and comparison. Template – A pre-defined collection of data that acts as a model to a command. It is used by the voice command recognition system for comparison with new input. Algorithm – A mapped chart or diagram that describes a software methodology for simplification. Asynchronous – describes how the communication between two peripherals or between a peripheral and the main system do not share the same clock timing. Byte – Eight bits of data Noise – Foreign information introduced into the system that disrupts or distorts the input. SCI – Serial Communications Interface 68HC11 – Motorola microcontroller model number/type MCU – microcontroller CME119-EVBU – model number and type of evaluation board, contains the motorola 68HC11 MCU RC – radio control Tyco Fast Traxx – device used for demonstrating ROBOKART, RC car distributed by Tyco International Ltd. Turbo King – alternative device used for radio control testing, built from a electronics kit.

Page 70: Speech Recognition Final

June 2003

70

Figure References

1. http://tonydude.net/physics201/p201chapter6.htm

2. http://hyperphysics.phy-astr.gsu.edu/hbase/sound/acont.html#c1

3, 4:

http://www.isip.msstate.edu/publications/journals/ieee_proceedings/1993/signal_modelin

g/paper_v2.pdf

21: http://www.st-andrews.ac.uk/~wjh/dataview/cluster.html

Table References

3:

http://www.isip.msstate.edu/publications/journals/ieee_proceedings/1993/signal_modelin

g/paper_v2.pdf

References

1. “History of Speech Recognition”

http://www.stanford.edu/~jmaurer/history.htm

2. “Signal Modeling Techniques in Speech Recognition” by Joseph Picone

http://www.isip.msstate.edu/publications/journals/ieee_proceedings/1993/signal_

modeling/paper_v2.pdf

3. “Columbia University Speech Recognition Lecture Notes” http://www.ee.columbia.edu/~dpwe/e6820/lectures/ Documents [1] Performance Analysis of Serial Port Interface. Petr Blaha, Pavel Vaclavek, Centre

for Applied Cybernetics. [2] Rodman, Robert D. Computer Speech Technology Artech, Boston 1999.

Page 71: Speech Recognition Final

June 2003

71

Internet Websites [1] HC11 compendium, http://ee.cleversoul.com/hc11.html [2] Motorola website, http://www.motorola.com

Page 72: Speech Recognition Final

June 2003

72

PHOTOS

Page 73: Speech Recognition Final

June 2003

73

Page 74: Speech Recognition Final

June 2003

74

Page 75: Speech Recognition Final

June 2003

75

Page 76: Speech Recognition Final

June 2003

76

APPENDIX

SOURCE CODE

C code for ROBOKART Tested in IAR embedded workbench

#include <io6811.h> #include <intr6811.h> #include <stdio.h> // unsigned char baudrate = BAUD; unsigned char record; // unsigned char Trdat = 0xa5; void writesci(unsigned char); unsigned char readsci(void); void main(void) { // configures for SCI BAUD = 0x30; SCCR1 = 0x00; SCCR2 = 0x0c; record = SCSR; record = SCDR; DDRD = 0x3c; // set initial output settings for PORTD // DDRC = 0xff; /*set PORTC as 8-bit output*/ // PORTB = 0x00; // initialize to zero outputs // PORTC = 0x00; while(1) { writesci(record); record = readsci(); /* if (CNT_DIR == 0) PORTB = 0x0f; else PORTB = 0x70; if (ROT_DIR == 0) PORTC = 0x0f; else PORTC = 0x70; */ }

Page 77: Speech Recognition Final

June 2003

77

} void writesci(unsigned char data) { // data = 0xf1; (use this for testing) // get transmitted data, wait until register flag empty TDRE set SCDR = data; while ((SCDR & (SCSR & 0x80)) == 0); } unsigned char readsci(void) { // wait for RDRF sets then return SCDR data int y; while ((SCSR & (SCSR & 0x20)) == 0); y = 0; PORTD = 0x00; switch(PORTD) { case 0: // Proceed Forward SCDR = 0x80; while(1) { while(y < 100000, y++) {} PORTD = 0x2b; } PORTD = 0x00; break; case 1: // Proceed Backward SCDR = 0x81; while(1) { while(y < 100000, y++) {} PORTD = 0x17; } PORTD = 0x00; break; case 2: // Turn Left SCDR = 0x88; while(1) { while(y < 100000, y++) {} PORTD = 0x23; } PORTD = 0x00; break;

Page 78: Speech Recognition Final

June 2003

78

case 3: // Turn Right SCDR = 0x89; while(1) { while(y < 100000, y++) {} PORTD = 0x08; } PORTD = 0x00; break; case 4: // Rotate Clockwise SCDR = 0x10; while(1) { while(y < 100000, y++) {} PORTD = 0x27; } PORTD = 0x00; break; case 5: // Rotate Counterclockwise SCDR = 0x11; while(1) { while(y < 100000, y++) {} PORTD = 0x1b; } PORTD = 0x00; break; case 6: // Stop SCDR = 0x00; while(1) { while(y < 100000, y++) {} PORTD = 0x03; } PORTD = 0x00; break; case 7: // Charge SCDR = 0x1f; while(1) { while(y < 1000000, y++) {} PORTD = 0x2b; } PORTD = 0x00; break;

Page 79: Speech Recognition Final

June 2003

79

case 8: // Retreat SCDR = 0xf1; while(1) { while(y < 1000000, y++) {} PORTD = 0x17; } PORTD = 0x00; break; } return (SCDR); } function [TD_template,FD_template] = make_template(ccv) % Standard Template Format: % Rows of each template correspond to a span (or sampling chunk) in time % For Time-Domain templates, each column corresponds to a unique feature % value % For Frequency-Domain templates, every column corresponds to the same % feature type but at a different location in frequency % TD_template legend: % TD_template(row,:) = [mean_zero_crossings mean_power] % 1 2 % All changes are expressed as a simple ratio between current and previous % values warning off MATLAB:divideByZero global Bark_Indices global Ham_Window % User Customizable Values: feature_length = 706; sampling_rate = 22050; overlap = round(feature_length/2); % Prepare spectral window Ham_Window = hamming(feature_length)'; % Find Bark Intervals f_scale = sampling_rate*(0:ceil(feature_length/2))/feature_length; % Find the frequency indices that correspond to each Bark number Bark_Indices = bark_grouper(f_scale);

Page 80: Speech Recognition Final

June 2003

80

% Calculate Equal-Loudness Preemphasis weighting vector preemphasis_weights = preemphasis(f_scale); for c1_index = 1:length(ccv), cv = ccv{c1_index}; for c2_index = 1:length(cv), raw_v = cv{c2_index}; v = sound_finder(raw_v); windowed_samples = slider(v,feature_length,overlap); for k = 1:length(windowed_samples), % Extract time-domain features v_features = vector_features_static(windowed_samples{k}); TD_template{c1_index}{c2_index}(k,:) = v_features; % Extract spectral features if length(windowed_samples{k}) < feature_length, shortfall = feature_length - length(windowed_samples{k}); % if length of current data is less than required feature_length, add zero padding: windowed_samples{k} = cat(2,windowed_samples{k},zeros(1,shortfall)); end spectrum = abs(fft(windowed_samples{k}.*Ham_Window,feature_length)); spectrum = spectrum(1:ceil(feature_length/2)+1); weighted_spectrum = spectrum; bank_powers = filter_bank(weighted_spectrum); FD_template{c1_index}{c2_index}(k,:) = bank_powers; end end end % ----------------------------------------------------------------------- % function [pure_sound] = sound_finder(vector) baseline_power = 9.698568682625995e-007; % measured 'average' power for a reasonably quiet environment minimum_power = baseline_power * 1.5; power_interval = 105; min_level = -0.00386435942872 * 1.5; max_level = 0.00386267192857 * 1.5; % Find Start Index, for k = 1:length(vector), if vector(k) > max_level | vector(k) < min_level, sound_chunk = vector(k:(k+power_interval)-1); power_chunk = mean(sound_chunk.^2); if power_chunk > minimum_power,

Page 81: Speech Recognition Final

June 2003

81

start_index = k; break end end end % Find Stop Index, for k = length(vector):-1:1, sound_chunk = vector(k:-1:(k-power_interval)+1); power_chunk = mean(sound_chunk.^2); if power_chunk > minimum_power, stop_index = k; break end end pure_sound = vector(start_index:stop_index); % ----------------------------------------------------------------------- % function [windowed_samples] = slider(vector,window_length,overlap) % For this version, overlap must be greater than zero % NOTE: windowed_samples{k} = column vector [rows,cols] = size(vector); if rows > 1, vector = vector'; end start_indices = 1:window_length-overlap:length(vector)-(window_length-overlap); stop_indices = window_length:window_length-overlap:length(vector); if length(stop_indices) < length(start_indices), stop_indices = cat(2,stop_indices,length(vector)); end windowed_samples = cell(length(start_indices),1); % pre-allocation for k = 1:length(start_indices), windowed_samples{k} = vector(start_indices(k):stop_indices(k)); end % ----------------------------------------------------------------------- % function [v_features_static] = vector_features_static(row_vector) mean_zero_crossings = mean(crossing_counter(row_vector,0));

Page 82: Speech Recognition Final

June 2003

82

sound_power = mean(row_vector.^2); v_features_static = [mean_zero_crossings sound_power]; % ----------------------------------------------------------------------- % function [output] = crossing_counter(vector,target_value) % forward-looking function that flags instances when a specified value is % crossed. % output will be the same format as the input vector % output length will be the same length as the input [in_rows,in_cols] = size(vector); if in_rows > 1, output = zeros(length(vector),1); end if in_cols > 1, output = zeros(1,length(vector)); end for index = 1:length(vector)-1, % check for upward crossing: if vector(index) < target_value & vector(index+1) > target_value, output(index+1) = 1; elseif vector(index) == target_value & vector(index+1) > target_value, output(index) = 1; end % check for downward crossing: if vector(index) > target_value & vector(index+1) < target_value, output(index+1) = 1; elseif vector(index) == target_value & vector(index+1) < target_value, output(index) = 1; end end % ----------------------------------------------------------------------- % function [weights] = preemphasis(fvector) % OFFICIAL (SORT OF) VERSION OF THE PRE-EMPHASIS FUNCTION % Maximum sensitivity at 4722.37 Hz % Peak Gain = 390 (approx) function_boost = 1.640399551056837e+021; w = 2*pi*fvector;

Page 83: Speech Recognition Final

June 2003

83

weights = (((w.^2 + 56.8*10^6).*w.^4)./((w.^2 + 6.3*10^6).*(w.^2 + 0.38*10^9).*(w.^6+9.58*10^26)))*function_boost; % ----------------------------------------------------------------------- % function [bank_powers] = filter_bank(spectrum) % spectrum must be a column vector global Bark_Indices; bank_powers = zeros(1,length(Bark_Indices)); % pre-allocate 'bank_powers' as a column vector for k = 1:length(Bark_Indices), triangle_window = triang(length(Bark_Indices{k}))'; % triangle_window is a column vector chunk = spectrum(Bark_Indices{k}); windowed_chunk = triangle_window.*chunk; bank_powers(k) = log10(mean(windowed_chunk.^2)); end % ----------------------------------------------------------------------- % function [barked_indices] = bark_grouper(fvector) % output is a CV (Cells of Vectors) % make sure that fvector is in column form [rows,cols] = size(fvector); if rows > 1, fvector = fvector'; end center_freq = [50 150 250 350 450 570 700 840 1000 1170 1370 1600 1850 2150 2500 2900 3400 4000 4800 5800 7000 8500 10500 13500]; bandwidth = [100 100 100 100 110 120 140 150 160 190 210 240 280 320 380 450 550 700 900 1100 1300 1800 2500 3500]; f = fvector; for k = 1:24, if k == 1, lower_bound = 0; upper_bound = 150; else lower_bound = center_freq(k) - bandwidth(k); upper_bound = center_freq(k) + bandwidth(k); end test = find(f >= lower_bound & f <= upper_bound); if isempty(test) == 1,

Page 84: Speech Recognition Final

June 2003

84

break else barked_indices{k} = test; end end function [] = recognizer(dc_offset,base_power,ln,un) % USER-CUSTOMIZABLE VARIABLES: global Sampling_Rate; % frequency rate at which voice will be sampled at global Trigger_Interval; % number of samples to acquire during every trigger global Buffer_Size; % size of the various storage arrays global Feature_Length; % must be an even number global Overlap; % number of samples by which each sliding window overlaps global DC; % average value of intrinsic noise global Baseline_Power; % average power of intrinsic noise global Minimum_Power; % minimum required power that a spoken utterance ought to have global Min_Power_Interval; % minimum number of samples required to make an accurate power estimate global Min_Power_Chunks; % minimum number of power measurements per trigger global Elasticity; % permissible magnitude variance (must be an integer) global Attention; % number of promising candidates to pay attention to global Entity_Groups; % the number of "words" that the system is required to recognize global Recog_Span global Recog_Space global Mag_Threshold global Time_Threshold_High global Time_Threshold_Low global Chain_Length global Minimum_Length % DATA-STORAGE STRUCTURES global Partitioned_Sound_Buffer; % place to store trigger-collected samples after noise and silence have been removed global Vector_Features_Buffer

Page 85: Speech Recognition Final

June 2003

85

global Raw_Sound_Chunk; % the most recently acquired, yet-unprocessed trigger sample global Recurring_Matrix global Current_Recurring_Entities global Previous_Recurring_Entities % STATUS FLAGS global PSB_Full; % flags a '1' if Partitioned Sound Buffer is full global First_Time; % INDEX (LOCATION) TRACKERS global Current_PSB_Count; % index pointing to the location of the most recent value in the Partitioned Sound Buffer global Current_VFB_Count; % index pointing to the location of the most recent value in the Vector Features Buffer global Recurring_Matrix_Count; % AUTOMATICALLY-DETERMINED PARAMETERS global Power_Interval; % the automatically-determined number of samples to be used on every power measurement global Power_Chunk_Indices; % an automatic, pre-set vector used to determine the sample cutoff points for each power measurement global Bark_Indices; % frequency index values that correspond to each integer Bark value (of 1 through 24) global Feature_Count; % Number of features involved in time-domain analysis global Ham_Window; global Prime_Features global Associations global Times global Template_Size % Loaded settings DC = dc_offset; Baseline_Power = base_power; lower_noise = ln; upper_noise = un; % User Settings: Sampling_Rate = 22050; Trigger_Interval = 11025; % must be an even number Buffer_Size = 120; Feature_Length = 706; % if using half-overlap, this must be an even number

Page 86: Speech Recognition Final

June 2003

86

Overlap = round(Feature_Length/2); % must be smaller than Feature_Length Minimum_Power = Baseline_Power * 1.5; Min_Power_Interval = 60; Min_Power_Chunks = 7; Elasticity = 1; Attention = 5; Entity_Groups = 24; Recog_Span = 90; Mag_Threshold = 0.6; Time_Threshold_Low = 0.27; Time_Threshold_High = 0.77; Chain_Length = 3; Minimum_Length = 30; % LOAD TEMPLATE FILES load Prime_Template_706; Prime_Features = FEATURES_706; Associations = ASSOCIATIONS_706; Times = TIMES_706; % Extracted parameters: lower_trigger = (lower_noise * 1.5) + DC; upper_trigger = (upper_noise * 1.5) + DC; % Data Acquisition Parameters: source = analoginput('winsound'); addchannel(source,1); source.TriggerChannel = source.Channel(1); source.BitsPerSample = 16; source.LoggingMode = 'memory'; source.SampleRate = Sampling_Rate; source.SamplesPerTrigger = Trigger_Interval; source.TriggerType = 'software'; source.TriggerCondition = 'leaving'; source.TriggerConditionValue = [lower_trigger upper_trigger]; source.TriggerDelay = 0; source.TriggerDelayUnits = 'samples'; source.TriggerRepeat = Inf; source.StartFcn = @start_prep; source.StopFcn = @shutdown; source.SamplesAcquiredFcnCount = Trigger_Interval; source.TriggerFcn = @extract_features; source.SamplesAcquiredFcn = @all_clear; % Pre-Allocate Storage Structures:

Page 87: Speech Recognition Final

June 2003

87

Partitioned_Sound_Buffer = cell(1,Buffer_Size); % Calculate hamming window: Ham_Window = hamming(Feature_Length)'; % Intelligent power interval determination: upper_choice = floor(Trigger_Interval/Min_Power_Chunks); choices = Min_Power_Interval:upper_choice; remainders = mod(Trigger_Interval,choices); indices = find(remainders == 0); if isempty(indices) ~= 1, qualified_choices = choices(indices); sorted_qualified_choices = sort(qualified_choices); qualified_indices = find(sorted_qualified_choices >= Min_Power_Interval); if isempty(qualified_indices) ~= 1, best_choices = sorted_qualified_choices(qualified_indices); Power_Interval = best_choices(1); else backup_qualified_indices = find(sorted_qualified_choices < Min_Power_Interval); next_best_choices = sorted_qualified_choices(backup_qualified_indices); Power_Interval = max(next_best_choices); end else Power_Interval = Trigger_Interval; end Power_Chunk_Indices = 1:Power_Interval:Trigger_Interval; % Find Bark Intervals f_scale = Sampling_Rate*(0:ceil(Feature_Length/2))/Feature_Length; % Find the frequency indices that correspond to each Bark number Bark_Indices = bark_grouper(f_scale); % Find Template Size [rows,cols] = size(Prime_Features); Template_Size = rows; warning off MATLAB:divideByZero First_Time = 1; disp(' ') disp(' ') disp(' ') start(source); % ----------------------------------------------------------------------- %

Page 88: Speech Recognition Final

June 2003

88

function [] = start_prep(source,Start) global Partitioned_Sound_Buffer global Current_PSB_Count; global Current_VFB_Count; global Vector_Features_Buffer; global Raw_Sound_Chunk; global PSB_Full; global Entity_Groups; global Recog_Space global Recurring_Matrix Current_PSB_Count = 0; Current_VFB_Count = 0; Partitioned_Sound_Buffer = {}; Raw_Sound_Chunk = 0; PSB_Full = 0; Vector_Features_Buffer = []; Recog_Space = []; Recurring_Matrix = []; % ----------------------------------------------------------------------- % function [] = all_clear(source,SamplesAcquired) global First_Time First_Time = 0; % ----------------------------------------------------------------------- % function [] = shutdown(source,Stop) delete(source); clear source; % ----------------------------------------------------------------------- % function [] = extract_features(source,Trigger) global Sampling_Rate; global Trigger_Interval; global Buffer_Size; global Feature_Length; global Overlap; global DC; global Baseline_Power; global Minimum_Power; global Min_Power_Interval; global Min_Power_Chunks; global Elasticity;

Page 89: Speech Recognition Final

June 2003

89

global Attention; global Entity_Groups; global Recog_Span global Recog_Space global Mag_Threshold global Time_Threshold_High global Time_Threshold_Low global Chain_Length global Minimum_Length global Partitioned_Sound_Buffer; global Vector_Features_Buffer global Raw_Sound_Chunk; global Recurring_Matrix; global Current_Recurring_Entities global Previous_Recurring_Entities global PSB_Full; global First_Time global Current_PSB_Count; global Current_VFB_Count; global Recurring_Matrix_Count; global Power_Interval; global Power_Chunk_Indices; global Bark_Indices; global Ham_Window; global Prime_Features global Associations global Times global Template_Size if First_Time == 1, return end Raw_Sound_Chunk = getdata(source) - DC; % Determine if acquired trigger sample has sufficient active power and % store those samples with sufficient power [pure_sound,valid] = partitioner(Raw_Sound_Chunk); if valid == 1, Current_PSB_Count = Current_PSB_Count + 1; Partitioned_Sound_Buffer{Current_PSB_Count} = pure_sound; % pure_sound = row vector

Page 90: Speech Recognition Final

June 2003

90

end % Extract features from active samples if valid == 1, % NOTE: windowed_samples{k} = column vector if length(pure_sound) >= Feature_Length, windowed_samples = slider(pure_sound,Feature_Length,Overlap); else windowed_samples{1} = pure_sound'; end % Acquire each window of samples for k = 1:length(windowed_samples), % Extract time-domain features v_features = vector_features_static(windowed_samples{k}); Current_VFB_Count = Current_VFB_Count + 1; Vector_Features_Buffer(Current_VFB_Count,:) = v_features; % Extract spectral features if length(windowed_samples{k}) < Feature_Length, shortfall = Feature_Length - length(windowed_samples{k}); % if length of current data is less than required Feature_Length, add zero padding: windowed_samples{k} = cat(2,windowed_samples{k},zeros(1,shortfall)); end spectrum = abs(fft(windowed_samples{k}.*Ham_Window,Feature_Length)); spectrum = spectrum(1:ceil(Feature_Length/2)+1); weighted_spectrum = spectrum; bank_powers = filter_bank(weighted_spectrum); acquired_features = cat(2,v_features,bank_powers); % Calculate Error replicant = repmat(acquired_features,[Template_Size,1]); error_distances = sqrt(sum((replicant-Prime_Features).^2,2)); [Y,I] = sort(error_distances); candidate_entities_matrix = Associations(I(1:Attention),:); candidate_entities_matrix = candidate_entities_matrix'; candidate_entities = cat(1,candidate_entities_matrix(:))'; candidate_entities(candidate_entities == 0) = []; candidate_times_matrix = Times(I(1:Attention),:); candidate_times_matrix = candidate_times_matrix'; candidate_times = cat(1,candidate_times_matrix(:))'; candidate_times(candidate_times == 0) = []; pad = zeros(1,Entity_Groups); pad(candidate_entities) = candidate_times; Recog_Space = cat(1,Recog_Space,pad);

Page 91: Speech Recognition Final

June 2003

91

[recog_rows,cols] = size(Recog_Space); if recog_rows > Recog_Span, Recog_Space = Recog_Space(2:recog_rows,:); end [recog_rows,cols] = size(Recog_Space); if recog_rows > Chain_Length + 1, previous_chunk = Recog_Space(recog_rows-Chain_Length:recog_rows-1,:); [d,previous_recog_entities] = find(previous_chunk > 0); previous_recog_entities = unique(previous_recog_entities); current_recog_line = Recog_Space(recog_rows,:); current_recog_entities = find(current_recog_line > 0); recurring_entities = intersect(current_recog_entities,previous_recog_entities); % check for causality: if isempty(recurring_entities) ~= 1, previous_recurring_times = Recog_Space(recog_rows-Chain_Length:recog_rows-1,recurring_entities); previous_recurring_times_mean = mean(previous_recurring_times,1); current_recurring_times = current_recog_line(recurring_entities); current_minus_previous = current_recurring_times-previous_recurring_times_mean; pseudo_indices = find(current_minus_previous > -0.1); if isempty(pseudo_indices) ~= 1, qualified_recurring_entities = recurring_entities(pseudo_indices); current_qualified_recurring_times = Recog_Space(recog_rows,qualified_recurring_entities); pad = zeros(1,Entity_Groups); pad(qualified_recurring_entities) = current_qualified_recurring_times; Recurring_Matrix = cat(1,Recurring_Matrix,pad); [check_rows,check_cols] = size(Recurring_Matrix); if check_rows > Recog_Span, Recurring_Matrix = Recurring_Matrix(2:check_rows,:); end [Recurring_Matrix_Count,check_cols] = size(Recurring_Matrix); % check for finalists current_recurring_matrix_line = Recurring_Matrix(Recurring_Matrix_Count,:); potential_finalist_entities = find(current_recurring_matrix_line >= Time_Threshold_High); if isempty(potential_finalist_entities) ~= 1, absolute_final = zeros(1,Entity_Groups); for k = 1:length(potential_finalist_entities), current_finalist_entity = potential_finalist_entities(k); current_finalist_times = Recurring_Matrix(:,current_finalist_entity); current_finalist_times_locs = find(current_finalist_times > 0); time_loc_span = length(current_finalist_times_locs);

Page 92: Speech Recognition Final

June 2003

92

if time_loc_span < Minimum_Length, %Recurring_Matrix(:,current_finalist_entity) = 0; continue end if min(current_finalist_times) <= Time_Threshold_Low, absolute_final(current_finalist_entity) = time_loc_span; end end if sum(absolute_final) ~= 0, [test_value,Entity] = max(absolute_final); flasher(Entity); Recurring_Matrix(:,Entity) = 0; Recurring_Matrix(:,potential_finalist_entities) = 0; if Entity == 8, disp('ENGINE HAS STOPPED') stop(source); end end end end end end end end % ----------------------------------------------------------------------- % function [v_features_static] = vector_features_static(row_vector) zero_crossings = mean(crossing_counter(row_vector,0)); sound_power = mean(row_vector.^2); v_features_static = [zero_crossings sound_power]; % ----------------------------------------------------------------------- % function [output] = crossing_counter(vector,target_value) % forward-looking function that flags instances when a specified value is % crossed. % output will be the same format as the input vector % output length will be the same length as the input [in_rows,in_cols] = size(vector); if in_rows > 1, output = zeros(length(vector),1); end if in_cols > 1,

Page 93: Speech Recognition Final

June 2003

93

output = zeros(1,length(vector)); end for index = 1:length(vector)-1, % check for upward crossing: if vector(index) < target_value & vector(index+1) > target_value, output(index+1) = 1; elseif vector(index) == target_value & vector(index+1) > target_value, output(index) = 1; end % check for downward crossing: if vector(index) > target_value & vector(index+1) < target_value, output(index+1) = 1; elseif vector(index) == target_value & vector(index+1) < target_value, output(index) = 1; end end % ----------------------------------------------------------------------- % function [pure_sound,valid] = partitioner(vector) global Minimum_Power; global Power_Interval; global Power_Chunk_Indices; % make sure 'vector' is in column form vector = vector'; squared = vector.^2; power_chunks = zeros(length(Power_Chunk_Indices)-1,1); % pre-allocation vector_chunks = zeros(length(Power_Chunk_Indices)-1,Power_Interval); % pre-allocation for k = 1:length(Power_Chunk_Indices)-1, power_chunks(k,:) = mean(squared(Power_Chunk_Indices(k):Power_Chunk_Indices(k+1)-1)); vector_chunks(k,:) = vector(Power_Chunk_Indices(k):Power_Chunk_Indices(k+1)-1); end qualified_power_indices = find(power_chunks > Minimum_Power); if isempty(qualified_power_indices) ~= 1, qualified_vector_chunks = vector_chunks(qualified_power_indices,:); pure_sound = reshape(qualified_vector_chunks',[],1); % turn pure_sound into a continuous row vector valid = 1; else pure_sound = NaN;

Page 94: Speech Recognition Final

June 2003

94

valid = 0; end % ----------------------------------------------------------------------- % function [windowed_samples] = slider(vector,window_length,overlap) % For this version, overlap must be greater than zero % NOTE: windowed_samples{k} = column vector [rows,cols] = size(vector); if rows > 1, vector = vector'; end start_indices = 1:window_length-overlap:length(vector)-(window_length-overlap); stop_indices = window_length:window_length-overlap:length(vector); if length(stop_indices) < length(start_indices), stop_indices = cat(2,stop_indices,length(vector)); end windowed_samples = cell(length(start_indices),1); % pre-allocation for k = 1:length(start_indices), windowed_samples{k} = vector(start_indices(k):stop_indices(k)); end % ----------------------------------------------------------------------- % function [barks] = bark(fvector) % Implements Traunmuller's Bark Scale f = fvector; barks = ((26.81*f)./(1960+f))-0.53; % Apply low and high-frequency corrections for index = 1:length(barks), if barks(index) < 2, barks(index) = barks(index)+0.15*(2-barks(index)); elseif barks(index) > 20.1, barks(index) = barks(index)+0.22*(barks(index)-20.1); end end % ----------------------------------------------------------------------- % function [bank_powers] = filter_bank(spectrum)

Page 95: Speech Recognition Final

June 2003

95

% spectrum must be a column vector global Bark_Indices; bank_powers = zeros(1,length(Bark_Indices)); % pre-allocate 'bank_powers' as a column vector for k = 1:length(Bark_Indices), triangle_window = triang(length(Bark_Indices{k}))'; % triangle_window is a column vector chunk = spectrum(Bark_Indices{k}); windowed_chunk = triangle_window.*chunk; bank_powers(k) = log10(mean(windowed_chunk.^2)); end % ----------------------------------------------------------------------- % function [barked_indices] = bark_grouper(fvector) % output is a CV (Cells of Vectors) % make sure that fvector is in column form [rows,cols] = size(fvector); if rows > 1, fvector = fvector'; end center_freq = [50 150 250 350 450 570 700 840 1000 1170 1370 1600 1850 2150 2500 2900 3400 4000 4800 5800 7000 8500 10500 13500]; bandwidth = [100 100 100 100 110 120 140 150 160 190 210 240 280 320 380 450 550 700 900 1100 1300 1800 2500 3500]; f = fvector; for k = 1:24, if k == 1, lower_bound = 0; upper_bound = 150; else lower_bound = center_freq(k) - bandwidth(k); upper_bound = center_freq(k) + bandwidth(k); end test = find(f >= lower_bound & f <= upper_bound); if isempty(test) == 1, break else barked_indices{k} = test; end end

Page 96: Speech Recognition Final

June 2003

96

%---------------------------------------------------------------------- function [] = flasher(entity) if entity == 1, disp('ROBOKART') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 2, disp('ROTATE') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 3, disp('CLOCKWISE') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 4, disp('COUNTERCLOCKWISE') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 5, disp('PROCEED') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 6, disp('FORWARD') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 7,

Page 97: Speech Recognition Final

June 2003

97

disp('BACKWARD') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 8, disp('STOP COMMAND ISSUED') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 9, disp('TURN') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 10, disp('LEFT') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 11, disp('RIGHT') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 12, disp('SPEED') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 13, disp('UP') disp(' ') disp(' ') disp(' ')

Page 98: Speech Recognition Final

June 2003

98

disp(' ') end if entity == 14, disp('SLOW') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 15, disp('DOWN') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 16, disp('DANCE') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 17, disp('CHARGE') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 18, disp('RETREAT') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 19, disp('GOOD') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 20, disp('BAD')

Page 99: Speech Recognition Final

June 2003

99

disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 21, disp('GO') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 22, disp('TO') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 23, disp('SLEEP') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 24, disp('WAKE') disp(' ') disp(' ') disp(' ') disp(' ') end function [intra_entity_synonyms] = example_matcher(combined_template) % combined_template must be a 4D matrix % intra_entity_synonyms: 3D Cell (m o p) of 2 row vectors: example numbers % aligned with sample indices % DATA-STORAGE STRUCTURES global Local_Template % AUTOMATICALLY-DETERMINED PARAMETERS global Feature_Count;

Page 100: Speech Recognition Final

June 2003

100

global Elasticity % Settings: Elasticity = 1; warning off MATLAB:divideByZero Template = combined_template; [m,Feature_Count,o,p] = size(Template); % Find maximum durations of each example for p_index = 1:p, for o_index = 1:o, example = Template(:,:,o_index,p_index); vertical_slice = example(:,1); valid_slice = vertical_slice(finite(vertical_slice)); duration = length(valid_slice); durations(o_index,p_index) = duration; end end intra_entity_synonyms = cell(m,o,p); for p_index = 1:p, p_index Local_Template = Template(:,:,:,p_index); for o_index = 1:o, o_index example_duration = durations(o_index,p_index); for m_index = 1:example_duration, acquired_sample = Local_Template(m_index,:,o_index); Local_Template(m_index,:,o_index) = repmat(NaN,[1 Feature_Count]); % remove [best_examples,best_sample_indices] = local_matcher(acquired_sample); Local_Template(m_index,:,o_index) = acquired_sample; synonyms = cat(2,best_examples,best_sample_indices); intra_entity_synonyms{m_index,o_index,p_index} = synonyms; end end end % ----------------------------------------------------------------------- % function [delta_m_best,best_indices] = local_matcher(acquired_features) global Elasticity; global Local_Template

Page 101: Speech Recognition Final

June 2003

101

% Create error template [m,n,o] = size(Local_Template); error_template = repmat(NaN,[m n o]); for feature_index = 1:length(acquired_features), current_feature_value = acquired_features(feature_index); error_template(:,feature_index,:) = abs((current_feature_value-Local_Template(:,feature_index,:))./Local_Template(:,feature_index,:)); end % Calculate Delta-M Error Values: unsorted_delta_m_cf_error = mean(error_template,2); [delta_m_cf_error,delta_m_cf_indices] = sort(unsorted_delta_m_cf_error,1); best_delta_m_cf_error = squeeze(delta_m_cf_error(Elasticity,:,:)); best_delta_m_cf_indices = squeeze(delta_m_cf_indices(Elasticity,:,:)); [Y,I] = sort(best_delta_m_cf_error); delta_m_best = I; % example numbers, row vector best_indices = best_delta_m_cf_indices(I); % in-example indices, row vector function [associations] = associator(template,synonyms) tic % template must be a 4-D matrix % synonyms must be a 3-D Cell [m,n,o,p] = size(template); % Find maximum durations of each example for p_index = 1:p, for o_index = 1:o, example = template(:,:,o_index,p_index); vertical_slice = example(:,1); valid_slice = vertical_slice(finite(vertical_slice)); duration = length(valid_slice); durations(o_index,p_index) = duration; end end for p_index = 1:p, reference_entity = p_index target_entities = 1:p; for o_index = 1:o, o_index max_duration = durations(o_index,p_index); for m_index = 1:max_duration, % Retrieve reference data:

Page 102: Speech Recognition Final

June 2003

102

reference_point = template(m_index,:,o_index,p_index); reference_point_neighbor_distances = synonyms{m_index,o_index,p_index}(:,3); reference_point_neighbor_times = synonyms{m_index,o_index,p_index}(:,2); % Remove current entity example from reference data to % maintain an accurate standard deviation reference_point_neighbor_distances(1) = []; reference_point_neighbor_times(1) = []; % Set the standards distance_to_beat = mean(reference_point_neighbor_distances); distance_dev_to_beat = std(reference_point_neighbor_distances); time_dev_to_beat = std(reference_point_neighbor_times); % Calculate competing entity stats: competition_matrix = template(:,:,:,target_entities); [c1,c2,c3,c4] = size(competition_matrix); reference_matrix = repmat(reference_point,[c1 1 c3 c4]); distance_matrix = sqrt(sum((competition_matrix-reference_matrix).^2,2)); [sorted_distances,local_times] = sort(distance_matrix,1); best_distances = squeeze(sorted_distances(1,:,:,:)); best_times = squeeze(local_times(1,:,:,:)); % Make sure current entity always wins: best_distances(:,reference_entity) = 0; best_times(:,reference_entity) = m_index; % Judgement time: consolidated_error = mean(best_distances,1); consolidated_error_dev = std(best_distances,0,1); consolidated_time_dev = std(best_times,0,1); best_error_candidates = find(consolidated_error <= distance_to_beat); best_error_dev_candidates = find(consolidated_error_dev <= distance_dev_to_beat); best_time_dev_candidates = find(consolidated_time_dev <= time_dev_to_beat); pre_candidates = intersect(best_error_candidates,best_error_dev_candidates); Best_Candidates = intersect(pre_candidates,best_time_dev_candidates); chain = []; for k = Best_Candidates, [hits,best_local_time] = max(hist(best_times(:,k),[1:max(best_times(:,k))])); chain = cat(2,chain,best_local_time); end

Page 103: Speech Recognition Final

June 2003

103

% Store the winners: Best_Entities = Best_Candidates; Best_Times = chain; associations{m_index,o_index,p_index} = cat(1,Best_Entities,Best_Times); end end end toc function [matrix] = ccm_to_matrix(ccm) % Converts a Cell-Cell-Matrix (CCM) into a 4-dimensional matrix % C1 is formed along the 4th dimension % C2 is formed along the 3rd (Z) dimension % M is already formed along the 1st (X) and 2nd (Y) dimensions % Find maximum lengths: c1_max = length(ccm); c2_max = 0; row_max = 0; col_max = 0; for c1_index = 1:length(ccm), cm = ccm{c1_index}; if length(cm) > c2_max, c2_max = length(cm); end for c2_index = 1:length(cm), m = cm{c2_index}; [rows,cols] = size(m); if rows > row_max, row_max = rows; end if cols > col_max, col_max = cols; end end end % Pre-allocate the matrix, pad with NaN's matrix = zeros(row_max,col_max,c2_max,c1_max) + NaN; % Populate the matrix for c1_index = 1:length(ccm), cm = ccm{c1_index}; for c2_index = 1:length(cm),

Page 104: Speech Recognition Final

June 2003

104

m = cm{c2_index}; [current_rows,current_cols] = size(m); matrix(1:current_rows,1:current_cols,c2_index,c1_index) = m; end end function [condensed_template] = condenser(template,intra_synonyms,duration) tic % combined must be a 4-D Matrix % intra_synonyms must be a 3-D Cell % duration: 1 = shortest, 2 = mean, 3 = longest % controls the size standard by which each example is to be compared % against [m,n,o,p] = size(template); % Find desired durations for p_index = 1:p, for o_index = 1:o, example = template(:,:,o_index,p_index); vert_slice = example(:,1); I = find(finite(vert_slice)); example_lengths(o_index,p_index) = length(I); end [min_lengths(p_index),min_example(p_index)] = min(example_lengths(:,p_index)); % find mean: theoretical_mean = mean(example_lengths(:,p_index)); % calculate error offsets: error_line = abs((theoretical_mean-example_lengths(:,p_index))./example_lengths(:,p_index)); [Y,I] = sort(error_line); mean_lengths(p_index) = example_lengths(I(1),p_index); mean_example(p_index) = I(1); [max_lengths(p_index),max_example(p_index)] = max(example_lengths(:,p_index)); end if duration == 1, % SHORTEST target_lengths = min_lengths; target_examples = min_example; end if duration == 2, % MEAN target_lengths = mean_lengths; target_examples = mean_example; end if duration == 3, % LONGEST

Page 105: Speech Recognition Final

June 2003

105

target_lengths = max_lengths; target_examples = max_example; end % pre-allocation condensed_template = repmat(NaN,[max(target_lengths) n 3 p]); for p_index = 1:p, p_index target = target_examples(p_index); duration = target_lengths(p_index); for m_index = 1:duration, associations = intra_synonyms{m_index,target,p_index}; examples = associations(:,1); locations = associations(:,2); % Individual feature differentiation: for n_index = 1:n, % n = feature count feature_list = []; for k = 1:length(examples), feature_list = cat(1,feature_list,template(locations(k),n_index,examples(k),p_index)); min_feature = min(feature_list); mean_feature = mean(feature_list); max_feature = max(feature_list); condensed_template(m_index,n_index,1,p_index) = min_feature; condensed_template(m_index,n_index,2,p_index) = mean_feature; condensed_template(m_index,n_index,3,p_index) = max_feature; end end end end toc function [features,associations,times] = final_template(raw_template,raw_associations) % raw_template must be a 4-D matrix % raw_associations must be a 3-D cell array tic [m,n,o,p] = size(raw_template); % Find maximum durations of each example for p_index = 1:p, for o_index = 1:o, example = raw_template(:,:,o_index,p_index); vertical_slice = example(:,1);

Page 106: Speech Recognition Final

June 2003

106

valid_slice = vertical_slice(finite(vertical_slice)); duration = length(valid_slice); durations(o_index,p_index) = duration; end mean_durations(p_index) = mean(durations(:,p_index)); end [m2,o2,p2] = size(raw_associations); max_col = 0; for p2_index = 1:p2, for o2_index = 1:o2, for m2_index = 1:m2, retrieved = raw_associations{m2_index,o2_index,p2_index}; if isempty(retrieved) ~= 1, sample = retrieved(1,:); if length(sample) > max_col, max_col = length(sample); end end end end end k = 1; for p_index = 1:p, p_index for o_index = 1:o, max_duration = durations(o_index,p_index); for m_index = 1:max_duration, features(k,:) = raw_template(m_index,:,o_index,p_index); retrieved = raw_associations{m_index,o_index,p_index}(1,:); padding = zeros(1,max_col-length(retrieved)); associations(k,:) = cat(2,raw_associations{m_index,o_index,p_index}(1,:),padding); normalized_time = (raw_associations{m_index,o_index,p_index}(2,:))./mean_durations(retrieved(1,:)); times(k,:) = cat(2,normalized_time,padding); k = k + 1; end end end toc function [output] = statistics(DataVector) % This function calculates the mean, standard deviation,

Page 107: Speech Recognition Final

June 2003

107

% gaussian maximum, gaussian minimum, actual maximum, % actual minimum, and a 'normal distribution figure of merit' % INPUT MUST BE A ONE-DIMENSIONAL VECTOR % OUTPUT WILL BE A COLUMN VECTOR DataMean = mean(DataVector); Sdev = std(DataVector); Gmin = DataMean - 3*Sdev; Gmax = DataMean + 3*Sdev; Amin = min(DataVector); Amax = max(DataVector); normality = 1 - abs(DataMean-median(DataVector))/DataMean; output = [DataMean Sdev Gmin Gmax Amin Amax normality];