INTRODUCTION: - Florida Institute of Technologymy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih... · Web viewProsodic measurements based on pitch and energy are analyzed to introduce new

INTRODUCTION:

The objective of this thesis is to research and develop prosodic features for discriminating proper names uses in alerting (e.g., John, can I have that book?) from reverential context (e.g., I saw John yesterday). Prosodic measurements based on pitch and energy are analyzed to introduce new prosodic based features to the Wake-Up-Word Speech Recognition system . During the process of finding the prosodic features, an innovative data collection method had been designed and developed.

In the conventional automatic speech recognition system, the users are required to physically activate the recognition system by clicking a button, or by manually starting the application. The Wake-Up-Word Speech Recognition system invented by Kpuska the way people can activate their systems by enabling the users to use their voice only. The Wake-Up-Word Speech Recognition System will eventually further improve the way people use speech recognition system by enabling speech only interfaces.

In the Wake-Up-Word Speech Recognition system , a word or phrase is used as a Wake-Up-Word (WUW) word indicating the system that the user requires its attention (e.g., alerting context). Any user can call activate the system by uttering WUW (e.g., Operator), that will enable the application to accept the following command (e.g., Next slide please). Since the same word may occur during referential context, instead of needing attention from the system, it is important to discriminate accurately between the two. This use of the same word refers to it as a none-Wake-Up-Word, nonWUW context. The following examples further demonstrate the use of the word Operator in those two contexts:

Example sentence 1: Operator, please go to the next slide.Example sentence 2: We are using the word operator as the WUW.

Depicted cases above indicate different user intentions. In the first example, the word operator is been used as a way to alert the system and get its attention. In the second example, the same word operator, is used to refer to it and thus the term referential context. Current Wake-Up-Word Speech Recognition system implements only the pre & post WUW silence as a prosodic feature to differentiate the alerting and referential contexts. In this thesis, pitch and energy based prosodic features are used. The problem of general prosodic analysis is introduced in Section 1.1.

In Chapter 2, the use of the pitch as a prosodic feature is described. The pitch in general represents the intonation of the speech, and the intonation is used to convey linguistic and paralinguistic information of that speech (Lehiste, 1970) . The definition and characteristics of pitch will be covered in Section 2.1. In Section 2.2, pitch estimation method named eSRFD (Enhanced Super Resolution Fundamental Frequency Determinator) (Bagshaw, 1994) is introduced. Finally, in Section 2.3, derivation of multiple pitch-based features from pitch measurements to find the best feature to discriminate the WUW used in alerting context from reverential.

In Chapter 3, an additional prosodic feature based on energy is described. The definition of prominence, an important prosodic feature based on energy and pitch, and its characteristics will be covered in Section 3.1. In the following Section 3.2, description of energy is computation is presented. Finally, in Section 3.3, derivation of multiple energy features from the energy measurement is presented and analyzed.

In Chapter 4, an innovative idea of performing speech data collection is presented. After a number of prosodic analysis experiments conducted using WUWII Corpus (Tudor, 2007), the validation of obtained results was deemed necessary using a different data set. Since to our knowledge no specialized speech database is available, the Dr. Wallaces idea was adopted to collect the data from the movies. We designed a system which extracts speech from audio channel and if necessary video information from recorded medium (e.g., DVD) of movies and/or TV series. This project is currently under development by the Dr. Kpuskas VoiceKey Group.

The problem definition and system introduction will be explained in Section 4.1, followed by the system design in Section 4.2.

1.1 Prosodic Analysis

The word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster Dictionary). Its etymology comes from ancient Greek, where it was used in singing with instrumental music. In later time, the word was used for the science of versification and the laws of meter, governing the modulation of the human voice in reading poetry aloud. In modern phonetics the word prosody is most often referred to those properties of speech that cannot be derived from the segmental sequence of phonemes underlying human utterances. (William J. Hardcastle, 1997).

Based on the phonological aspect; the prosody maybe classified into structure, tune and prominence.

1. The prosodic structure refers to the noticeable break or disjunctures between words in sentences which can also be interpreted as the duration of the silence between words as a person speak. This factor has been considered in the current Wake-Up-Word Speech Recognition system where the minimal silence period before the WUW and after must be present. The silence period before the WUW is usually longer than the average silence period of nonWUW or other parts of the sentence.

2. The tune refers to the intonational melody of an utterance (Jurafsky & Martin) which can be quantified by pitch measurement also known as fundamental frequency of the speech. The detail on the pitch characteristic, pitch estimation algorithm and the usage of pitch features are presented and explained in Chapter 2.

3. Finally, the prominence includes the measurement of the stress and accent in a speech. The prominence is measured in our experiments using the energy of the sound. The details of energy computation, feature derivation based on energy, and experimental results are presented in Chapter 3.

PITCH FEATURES:

In this chapter intonation melody of an utterance, computed using pitch measurement, is descried. The pitch feature also referred as fundamental frequency, and the comparison of various pitch estimation algorithms are covered in section 2.1. Based on those results from multiple fundamental frequency algorithms (FDA) the eSRFD (Enhanced Super Resolution Fundamental Frequency Determinator) is selected as the algorithm of choice to perform the pitch estimation. The details of the eSRFD algorithm are covered in section 2.2. Derivation of multiple pitch-based features and their performance evaluations are covered in section 2.3.

1.1 Pitch and pitch estimation methods

The intonation is one of the prosodic features that contain the information that may be the key to discriminate the referential context and the alerting context. The intonation of a speech is strictly interpreted as the ensemble of pitch variations in the course of an utterance (Hart, 1975). Unlike tonal languages such as Mandarin Chinese language that has lexical forms that are distinguished by different levels or patterns of pitch of a particular phoneme. The pitch in the intonation languages such as English language, Germanic languages, Romance languages, and Japanese languages, is been used syntactically. In addition, the intonation patterns in the intonational languages are grouped with number of words which are called intonation groups. Intonation groups of words are usually uttered in one single breath. The pitch measurement in the intonation languages reveals the emotion of a person and the intention of his/her speech. For example:

Can you pass me the phone?

The pattern of continuous rising pitch in the last three words in the above sentence indicates a request.

In strict terms, pitch is defined as the fundamental frequency or fundamental repetition of a sound. The typical pitch range for an adult male is between 60-200 Hz and 200-400 Hz for adult female and children. The contraction of vocal fold produces relatively high pitch and vice versa the expended vocal fold produces lower pitch. This explains the reason a persons rise in pitch when he/she gets nervous or surprised. The reason why a male usually has a lower pitch than female and children can also be explained by the fact that males usually have longer and larger vocal folds.

After years of development of pitch estimation algorithms, pitch estimation methods can be classified into the following three categories:

1. Frequency based methods such as CFD (Cepstrum-based F determinator) and HPS (Harmonic product spectrum), use frequency domain representation of the speech signal to find the fundamental frequency.

2. Time domain based methods such as FBFT (Feature-based F tracker) (Phillips, 1985) uses perceptually motivated features and PP (Parallel processing method) produce fundamental frequency estimates by analyzing the waveform in the time domain.

3. Cross-correlation method such as IFTA (Integrated F tracking algorithm) and SRFD (Super resolution F determinator) uses a waveform similarity metric based on a normalized cross-correlation coefficient.

The method of eSRFD (Enhanced Super Resolution Fundamental Frequency Determinator) (Bagshaw, 1994) was chosen to extract the pitch measurement for the Wake-Up-Word because of its high overall accuracy. According to Bagshaws experiments, the accuracy of the eSRFD algorithm can have voiced and unvoiced combined error rate below 17% and low-gross fundamental frequency error rate of 2.1% and 4.2% for male and female respectively. The Figure 2.1 and Figure 2.2 below show the error rate comparison charts between eSRFD and other FDAs for male and female voice respectively.

Figure A1 FDA Evaluation Chart: Male Speech. Reproduced from (Bagshaw, 1994)

In the Figure 2.1 and Figure 2.2, the purple bars indicate the low-gross F error which refers to the halving error where the pitch has been estimated wrongly with a value about half of the actual pitch. The green bars represent the high-gross F error which refers to the doubling error where the pitch has been estimated wrongly with a value about twice of the actual pitch. The voiced error represented by red bars refers to the miss identified unvoiced frames as voiced ones by the FDA. Finally the unvoiced error means the voiced data is been miss identified as unvoiced data and this error is represented by blue bars.

Figure A2 FDA Evaluation Chart: Female Speech. Reproduced from (Bagshaw, 1994)

The Figure 2.1 and Figure 2.2, refer to male and female fundamental frequency evaluation charts. They depict that the eSRFD algorithm achieves the lowest overall error rate. This result was confirmed in the more recent study of (Veprek & Scordilis, 2002). Consequently, eSRFD it has been chosen to be the FDA in our project.

1.2 eSRFD Frequency Determinator Algorithm

The eSRFD is the advanced version of SRFD (Medan, 1991); The program flow chart of the eSRFD FDA is illustrated in Figure 2.3.

The theory behind the SRFD algorithm is to use a normalized cross-correlation coefficient to quantify the degree of similarity between two adjacent, non-overlapping sections of speech. In eSRFD, a frame is been divided in three consecutive sections instead of two as in the original SRFD algorithm.

At the beginning, the sample waveform is passed through a low-pass filter to remove the signal noise. The sample utterance is then divided into non-overlapping frames of 6.5 ms length (tinterval = 6.5ms) and each frame contains a set of samples, SN, where which is divided into 3 consecutive segments each containing equal number of a varying number of samples, n. The definition of segmentation is defined by the Equation 21 below and further described in Figure 2.4 below.

Figure A1 eSRFD Flow chart

Equation 21

Figure A2 Analysis segments of eSRFD FDA

In eSRFDA each frame is processed by a silence detector which labels the frame as unvoiced if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin and zmax is smaller than a preset value (e.g., 50db signal-to-noise level), vice versa, the frame is voiced if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin and zmax is larger than a preset value (e.g., 50db signal-to-noise level). No fundamental frequency will be searched if the frame is marked as an unvoiced frame. In cases where at least one of the segments of xn, yn or zn is not defined, which usually happens at the beginning of the speech file and the end of the speech file, these frames will be labeled as unvoiced and no FDA will be applied to these frames.

If the frame of sample is not labeled as silence, then candidate values for the fundamental period are searched from values of n within the range Nmin to Nmax by using the normalized cross-correlation coefficient Px,y(n) as described by Equation 22.

Equation 22

In the Equation 22, the decimation factor L is used to lower the computational load of the algorithm. The smaller L values allow higher resolution but also causes increase in computational load of FDA. Larger L values produce faster computation with lower resolution search. The L is set to 1 since the purpose of this research is to find as accurate as possible relationship between pitch measurements in WUW words, thus the computational speed is considered secondary and thus is not taken into account. However, the variable L will be considered when this algorithm is integrated into the WUW Speech Recognition System.

Figure A3 Analysis segments for Px,y(n) in the eSRFD

The candidate values of the fundamental period of a frame are found by locating peaks in the normalized cross-correlation result of Px,y(n). If this value exceeds a specified threshold, Tsrfd, then the frame is further considered voiced candidate. This threshold is adaptive and is dependent on the voice classification of the previous frame and three preset parameters. The definition of Tsrfd is described in the Equation 23. If the previous frame is unvoiced or silent, the Tsrfd is equal to 0.88. If the previous frame is voiced, the Tsrfd is equal to the larger value between 0.75 and 0.85 times the value of Px,y of previous frame Px,y. The threshold is adjusted because the present frame has higher possibility to be classified as voiced if the previous frame is voiced as well.

If the previous frame is unvoiced or silent.

If the previous frame is unvoiced or silent.

Equation 23

In case no candidates for the fundamental period are found in the frame, the frame is reclassified as unvoiced and no further processing will be applied to the unvoiced frame. In another case, the frame is classified as voiced and following process will be used to find the optimal candidate as described next.

After getting the first normalized cross-correlation coefficient Px,y, the second normalized cross-correlation coefficient Py,z, will be calculated for the voiced frame. The normalized cross-correlation coefficient Py,z is described by the Equation 24 below.

Equation 24

After the second normalized cross-correlation, the score will be given to all candidates. If the candidate pitch value of a frame has both Px,y and Py,z larger than Tsrfd, a score of 2 is given to the candidate. If only Px,y is above Tsrfd, a score of 1 is assigned to the candidate. The higher score indicates higher possibility for the candidate to represent the fundamental period of the frame. After candidate scores are given, if there are one or more candidates with a score of 2, all candidates score with 1 in that frame are removed from the candidate list. If there is only one candidate with score of 2, then the candidate is assumed to be the best estimation of fundamental period of the particular frame. If there are multiple candidates with score 1 but no candidate scores of 2, an optimal fundamental period is sought from the remaining candidates.

For the case of multiple candidates with score 1 but no candidate scores of 2, the candidates are sorted in ascending order of fundamental period. The last candidate of the list which has the largest fundamental period represents a fundamental period of nM and nm for mth candidate.

Figure A4 Analysis segments for q(nm) in the eSRFD

Then the third normalized cross-correlation coefficient, q(nm), between two sections of length nM spaced nm apart, is calculated for each candidate. The Equation 25 describes the normalized cross-correlation coefficient, q(nm) used in this case.

Equation 25

After the third normalized cross-correlation coefficient is generated, the q(nm) of the first candidate on the list is assumed to be the optimal value. If the following q(nm) multiplied by 0.77, is larger than the current optimal value, the candidate for which q(nm) is considered to be the new optimal value. We apply the same concept through the list of candidates; resulting with the optimal candidate value.

For the case where only one candidate has score of 1 and no candidate scores of 2, the possibility for the candidate to be the true fundamental period of the frame is low. In such a case, if both previous frames and subsequent frame are silent, the current frame is an isolated frame and is reclassified as silent frame. If either the previous or the next frame is voiced frame, we assume the candidate of the current frame is the optimal and it defines the fundamental period of the current frame.

The above algorithm has high possibility to miss identify the voiced frame to unvoiced or silent frames. In order to counteract this imbalance, biasing is applied when all of the three conditions below are satisfied:

The two previous frames were voiced frames.

The fundamental period of the previous frame is not temporarily on hold.

The fundamental frequency of the previous frame is less than 7/4 times the fundamental frequency of its next voiced frame and greater than 5/8 of the next frame.

After getting the fundamental frequency, in order to further minimize the occurrence of doubling or halving errors, the pitch contour is passed through a median filter.

The median filter is of length 7 as the default size, but it will decrease the size to 5 or 3 in case there are less than 7 consecutive voiced frames. The Figure 2.7 below shows an example of doubling points being corrected by the medium filter. In the Figure 2.7, the top row shows the pitch measurement generated by eSRFD FDA and the bottom row shows the fixed measurement by medium filter. As we can see from the figure, the two points marked as doubling error were fixed by medium filter.

(Doubling Error)

Figure A5 medium filter example

We applied the above pitch estimation method to the WUWII (wake-up-word II corpus) which contains approximately 3410 utterances and every utterance contains at least one WUW. The Figure 2.8 displays a sample utterance containing the following sentence:

Hi. You know, I have this cool wildfire service and, you know, I'm gonna try to invoke it right now. Wildfire

Figure A6 Example, WUWII00073_009.ulaw

In the Figure 2.8, the first row shows the waveform of the speech, the second row shows the pitch estimation from eSRFD FDA, the third shows the pitch estimation after the medan filter and the last row shows the spectrogram of the speech. The WUW of this sentence is Wildfire which is the section delineated between two red lines.

1.3 Pitch features

The pattern of the fundamental frequency contour of utterance waveforms represents the intonation of the speech. Since this problem of discriminating the use of the words in alerting context from referential context to our best knowledge has never been done before, a specialized corpus containing WUWs is necessary. In this project, the corpus named WUWII was chosen. The WUWII corpus contains 3410 sample utterances and each utterance sentence contains at least one of the five different WUWs. The 5 WUWs are Wildfire, Operator, ThinkEngine, Onword and Voyager.

In our hypothesis, the intonation rise on the WUW, thus there should be an increment on the average pitch and maximum pitch on the WUW sections compare to the nonWUW sections.

Based on the above hypothesis, the average pitch and maximum pitch of the WUW are considered and the following twelve features are derived.

1. APW_AP1SBW: The relative change of the average pitch of WUW to the average pitch of the previous section just before WUW.

2. AP1sSW_AP1SBW: The relative change of the average pitch of the first section of WUW to the average pitch of previous section just before WUW.

3. APW_APALL: The relative change of the average pitch of WUW to the average pitch of the entire speech sample excluding the WUW sections.

4. AP1sSW_APALL: The relative change of the average pitch of the first section of the WUW to the average pitch of the entire speech sample excluding the WUW sections.

5. APW_APALLBW: The relative change of the average pitch of the WUW to the average pitch of entire speech sample before the WUW.

6. AP1sSW_APALL: The relative changes of the average pitch of the first section of the WUW to the average pitch of the entire speech sample excluding the WUW sections.

7. MaxP_MaxP1SBW: The relative change of the maximum pitch in the WUW sections to the maximum pitch in the previous section just before the WUW.

8. MaxP1sSW_MaxP1SBW: The relative change of the maximum pitch in the first section of the WUW to the maximum pitch of the previous section just before the WUW.

9. MaxPW_MaxPAll: The relative change of the maximum pitch of the WUW to the maximum pitch of the entire speech sample excluding the WUW sections.

10. MaxP1sSW_MaxPAll: The relative change of the maximum pitch of the first section of the WUW to the maximum pitch of the entire speech sample excluding the WUW sections.

11. MaxP1sSW_MaxPAllBW: The percentage changes of the maximum pitch in the first section of the WUW to the maximum pitch of the entire speech before the WUW.

12. MaxPW_MaxPAllBW: The percentage changes of the maximum pitch in the WUW sections to the maximum pitch of the entire speech sample before the WUW.

In the presented experiment, no significant discriminating pattern is found from the results. The results of WUW experiments using the pitch measurement defined above are shown in Table 21. The best feature for the all WUWs is the relative change of the maximum pitch of WUW to the maximum pitch of the previous section just before WUW. The result can be improved if clear syllabic boundaries are defined. However, syllabuses in English language are not clearly defined. The details of the results are shown in Appendix A.

Beside the above features, other approaches such as pitch measurement patterns can also been used to discriminate the WUWs and nonWUWs. This is one of the current research topics by Raymond Sastraputera, a graduate student working with Dr. Kepuska. The potential approaches of pitch based features are covered in the Chapter 5.

WUW: All

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

APW_AP1SBW

1415

726

51

0

0

689

49

AP1sSW_AP1SBW

1415

735

52

0

0

680

48

APW_APALL

2282

947

41

0

0

1335

59

AP1sSW_APALL

2282

996

44

2

0

1284

56

APW_APALLBW

2188

962

44

0

0

1226

56

AP1sSW_APALL

2188

1003

46

2

0

1183

54

MaxP_MaxP1SBW

1415

948

67

53

4

414

29

MaxP1sSW_MaxP1SBW

1415

719

51

54

4

642

45

MaxPW_MaxPAll

2282

1020

45

109

5

1153

51

MaxP1sSW_MaxPAll

2282

716

31

213

9

1353

59

MaxP1sSW_MaxPAllBW

2188

1069

49

111

5

1008

46

MaxPW_MaxPAllBW

2188

1003

35

2

10

1183

55

Table A1 Pitch Features Result, All WUWs

ENERGY FEATURES

As it was mentioned in Section 1.1, the prominence can be measured using energy of the utterance. If pitch represents the intonation of a speech then the energy is representing the stress of the speech. In this chapter the same concept from the pitch that described in Chapter 2 was used with energy to generate the similar feature set.

1.4 Energy Characteristic

In an English sentence, certain syllables are more prominent than others and these syllables are called accented syllables. Accented syllables are usually either louder or longer compared to the other syllables in the same word. In English language, different position of the accented syllables on the same word is used to differentiate the meaning of the word. For example, the word object (noun [ ab.dzekt ]) compared to the same word object used as verb (verb [ab.dzekt]) (Cutler, 1986) has a different place of accented syllables. The position of the accent syllables is indicated by in the phonetic transcription. If this idea of accented speech is applied to the entire sentence instead of one single word, then it may provide additional clues about the use of a word of interest and its meaning within the sentence.

Classifying the factors that model speakers speech and how they choose to accentuate a particular syllable within the whole sentence is a very complex problem. However, the measurement of the accented syllables can be simply done by using energy of the speech signal and its pitch change.

1.5 Energy Extraction

The energy of speech signal can be expressed by Parsevals Theorem as in the Equation 3-1 below.

Equation 31

In the Equation 3-1, the energy of a signal is been defined in both the time or frequency domain. Both |x[n]|2 and |X()|2 represent the energy density which can be thought as energy per unit of time and energy per unit of frequency.

The energy of a fixed frame size (6.5ms), same as in pitch computation, is used here as well. After the energy is calculated for all samples of each utterance in the WUWII corpus, the energy features are computed in similar fashion as the pitch features section 2.3 as described in the next section.

1.6 Energy Features

As in the previous experiments with pitch features, 12 energy based features were computed and tested. The features are represented as the relative change which is explained in the Equation 32.

Relative Change between A and B=

Equation 32

The features are listed below:

1. AEW_AE1SBW: The relative change of the average energy of the WUW to the average energy of previous section just before the WUW.

2. AE1sSW_AE1SBW: The relative change of the average energy of the first section of the WUW to the average energy of previous section just before the WUW.

3. AEW_AEAll: The relative change of the average energy of the WUW to the average energy of the entire sample speech excluding the WUW sections.

4. AE1sSW_AEAll: The relative change of the average energy of the first section in the WUW to the average energy of the entire utterance excluding the WUW sections.

5. AEW_AEAllBW: The relative change of the average energy of the WUW to the average energy of all speech before the WUW.

6. AE1sSW_AEAllBW: The relative change of the average energy of the first section in the WUW to the average energy of the entire sample speech before the WUW.

7. MaxEW_MaxE1SBW: The relative change of the maximum energy in the WUW sections to the maximum energy in the previous section of the WUW.

8. MaxE1sSW_MaxEAllBW: The relative change of the maximum energy in the first section of WUW to the maximum energy in the entire speech before of the WUW.

9. MaxEW_MaxEAll: The relative change of the maximum energy in the WUW to the maximum energy of the entire speech sample excluding the WUW section.

10. MaxE1sSW_MaxEAll: The relative change of the maximum energy in the first section of the WUW to the maximum energy of the entire speech sample excluding the WUW section.

11. MaxE1sSW_MaxEAllBW: The relative change of the maximum energy in the first section of the WUW to the maximum energy of the entire speech before the WUW.

12. MaxEW_MaxEAllBW: The relative change of the maximum energy in the WUW sections to the maximum energy of the entire speech sample before the WUW.

In this experiment few of the features may not be implementable in real-time application since they relay on the measurements after the WUW word of interest. However for it may lead to interesting conclusions. For real time speech recognition systems those features that do not relay on the features past WUW word of interest are the most useful. The Table 31 below shows the results of the measurements of on energy features based on all WUWs of WUWII corpus, namely the words Operator, ThinkEngine, Onword, Wildfire and Voyager. The details broken done for each word are included in Appendix B.

WUW: All WUWs

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

AEW_AE1SBW

1479

1164

79

0

0

315

21

AE1sSW_AE1SBW

1479

1283

84

1

0

240

16

AEW_AEAll

2175

1059

49

9

9

1116

51

AE1sSW_AEAll

2175

1155

53

2

0

1018

47

AEW_AEAllBW

1969

1427

72

0

0

542

28

AE1sSW_AEAllBW

1969

1562

79

3

0

404

21

MaxEW_MaxE1SBW

1479

1244

84

20

1

215

15

MaxE1sSW_MaxEAllBW

1479

1221

83

13

1

245

17

MaxEW_MaxEAll

2175

1373

63

13

1

245

17

MaxE1sSW_MaxEAll

2175

1336

61

25

1

814

37

MaxE1sSW_MaxEAllBW

1969

1209

61

16

1

744

38

MaxEW_MaxEAllBW

1969

1562

60

3

1

404

39

Table A1 Energy Feature Result of All WUW

Based on the results shown in Table 31 above, the following three features performed the best in discriminating WUW word from others word tokens:

AE1sSW_AE1SBW: The relative change of the average energy of the first section in the WUW compared to the average energy of the last section before WUW. Using this feature 84% of data shows the average energy of the first section of the WUW is higher than the average energy of the previous section. The result is illustrated in Figure 3.1 below depicting distribution of features as well as cumulative distribution.

Figure A1 Distribution and Cumulative plots of energy feature, AE1sSW_AE1SBW.

MaxEW_MaxE1SBW: The relative change of the Maximum energy in the WUW sections compared to the maximum energy from the last section before WUW. Using this feature 84% of the samples show that the maximum energy in the WUW sections is higher than the maximum energy of the previous section. The distribution o features as well as cumulative distribution are shown in the Figure 3.2 below.

Figure A2 Distribution and Cumulative plot of energy feature, the Max Energy of WUW.

The relative change of the Maximum energy of the first section of WUW compared to the maximum energy from the last section before WUW. This feature correctly discriminated 83% of cases that exhibited higher maximum energy of the first section of WUW than the maximum energy of the previous section. The cumulative and distribution plots of this feature are shown in the Figure 3.3.

Figure A3 Distribution and Cumulative plot of energy feature, the Max Energy of the 1st section in WUW.

The above results are based on all the data including all five different WUWs. Thus, investigating each word independently may be more appropriate. The detail performance result of each individual WUWs is covered in Appendix B.

Linguistically, one of the more appropriate WUWs is the word Operator. This word is also been used in the current Wake-Up-Word Speech Recognition System . Based on the result on the Table 32, two features show over 90% of the WUW cases to have average or maximum energy is higher than the other regions of the speech. These two features are:

AE1sSW_AE1SBW: The relative change of the average energy of the first section in the WUW compare to the average energy of the last section before WUW. Using this feature 94% of samples has the first section of the WUW with higher average energy then previous section.

AE1sSW_AEAllBW: The relative change of the average energy of the first section in the WUW compared to the average energy of the entire speech before the WUW sections. Using this feature 91% of samples show the first section of WUW has higher average energy.

WUW: Operator

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

AEW_AE1SBW

275

228

83

0

0

47

17

AE1sSW_AE1SBW

275

258

94

0

0

17

6

AEW_AEAll

418

248

59

0

0

170

41

AE1sSW_AEAll

418

290

69

1

0

127

30

AEW_AEAllBW

394

303

77

0

0

91

23

AE1sSW_AEAllBW

394

359

91

1

0

34

9

MaxEW_MaxE1SBW

275

240

87

1

0

34

12

MaxE1sSW_MaxEAllBW

275

243

88

0

0

32

12

MaxEW_MaxEAll

418

290

69

4

1

124

30

MaxE1sSW_MaxEAll

418

285

68

6

1

127

30

MaxE1sSW_MaxEAllBW

394

272

69

4

1

118

30

MaxEW_MaxEAllBW

394

359

68

1

1

34

30

Table A2 Energy Feature Result of WUW Operator

Based on the preformed experiment, WUW Wildfire achieved the best overall result. Using this word, 4 features scored higher than 90%. The results are shown on Table 33. The four best features are:

AEW_AE1SBW: The relative change of the average energy of the entire WUW compared to the average energy of the last section just before WUW. It shows 90% of the average energy of WUW is higher than the previous section.

AE1sSW_AE1SBW: The relative change of the average energy of the first section of the WUW compared to the average energy of the last section before the WUW. Using this feature 93% of samples show the first section of WUW has higher average energy.

MaxEW_MaxE1SBW: The relative change of the maximum energy of the WUW sections compared to the maximum energy in the last section before WUW. Using this feature 91% of samples show the WUW has higher maximum energy.

MaxE1sSW_MaxEAllBW: The relative change of the maximum energy of the WUW sections compared to the maximum energy of all sections before WUW. Using this feature 90% of samples show the first section in the WUW has higher maximum energy.

WUW: Wildfire

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

AEW_AE1SBW

282

253

90

0

0

29

10

AE1sSW_AE1SBW

282

261

93

0

0

21

7

AEW_AEAll

340

173

51

0

0

167

49

AE1sSW_AEAll

340

185

54

0

0

155

46

AEW_AEAllBW

298

252

85

0

0

46

15

AE1sSW_AEAllBW

298

265

89

0

0

33

11

MaxEW_MaxE1SBW

282

258

91

8

3

16

6

MaxE1sSW_MaxEAllBW

282

253

90

2

1

27

10

MaxEW_MaxEAll

340

230

68

4

1

106

31

MaxE1sSW_MaxEAll

340

219

64

4

1

117

34

MaxE1sSW_MaxEAllBW

298

195

65

4

1

99

33

MaxEW_MaxEAllBW

298

265

62

0

1

33

36

Table A3 Energy Feature Result of WUW Wildfire

The complete results are shown in Appendix II.

From the obtained results above, it can be concluded that WUW is frequently accentuated compared to the rest of the words in the utterance.

DATA COLLECTION

In this chapter, we are introducing a revolution way to collect speech samples. We will also introduce the preliminary design of this data collection system in this chapter.

1.7 Introduction to the data collection

After we developed WUW discriminant features based on two prosodic measurements of pitch and energy, described in Chapter 2 and 3, we realized the data we used to generate those features may not be the most suitable. The corpus we used in the project was WUWII corpus. It only provides the data on the WUW under alerting situation and doesnt contain the data for the same word used under referential situation. As the result, we can only perform analysis based on the changes between alerting type of WUW against the overall sentence and not the information with the same word in the referential situation. Another drawback of the current WUWII corpus is that it contains speech that is not spontaneous. The testers were told to use the WUW to make up a sentence. Under such circumstance, tester may change the way he/she normally speaks.

In order to perform a more complete analysis, we will need a corpus which includes both alerting and referential WUW context with natural speaking utterances. Dr. Wallace came up with an idea to extract audio samples from movies and TV series.

Extracting speech samples from movies and TV series has the following advantages compared to the previous data collection method:

1. The speech examples are more natural. The speech from professional actors is more natural since they tend to think and speak like a particular character and act the situation of the character that they are depicting.

2. The data collection process will cost much less since we are not compensating individuals to record their voice. We are not currently considering the problem of the copyright since we use the data for scientific research purposes only.

3. Large number of data can be collected in a short period of time once the process is fully automated.

4. The voice channel data is of CD quality. In this project, we are extracting speech data from recorded videos compared to the conventional phone line or cell recording contained in WUWII corpus.

5. No manual labeling is required. We plan to use the transcripts obtained from the video channel (System Design). The transcripts provide time stamps for all spoken sentences. Thus, manual labeling is not needed.

With the listed advantages, we are planning to design an automatic data collection system to collect specific speech data suitable for prosodic analysis of the proper name use in referential context vs. alerting (or WUW) context.

1.8 System Design

The data collection project is a part of the prosodic features analysis project which is illustrated by the program flow chart in the. The prosodic features analysis project can be divided into three sub projects. In the Figure 4.1, the green boxes represent the project of prosodic features extraction which had been described in Chapter 2 and 3 of this thesis.

Figure A1 Program Flow Chart

The green boxes in the Figure 4.1 represent the functions of the prosodic feature extraction and analysis project. The blue boxes depict the WUW data collection project. Finally, the purple boxes represent the future project on video analysis.

In the prosodic feature analysis project, we use the prosodic features generated from acoustic measurement to differentiate the context of the words. In a part of the WUW data collection project the language analysis tools will be used to automatically classify the words of interest in this case referential or alerting. At the moment the capabilities of this tool, RelEx must be augmented in order to achieve this goal. The outcome of the WUW Speech Data Collection project will not only build a specialized corpus for the prosodic analysis project, but also provide a confirmation to the result from prosodic analysis. The detailed program flow chart of the WUW Speech Data Collection System is shown in the Figure 4.2 below.

Figure A2 WUW Audio Data Collection System Program Flow Diagram

The input of the system will be (1) the video file of the movies or TV series, (2) video transcription file if provided will be used otherwise it will be extracted from the video stream , and (3) English first names dictionary . In the case when there is no video transcription file and the subtitles are encoded into the video stream, the subtitle extractor, Subrip will extract subtitles and time stamps of the sentence from the video stream. An example of transcription file is provided in the Figure 4.3 below.

Figure A3 Example of Video Transcription File

The transcription files provide the following information: date and time when the files were been created, subtitle index number, start time and end time of each subtitle and, the subtitle transcription.

The audio extractor will extract audio channel from the video file. Then, using English first names dictionary and the sentence transcription with time markers, an application program called sentence parser was developed by VoiceKey team members to select sentences that include English first names. The Figure 4.4 below shows an example of the output of the sentence parser.

Figure A4 Example of output of the sentence parser

In the next step, the audio parser will use the information from the sentence parser to extract the corresponding audio sections from the audio file produced by media audio extractor .

After extraction of the sentence that contains a name, the RelEx is used to analyze the selected sentence. The RelEx is an English-language semantic relationship extractor based on Carnegie-Mellon link parser . The RelEx is able to provide sentence information on subject, object, indirect object and various words tagging such as verb, gender and noun. The current status of the WUW data collection project is at developing a rule based or statistical pattern recognition process based on the relationship information produced by RelEx. Ultimately, the system will be able to accurately identify if the name in the sentence is used in WUW or nonWUW context.

A necessary step in automation process is to obtain precise time markers indicating the words of interest. To achieve this one could use the HTK , a Hidden Markov Model Toolkit, to perform forced alignment on the audio input. The HTK was initially developed by Machine Intelligence Laboratory (formerly known as the Speech Vision and Robotics Group) of the Cambridge University Engineering Department (CUED). The HTK uses Hidden Markov model (HMM) which compares the acoustic features of the incoming audio with the known acoustic features of the typically 41 English phonemes to predict the most likely combination of phonemes reflecting to the audio and maps the words from the lexicon dictionary. In our case, since the transcription of the sentences is known, HTK is used to perform to map the phonemes of known words to the corresponding time intervals. The phoneme time labels or equivalently word boundaries of the spoken sentence are used to locate in time the WUWs or nonWUWs. Note that this step can be also performed by Microsofts SDK speech recognition system that is fully integrated in Microsofts Vista OS. The advantage of the Microsofts system is that we do not need to train it since the acoustic models are pre-built. However, a development of the application incorporating the Microsofts SDK features is necessary. Alternatively, HTK does not require any significant integration coding, however it does require accurate models. Automation of the described data collection process will be made possible by integrating the outputs from RelEx with the forced alignment.

With time segmented sentence labels of the audio stream indicating the WUW or nonWUW context, a new corpus can be generated just like WUWII corpus. This data will be used to perform prosodic analysis and develop new or refine existing prosodic features. It is expected that further study with the new data will not only validate the current prosodic analysis result, but also provide directions on developing new prosodic features. The ultimate goal is to find out the prosodic patterns on WUW, nonWUW and other parts of the sentence.

Conclusion

This thesis investigated two prosodic features and designed an innovative way of data collection system.

The pitch based features in section 2.3 did not provide significant discriminating patterns. The following are the potential solutions to improve the performance:

1. Build a specialized corpus which contains both WUWs and nonWUWs. The speech sentences in the current corpus, WUWII, only contain WUWs but no nonWUW. A new speech data collection system is designed in chapter 4 in order to improve the performance of the features.

2. Use different approaches on defining pitch based features. Instead of using average and maximum pitch measurements of the WUW, the pitch contour pattern should also been considered. Since we are interested on the general pattern of WUWs instead of a specific WUW, patterns which exclude the pitch pattern of the word.

The energy based features in section 3.3 provide significant discriminating patterns. The future improvement is to quantify the level of change compare WUWs to the nonWUWs.

The new data collection system is an ongoing project which will eventually provide sufficient data on both WUWs and nonWUWs. The data will help us on research new patterns for discriminate alerting context and referential context.

References

AOAMedia.com. (n.d.). AoA Audio Extractor. Retrieved from AOAMedia.com.Bagshaw, P. C. (1994). Automatic prosodic analysis form computer aided pronunciation teaching.Cutler, A. (1986). Forbear is a homophone: Lexical prosody does not constrain lexical access. Language and Speech. Hart, J. '. (1975). Integrating different levels of intonation analysis. Phonetics , pp. 309-327.Jurafsky, D., & Martin, J. H. (n.d.). Speech and Language Processing. An introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition.Kpuska, V. C. (2006). Leading and Trailing Silence in Wake-Up-Word Speech Recognition. Industry, Engineering & Management Systems 2006. Cocoa Beach.Kpuska, V. WUWII Corpus. Lehiste, I. (1970). Supersegmental. Cambridge Massachusetts: The Massachusetts Institute Technology Press .Machine Intelligence Laboratory of the Cambridge University Engineering Department . (n.d.). HTK, The Hidden Markov Model Toolkit .Medan, .. Y. (1991). Super resolution pitch determination of speech signals. IEEE Trans. Signal Processing ASSP-39(1), 40-48 .Merriam-Webster Dictionary. (n.d.). Merriam-Webster Dictionary.Novamente LLC. (n.d.). RelEx Semantic Relationship Extractor. Retrieved from http://opencog.org/wiki/RelExPattarapong, R., Ramdhan, R., & Beharry, X. (2009). Sentence Parser Program.Phillips, M. (1985). A feature-based time domain pitch tracker. Journal f the Acoustical Society of America , 77, S9-S10(A).Rojanasthien, P., Ramdhan, R., & Beharry, X. (2009). Audio Parser Program.Temperlyey, D., Lafferty, J., & Sleator, D. (n.d.). CMU Link Grammar Parser.Tudor, K. B. (2007). Triple Scoring of Hidden Markov Models in Wake-Up-Word Speech Recognition.Veprek, P., & Scordilis, M. (2002). Analysis, Enhancement and Evaluation of Five Pitch Determination Techniques. Elsevier Science Journal of Speech Communication , 37, 249-270.William J. Hardcastle, J. L. (1997). The handbook of phonetic sciences. p. 640.Zuggy. (n.d.). SubRip.

Pitch Feature Experimental ResultWake-Up-Word: All

Feature

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

APW_AP1SBW

1415

726

51

0

0

689

49

AP1sSW_AP1SBW

1415

735

52

0

0

680

48

APW_APALL

2282

947

41

0

0

1335

59

AP1sSW_APALL

2282

996

44

2

0

1284

56

APW_APALLBW

2188

962

44

0

0

1226

56

AP1sSW_APALL

2188

1003

46

2

0

1183

54

MaxP_MaxP1SBW

1415

948

67

53

4

414

29

MaxP1sSW_MaxP1SBW

1415

719

51

54

4

642

45

MaxPW_MaxPAll

2282

1020

45

109

5

1153

51

MaxP1sSW_MaxPAll

2282

716

31

213

9

1353

59

MaxP1sSW_MaxPAllBW

2188

1069

49

111

5

1008

46

MaxPW_MaxPAllBW

2188

1003

35

2

10

1183

55

Table A1 Pitch Features Result of All WUW

Figure A1 Distribution and Cumulative plot of pitch feature, APW_AP1SBW

Figure A2 Distribution and Cumulative plot of pitch feature, AP1sSW_AP1SBW

Figure A3 Distribution and Cumulative plot of pitch feature, APW_APALL

Figure A4 Distribution and Cumulative plot of pitch feature, AP1sSW_APALL

Figure A5 Distribution and Cumulative plot of pitch feature, APW_APALLBW


Figure A7 Distribution and Cumulative plot of pitch feature, MaxP_MaxP1SBW

Figure A8 Distribution and Cumulative plot of pitch feature, MaxP1sSW_MaxP1SBW

Figure A9 Distribution and Cumulative plot of pitch feature, MaxPW_MaxPAll

Figure A10 Distribution and Cumulative plot of pitch feature, MaxP1sSW_MaxPAll

Figure A11 Distribution and Cumulative plot of pitch feature, MaxP1sSW_MaxPAllBW

Figure A12 Distribution and Cumulative plot of pitch feature, MaxPW_MaxPAllBW

WUW: Operator

WUW:OperatoFeature

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

APW_AP1SBW

268

122

46

0

0

146

54

AP1sSW_AP1SBW

268

113

42

0

0

155

58

APW_APALL

461

184

40

0

0

277

60

AP1sSW_APALL

461

182

39

0

0

279

61

APW_APALLBW

455

187

41

0

0

268

59

AP1sSW_APALL

455

179

39

0

0

276

61

MaxP_MaxP1SBW

268

155

58

12

4

101

38

MaxP1sSW_MaxP1SBW

268

94

35

8

3

166

62

MaxPW_MaxPAll

461

192

42

27

6

240

52

MaxP1sSW_MaxPAll

461

144

31

48

10

269

58

MaxP1sSW_MaxPAllBW

455

209

46

27

6

219

48

MaxPW_MaxPAllBW

455

179

33

0

12

276

55

Table A2 Pitch Features Result of WUW Operator

Figure A13 Distribution and Cumulative plot of pitch feature, APW_AP1SBW

Figure A14 Distribution and Cumulative plot of pitch feature, AP1sSW_AP1SBW

Figure A15 Distribution and Cumulative plot of pitch feature, APW_APALL


Figure A17 Distribution and Cumulative plot of pitch feature, APW_APALLBW


Figure A19 Distribution and Cumulative plot of pitch feature, MaxP_MaxP1SBW

Figure A20 Distribution and Cumulative plot of pitch feature, MaxP1sSW_MaxP1SBW

Figure A21 Distribution and Cumulative plot of pitch feature, MaxPW_MaxPAll

Figure A22 Distribution and Cumulative plot of pitch feature, MaxP1sSW_MaxPAll

Figure A23 Distribution and Cumulative plot of pitch feature, MaxP1sSW_MaxPAllBW

Figure A24 Distribution and Cumulative plot of pitch feature, MaxPW_MaxPAllBW

WUW: Wildfire

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

APW_AP1SBW

266

111

42

0

0

155

58

AP1sSW_AP1SBW

266

132

50

0

0

134

50

APW_APALL

323

70

22

0

0

253

78

AP1sSW_APALL

323

89

28

0

0

234

72

APW_APALLBW

297

73

25

0

0

224

75

AP1sSW_APALL

297

97

33

0

0

200

67

MaxP_MaxP1SBW

266

175

66

12

5

79

30

MaxP1sSW_MaxP1SBW

266

141

53

12

5

113

42

MaxPW_MaxPAll

323

84

26

9

3

230

71

MaxP1sSW_MaxPAll

323

54

17

11

3

258

80

MaxP1sSW_MaxPAllBW

297

79

27

9

3

209

70

MaxPW_MaxPAllBW

297

97

18

0

0

200

79

Table A3 Pitch Features Result of WUW Wildfire

Figure A25 Distribution and Cumulative plot of pitch feature, APW_AP1SBW, WUW: Wildfire

Figure A26 Distribution and Cumulative plot of pitch feature, AP1sSW_AP1SBW, WUW: Wildfire

Figure A27 Distribution and Cumulative plot of pitch feature, APW_APALL, WUW: Wildfire

Figure A28 Distribution and Cumulative plot of pitch feature, AP1sSW_APALL, WUW: Wildfire

Figure A29 Distribution and Cumulative plot of pitch feature, APW_APALLBW, WUW: Wildfire

Figure A30 Distribution and Cumulative plot of pitch feature, AP1sSW_APALL, WUW: Wildfire

Figure A31 Distribution and Cumulative plot of pitch feature, MaxP_MaxP1SBW, WUW: Wildfire

Figure A32 Distribution and Cumulative plot of pitch feature, MaxP1sSW_MaxP1SBW, WUW: Wildfire

Figure A33 Distribution and Cumulative plot of pitch feature, MaxPW_MaxPAll, WUW: Wildfire

Figure A34 Distribution and Cumulative plot of pitch feature, MaxP1sSW_MaxPAll, WUW: Wildfire

Figure A35 Distribution and Cumulative plot of pitch feature, MaxP1sSW_MaxPAllBW, WUW: Wildfire

Figure A36 Distribution and Cumulative plot of pitch feature, MaxPW_MaxPAllBW, WUW: Wildfire

Energy Feature Experimental Result

WUW: All WUWs

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

AEW_AE1SBW

1479

1164

79

0

0

315

21

AE1sSW_AE1SBW

1479

1283

84

1

0

240

16

AEW_AEAll

2175

1059

49

9

9

1116

51

AE1sSW_AEAll

2175

1155

53

2

0

1018

47

AEW_AEAllBW

1969

1427

72

0

0

542

28

AE1sSW_AEAllBW

1969

1562

79

3

0

404

21

MaxEW_MaxE1SBW

1479

1244

84

20

1

215

15

MaxE1sSW_MaxEAllBW

1479

1221

83

13

1

245

17

MaxEW_MaxEAll

2175

1373

63

13

1

245

17

MaxE1sSW_MaxEAll

2175

1336

61

25

1

814

37

MaxE1sSW_MaxEAllBW

1969

1209

61

16

1

744

38

MaxEW_MaxEAllBW

1969

1562

60

3

1

404

39

Table B1 Energy Features Result of All WUW

Figure B1 Distribution and Cumulative plot of energy feature, the average energy of WUW



Figure B4 Distribution and Cumulative plot of energy feature, the average energy the first section in WUW



Figure B7 Distribution and Cumulative plot of energy feature, the maximum energy of the WUW



Figure B10 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW


Figure B12 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW

WUW: Operator

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

AEW_AE1SBW

275

228

83

0

0

47

17

AE1sSW_AE1SBW

275

258

94

0

0

17

6

AEW_AEAll

418

248

59

0

0

170

41

AE1sSW_AEAll

418

290

69

1

0

127

30

AEW_AEAllBW

394

303

77

0

0

91

23

AE1sSW_AEAllBW

394

359

91

1

0

34

9

MaxEW_MaxE1SBW

275

240

87

1

0

34

12

MaxE1sSW_MaxEAllBW

275

243

88

0

0

32

12

MaxEW_MaxEAll

418

290

69

4

1

124

30

MaxE1sSW_MaxEAll

418

285

68

6

1

127

30

MaxE1sSW_MaxEAllBW

394

272

69

4

1

118

30

MaxEW_MaxEAllBW

394

359

68

1

1

34

30

Table B1 Energy Feature Result of WUW Operator

Figure B13 Distribution and Cumulative plot of energy feature, the average energy of WUW, Operator



Figure B16 Distribution and Cumulative plot of energy feature, the average energy the first section in WUW, Operator



Figure B19 Distribution and Cumulative plot of energy feature, the maximum energy of the WUW, Operator



Figure B22 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW, Operator


Figure B24 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW, Operator

WUW: ThinkEngine

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

AEW_AE1SBW

293

182

62

0

0

111

38

AE1sSW_AE1SBW

293

194

66

1

0

98

33

AEW_AEAll

414

159

38

0

0

255

62

AE1sSW_AEAll

414

178

43

0

0

236

57

AEW_AEAllBW

388

201

52

0

0

187

48

AE1sSW_AEAllBW

388

229

59

1

0

158

42

MaxEW_MaxE1SBW

293

209

71

3

1

81

28

MaxE1sSW_MaxEAllBW

293

195

67

5

2

93

32

MaxEW_MaxEAll

414

197

48

3

1

214

52

MaxE1sSW_MaxEAll

414

186

45

2

0

226

55

MaxE1sSW_MaxEAllBW

388

180

46

3

1

205

53

MaxEW_MaxEAllBW

388

229

45

1

1

158

54

Table B2 Energy Feature Result of WUW ThinkEngine

Figure B25 Distribution and Cumulative plot of energy feature, the average energy of WUW, ThinkEngine



Figure B28 Distribution and Cumulative plot of energy feature, the average energy the first section in WUW, ThinkEngine



Figure B31 Distribution and Cumulative plot of energy feature, the maximum energy of the WUW, ThinkEngine



Figure B34 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW, ThinkEngine


Figure B36 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW, ThinkEngine

WUW: Onword

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

AEW_AE1SBW

262

207

79

0

0

55

21

AE1sSW_AE1SBW

262

221

84

0

0

41

16

AEW_AEAll

435

215

49

0

0

220

51

AE1sSW_AEAll

435

226

52

0

0

209

48

AEW_AEAllBW

389

306

79

0

0

83

21

AE1sSW_AEAllBW

389

327

84

0

0

62

16

MaxEW_MaxE1SBW

262

228

87

5

2

29

11

MaxE1sSW_MaxEAllBW

262

226

86

3

1

33

13

MaxEW_MaxEAll

435

229

69

2

0

134

31

MaxE1sSW_MaxEAll

435

295

68

3

1

137

31

MaxE1sSW_MaxEAllBW

389

261

67

2

1

126

32

MaxEW_MaxEAllBW

389

327

66

0

1

62

33

Table B4 Energy Feature Result of WUW Operator

Figure B37 Distribution and Cumulative plot of energy feature, the average energy of WUW, Onword



Figure B40 Distribution and Cumulative plot of energy feature, the average energy the first section in WUW, Onword



Figure B43 Distribution and Cumulative plot of energy feature, the maximum energy of the WUW, Onword



Figure B46 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW, Onword


Figure B48 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW, Onword

WUW: Wildfire

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

AEW_AE1SBW

282

253

90

0

0

29

10

AE1sSW_AE1SBW

282

261

93

0

0

21

7

AEW_AEAll

340

173

51

0

0

167

49

AE1sSW_AEAll

340

185

54

0

0

155

46

AEW_AEAllBW

298

252

85

0

0

46

15

AE1sSW_AEAllBW

298

265

89

0

0

33

11

MaxEW_MaxE1SBW

282

258

91

8

3

16

6

MaxE1sSW_MaxEAllBW

282

253

90

2

1

27

10

MaxEW_MaxEAll

340

230

68

4

1

106

31

MaxE1sSW_MaxEAll

340

219

64

4

1

117

34

MaxE1sSW_MaxEAllBW

298

195

65

4

1

99

33

MaxEW_MaxEAllBW

298

265

62

0

1

33

36

Table B5 Energy Feature Result of WUW Wildfire

Figure B49 Distribution and Cumulative plot of energy feature, the average energy of WUW, Wildfire



Figure B52 Distribution and Cumulative plot of energy feature, the average energy the first section in WUW, Wildfire



Figure B55 Distribution and Cumulative plot of energy feature, the maximum energy of the WUW, Wildfire



Figure B58 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW, Wildfire


Figure B60 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW, Wildfire

WUW: Voyager

Valid Data

Pt > 0

% > 0

Pt = 0

% = 0

Pt < 0

% < 0

AEW_AE1SBW

281

220

78

0

0

61

22

AE1sSW_AE1SBW

281

229

81

0

0

52

19

AEW_AEAll

361

149

41

0

0

212

59

AE1sSW_AEAll

361

161

45

1

0

199

55

AEW_AEAllBW

325

207

64

0

0

118

36

AE1sSW_AEAllBW

325

222

68

1

0

102

31

MaxEW_MaxE1SBW

281

234

83

2

1

45

16

MaxE1sSW_MaxEAllBW

281

231

82

2

1

48

17

MaxEW_MaxEAll

361

172

48

5

1

184

51

MaxE1sSW_MaxEAll

361

167

46

7

2

187

52

MaxE1sSW_MaxEAllBW

325

148

46

3

1

174

54

MaxEW_MaxEAllBW

325

222

44

1

1

102

55

Table B6 Energy Feature Result of WUW, Voyage

Figure B61 Distribution and Cumulative plot of energy feature, the average energy of WUW, Voyager



Figure B64 Distribution and Cumulative plot of energy feature, the average energy the first section in WUW, Voyager



Figure B67 Distribution and Cumulative plot of energy feature, the maximum energy of the WUW, Voyager



Figure B70 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW, Voyager


Figure B72 Distribution and Cumulative plot of energy feature, the maximum energy of the first section in WUW, Voyager

Works CitedAOAMedia.com. (n.d.). AoA Audio Extractor. Retrieved from AOAMedia.com.Bagshaw, P. C. (1994). Automatic prosodic analysis form computer aided pronunciation teaching.Campbell, M. (n.d.). Behind The Name. Retrieved from http://www.behindthename.com/Cutler, A. (1986). Forbear is a homophone: Lexical prosody does not constrain lexical access. Language and Speech. Hart, J. '. (1975). Integrating different levels of intonation analysis. Phonetics , pp. 309-327.Jurafsky, D., & Martin, J. H. (n.d.). Speech and Language Processing. An introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition.Kpuska, V. C. (2006). Leading and Trailing Silence in Wake-Up-Word Speech Recognition. Industry, Engineering & Management Systems 2006. Cocoa Beach.Kpuska, V. WUWII Corpus. Lehiste, I. (1970). Supersegmental. Cambridge Massachusetts: The Massachusetts Institute Technology Press .Machine Intelligence Laboratory of the Cambridge University Engineering Department . (n.d.). HTK, The Hidden Markov Model Toolkit .Medan, .. Y. (1991). Super resolution pitch determination of speech signals. IEEE Trans. Signal Processing ASSP-39(1), 40-48 .Merriam-Webster Dictionary. (n.d.). Merriam-Webster Dictionary.Novamente LLC. (n.d.). RelEx Semantic Relationship Extractor. Retrieved from http://opencog.org/wiki/RelExPattarapong, R., Ramdhan, R., & Beharry, X. (2009). Sentence Parser Program.Pattarapong, R., Ronald, R., & Xerxes, B. (2009). Audio Parser Program.Phillips, M. (1985). A feature-based time domain pitch tracker. Journal f the Acoustical Society of America , 77, S9-S10(A).Temperlyey, D., Lafferty, J., & Sleator, D. (n.d.). CMU Link Grammar Parser.Tudor, K. B. (2007). Triple Scoring of Hidden Markov Models in Wake-Up-Word Speech Recognition.Veprek, P., & Scordilis, M. (2002). Analysis, Enhancement and Evaluation of Five Pitch Determination Techniques. Elsevier Science Journal of Speech Communication , 37, 249-270.William J. Hardcastle, J. L. (1997). The handbook of phonetic sciences. p. 640.Zuggy. (n.d.). SubRip.

UnvoicedECDHPSFBFTPPIFTASRFDeSRFD1814471045VoicedECDHPSFBFTPPIFTASRFDeSRFD2071316171511Gross0 Error HighECDHPSFBFTPPIFTASRFDeSRFD4520211Gross 0 Error LowECDHPSFBFTPPIFTASRFDeSRFD12712120.5UnvoicedECDHPSFBFTPPIFTASRFDeSRFD31.5193.56523VoicedECDHPSFBFTPPIFTASRFDeSRFD222112.51316127Gross0 Error HighECDHPSFBFTPPIFTASRFDeSRFD0.50.50.50.50.511Gross 0 Error LowECDHPSFBFTPPIFTASRFDeSRFD423.534.55.50

29

}

,...,

|

)

(

{

max

max

N

N

N

i

i

s

s

N

-

-

=

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUWMAXE-AllMAXE)/AllMAXE cumulative plot

(WUWMAXE-AllMAXE)/AllMAXE

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUW1stMAXE-AllAMAXE

b

efore

W

UW)/AllMAXE

b

efore

W

UW cumulative plot


b

efore

W

UW)/AllMAXE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUWAE-LSAE)/LSAE cumulative plot

(WUWAE-LSAE)/LSAE

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUWAE-AllAE)/AllAE cumulative plot

(WUWAE-AllAE)/AllAE

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUWAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW cumulative plot

(WUWAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUW1stAE-LSAE)/LSAE cumulative plot

(WUW1stAE-LSAE)/LSAE

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUW1stAE-AllAE)/AllAE cumulative plot

(WUW1stAE-AllAE)/AllAE

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUW1stAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW cumulative plot

(WUW1stAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUWMAXE-LSMAXE)/LSMAXE cumulative plot

(WUWMAXE-LSMAXE)/LSMAXE

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUWMAXE-AllMAXE

b

efore

W

UW)/AllMAXE

b

efore

W

UW cumulative plot

(WUWMAXE-AllMAXE

b

efore

W

UW)/AllMAXE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUW1stMAXE-LSMAXE)/LSMAXE cumulative plot

(WUW1stMAXE-LSMAXE)/LSMAXE

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100


b

efore

W

UW)/AllMAXE

b

efore

W

UW cumulative plot


b

efore

W

UW)/AllMAXE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100


(WUWAE-LSAE)/LSAE

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100


(WUWAE-AllAE)/AllAE

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUWAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW cumulative plot

(WUWAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUW1stAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW cumulative plot

(WUW1stAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUWMAXE-AllMAXE

b

efore

W

UW)/AllMAXE

b

efore

W

UW cumulative plot

(WUWMAXE-AllMAXE

b

efore

W

UW)/AllMAXE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100


b

efore

W

UW)/AllMAXE

b

efore

W

UW cumulative plot


b

efore

W

UW)/AllMAXE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100


(WUWAE-LSAE)/LSAE

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100


(WUWAE-AllAE)/AllAE

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUWAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW cumulative plot

(WUWAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

}

,...

1

|

)

(

)

(

{

}

,...

1

|

)

(

)

(

{

}

,...

1

|

)

(

)

(

{

n

i

n

i

s

i

x

z

n

i

i

s

i

x

y

n

i

n

i

s

i

x

x

n

n

n

+

=

=

=

=

-

=

=

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUW1stAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW cumulative plot

(WUW1stAE-AllAE

b

efore

W

UW)/AllAE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100

(WUWMAXE-AllMAXE

b

efore

W

UW)/AllMAXE

b

efore

W

UW cumulative plot

(WUWMAXE-AllMAXE

b

efore

W

UW)/AllMAXE

b

efore

W

UW

%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100


b

efore

W

UW)/AllMAXE

b

efore

W

UW cumulative plot


b

efore

W

UW)/AllMAXE

b

efore

W

UW

%

max}

min

min

/

[

1

]

/

[

1

2

2

]

/

[

1

,

,...;

1

,

0

|

{

)

(

*

)

(

)

(

*

)

(

)

(

N

n

N

i

iL

N

n

jL

y

jL

x

jL

y

jL

x

n

P

L

n

j

L

n

j

L

n

j

y

x

+

=

=

=

=

=

88

.

0

=

srfd

T

)

'

(

'

85

.

0

,

75

.

0

max(

0

,

n

p

T

y

x

srfd

=

max}

min

min

/

[

1

]

/

[

1

2

2

]

/

[

1

,

,...;

1

,

0

|

{

)

(

*

)

(

)

(

*

)

(

)

(

N

n

N

i

iL

N

n

jL

y

jL

x

jL

y

jL

x

n

P

L

n

j

L

n

j

L

n

j

z

y

+

=

=

=

=

=

=

=

=

+

+

+

+

=

]

[

1

]

[

1

2

2

]

[

1

)

(

*

)

(

)

(

*

)

(

)

(

M

M

M

n

j

n

j

m

M

n

j

m

M

m

n

n

j

y

j

s

n

n

j

s

j

s

n

q

-

=

-

=

n

d

X

n

x

p

p

w

w

p

2

2

)

(

2

1

]

[

B

B

A

-

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

0102030405060708090100

0

10

20

30

40

50

60

70

80

90

100



%

-1-0.500.511.522.533.54

0

50

100

150

200

250

(WUWAP-LSAP)/LSAP cumulative plot

(WUWAP-LSAP)/LSAP

%, No. of Data

-1-0.500.511.522.533.54

0

50

100

150

200

250

(WUW1stAP-LSAP)/LSAP cumulative plot

(WUW1stAP-LSAP)/LSAP

%, No. of Data

-1-0.500.511.522.533.54

0

50

100

150

200

250

300

350

400

(WUWAP-AllAP)/AllAP cumulative plot

(WUWAP-AllAP)/AllAP

%, No. of Data

-1-0.500.511.522.533.54

0

50

100

150

200

250

300

350

(WUW1stAP-AllAP)/AllAP cumulative plot

(WUW1stAP-AllAP)/AllAP

%, No. of Data

-1-0.500.511.522.533.54

0

20

40

60

80

100

120

140

160

180

(WUWMAXP-LSMAXP)/LSMAXP cumulative plot

(WUWMAXP-LSMAXP)/LSMAXP

%, No. of Data

-1-0.500.511.522.533.54

0

20

40

60

80

100

120

140

160

180

200

(WUW1stMAXP-LSMAXP)/LSMAXP cumulative plot

(WUW1stMAXP-LSMAXP)/LSMAXP

%, No. of Data

-1-0.500.511.522.533.54

0

50

100

150

200

250

300

350

(WUWMAXP-AllMAXP)/AllMAXP cumulative plot

(WUWMAXP-AllMAXP)/AllMAXP

%, No. of Data

-1-0.500.511.522.533.54

0

50

100

150

200

250

300

(WUWMAXP-AllMAXP)/AllMAXP cumulative plot

(WUWMAXP-AllMAXP)/AllMAXP

%, No. of Data

-1-0.500.511.522.533.54

0

50

100

150

200

250

300

350

400

(WUW1stMAXP-AllAMAXP

b

efore

W

UW)/AllMAXP

b

efore

W

UW cumulative plot

(WUW1stMAXP-AllAMAXP

b

efore

W

UW)/AllMAXP

b

efore

W

UW

%, No. of Data

-1-0.500.511.522.533.54

0

50

100

150

200

250

300

(WUWMAXP-AllMAXP

b

efore

W

UW)/AllMAXP

b

efore

W

UW cumulative plot

(WUWMAXP-AllMAXP

b

efore

W

UW)/AllMAXP

b

efore

W

UW

%, No. of Data

-1-0.500.511.522.533.54

0

10

20

30

40

50

60

70

80

90

100

(WUWAP-LSAP)/LSAP cumulative plot

(WUWAP-LSAP)/LSAP

%, No. of Data

-1-0.500.511.522.533.54

0

10

20

30

40

50

60

70

80

90

100

(WUW1stAP-LSAP)/LSAP cumulative plot

(WUW1stAP-LSAP)/LSAP

%, No. of Data

-1-0.500.511.522.533.54

0

10

20

30

40

INTRODUCTION: - Florida Institute of Technologymy.fit.edu/~vkepuska/Thesis/Chih-Ti Shih... · Web viewProsodic measurements based on pitch and energy are analyzed to introduce new

Documents