Aug 24, 2020
Speech Analysis Methodologies towards Unobtrusive
Mental Health Monitoring
Keng-hao Chang
Electrical Engineering and Computer Sciences University of California at Berkeley
Technical Report No. UCB/EECS-2012-55
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-55.html
May 1, 2012
Copyright © 2012, by the author(s). All rights reserved.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.
Speech Analysis Methodologies towards Unobtrusive Mental Health Monitoring
by
Keng-hao Chang
A dissertation submitted in partial satisfaction of the
requirements for the degree of
Doctor of Philosophy
in
Engineering - Electrical Engineering and Computer Science
in the
Graduate Division
of the
University of California, Berkeley
Committee in charge:
Professor John F. Canny, Chair Professor Nelson Morgan Professor Allison Harvey
Spring 2012
Speech Analysis Methodologies towards Unobtrusive Mental Health Monitoring
Copyright 2012 by
Keng-hao Chang
1
Abstract
Speech Analysis Methodologies towards Unobtrusive Mental Health Monitoring
by
Keng-hao Chang
Doctor of Philosophy in Engineering - Electrical Engineering and Computer Science
University of California, Berkeley
Professor John F. Canny, Chair
The human voice encodes a wealth of information about emotion, mood and mental states. With the advent of pervasively available speech collection methods (e.g. mobile phones) and the low-computation costs of speech analysis, it suggests that non-invasive, relatively reliable, and modestly inexpensive platforms are available for mass and long-term deployment of a mental health monitor. In the thesis, I describe my investigation pathway on speech analysis to measure a variety of mental states, including affect and those triggered by psychological stress and sleep deprivation.
This work has contributions in many folds, and it brings together techniques from several areas, including speech processing, psychology, human-computer interaction, and mobile computing systems. First, I revisited emotion recognition methods by building an affective model with a naturalistic emotional speech dataset, which is consisted of a realistic set of emotion labels for real world applications. Then, leveraging the speech production theory I verified that the glottal vibrational cycles, the source of speech production, are physically affected by psychological states, e.g. mental stress. Finally, I built the AMMON (Affective and Mental health MONitor) library, a low footprint C library designed for widely available phones as an enabler of applications for richer, more appropriate, and more satisfying human- computer interaction and healthcare technologies.
i
To my dear family and my lovely friends
I’d like to dedicate this thesis to my family, who has been extremely supportive throughout my journey in the Ph.D. program. As an international student studying abroad, I can
always sense the warmth sent overseas from my dear parents at Taiwan and the occasional but caring phone calls from my brother studying at UT Austin. I do want to apologize for my infrequent calls back to home as things get busy, but from the bottom of my heart I
always know it, I miss the time that we spent together. As the time progresses in the Ph.D. study, it is not only an intellectual but a psychological challenge. I know it was my lovely friends staying around me, listening to me that gives me the strength to move on, on those sometimes small but occasionally big obstacles. Friends
in Taiwan and in the United States, I dedicate this work to you.
ii
Contents
Contents ii
List of Figures iv
List of Tables vi
1 Introduction 1 1.1 Affect and Mental Health Monitor . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Theoretical Foundation and Background 6 2.1 Recognition of Affect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Diagnostic Cues in Vocal Expression . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Theory of Speech Production . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Long-term Monitor and Healthcare Applications . . . . . . . . . . . . . . . . 14
3 Emotion Recognition 15 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 The Naturalistic Belfast Emotional Database . . . . . . . . . . . . . . . . . . 17 3.4 Voice Analysis Library: The Feature Set . . . . . . . . . . . . . . . . . . . . 22 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Phoneme Processing 38 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Estimating Rate of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Simplified Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Voice Source Processing 51
iii
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Extracting the Glottal Waveforms . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Application I: Classification of Intelligible vs. Non-intelligible Speech . . . . 55 5.4 Application II: Classification of Speech Under Stress . . . . . . . . . . . . . . 59 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 Trigger by the Physical Body 63 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2 Application I: Sleep Deprivation . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 Application II: Simulated and Actual Stress . . . . . . . . . . . . . . . . . . 70 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7 A Speech Analysis Library on Mobile Phones 78 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.3 Speech Analysis Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.4 Extracting Glottal Timings . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.6 Feature Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8 Conclusion 94
Bibliography 96
A Application Mockups 104
iv
List of Figures
1.1 Scenarios for Speech Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Arousal-valence theory with discrete emotions. Arousal increases vertically, va- lence is positive to the right and negative to the left. . . . . . . . . . . . . . . . 7
2.2 The source-filter theory of speech production: (a) glottal wave, (b) vocal tract shape, (c) radiated sound wave, (d) glottal spectrum, (d) vocal tract transfer function, (f) acoustic spectrum at mouth opening (adapted from [26]) . . . . . . 13
2.3 The human speech production system . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 A Stylized Pitch Waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 A Glottal Vibrational Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 The log likelihood trajectories of a speech utterance given 44 phonemes (Gaussian mixtures) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 The Gaussian-filter smoothed log likelihood trajectories of a speech utterance given 44 phonemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 The correlation between the predicted speech rate (Y-axis) and the ground truth (X-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1 Algorithm for identifying the closed-phase region of a glottal cycle [27] . . . . . 54 5.2 Illustration of a frequency response and its envelope, which can be characterized
by the frequency locations and bandwidths of the peaks (formants). . . . . . . . 55 5.3 Algorithm for identifying instances of maximum excitation [27] . . . . . . . . . . 56 5.4 Hypothesis of Stress Detection by Glottal Features . . . . . . . . . . . . . . . . 59
6.1 Accuracies with combination of