Top Banner

Click here to load reader

Speech Analysis Methodologies towards Unobtrusive Mental ... · PDF file areas, including speech processing, psychology, human-computer interaction, and mobile computing systems. First,

Aug 24, 2020

ReportDownload

Documents

others

  • Speech Analysis Methodologies towards Unobtrusive

    Mental Health Monitoring

    Keng-hao Chang

    Electrical Engineering and Computer Sciences University of California at Berkeley

    Technical Report No. UCB/EECS-2012-55

    http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-55.html

    May 1, 2012

  • Copyright © 2012, by the author(s). All rights reserved.

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.

  • Speech Analysis Methodologies towards Unobtrusive Mental Health Monitoring

    by

    Keng-hao Chang

    A dissertation submitted in partial satisfaction of the

    requirements for the degree of

    Doctor of Philosophy

    in

    Engineering - Electrical Engineering and Computer Science

    in the

    Graduate Division

    of the

    University of California, Berkeley

    Committee in charge:

    Professor John F. Canny, Chair Professor Nelson Morgan Professor Allison Harvey

    Spring 2012

  • Speech Analysis Methodologies towards Unobtrusive Mental Health Monitoring

    Copyright 2012 by

    Keng-hao Chang

  • 1

    Abstract

    Speech Analysis Methodologies towards Unobtrusive Mental Health Monitoring

    by

    Keng-hao Chang

    Doctor of Philosophy in Engineering - Electrical Engineering and Computer Science

    University of California, Berkeley

    Professor John F. Canny, Chair

    The human voice encodes a wealth of information about emotion, mood and mental states. With the advent of pervasively available speech collection methods (e.g. mobile phones) and the low-computation costs of speech analysis, it suggests that non-invasive, relatively reliable, and modestly inexpensive platforms are available for mass and long-term deployment of a mental health monitor. In the thesis, I describe my investigation pathway on speech analysis to measure a variety of mental states, including affect and those triggered by psychological stress and sleep deprivation.

    This work has contributions in many folds, and it brings together techniques from several areas, including speech processing, psychology, human-computer interaction, and mobile computing systems. First, I revisited emotion recognition methods by building an affective model with a naturalistic emotional speech dataset, which is consisted of a realistic set of emotion labels for real world applications. Then, leveraging the speech production theory I verified that the glottal vibrational cycles, the source of speech production, are physically affected by psychological states, e.g. mental stress. Finally, I built the AMMON (Affective and Mental health MONitor) library, a low footprint C library designed for widely available phones as an enabler of applications for richer, more appropriate, and more satisfying human- computer interaction and healthcare technologies.

  • i

    To my dear family and my lovely friends

    I’d like to dedicate this thesis to my family, who has been extremely supportive throughout my journey in the Ph.D. program. As an international student studying abroad, I can

    always sense the warmth sent overseas from my dear parents at Taiwan and the occasional but caring phone calls from my brother studying at UT Austin. I do want to apologize for my infrequent calls back to home as things get busy, but from the bottom of my heart I

    always know it, I miss the time that we spent together. As the time progresses in the Ph.D. study, it is not only an intellectual but a psychological challenge. I know it was my lovely friends staying around me, listening to me that gives me the strength to move on, on those sometimes small but occasionally big obstacles. Friends

    in Taiwan and in the United States, I dedicate this work to you.

  • ii

    Contents

    Contents ii

    List of Figures iv

    List of Tables vi

    1 Introduction 1 1.1 Affect and Mental Health Monitor . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2 Theoretical Foundation and Background 6 2.1 Recognition of Affect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Diagnostic Cues in Vocal Expression . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Theory of Speech Production . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Long-term Monitor and Healthcare Applications . . . . . . . . . . . . . . . . 14

    3 Emotion Recognition 15 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 The Naturalistic Belfast Emotional Database . . . . . . . . . . . . . . . . . . 17 3.4 Voice Analysis Library: The Feature Set . . . . . . . . . . . . . . . . . . . . 22 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4 Phoneme Processing 38 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Estimating Rate of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Simplified Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    5 Voice Source Processing 51

  • iii

    5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Extracting the Glottal Waveforms . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Application I: Classification of Intelligible vs. Non-intelligible Speech . . . . 55 5.4 Application II: Classification of Speech Under Stress . . . . . . . . . . . . . . 59 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    6 Trigger by the Physical Body 63 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2 Application I: Sleep Deprivation . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 Application II: Simulated and Actual Stress . . . . . . . . . . . . . . . . . . 70 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    7 A Speech Analysis Library on Mobile Phones 78 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.3 Speech Analysis Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.4 Extracting Glottal Timings . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.6 Feature Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    8 Conclusion 94

    Bibliography 96

    A Application Mockups 104

  • iv

    List of Figures

    1.1 Scenarios for Speech Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.1 Arousal-valence theory with discrete emotions. Arousal increases vertically, va- lence is positive to the right and negative to the left. . . . . . . . . . . . . . . . 7

    2.2 The source-filter theory of speech production: (a) glottal wave, (b) vocal tract shape, (c) radiated sound wave, (d) glottal spectrum, (d) vocal tract transfer function, (f) acoustic spectrum at mouth opening (adapted from [26]) . . . . . . 13

    2.3 The human speech production system . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.1 A Stylized Pitch Waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 A Glottal Vibrational Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    4.1 The log likelihood trajectories of a speech utterance given 44 phonemes (Gaussian mixtures) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.2 The Gaussian-filter smoothed log likelihood trajectories of a speech utterance given 44 phonemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    4.3 The correlation between the predicted speech rate (Y-axis) and the ground truth (X-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    5.1 Algorithm for identifying the closed-phase region of a glottal cycle [27] . . . . . 54 5.2 Illustration of a frequency response and its envelope, which can be characterized

    by the frequency locations and bandwidths of the peaks (formants). . . . . . . . 55 5.3 Algorithm for identifying instances of maximum excitation [27] . . . . . . . . . . 56 5.4 Hypothesis of Stress Detection by Glottal Features . . . . . . . . . . . . . . . . 59

    6.1 Accuracies with combination of