WIKT 2006 WIKT 2006 SPEECH SPEECH IS MORE THAN ONLY IS MORE THAN ONLY ITS LINGVISTIC CONTENT ITS LINGVISTIC CONTENT Institute of Informatics of the Slovak Academy of Sciences Institute of Informatics of the Slovak Academy of Sciences Dubravska cesta 9, 847 05 Bratislava, Slovakia Dubravska cesta 9, 847 05 Bratislava, Slovakia Milan.R Milan.R [email protected][email protected]Rusko Milan Institute of Informatics of the Slovak Academy of Sciences
Institute of Informatics of the Slovak Academy of Sciences. SPEECH IS MORE THAN ONLY ITS LINGVISTIC CONTENT. Rusko Milan. Institute of Informatics of the Slovak Academy of Sciences Dubravska cesta 9, 847 05 Bratislava, Slovakia Milan.R [email protected]. E xpressive speech. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
WIKT 2006WIKT 2006
SPEECHSPEECH IS MORE THAN ONLY IS MORE THAN ONLY ITS LINGVISTIC CONTENTITS LINGVISTIC CONTENT
Institute of Informatics of the Slovak Academy of SciencesInstitute of Informatics of the Slovak Academy of SciencesDubravska cesta 9, 847 05 Bratislava, SlovakiaDubravska cesta 9, 847 05 Bratislava, Slovakia
Interactions with others Friendly Compassionate Trusting Cooperative
Competitive
Conscientiousness
C [1,0,-1]
Organized, persistent in achieving goals
Efficient Methodical Well organized Dutiful
Easy-going Careless
WIKT 2006WIKT 2006
Mood and EmotionMood and Emotion
Mood (attitude) can be defined as a rather static state of being, that is less static than personality and less fluent than emotions. Mood can be defined as one-dimensional (e.g. good or bad mood) or perhaps multi-dimensional (feeling in love, being paranoid etc.)
(Ksirsagar&Magnenat-Thalmann[5])
WIKT 2006WIKT 2006
Generalized model of emotionGeneralized model of emotion
1,0:,1,...1 imTt mie if t>0
and
0Tte if t=0
(2)
An emotional state has a similar structure as personality, but it changes over time.
Defined as an m-dimensional vector, where all m emotion intensities are represented by a value in the interval [0,1] .
The actual emotional state is dependent on the preliminary evolvement of emotins.
A need to model the emotins respecting their previous trends (history).
An emotional state history ωt is defined, that contains all emotional states until et, thus :
tt eee ,...,, 10 (3)
WIKT 2006WIKT 2006
Generalized model of Generalized model of moodmoodEgges continues with defining the individual ITas a triple (p, mt, et), where mt represents the mood of the individual at a time t.
Mood dimension is defined as a value in the interval [-1,1].
k mood dimensions => the mood can be described as follows:
The mood and emotional values are changing in time
Semantic differential scales are often used for measuring emotion dimensions.
A Set of dimensions as proposed by Mehrabian & Russell (1974, Appendix B, p. 216)[7].
It is evident that the authors have included moods and personality dimensions in this system too.
WIKT 2006WIKT 2006
Acoustic correlates of emotionsAcoustic correlates of emotions
Problem: speech parameters involved in expression of personality, moods and emotions are shared for all the components of expressivity.
Decoding the expressive speech code is very subjective.
Nevertheless, a general set of the speech parameters responsible for the expression of emotion can be constructed. There are three main categories of speech correlates of emotion:
• Pitch contour
• Timing
• Voice quality
It is believed that value combinations of these speech parameters are used to express vocal emotion.(Schröder M.[8])
WIKT 2006WIKT 2006
Pitch contour Pitch contour Pitch contour is a representation of the intonation of an utterance, which describes the nature of accents and the overall pitch range of the utterance.
Pitch is expressed as fundamental frequency (F0).
One of the most frequently used methods for F0 measurement is the method using autocorrelation function of the LP residual.
Parameters include average pitch, pitch range, contour slope, and final lowering.
Anger Happiness Sadness Fear
Speech rate
Faster Slightly faster
Slightly slower
Much faster
Pitch average
Very much higher
Much higher
Slightly lower
Very much higher
Pitch range
Much wider
Much wider
Slightly narrower
Much wider
Intensity Higher Higher Lower Higher
Pitch changes
Abrupt, down-ward,
directed contours
Smooth, upward
inflections
Downward inflections
Down-ward
terminal inflections
Voice quality
Breathy, chesty tone1
Breathy, blaring
Resonant Irregular voicing
Articu-lation
Clipped Slightly slurred
Slurred Precise
WIKT 2006WIKT 2006
Intonation contourIntonation contourModels of intonation - two main categories:
Phonetic
Phonological
The phonetic models (e.g. Fujisaki model, Tilt model, MOMEL and many others) model the intonation curve. The phonological model (e.g. ToBI) is used to model the speaker's concept of distribution of accents in the intonational phrase.
WIKT 2006WIKT 2006
Automatic intonation contourAutomatic intonation contour anal analyysis in sis in Fujisaki editorFujisaki editor
WIKT 2006WIKT 2006
Pitch contourPitch contour analysis in PRAAT with ToBI labels analysis in PRAAT with ToBI labels
WIKT 2006WIKT 2006
TimingTimingTiming
Speed that an utterance is spoken
Rhythm
Duration of emphasized syllables
The results of measurement of syllable and phoneme lengths are often given in a form of z-scores
(the instantaneous value is normalized be the mean value of the same elements in the whole database.
Latitudinal axis settings: Labial Close rounding Lip-spreading Lingual tip-blade Tip articulation Blade articulation Retroflex articulation Tongue-body Dentalized Palato-alveolarized Palatalized Velarized Pharyngealized Laryngopharyngealized Mandibular Close jaw position Open jaw position Protruded jaw position Retracted jaw position
Analysis of the glottal functionAnalysis of the glottal functionThe analysis of the glottal function is generally done using source-filter model of
speech production [10].
The glottal function is obtained from the speech signal by inverse filtering. One of the most efficient inverse filtering methods uses Discrete Linear Prediction – DLP (El-Jaroudi A., Makhoul J., [11])
to obtain the inverse filter coefficients and to filter the speech signal.
The resultant DLP residual function is considered as a representative of a derivative of glottal volume velocity function.
WIKT 2006WIKT 2006
TimeTime and spectral and spectral domain characteristics domain characteristics of the glottal function of the glottal functionTime characteristicsOQ, Open Quotient – ratio of the open phase of the glottal waveform to the period of the pulse. OQ predicts the values for the amplitudes of the lower harmonics. (increased value of OQ is correlated with an
increase in the amplitude of the lower harmonics in the voice spectrum.)
CQ, Closing Quotient – ratio of the closing phase of the glottal pulse to the period of the pulse.These characteristics has been recently often replaced by AQ – Amplitude quotient and NAQ-Normalized amplitude
quotient (Alku [12]).
EE, Excitation Strength – amplitude of the negative peak, calculated after the positive peak. EE is correlated with the overall intensity of the signal. A decrease in EE is correlated with a breathy voice.
RK, Glottal Symmetry/Skew – ratio of the closing phase to the opening phase of the differentiated glottal pulse. RK affects mainly the lower harmonics; the more symmetrical the pulse, the greater their amplitude.
Spectral characteristicsH1-H2– the amplitude of the first harmonic (H1) compared to the amplitude of the second harmonic (H2). An indicator
of the relative length of the opening phase of the glottal pulse (Hanson 1997).
H1-A1– the amplitude of the first harmonic (H1) compared to the strongest harmonic in the first formant (A1). Reflects the first formant bandwidth
spectral tilt - Expected to be large and positive for breathy voices and small and/or negative for creaky voices
H1-A2– the amplitude of the first harmonic (H1) compared to the amplitude of the strongest harmonic in the second formant (A2). An indicator of spectral tilt at the mid formant frequencies. Large and positive for breathy voices and small and/or negative for creaky voices.
H1-A3– the amplitude of the first harmonic (H1) compared to the amplitude of the strongest harmonic in the third formant (A3). An indicator of spectral tilt at the higher formant frequencies. Large and positive for breathy voices and small and/or negative for creaky voices.
WIKT 2006WIKT 2006
Glottal pulse analysisGlottal pulse analysis in APARAT in APARAT
WIKT 2006WIKT 2006
Analysis of the vocal tractAnalysis of the vocal tract
Methods of vocal tract shape estimation include x-ray, computer tomography and magnetic resonance methods.
- stationary sound production only
.Cheaper and quicker method – computing of the vocal tract shape from the speech signal
complementary to glottal pulse analysis from the speech signal. (e.g. vocal tract shape computation from LPC derived reflection coefficients).
- allows for analysis of the dynamic behavior of the articulators. Similar information can be obtained by formant analysis using homomorphic deconvolution (cepstrum) or LPC spectrum analysis.
WIKT 2006WIKT 2006
Static aStatic analysis by synthesis using articulatory synthesizernalysis by synthesis using articulatory synthesizer (TRACTSYN)(TRACTSYN)
WIKT 2006WIKT 2006
Dynamic aDynamic analysis by synthesisnalysis by synthesis ( (articulatory syntharticulatory synth. TRACTSYN). TRACTSYN)
WIKT 2006WIKT 2006
Acoustic correlates of Acoustic correlates of emotions applied in emotions applied in speech synthesisspeech synthesis
Emotion Study
Language Rec. Rat
Parameter settings
Joy Burkhardt & Sendlmeier
(2000) German
81% (1/9)
F0 mean: +50% F0 range: +100%
Tempo: +30% Voice Qu.: modal or tense; “lip-spreading
feature”: F1 / F2 +10% Other: “wave pitch contour model”: main
stressed syllables are raised (+100%), syllables in between are lowered
(-20%)
Sadness Cahn (1990)
American English
91% (1/6)
F0 mean: “0”, reference line “-1”, less final lowering “-5”
Aim: to extract information from supra-segmental and extra-linguistic layers
Where to look for information: time domain a) quantity (lengths of segments)
b) rhythmfrequency domain
a) long term characteristicsb) short term characteristics
model based characteristicsa) glottal excitation function b) articulatory model
WIKT 2006WIKT 2006
Vision:Vision: Speech Sound Mining Speech Sound MiningHow to define a set of speech sound objects?
Objective methods of analysis (pattern recognition)
Subjective methods (impression of the listener)
Possible objects:
Speech sound event
Speech sound act
Speech sound gesture
Speech sound characteristic
Speech sound characteristic change
WIKT 2006WIKT 2006
Vision:Vision: Speech Sound Mining Speech Sound MiningFirst steps to be accomplished:
Speech corpus buildingAnnotation of SSOBoundary markers Frequencies of occurence of SSOConcordances of SSOCorrelation among different sets of objects (pitch SSO, accent SSO, rhythmic SSO, timbre SSO, etc.)Semantic representation of SSOCross cultural semantic analysis
WIKT 2006WIKT 2006
Vision:Vision: Speech Sound Mining Speech Sound MiningTraditional methods used in NLP and data mining will be applicable:
Bag of words Bag of SSOWordNet SSO semantic nete.t.c.
Research on the relation between lingvistic and paralingvistic&extralingvistic information.
Creation of a complex (holistic) model of the speech signal as an information carrier in communication.
WIKT 2006WIKT 2006
Thank you for your attentionThank you for your attention
MMilanilan Rusko Rusko
Institute of Informatics Institute of Informatics Slovak Academy of SciencesSlovak Academy of Sciences