Top Banner
The Perception of Two Vocal Qualities in a Synthesized Vocal Utterance: Ring and Pressed Voice *Christine C. Bergan, *†Ingo R. Titze, and ‡Brad Story *Iowa City, Iowa, †Denver, Colorado, and ‡Tucson, Arizona Summary: Two vocal qualities, ring quality and pressed quality, were analyzed perceptually. Listeners were asked to rate (on a scale from 0 to 10) the “amount of ring” in one listening and the “amount of pressedness” in another listening. The stimulus was the synthesized utterance /ya-ya-ya-ya- ya/. In the continuum representation of ring, the skewing quotient and the cross section of the epilaryngeal tube area were systematically varied, independently and by a covariation rule. In the continuum representation of pressed, the flow amplitude and open quotient were similarly varied. Results indicated that the crossover point between ring and no ring occurred with an epilaryngeal area of around 1.0 cm 2 , and the crossover point between pressed and not pressed quality occurred at an open quotient of about 0.4. Fundamental frequency also had an effect on the perceptions, with a higher fundamental frequency receiving higher ratings of ring and pressed for otherwise the same parameters. Listeners demonstrated highly variable perceptions in both continua with poor intersubject, intrasubject, and intergroup reliability. Key Words: Voice—Ring—Pressed—Voice quality—Timbre—Perception— Musicians. INTRODUCTION The perception of vocal quality is an integral part of a voice evaluation. The ability to accurately per- ceive such vocal qualities as roughness, resonance, Accepted for publication September 4, 2003. From the *Department of Speech Pathology and Audiology, The University of Iowa, Iowa City, IA; †National Center for Voice and Speech, The Denver Center for the Performing Arts Denver, CO; ‡University of Arizona, Tucson, AZ. Address correspondence and reprint requests to Christine C. Bergan, ABD, CCC-SLP, Department of Speech Pathology and Audiology, The University of Iowa, Iowa City, IA 52240. E-mail: [email protected] Journal of Voice, Vol. 18, No. 3, pp. 305–317 0892-1997/$30.00 2004 The Voice Foundation doi:10.1016/j.jvoice.2003.09.004 305 and breathiness is a skill on which vocologists depend. If diagnosis and therapy of vocal patholo- gies is to improve, it is important to determine the reliability and validity of voice quality assessment. In a previous study by Bergan and Titze 1 pitch and roughness were rated according to the extent of amplitude and frequency modulation by a subhar- monic (F 0 /2). The objective was to determine the identification boundaries for pitch and roughness and to discover how subharmonic modulation af- fected these boundaries. Sustained vowels at three different fundamental frequencies were presented with varying amounts of either amplitude or fre- quency modulation. It was found that both musicians and nonmusicians demonstrated somewhat poor in- trasubject and intersubject agreement in the rating of roughness and in the selection of a single dominant fundamental frequency; however, agreement was
13

The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

Mar 05, 2023

Download

Documents

Ashwin Naidu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

The Perception of Two Vocal Qualities in a SynthesizedVocal Utterance: Ring and Pressed Voice

*Christine C. Bergan, *†Ingo R. Titze, and ‡Brad Story

*Iowa City, Iowa, †Denver, Colorado, and ‡Tucson, Arizona

Summary: Two vocal qualities, ring quality and pressed quality, wereanalyzed perceptually. Listeners were asked to rate (on a scale from 0 to 10)the “amount of ring” in one listening and the “amount of pressedness” inanother listening. The stimulus was the synthesized utterance /ya-ya-ya-ya-ya/. In the continuum representation of ring, the skewing quotient and thecross section of the epilaryngeal tube area were systematically varied,independently and by a covariation rule. In the continuum representation ofpressed, the flow amplitude and open quotient were similarly varied. Resultsindicated that the crossover point between ring and no ring occurred with anepilaryngeal area of around 1.0 cm2, and the crossover point between pressedand not pressed quality occurred at an open quotient of about 0.4. Fundamentalfrequency also had an effect on the perceptions, with a higher fundamentalfrequency receiving higher ratings of ring and pressed for otherwise thesame parameters. Listeners demonstrated highly variable perceptions in bothcontinua with poor intersubject, intrasubject, and intergroup reliability.

Key Words: Voice—Ring—Pressed—Voice quality—Timbre—Perception—Musicians.

INTRODUCTION

The perception of vocal quality is an integral partof a voice evaluation. The ability to accurately per-ceive such vocal qualities as roughness, resonance,

Accepted for publication September 4, 2003.From the *Department of Speech Pathology and Audiology,

The University of Iowa, Iowa City, IA; †National Center forVoice and Speech, The Denver Center for the Performing ArtsDenver, CO; ‡University of Arizona, Tucson, AZ.

Address correspondence and reprint requests to ChristineC. Bergan, ABD, CCC-SLP, Department of Speech Pathologyand Audiology, The University of Iowa, Iowa City, IA 52240.E-mail: [email protected]

Journal of Voice, Vol. 18, No. 3, pp. 305–3170892-1997/$30.00� 2004 The Voice Foundationdoi:10.1016/j.jvoice.2003.09.004

30

and breathiness is a skill on which vocologistsdepend. If diagnosis and therapy of vocal patholo-gies is to improve, it is important to determine thereliability and validity of voice quality assessment.

In a previous study by Bergan and Titze1 pitchand roughness were rated according to the extent ofamplitude and frequency modulation by a subhar-monic (F0/2). The objective was to determine theidentification boundaries for pitch and roughnessand to discover how subharmonic modulation af-fected these boundaries. Sustained vowels at threedifferent fundamental frequencies were presentedwith varying amounts of either amplitude or fre-quency modulation. It was found that both musiciansand nonmusicians demonstrated somewhat poor in-trasubject and intersubject agreement in the rating ofroughness and in the selection of a single dominantfundamental frequency; however, agreement was

5

Page 2: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

CHRISTINE C. BERGAN ET AL306

significantly higher in musicians than in nonmusi-cians, both in their ability to “select” a dominantpitch (in a method-of-adjustment task) and in therating of roughness.

Recent research has demonstrated poor intersub-ject and intrasubject reliability when rating vocalqualities. Some of this research has shown poorlistener agreement in midrange ratings of breathinessand roughness,2 the effect of experience and pro-fessional background on perceptual ratings of voicequality,3 evaluation of the adequacy of current termi-nology for clinical judgment of voice quality,4 andthe comparison of internal and external standards invoice quality judgments through the use of standardrating scales versus scales in which a set of anchorswas presented prior to rating.5

Wuyts et al6 studied the reliability of a visualanalog versus an ordinal scale for the perceptualevaluation of dysphonia in 14 pathological voices.Two versions of the Grade, Roughness, Breathin-ess, Asthenia, Strain (GRBAS) scale were presented,the original 4-point scale (0-3) and a visual analogscale broken down into ten possible ratings. Twenty-nine listeners (25 speech pathologists and 4 otolaryn-gologists with at least 1 year of experience in judgingvoice disorders) used each of the two types of scalesto judge the same voices with an intermittent timeinterval of 2 weeks. Their results showed that theoriginal 4-point scale resulted in higher inter-rateragreement and that even though a visual analogscale enables finer judgments of voice qualities, theincreased freedom of judgment resulted in decreasedinter-rater agreement. Additionally, Wuyts et al6

found that the addition of more levels of rating fora given category (eg, “Grade”) exhibited a tendencyto score in the middle, and listeners therefore didnot use the opportunity of having more judgmentpossibilities. The number of categories and levelswithin which to make a judgment (or the “degreesof freedom”) may be an important factor to considerin future perceptual judgments of voice qualities.Perhaps the results of our previous and current studymay contribute to some guidelines.

In an investigation of listener disagreement inthe perception of vocal qualities by Kreiman andGerratt,7 the comparison was made between the useof interval or ordinal rating scales and between per-ception of natural pathological voices and syntheti-cally generated voices. Listeners were asked to

Journal of Voice, Vol. 18, No. 3, 2004

classify pathological voices as “having or not havingdifferent voice qualities.” This was to enable thelistener to focus on “the kind of quality a voice had,rather than how much of a quality it possessed.”Listeners were instructed to judge whether thesevoice samples were breathy, rough, and low-pitched.Listener agreement was above chance, but there waspoor agreement with regard to whether individualvoices belonged to a particular perceptual class. Lis-tener agreement was significantly higher for thesynthetic stimuli than for the natural voices. Theresults suggested that difficulty isolating a singleperceptual dimension of a complex stimuli maybe one reason why traditional unidimensional ratingprotocols are “unsuited to measuring pathologicvoice quality.”

In a study of perception of vocal quality in singers,Wapnick and Ekholm8 used a scale created by expertsinging teachers to evaluate 19 performances, 6 ofwhich were presented twice. Interjudge and intra-judge reliability was assessed. The expert listeners(21 total) showed a marked division into two under-lying evaluative dimensions. Thirteen of them wereprimarily influenced by “execution” of perfor-mance, and eight of them were primarily affectedby “intrinsic quality” of the voice. Intrajudge relia-bility for one judge was modest (Pearson r � 0.49),but interjudge reliability increased dramatically asthe pool of judges increased to four or morejudges (r � 0.80). Interestingly, the variability inboth interjudge and intrajudge reliability “was notrelated to age, teaching experience, or adjudicationexperience. The ability to evaluate consistently thusappears to be a skill that is not necessarily acquiredin the course of a normal teaching or singing career.It is unclear at present what it is related to (p. 429).”

One of our primary long-term interests in conduct-ing these studies is to determine if vocal qualityassessment is teachable. Boyle and Radocky9 statethat “the measurement of musical performance isinherently subjective.” However, it is important toremember that subjectivity does not mean thatachievement of a general consensus is impossible.“If there is no consensus, evaluation loses much ofits meaning (p. 435).”

Dejonckere et al10 studied the use of the GRBASscale for deviant voice quality in five different insti-tutes on 943 different patients. Each voice was evalu-ated separately by two professionals. The inter-rater

Page 3: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

SYNTHESIZED VOCAL UTTERANCE 307

and intrarater correlations were satisfactory onlyfor the G, R, and B evaluations. Dejonckere et al10

also found that experience with the scale significantlyimproved the inter-rater agreement. Although the au-ditory system can provide a far more sensitive andaccurate measure of vocal qualities than can often beprovided by the most advanced acoustic analysis,the point at which perceptual voice analysis typically“breaks down” is often the point of greatest interest(eg, where diplophonia, breathiness, roughness, ape-riodicity, etc. occur). Aperiodic sounds, with lots ofspectral “fill in” between the harmonics (ie, with arelatively unclear or indiscernible harmonic patternwhen compared with periodic sounds), are difficultto distinguish perceptually. The importance of im-proving our ability to evaluate spectrally complexvocal qualities can therefore not be overstressed if wewant to have a more complete and accurate picture ofa voice and all of its possible pathologies.

DESCRIPTION OF THETWO QUALITIES SELECTED

The perception of pressed voice is postulated tobe a result of the amount of adduction of the vocalfolds. It is quantified by the open quotient Qo, definedas the duration of opening of the glottis divided bythe glottal period. Spectrally, pressed voice is char-acterized by high harmonic content (ie, a small spec-tral slope). Qo is regulated by glottal width (thespace between the vocal processes of the arytenoidcartilages). Typical values of Qo range from 0.4 to0.7 in normal vocal production.11,12 Values lowerthan 0.4 produce a voice that is sometimes perceivedto be pressed, whereas values higher than 0.7 pro-duce a voice sometimes perceived to be breathy.When listening to glottal flow waveforms, Schereret al13 found the just noticeable difference in openquotient of glottal flow to be 0.02 and in the skewingquotient of glottal flow to be 0.15. Skewing quotientQs is the ratio of the duration of increasing flow tothe duration of decreasing flow within the glottalflow cycle. Typical values of Qs range from 1.0 to 5.0in normal vocal production.15,16 The just noticeabledifferences when listening to corresponding outputpressure, however, were somewhat higher, with re-spective values of 0.03 and 0.32. In addition to openquotient and skewing quotient, the peak-to-peak

flow could be hypothesized to have a secondaryeffect on pressed voice because more flow couldpossibly diminish the perception of pressing.

The perception of vocal ring is postulated to berelated to epilaryngeal tube area Ae, skewing quo-tient Qs, and possibly Fo. The epilaryngeal tubeextends from the vocal folds to the aryepiglotticfolds, and in our model comprises eight sections,3.96 mm in length each, for a total length of 3.17cm. (Epilaryngeal tube area refers to the cross-sec-tional area of this tube.) It is expected that epilarynxtube area, along with open quotient and skewingquotient, can all have the effect of altering the energyin the higher partials of the spectrum. Hence, therecould be some confusion between pressed voice andvocal ring. Typical ranges of epilaryngeal tube areasare 0.2–1.0 cm2.14

In an interactive source-vocal tract model, Qs isgoverned by the inertance of the vocal tract,17 whichis a strong function of the epilarynx tube area Ae.

18

A narrowed epilarynx tube, together with a relativelywide pharynx, produces a boost of energy around3000 Hz, known as the “singer’s formant” or “vocalring.”19 This boost of energy results from a cluster-ing of the third, fourth, and fifth formants throughadjustments in the epilaryngeal area and shape ofthe pharyngeal cavity. Titze and Story18 found thatif the epilarynx tube is narrow, the input impedanceof the vocal tract becomescomparable with the glottalimpedance and can thus “shape the flow and influ-ence the mode of vibration.” Higher pitches seemto inherently contain more potential for this “ring”quality because lower harmonics are in this “sing-er’s formant” region. As lower harmonics typicallycarry more energy than higher harmonics in an ideal-ized source spectrum, when these lower partialsreside in the singers’ formant area, the result is thatof increased intensity.19–21

PURPOSE AND RESEARCH QUESTIONS

The purpose of this study was to determine howselected acoustic parameters result in the perceptionof ring vocal quality and pressed vocal quality. Sub-jects were presented with the sentence-length utter-ance /ya-ya-ya-ya-ya/ created with a voice simulationmodel described in the procedures below. The utter-ance is semantically similar to “yes, I know,” or “I’ve

Journal of Voice, Vol. 18, No. 3, 2004

Page 4: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

CHRISTINE C. BERGAN ET AL308

heard this before.” It is continuously voiced and con-tains a speech-like intonation contour. The meanfundamental frequency was either 200 or 300 Hz. Forthe 200-Hz case, the utterance had an Fo a frequencyrange of 100-200 Hz; for 300 Hz, a proportionatelyhigher Fo range was obtained by frequency transfor-mation. For future discussion, these Fo intonationcontours will be referred to simply in terms of meanFo values.

The research questions to be answered were asfollows: (1) How do the acoustic parameters Qs andAe individually correlate with the perception of vocalring? (2) Does the covariation of QS and Ae resultin a greater perception of ring than either one alone?(3) How do the acoustic parameters Qo and Um

(the peak-to-peak flow) individually correlate withthe perception of pressed vocal quality? (4) Does thecovariation of the Qo and the Um result in a greaterperception of pressed vocal quality? (5) Does thefundamental frequency Fo affect the perception ofring or pressed vocal qualities? (6) How variableare the subjects’ abilities to rate these qualities? (7)Is there a significant difference in intersubject andintrasubject variability between musicians and non-musicians?

PROCEDURES

SubjectsTwenty listener volunteers were recruited. All of

them reported normal hearing. The ages of the listen-ers ranged from 18 to 55, with a mean age of 26years. Amount of musical background or trainingranged from none or slight (10 subjects) to well-trained (10 subjects). A well-trained subject wasdefined as one whose minimum amount of trainingincluded a bachelor’s degree in music. Musiciansand nonmusicians were also matched by gender,with 5 females and 5 males in each category.

Experimental procedures and instructionsListeners were presented with the synthetically

simulated sentence “ya-ya-ya-ya-ya.” They wereasked to rate each sentence on a scale from 1 to 10in each continuum, with a rating of “1” signifyingvery little of that particular vocal quality (eg, not very“pressed”) and a rating of “10” signifying a great

Journal of Voice, Vol. 18, No. 3, 2004

deal of that vocal quality (eg, very “pressed.”)Listeners were encouraged to use the full availablerange of ratings, according to their perceived differ-ences in these qualities.

Twenty different stimuli were randomly presentedthree times for a total of 60 presentations in eachcontinuum (ring and pressed). The 20 stimulicomprised the covariations of mean fundamentalfrequency, open quotient, skewing quotient, epilar-yngeal area, and peak flow. No two identical stimuliwere sequentially presented.

A preliminary trial set of all 20 stimuli was firstpresented to the listeners to allow them to hear thefull range of subsequently rated stimuli. The purposewas to ensure that the listeners had some frame ofreference against which to compare the stimuli tofollow; otherwise, the scale would have limited va-lidity because the listener would have no conceptof the relative range of possible vocal qualities.

No definition of “ring” or “pressed” voice wasgiven by the experimenter; rather, the listeners wereinstructed to judge each stimulus according to theirown personal interpretation or internal standardsof what “ring” and “pressed” vocal quality shouldsound like. Many of the listeners voiced some anxi-ety about not being “told” or “shown” precisely whatthey were supposed to listen for (eg, what does“ring” sound like?). They were assured that one ofthe purposes of this study was to demonstrate thevariability in listeners’ perceptual ratings of thesetwo qualities due to the individualized definitions ofthese terms. Variability of the listeners’ responseswas measured through calculation of standard devia-tion of rating for each specific stimuli (intersubjectvariability) and through a calculation of the amountof variance each individual listener demonstratedwhen rating the same stimuli in three different pre-sentations (within-subject variability).

INSTRUMENTATION AND STIMULI

Stimuli were presented through a Sony CD (SonyCorporation, New York, NY) player and over loud-speakers, with a constant loudness setting7 and dis-tance from the listeners (8–10 ft). Stimuli werepresented free-field in a sound-controlled room, andthe same room was used for all listeners and alllistening tasks.

Page 5: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

SYNTHESIZED VOCAL UTTERANCE 309

The sound stimuli were generated with a computermodel,22 which computes a mathematical flow pulseas a sound source and a vocal tract area functionas a filter. Several glottal source parameters andthe area functions were controlled independently.The glottal parameters are as follows: peak flow,fundamental frequency, open quotient, skewing quo-tient, and the cross-sectional area of the epilaryngealtube. Table 1 represents all nominal parameters perti-nent to this study with typical default values.

The parameters remained at these nominal values,unless they were deliberately varied as follows: peakflow values were 24.47, 95.49, 345.50, 793.89, and1000.00 ml/sec; mean fundamental frequency waseither 200 or 300 Hz; open quotient values were0.1, 0.2, 0.4, 0.7, and 1.0; skewing quotient valueswere 3.16, 1.82, 1.41, 1.19, and 1.0; and epilaryngealareas values were 0.2, 0.6, 1.0, 1.4, and 2.0 cm2.With only 20 stimuli, it is obvious that not all combi-nations were taken. There was considerable covaria-tion of these parameters, as described below.

In order to approximate some source-filter interac-tion with an otherwise noninteractive model, it wasnecessary to invent a rule. The skewing quotient Qs

was chosen to covary inversely with epilaryngealarea Ae as follows:

Qs � (2/Ae)1/2. (1)

With this formula, as the epilaryngeal tube areawas increased from 0.2 to 2.0 cm2 as the skewingquotient was simultaneously decreased from 3.16to 1.0.

A further covariation between peak glottal flowUm and open quotient Qo was developed as shown

TABLE 1. Nominal Parameter Valuesfor Simulation Model

AssumedParameter Symbol Default Perceptual Effect

Peak Flow Um 500 cm3/s Loudness andPressedness

Fundamental Fo 100 Hz Pitchfrequency

Open quotient Qo 0.6 PressednessSkewing quotient Qs 1.7 RingArea of Ae 0.5 cm2 Ring

epilarynx tube

in Figure 1. Assume that vocal fold displacement x isrepresented by a raised sinusoid of the form

x � xo � A sin q, (2)

where xo is the rest position and q� wt is radiantime. If we now define �a and p � a to be theradian times when the vocal folds collide, then

0 � xo � A sin (�a), (3)

which yields

xo � A sin a. (4)

The open quotient Qo is defined as the ratio ofduration of no collision to the radian period (2π),

Qo � (p � 2a)/2p � 0.5 � a/p. (5)

Finally, according to Equation 2, the peak dis-placement is

xm � xo � A (6)

� A (1�sin a). (7)

Substituting α in terms of Qo from Equation 5:

xm � A [1.0 � sin p(Qo � 0.5)]. (8)

Note that when Qo � 0.5, the maximum displace-ment is A, and when Qo � 1.0, the maximum dis-placement is 2A. Now assuming that maximum flowand maximum displacement are roughly propor-tional, we estimated that the maximum flow shouldcovary with open quotient in the same way

Um � Uom [1 � sin p(Q0 � 0.5)]. (9)

With this rule, as Qo was varied from 0.1 to 1.0,the peak flow Um varied from 0.049Uom to 2 Uom,where Uom is the default value of 500 cm3/s givenin Table 1.

RESULTS AND DISCUSSION

Selected stimuli from our study were first Fourieranalyzed to compare the relative levels of the

Journal of Voice, Vol. 18, No. 3, 2004

Page 6: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

CHRISTINE C. BERGAN ET AL310

FIGURE 1. Relating maximum vocal fold displacement xm to amplitude A and mean displacement xo.

formants for ring and pressed conditions. Time-wise identical /ɑ/ segments of the second [ya]syllable in the /ya ya ya ya ya/ utterance wereanalyzed to make comparisons. As formant frequen-cies for vowels are sensitive to vocal tract constric-tions, it was expected that some alteration in vowelperception may have occurred, especially for vocalring. Narrowing the epilaryngeal tube area raises thefirst formant and lowers the second formant, butthese changes are inevitable in any study of voicequality. Vowel quality and voice quality are basicallyinseparable when a vocal tract change is imposed.

For the ring continuum, when the skewing quo-tient was raised from 1.0 to 3.16 and the epilaryngealarea was simultaneously reduced from 2.0 cm2 to0.2 cm2, there was a significant increase in acousticenergy in the F3-F4 formant region [2500–4000 Hz],as seen in Figure 2. This energy increase of about30 dB relative to the F1 level accounts for the per-ception of ring.23 It is also evident that formantclustering (drawing together of formant peaks) hasoccurred. F4 was drawn closer to F3, from 3600 Hzto 3200 Hz, and F5 was lowered from 4600 Hz to4300 Hz. In terms of a fundamental frequency effect(not shown), increasing Fo from 200 Hz to 300 Hzfurther increased the energy in the 2500–4000-Hz range by about 10 dB. For this higher Fo, thelevels of F3 and F4 were nearly the same as the levelsof F1.

For the pressed continuum (Figure 3), results weresimilar in that high-frequency energy was raised,

Journal of Voice, Vol. 18, No. 3, 2004

but the main effect was an overall change in thespectral tilt rather than formant clustering. Notethe difference in spectral tilt between the “mostpressed” and “least pressed” cases shown. The fre-quency spacing between the formant peaks did notchange, however; there was no formant clustering.A greater drop of energy in the fundamental (200Hz) is seen from “least pressed” to “most pressed”than from “least ring” to “most ring” in the previousfigure. The level of Fo dropped from about 10 dBbelow F1 in Figure 2 to as much as 30 dB belowF1 in Figure 3. Thus, pressed quality results froman overall “whitening” of the sound, whereas ringquality retains more low-frequency energy while the“singer’s formant” cluster is raised. Raising mean F0

from 200 Hz to 300 Hz had no profound effect onthe spectrum.

Perceptual ratings demonstrated a pitch and mag-nitude effect (of skewing quotient, open quotient,size of epilaryngeal area, and peak flow) on theratings of ring and pressed quality by the listeners.Variation of open quotient alone had a dramaticimpact on the resultant perception of pressed voice(Figure 4, filled triangles). In particular, the variationof open quotient alone yielded an average overallrating of 5.49, with a range of 2.18–8.13. The varia-tion of peak flow alone (open triangles) yielded anaverage rating of 4.76 with a very limited range of4.58–4.89. The simultaneous covariation of openquotient and peak flow (open circles) yielded anaverage rating of 5.55, with a range of 2.61–8.31.

Page 7: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

SYNTHESIZED VOCAL UTTERANCE 311

FIGURE 2. Fast Fourier transform (FFT) of a segment of /ɑ/ in the utterance/ya-ya-ya-ya-ya/ for Qs � 1.0, Ae � 2.0 cm2 (least ring), and Qs � 3.16, Ae � 0.2 cm2

(most ring). All other parameters are nominal.

FIGURE 3. Fast Fourier transform (FFT) of a segment of /ɑ/ in the utterance/ya-ya-ya-ya-ya/ for Um � 1000 cm3/s and Qo � 1.0 (least pressed) and Um � 24.47cm3/s, Qo � 0.1 (most pressed). All other parameters are nominal.

Journal of Voice, Vol. 18, No. 3, 2004

Page 8: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

CHRISTINE C. BERGAN ET AL312

J

FIGURE 4. Listener’s ratings of pressed quality for values of open quotient (bottomaxis) and peak flow (top axis), in isolation and in covariation.

This curve is essentially the same as the one for openquotient alone. Finally, raising mean Fo from 200 Hzto 300 Hz (filled circles) resulted in an overall in-crease in the pressed rating, with an average ratingof 7.31 and a range of 4.82–9.0.

Figure 5 shows the listener’s rating of ring qualitywith changes in skewing quotient and epilarynx tubearea. In particular, the variation of epilaryngeal areaalone (open triangles) yielded an average rating of4.66, with a range of 3.48–6.29, whereas the varia-tion of skewing quotient alone (filled triangles)yielded an average rating of 4.84, with a range of3.35–7.0. Neither had a significantly stronger impactthan the other. When epilaryngeal area and skewingquotient were covaried according to Equation 1, theperception of ring was slightly higher (open circles).The average rating was 4.87, and the range was2.52–7.35. When Fo was raised from 200 Hz to 300Hz, a significantly higher rating of vocal ring wasobtained (filled circles). The average rating was 6.03,and the range was 3.36–8.96.

ournal of Voice, Vol. 18, No. 3, 2004

Statistical analysisA mixed model ANOVA was performed for each

subject group, parameter type, parameter magnitude,and pitch to determine the correlation to the resultingrating. The purpose was to determine the relativeimpact each parameter had on the slope of the regres-sion line and, therefore, on the resulting rating.Tables 2 and 3 summarize the results.

For the ring continuum (Table 2), the results maybe interpreted as follows: When the epilaryngealarea Ae was increased by 1 cm2, the rating decreasedby 1.33 points and had a y-intercept of 5.92. Whenthe skewing quotient Qs was increased by 1.0 unit, therating of ring went up by 1.11 points and had a y-intercept of 2.67. When both the Ae and Qs weresimultaneously varied, the slopes changed slightlyand the intercept was basically the average of thetwo. When Fo was raised from 200 Hz to 300 Hz(with simultaneous Ae and Qs variation), the slopeof Ae became slightly greater and the slope of Qs

slightly smaller. The effect of Fo alone was to raise

Page 9: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

SYNTHESIZED VOCAL UTTERANCE 313

FIGURE 5. Listener’s ratings of ring quality for values of skewing quotient (bottomaxis) and epilarynx tube area (top axis), in isolation and in covariation.

the perception of ring by 1.38 units per musical fifthincrease in Fo, (200 Hz to 300 Hz).

For the pressed continuum (Table 3), the resultsmay be interpreted in the same way as describedabove for the ring continuum. The results suggestthat the open quotient Qo is inversely related to therating of pressed quality and that the magnitude ofpeak flow Um is also inversely related (but weakly)to the rating of pressed quality. In fact, the regressionequation for “Um alone” failed to reach significance.

If the regression equation was plotted out for the “Um

alone” condition, the ratings of 5.12, 5.09, 5.00,4.84, and 4.77 would appear almost as a straightline relative to other variations.

Standard deviations of perceptual ratings areshown in Tables 4 and 5. In general, the standarddeviations are on the order of 1.0 rating units. Ina few cases, they are as high as 2.0 rating units,particularly when a parameter such as Qo or Ae

reached its maximum values. This is understandable

TABLE 2. Summary Statistics for “Ring” Continuum (R � rating)

ConfidenceVariation Regression Equation t-value (and p-value) F-value (and p-value) Interval (95%)

Epilaryngeal Area (Ae) R � 5.92 � 1.33(Ae) �9.93 (p � .0001) 98.68 (p � .0001) [�1.59, �1.07]Skewing Quotient (Qs) R � 2.67 � 1.11(Qs) 9.33 (p � .0001) 87.03 (p � .0001) [.88, 1.35]Both Ae and Qs R � 4.20 � 1.21(Ae) Ae: �10.71 (p � .0001) Ae: 114.74 (p � .0001) [�1.44, 1.00]

� 1.00(Qs) Qs: 10.86 (p � .0009) Qs: 118.01 (p � .0001) [.82, 1.18]Both Ae and Qs R � 6.24 � 1.75(Ae) Ae: �5.64 (p � .0001) Ae: 31.78 (p � .0001) [�2.37, 1.14]

with Fo 300 Hz � .85 (Qs) Qs: 3.39 (p � .0009) Qs: 11.49 (p � .0009) [.36, �1.35]Fo alone R � 4.62 � 1.38(Fo) 8.35 (p � .0001) 69.65 (p � .0001) [1.05, 1.10]

Journal of Voice, Vol. 18, No. 3, 2004

Page 10: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

CHRISTINE C. BERGAN ET AL314

TABLE 3. Summary Statistics for “Pressed” Continuum (R � rating)

Variation Regression Equation t-value (and p-value) F-value (and p-value) Confidence Interval (95%)

Peak flow (Um) R � 5.13�0.00036 (Um) �2.50 (p � .0130) 6.23 (p � .0130) [�0.0006, 0.00008]Open Quotient (Qo) R � 8.45�6.59 (Qo) �27.61 (p � .0001) 762.13 (p � .0001) [�7.06, �6.12]Both Um and Qo R � 8.13�6.15 �33.12 (p � .0001) 1096.65 (p � .0001) [�6.51, �5.78]

(Qo)�.0002 (Um)�1.59 (p � .1115) 2.54 (p � .1115) [�0.00054, �0.000056]

Both Um and Qo R � 9.79�4.76 Qo: �7.30 (p � .0001) 53.31 (p � .0001) [�6.05, �3.48]with Fo 300 Hz (Qo)�.00045 (Um)

Um: �0.85 (p � .3939) Um: 0.73 (p � .3939) [�0.0015, 0.0006]Fo alone R � 5.26 � 2.19 (Fo) 14.44 (p � .0001) 208.5 (p � .0001) [1.90, 2.49]

because typical categorical perception functions sat-urate at the endpoints where the effect of a parameterhas been fully exploited.

Intrasubject and intersubject reliability were ana-lyzed through intergroup agreement (musicians vs.nonmusicians) with the correlation procedure andthe application of the Cronbach coefficient alpha.The “Type 1” error rate of the correlation proceduregave us the reliability between groups. The Cron-bach coefficient measures the consistency of the rankordering of the responses across the three repeatedobservations. It reveals the reliability of the intersub-ject differences and measures how systematic theirresponses are. In particular, it measures how consis-tent the rating of each person’s three responses is fora “like” item when compared with other responsesfrom themselves and from others. It therefore gives ameasure of both intrasubject and intersubject vari-ability.

Cronbach alpha values and regression equationsdiffer in that the regression equation relates to theentire data set, whereas Cronbach’s alpha relates tothe average proportion of systematic variance in any

Journal of Voice, Vol. 18, No. 3, 2004

one subject and in any single condition. Keepingthat in mind, the power to detect differences is muchgreater for the entire data set than it would be ifone were evaluating only a single condition at atime. If the overall reliability of a relationship isthought of as a joint probability equation inwhich the reliability of a set of observations wouldequal [(1-cronbach)# of subjects tested], then the reliabil-ity of the group score will be considerably higherthan the reliability of any one subject’s score. Be-cause of this, it is quite possible to detect groupeffects even when the alpha values are relatively low.

Table 6 shows the Cronbach coefficient alphalevels (columns 4–6) for musicians and nonmusi-cians for all 20 parameter combinations (columns1–3) that were used to simulate ring quality. Theoret-ically, alpha ranges from 0 to 1 and may be comparedwith a correlation coefficient. A high alpha levelsignifies stability in the ratings, and a similar alphalevel between groups signifies a shared varianceacross the observations. If a negative alpha valueis obtained, as in row 9, it is interpreted as a zerovalue because random error components failed to

TABLE 4. Standard Deviations for the Pressed Continuum

Qo/Um Values Qo and Um 300 Hz Qo and Um 200 Hz Qo alone 200 Hz Um alone 200 Hz

0.1/24.47 0.92 1.33 1.77 0.850.2/95.49 0.88 1.39 1.00 1.310.4/345.50 0.87 1.04 1.05 1.230.7/793.89 1.61 1.34 1.05 1.211.0/1000.00 2.21 1.48 1.52 0.94

Boldface indicates greatest standard deviation for that condition.

Page 11: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

SYNTHESIZED VOCAL UTTERANCE 315

TABLE 5. Standard Deviations for the Ring Continuum

Qs /Ae Values Qs and Ae 300 Hz Qs and Ae 200 Hz Qs alone 200 Hz Ae alone 200 Hz

3.16/0.2 1.26 0.96 1.34 0.991.82/0.6 1.60 1.11 1.04 0.931.41/1.0 1.57 1.13 1.25 1.081.19/1.4 2.07 1.41 1.54 1.231.00/2.0 2.04 1.64 1.59 1.36

Boldface indicates greatest standard deviation for that condition.

be averaged out. Boldfaced numbers in column 4of Table 6 are cases for which musicians were morereliable in their ratings than nonmusicians. Note thatthere are 12 of 20 such cases. Table 7 shows thesame analysis for pressed voice. For this quality,the musicians scored higher than the nonmusiciansin 9 of 20 cases. Thus, although musicians appearedto be a little better in judging ring quality thanpressed quality, there was no overall significant dif-ference between musicians and nonmusicians withregard to intrasubject reliability or consistency oftheir rating of these voice qualities.

TABLE 6. Cronbach Coefficient AlphaLevels and Standard Deviations of Ratings of “Ring”

Parameters Cronbach Coefficient Alpha

Qs Ae Fo Musicians Nonmusicians Combined

3.16 0.2 300 0.67 0.97 0.931.82 0.6 300 0.81 0.76 0.821.41 1.0 300 0.78 0.61 0.691.19 1.4 300 0.94 0.96 0.941.00 2.0 300 0.92 0.72 0.923.16 0.2 200 0.61 0.21 0.591.82 0.6 200 0.82 0.72 0.781.41 1.0 200 0.78 0.75 0.711.19 1.4 200 0.65 �0.11 0.471.00 2.0 200 0.81 0.94 0.873.16 1.0 200 0.74 0.69 0.741.82 1.0 200 0.44 0.59 0.541.41 1.0 200 0.50 0.63 0.521.19 1.0 200 0.76 0.88 0.821.00 1.0 200 0.84 0.74 0.711.70 0.2 200 0.04 0.67 0.721.70 0.6 200 0.52 0.84 0.701.70 1.0 200 0.82 0.59 0.601.70 1.4 200 0.90 0.79 0.821.70 2.0 200 0.92 0.65 0.79

Boldface indicates higher alpha in musicians than in non-musicians—hence, greater reliability of responses.

CONCLUSIONS AND IMPLICATIONSFOR FUTURE RESEARCH

The perception of ring and pressed voice qualityare both related to an increase in high frequencyenergy (2000–4000 Hz), as our spectral analysesshowed. But in the ring quality, there is a retentionof energy in the fundamental frequency and in thefirst formant region. In fact, ring quality retainsthe overall spectral balance of normal voice at bothextremes of the spectrum. In pressed quality, on theother hand, the overall reduced spectral tilt dimin-ishes the energy in Fo and around F1, while boosting

TABLE 7. Cronbach Coefficient Alpha Levelsand Standard Deviations of Ratings of “Pressedness”

Parameters Cronbach Coefficient Alpha

Qs Um Fo Musicians Nonmusicians Combined

0.1 24 300 0.86 0.94 0.880.2 95 300 0.66 0.75 0.700.4 345 300 0.19 0.11 0.200.7 794 300 0.94 0.90 0.921.0 1000 300 0.88 0.75 0.850.1 24 200 0.78 0.85 0.770.2 95 200 0.93 0.94 0.940.4 345 200 0.75 0.47 0.690.7 794 200 0.53 0.95 0.851.0 1000 200 0.88 0.96 0.950.1 345 200 0.88 0.92 0.890.2 345 200 0.69 0.84 0.810.4 345 200 0.77 0.30 0.650.7 345 200 0.70 0.68 0.641.0 345 200 0.77 0.96 0.870.6 24 200 0.59 0.44 0.440.6 95 200 0.44 0.55 0.440.6 345 200 �0.24 0.43 �0.080.6 794 200 0.91 0.87 0.870.6 1000 200 0.66 0.32 0.64

Boldface indicates higher alpha in musicians than in non-musicians—hence, greater reliability of responses.

Journal of Voice, Vol. 18, No. 3, 2004

Page 12: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

CHRISTINE C. BERGAN ET AL316

the high-frequency energy around F4 – F5. This isresponsible for a “whiter” sound. Another observa-tion in our current study was that pressed qualitywas captured by a single variable, the open quotientin the time domain, and spectral tilt in the frequencydomain, whereas ring quality was almost equally at-tributable to the skewing quotient of glottal airflowand epilaryngeal tube area. With a larger skewingquotient, more high-frequency energy is produced,whereas with a narrower epilarynx tube area, morehigh-frequency energy is resonated. The result issomewhat synergistic; because a narrowed epilarynxtube can assist in vocal fold vibration,18 more overallsource energy can be created.

In a previous study,1 we found that musicians hadgreater intersubject agreement and therefore smallervariance than nonmusicians (as a group), but thejudgment was pitch and roughness (lack of harmon-icity) for which musicians receive specific training.In this study, we found no significant differencein the overall variability of responses between themusicians and the nonmusicians. In the ring contin-uum, the nonmusicians had greater variability in 8out of the 20 possible sets of stimuli. In the pressedcontinuum, the nonmusicians had greater variabilitythan the musicians in 11 out of 20 possible sets ofstimuli. It appears that the ability to consistentlyagree with others in their “group” of either musicianor nonmusician is roughly the same.

One possible interpretation of the group differencebetween this study and our former study may bebecause the previous study better tapped the inherentadvantage of musicians, which is the ability to matchand correctly identify or differentiate pitch. Addi-tionally, the perception of roughness (inharmonicity)may also be a quality better trained in a typicalmusician’s ear. In contrast, the qualities of ring andpressedness may be less thoroughly “trained” in theear of the musician, thus in effect leveling the playingfield between them and their nonmusician peers.

In the future, it may be possible to create trainingtapes comprising a variety of anchors that representthe entire perceptual continuum of one vocal quality(with a just noticeable difference being the unit ofchange). It would then be possible to set up pretrain-ing and posttraining testing of subjects’ perception ofvocal qualities. A goal would be to increase intersub-ject and intrasubject reliability.

Journal of Voice, Vol. 18, No. 3, 2004

Acknowledgments: The authors would like to thankDrs. Eric Hunter, Greg Flamme, and Kate Cowles fortheir kind assistance with parts of the statistical analysis.Funding for this research was provided through Grant5R01 DC–04224-02.

REFERENCES

1. Bergan C, Titze I. Perception of pitch and roughness invocal signals with subharmonics. J Voice. 2001;1:165–175.

2. Kreiman J, Gerratt BR. Validity of rating scale measuresof voice quality. J Acoust Soc Am. 1998;104:1598–1608.

3. DeBodt MS, Wuyts FL, Van de Heyning PH, Croux C. Test-retest study of the GRBAS scale: influence of experienceand professional background on perceptual rating ofvoice quality. J Voice. 1997;11(1):74–80.

4. Jensen PJ. Adequacy of terminology for clinical judgmentof voice quality deviation. Eye Ear Nose Throat Month.1965;44:77–82.

5. Gerratt BR, Kreiman J, Antononzas-Barroso N, Berke GS.Comparing internal and external standards in voice qualityjudgments. J Speech Hear Res. 1993;36:14–20.

6. Wuyts FL, De Bodt MS, Van de Heyning PH. Is the reliabil-ity of a visual analog scale higher than an ordinal scale?An experiment with the GRBAS scale for perceptual evalu-ation of dysphonia. J Voice. 1999;13:508–517.

7. Kreiman J, Gerratt BR. Sources of disagreement in voicequality assessment. J Acoust Soc Am. 2000;108:1867–1876.

8. Wapnick J, Ekholm E. Expert consensus in solo voice per-formance evaluation. J Voice. 1997;11:429–436.

9. Boyle J, Radocky R. Measurement and Evaluation of Musi-cal Experiences. New York: Schirmer Books; 1987.

10. Dejonckere PH, Remacle M, Fresnel-Elbaz E, Woisard V,Crevier L, Millet B. Reliability and clinical relevance ofperceptual evaluation of pathological voices. Revue de Lar-yngologie Otologie Rhinologie. 1998;119:247–248.

11. Holmberg EB, Hillman RE, Perkell JS, Guiod PC, Gold-man SL. Comparisons among aerodynamic, electroglotto-graphic, and acoustic spectral measures of female voice.J Speech Hear Res. 1995;38:1212–1223.

12. Stathopoulos ET, Sapienza CM. Developmental changes inlaryngeal and respiratory function with variations in soundpressure level. J Speech, Lang Hear Res. 1997;40:595–614.

13. Scherer RC, Arehart KH, Guo CG, Milstein CF, Horii Y. Justnoticeable differences for glottal flow waveform character-istics. J Voice. 1998;12(1):21–30.

14. Story BH, Titze IR, Hoffman EA. Vocal tract area functionsfor an adult female speaker based on volumetric imaging.J Acoust Soc Am. 1998;104(1):471–487.

15. Holmberg EB, Perkell JS, Hillman RE, Gress C. Individualvariation in measures of voice. Phonetica. 1994;51:30–37.

16. Sapienza CM, Stathopoulos ET, Dromey C. Approxima-tions of open quotient and speed quotient from glottal air-flow and EGG waveforms: effects of measurement criteriaand sound pressure level. J Voice. 1998;12(1):31–43.

Page 13: The perception of two vocal qualities in a synthesized vocal utterance: ring and pressed voice

SYNTHESIZED VOCAL UTTERANCE 317

17. Rothenberg M. Acoustic interaction between the glottalsource and the vocal tract. In: Stevens K, ed. Vocal FoldPhysiology, vol. 1. Tokyo, Japan: University of TokyoPress; 1981:305–323.

18. Titze IR, Story BH. Acoustic interactions of the voicesource with the lower vocal tract. J Acoust Soc Am.1997;101:2234–2243.

19. Sundberg J. The voice as a sound generator. In: Sundb-erg J, editor. Research Aspects in Singing. Stockholm,Sweden: The Royal Swedish Academy of Music; 1981:6–14.

20. Yanagisawa E, Kmucha ST, Estill J. Role of the softpalate in laryngeal functions and selected voice qualities.Simultaneous velolaryngeal videoendoscopy. Ann OtolRhinol Laryngol. 1990;99(1):18–28.

21. Yanagisawa E, Estill J, Kmucha ST, Leder SB. The con-tribution of aryepiglottic constriction to ringing voicequality. J Voice. 1989;3:342–350.

22. Titze I, Mapes S, Story B. Acoustics of the tenor highvoice. J Acoust Soc Am. 1994;95:1133–1142.

23. Sundberg J. The Science of the Singing Voice. De Kalb, IL:Northern Illinois University Press; 1987.

Journal of Voice, Vol. 18, No. 3, 2004