Modeling Norms of Turn-Taking in Multi-Party Conversation

Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 999–1008,Uppsala, Sweden, 11-16 July 2010. c©2010 Association for Computational Linguistics

Modeling Norms of Turn-Taking in Multi-Party Conversation

Kornel LaskowskiCarnegie Mellon University

Pittsburgh PA, [email protected]

Abstract

Substantial research effort has been in-vested in recent decades into the com-putational study and automatic process-ing of multi-party conversation. Whilemost aspects of conversational speechhave benefited from a wide availabil-ity of analytic, computationally tractabletechniques, only qualitative assessmentsare available for characterizing multi-partyturn-taking. The current paper attempts toaddress this deficiency by first proposinga framework for computing turn-takingmodel perplexity, and then by evaluat-ing several multi-participant modeling ap-proaches. Experiments show that directmulti-participant models do not general-ize to held out data, and likely never will,for practical reasons. In contrast, theExtended-Degree-of-Overlap model rep-resents a suitable candidate for futurework in this area, and is shown to success-fully predict the distribution of speech intime and across participants in previouslyunseen conversations.

1 Introduction

Substantial research effort has been invested inrecent decades into the computational study andautomatic processing of multi-party conversation.Whereas sociolinguists might argue that multi-party settings provide for the most natural formof conversation, and that dialogue and monologueare merely degenerate cases (Jaffe and Feldstein,1970), computational approaches have found itmost expedient to leverage past successes; theseoften involved at most one speaker. Consequently,even in multi-party settings, automatic systemsgenerally continue to treat participants indepen-dently, fusing information across participants rel-atively late in processing.

This state of affairs has resulted in the near-exclusion from computational consideration andfrom semantic analysis of a phenomenon whichoccurs at the lowest level of speech exchange,namely the relative timing of the deployment ofspeech in arbitrary multi-party groups. This phe-nomenon, the implicit taking of turns at talk(Sacks et al., 1974), is important because unlessparticipants adhere to its general rules, a conver-sation would simply not take place. It is there-fore somewhat surprising that while most otheraspects of speech enjoy a large base of computa-tional methodologies for their study, there are fewquantitative techniques for assessing the flow ofturn-taking in general multi-party conversation.

The current work attempts to address this prob-lem by proposing a simple framework, which, atleast conceptually, borrows quite heavily from thestandard language modeling paradigm. First, it de-fines the perplexity of a vector-valued Markov pro-cess whose multi-participant states are a concate-nation of the binary states of individual speakers.Second, it presents some obvious evidence regard-ing the unsuitability of models defined directlyover this space, under various assumptions of in-dependence, for the inference of conversation-independent norms of turn-taking. Finally, itdemonstrates that the extended-degree-of-overlapmodel of (Laskowski and Schultz, 2007), whichmodels participants in an alternate space, achievesby far the best likelihood estimates for previ-ously unseen conversations. This appears to bebecause the model can learnacross conversa-tions, regardless of the number of their partici-pants. Experimental results show that it yieldsrelative perplexity reductions of approximately75% when compared to the ubiquitous single-participant model which ignores interlocutors, in-dicating that it can learn and generalize aspects ofinteraction which direct multi-participant models,and merely single-participant models, cannot.

999

2 Data

Analysis and experiments are performed using theICSI Meeting Corpus (Janin et al., 2003; Shriberget al., 2004). The corpus consists of 75 meetings,held by various research groups at ICSI, whichwould have occurred even if they had not beenrecorded. This is important for studying naturallyoccurring interaction, since any form of interven-tion (including occurrence staging solely for thepurpose of obtaining a record) may have an un-known but consistent impact on the emergence ofturn-taking behaviors. Each meeting was attendedby 3 to 9 participants, providing a wide variety ofpossible interaction types.

3 Conceptual Framework

3.1 Definitions

Turn-takingis a generally observed phenomenonin conversation (Sacks et al., 1974; Goodwin,1981; Schegloff, 2007); one party talks while theothers listen. Its description and analysis is animportant problem, treated frequently as a sub-domain of linguistic pragmatics (Levinson, 1983).In spite of this, linguists tend to disagree aboutwhat precisely constitutes aturn (Sacks et al.,1974; Edelsky, 1981; Goodwin, 1981; Traum andHeeman, 1997), or even a turn boundary. For ex-ample, a “yeah” produced by a listener to indicateattentiveness, referred to as abackchannel(Yngve,1970), is often considered to not implement a turn(nor to delineate an ongoing turn of an interlocu-tor), as it bears no propositional content and doesnot “take the floor” from the current speaker.

To avoid being tied to any particular sociolin-guistic theory, the current work equates “turn”with any contiguous interval of speech uttered bythe same participant. Such intervals are commonlyreferred to astalk spurts(Norwine and Murphy,1938). Because Norwine and Murphy’s originaldefinition is somewhat ambiguous and non-trivialto operationalize, this work relies on that proposedby (Shriberg et al., 2001), in whichspurtsare “de-fined asspeech regions uninterrupted by pauseslonger than 500 ms” (italics in the original). Here,a threshold of 300 ms is used instead, as recentlyproposed in NIST’s Rich Transcription MeetingRecognition evaluations (NIST, 2002). The re-sulting definition of talk spurt, it is important tonote, is in quite common use but frequently un-der different names. An oft-cited example is the

inter-pausal unitof (Koiso et al., 1998)1, wherethe threshold is 100 ms.

A consequence of this choice is that anymodelof turn-taking behaviorinferred will effectively bea model of the distribution of speech, in time andacross participants. If the parameters of such amodel are maximum likelihood (ML) estimates,then that model will best account for what is mostlikely, or most “normal”; it will constitute anorm.

Finally, an important aspect of this work is thatit analyzes turn-taking behavior as independent ofthe words spoken (and of the ways in which thosewords are spoken). As a result, strictly speaking,what is modeled is not the distribution of speech intime and across participants but of binaryspeechactivity in time and across participants. Despitethis seemingly dramatic simplification, it will beseen that important aspects of turn-taking are suffi-ciently rare to be problematic for modeling. Mod-eling them jointly alongside lexical information,in multi-party scenarios, is likely to remain in-tractable for the foreseeable future.

3.2 The Vocal Interaction RecordQ

The notation used here, as in (Laskowski andSchultz, 2007), is a trivial extension of that pro-posed in (Rabiner, 1989) to vector-valued Markovprocesses.

At any instantt, each ofK participants to a con-versation is in a state drawn fromΨ ≡ {S0, S1} ≡{�, �}, whereS1 ≡ � indicates speech (or, moreprecisely, “intra-talk-spurt instants”) andS0 ≡� indicates non-speech (or “inter-talk-spurt in-stants”). Thejoint state of all participants at timet is described using theK-length column vector

qt ∈ ΨK ≡ Ψ×Ψ× . . .×Ψ

≡{

S0,S1, . . . ,S2K−1

}

. (1)

An entire conversation, from the point of view ofthis work, can be represented as the matrix

Q ≡ [q1,q2, . . . ,qT ] (2)

∈ ΨK×T .

Q is known as the (discrete) vocal interaction(Dabbs and Ruback, 1987) record.T is the totalnumber of frames in the conversation, sampled atTs = 100 ms intervals. This is approximately theduration of the shortest lexical productions in theICSI Meeting Corpus.

1The inter-pausal unit differs from thepause unitof(Seligman et al., 1997) in that the latter is an intra-turn unit,requiring prior turn segmentation

1000

3.3 Time-Independent First-Order MarkovModeling of Q

Given this definition ofQ, a modelΘ is soughtto account for it. Only time-independent models,whose parameters do not change over the courseof the conversation, are considered in this work.

For simplicity, the stateq0 = S0 =[�, �, . . . ,�]∗, in which no participant is speak-ing (∗ indicates matrix transpose, to avoid con-fusion with conversation durationT ) is firstprepended toQ. P0 = P (q0 ) therefore repre-sents the unconditional probability of all partici-pants being silent just prior to the start of any con-versation2. Then

P (Q ) = P0 ·T

∏

t=1

P (qt |q0,q1, · · · ,qt−1 )

.= P0 ·

T∏

t=1

P (qt |qt−1,Θ ) , (3)

where in the second line the history is truncated toyield a standard first-order Markov form.

Each of theT factors in Equation 3 is indepen-dent of the instantt,

P (qt |qt−1,Θ )

= P (qt = Sj |qt−1 = Si,Θ ) (4)

≡ aij , (5)

as per the notation in (Rabiner, 1989). In particu-lar, each factor is a function only of the stateSi inwhich the conversation was at timet − 1 and thestateSj in which the conversation is at timet, andnot of the instantst − 1 or t. It may be expressedas the scalaraij which forms theith row andjthcolumn entry of the matrix{aij} ≡ Θ.

3.4 Perplexity

In language modeling practice, one finds the like-lihoodP (w |Θ ), of a word sequencew of length‖w‖ under a modelΘ, to be an inconvenient mea-sure for comparison. Instead, thenegative log-likelihood (NLL) andperplexity(PPL), defined as

NLL = −1

‖w‖loge P (w |Θ ) (6)

PPL = 10NLL , (7)

2In reality, the instantt = 0 refers to the beginning oftherecordingof a conversation, rather than the beginning of theconversation itself; this detail is without consequence.

are often preferred (Jelinek, 1999). They are ubiq-uitously used to compare the complexity of differ-ent word sequences (or corpora)w andw′ underthe same modelΘ, or the performance on a sin-gle word sequence (or corpus)w under competingmodelsΘ andΘ′.

Here, a similar metric is proposed, to be usedfor the same purposes, for the recordQ.

NLL = −1

KTlog2 P (Q |Θ ) (8)

PPL = 2NLL

= (P (Q |Θ ))−1/KT (9)

are defined as measures ofturn-taking perplex-ity. As can be seen in Equation 8, the negativelog-likelihood is normalized by the numberK ofparticipants and the numberT of frames inQ;the latter renders the measure useful for makingduration-independent comparisons. The normal-ization by K does notper sesuggest that turn-taking in conversations with differentK is nec-essarily similar; it merely provides similar boundson the magnitudes of these metrics.

4 Direct Estimation of Θ

Direct application of bigram modeling techniques,defined over the states{S}, is treated as a baseline.

4.1 The Case ofK = 2 Participants

In contrast to multi-party conversation, dialoguehas been extensively modeled in the ways de-scribed in this paper. Beginning with (Brady,1969), Markov modeling techniques over the jointspeech activity of two interlocutors have beenexplored by both the sociolinguist and the psy-cholinguist community (Jaffe and Feldstein, 1970;Dabbs and Ruback, 1987). The same models havealso appeared in dialogue systems (Raux, 2008).Most recently, they have been augmented with du-ration models in a study of the Switchboard corpus(Grothendieck et al., 2009).

4.2 The Case ofK > 2 Participants

In the general case beyond dialogue, such mod-els have found less traction. This is partly due tothe exponential growth in the number of states asK increases, and partly due to difficulties in in-terpretation. The only model for arbitraryK thatthe author is familiar with is the GroupTalk model(Dabbs and Ruback, 1987), which is unsuitablefor the purposes here as it does not scale (withK,

1001

10 15 20

1.05

1.075

1.1

1.125

oracleA+BB+A

Figure 1: Perplexity (alongy-axis) in time (alongx-axis, in minutes) for meetingBmr024 undera conditionally dependent global oracle model,two “matched-half” models (A+B), and two“mismatched-half” models (B+A).

the number of participants) without losing track ofspeakers when two or more participants speak si-multaneously (known asoverlap).

4.2.1 Conditionally Dependent Participants

In a particular conversation withK participants,the state space of an ergodic process contains2K states, and the number of free parameters ina modelΘ which treats participant behavior asconditionally dependent(CD), henceforthΘCD,scales as2K ·

(

2K − 1)

. It should be immediatelyobvious that many of the2K states are likely to notoccur within a conversation of durationT , leadingto misestimation of the desired probabilities.

To demonstrate this, three perplexity trajecto-ries for a snippet of meetingBmr024 are shownin Figure 1, in the interval beginning 5 minutesinto the meeting and ending 20 minutes later. (Themeeting is actually just over 50 minutes long butonly a snippet is shown to better appreciate smalltime-scale variation.) The depicted perplexitiesare not unweighted averages over the whole meet-ing of durationT as in Equation 8, but over a 60-second Hamming window centered on eacht.

The first trajectory, the dashed black line, is ob-tained when the entire meeting is used to estimateΘCD, and is then scored by that same model (an“oracle” condition). Significant perplexity varia-tion is observed throughout the depicted snippet.

The second trajectory, the continuous blackline, is that obtained when the meeting is split intotwo equal-duration halves, one consisting of all in-stants prior to the midpoint and the other of all

instants following it. These halves are hereafterreferred to as A and B, respectively (the intervalin Figure 1 falls entirely within the A half). Twoseparate modelsΘCD

A andΘCDB are each trained

on only one of the two halves, and then applied tothose same halves. As can be seen at the scale em-ployed, the matched A+B model, demonstratingthe effect of training data ablation, deviates fromthe global oracle model only in the intervals[7, 11]seconds and[15, 18] seconds; otherwise it appearsthat more training data, from later in the conversa-tion, does not affect model performance.

Finally, the third trajectory, the continuous grayline, is obtained when the two halvesA and B

of the meeting are scored using the mismatchedmodelsΘCD

B andΘCDA , respectively (this condi-

tion is henceforth referred to as the B+A condi-tion). It can be seen that even when probabilitiesare estimated from the same participants, in ex-actly the same conversation, a direct conditionallydependent model exposed to over 25 minutes ofa conversation cannot predict the turn-taking pat-terns observed later.

4.2.2 Conditionally Independent Participants

A potential reason for the gross misestimation ofΘCD under mismatched conditions is the size ofthe state space{S}. The number of parameters inthe model can be reduced by assuming that par-ticipants behaveindependentlyat instantt, but areconditioned on theirjoint behavior att − 1. Thelikelihood of Q under the resultingconditionallyindependent modelΘCI has the form

P (Q )

.= P0 ·

T∏

t=1

K∏

k=1

P(

qt [k] |qt−1,ΘCIk

)

, (10)

where each factor is time-independent,

P(

qt [k] |qt−1,ΘCIk

)

= P(

qt [k] = Sn |qt−1 = Si,ΘCIk

)

(11)

≡ aCIk,in , (12)

with 0 ≤ i < 2K and 0 ≤ n < 2. The completemodel{ΘCI

k } ≡ {{aCIk,in}} consists ofK matrices

of size 2K × 2 each. It therefore contains onlyK ·2K free parameters, a significant reduction overthe conditionally dependent modelΘCD.

Panel (a) of Figure 2 shows the performanceof this model on the same conversational snippet

1002

as in Figure 1. The oracle, dashed black line ofthe latter is reproduced as a reference. The con-tinuous black and gray lines show the smoothedperplexity for the matched (A+B) and the mis-matched (B+A) conditions, respectively. In thematched condition, the CI model reproduces theoracle trajectory with relatively high fidelity, sug-gesting that participants’ behavior may in fact beassumed to be conditionally independent in thesense discussed. Furthermore, the failures of theCI model under mismatched conditions are less se-vere in magnitude than those of the CD model.

Panel (b) of Figure 2 demonstrates the trivialfact that a conditionally independent modelΘCI

any,tying the statistics of allK participants into a sin-gle model, is useless. This is of course because itcannot predict the next state of a generic partici-pant for which the indexk in qt−1 has been lost.

4.2.3 Mutually Independent Participants

A further reduction in the complexity ofΘ can beachieved by assuming that participants aremutu-ally independent(MI), leading to the participant-specificΘMI

k model:

P (Q )

.= P0 ·

T∏

t=1

K∏

k=1

P(

qt [k] |qt−1 [k] ,ΘMIk

)

. (13)

The factors are time-independent,

P(

qt [k] |qt−1 [k] ,ΘMIk

)

= P(

qt [k] = Sn |qt−1 [k] = Sm,ΘMIk

)

(14)

≡ aMIk,mn , (15)

where 0 ≤ m < 2 and 0 ≤ n < 2. This model{ΘMI

k } ≡ {{aMIk,mn}} consists ofK matrices of

size2× 2 each, with onlyK · 2 free parameters.Panel (c) of Figure 2 shows that the MI model

yields mismatched performance which is a muchbetter approximation to its performance undermatched conditions. However, its matched perfor-mance is worse than that of CD and CI models.When a single MI modelΘMI

any is trained insteadfor all participants, as shown in panel (d), both ofthese effects are exaggerated. In fact, the perfor-mance ofΘMI

any in matched and mismatched con-ditions is almost identical. The consistently higherperplexity is obtained, as mentioned, by smooth-ing over 60-second windows, and therefore un-derestimates poor performance at specific instants(which occur frequently).

10 15 20

1.05

1.075

1.1

1.125

10 15 20

1.1

1.2

1.3

1.4

(a)Θ ={

ΘCIk

}

(b) Θ = ΘCIany

10 15 20

1.05

1.075

1.1

1.125

10 15 20

1.05

1.075

1.1

1.125

(c) Θ ={

ΘMIk

}

(d) Θ = ΘMIany

Figure 2: Perplexity (alongy-axis) in time (alongx-axis, in minutes) for meetingBmr024 under aconditionally dependent global oracle model, andvarious matched (A+B) and mismatched (B+A)model pairs with relaxed dependence assump-tions. Legend as in Figure 1.

5 Limitations and Desiderata

As the analyses in Section 4 reveal, direct es-timation can be useful under oracle conditions,namely when all of a conversation has been ob-served and the task is to find intervals where multi-participant behavior deviates significantly fromits conversation-specificnorm. The assumptionof conditional independence among participantswas argued to lead to negligible degradation inthe detectability of these intervals. However, theassumption of mutual independence consistentlyleads to higher surprise by the model.

5.1 Predicting the Future WithinConversations

In the more interesting setting in which only a partof a conversation has been seen and the task is tolimit the perplexity of what is still to come, directestimation exhibits relatively large failures underboth conditionally dependent and conditionally in-dependent participant assumptions. This appearsto be due to the size of the state space, whichscales as2K with the numberK of participants.In the case of generalK, more conversational datamay be sought, from exactly the same group ofparticipants, but that approach appears likely to be

1003

insufficient, and, for practical reasons3, impossi-ble. One would instead like to be able to use otherconversations, also exhibiting participant interac-tion, to limit the perplexity of speech occurrencein the conversation under study.

Unfortunately, there are two reasons why directestimation cannot be tractably deployed acrossconversations. The first is that the direct modelsconsidered here, with the exception ofΘMI

any, areK-specific. In particular, the number and the iden-tity of conditioning states are both functions ofK,for ΘCD and{ΘCI

k }; the models may also con-sist of K distinct submodels, as for{ΘCI

k } and{ΘMI

k }. No techniques for computing the turn-taking perplexity in conversations withK partici-pants, using models trained on conversations withK ′ 6= K, are currently available.

The second reason is that these models, againwith the exception ofΘMI

any, are R-specific, in-dependently ofK-specificity. By this it is meantthat the models are sensitive to participant indexpermutation. Had a participant at indexk in Q

been assigned to another indexk′ 6=k, an alter-nate representation of the conversation, namelyQ′ = Rkk′ ·Q, would have been obtained. (Here,Rkk′ is a matrix rotation operator obtained by ex-changing columnsk andk′ of theK ×K identitymatrix I.) Since index assignment is entirely arbi-trary, useful direct models cannot be inferred fromother conversations, even when theirK ′ = K, un-lessK is small. The prospect of naively permutingevery training conversation prior to parameter in-ference has complexityK!.

5.2 Comparing Perplexity AcrossConversations

Until R-specificity is comprehensively addressed,the only model from among those discussed sofar, which exhibits noK-dependence, isΘMI

any,namely that which treats participants identicallyand independently. This model can be used toscore the perplexity ofanyconversation, and facil-itates the comparison of the distribution of speechactivity acrossconversations.

Unfortunately, since the model captures onlydurational aspects ofone-participant speech andnon-speech intervals, it does not in any way en-code a norm of turn-taking, an inherently interac-

3This pertains to the practicalities of re-inviting, instru-menting, recording and transcribing the same groups ofparticipants, with necessarily more conversations for largegroups than for small ones.

tive and hencemulti-participant phenomenon. Ittherefore cannot be said to rank conversations ac-cording to their deviation from turn-taking norms.

5.3 Theoretical Limitations

In addition to the concerns above, a funda-mental limitation of the analyzed direct models,whether for conversation-specific or conversation-independent use, is that they are theoretically cum-bersome if not vacuous. Given a solution to theproblem ofR-specificity, the parameters{aCD

ij }may be robustly inferred, and the models may beapplied to yield useful estimates of turn-takingperplexity. However, they cannot be said to di-rectly validate or dispute the vast qualitative ob-servations of sociolinguistics, and of conversationanalysis in particular.

5.4 Prospects for Smoothing

To produce Figures 1 and 2, a small fraction ofprobability mass was reserved for unseen bigramtransitions (as opposed to backing off to unigramprobabilities). Furthermore, transitions into never-observed states were assigned uniform probabili-ties. This policy is simplistic, and there is signifi-cant scope for more detailed back-off and interpo-lation. However, such techniques infer values forunder-estimated probabilities fromshorter trunca-tions of the conditioning history. As K-specificityand R-specificity suggest, what appears to beneeded here are back-off and interpolationacrossstates. For example, in a conversation ofK = 5participants, estimates of the likelihood of the stateqt = [��]∗, which might have been unob-served in any training material, can be assumedto be related to those ofq′t = [��]∗ andq′′t = [��]∗, as well as those ofRq′t andRq′′t , for arbitraryR.

6 The Extended-Degree-of-OverlapModel

The limitations of direct models appear to be ad-dressable by a form proposed by Laskowski andSchultz in (2006) and (2007). That form, theExtended-Degree-of-Overlap (EDO) model, wasused to provide prior probabilitiesP (Q |Θ ) ofthe speech states of multiple meeting participantssimultaneously, for use in speech activity detec-tion. The model was trained onutterances(ratherthan talk spurts) from a different corpus than that

1004

used here, and the authors did not explore the turn-taking perplexities of their data sets.

Several of the equations in (Laskowski andSchultz, 2007) are reproduced here for compar-ison. The EDO model yields time-independenttransition probabilities which assume conditionalinter-participant dependence (cf. Equation 3),

P (qt+1 = Sj |qt = Si ) = αij · (16)

P ( ‖qt+1‖ = nj , ‖qt+1 · qt‖ = oij | ‖qt‖ = ni) ,

whereni ≡ ‖Si‖ andnj ≡ ‖Sj‖, with ‖S‖ yield-ing the number of participants in� in the multi-participant stateS. In other words,ni andnj arethenumbersof participants simultaneously speak-ing in statesSi andSj , respectively. The elementsof the binary productS = S1 · S2 are given by

S [k] ≡

{

�, if S1 [k] = S2 [k] = �

�, otherwise,(17)

and oij is therefore the number of same partici-pants speaking inSi and Sj . The discussion ofthe role ofαij in Equation 16 is deferred to theend of this section.

The EDO model mitigatesR-specificity be-cause it models each bigram(qt−1,qt) = (Si,Sj)as the modified bigram(ni, [oij , nj ]), involvingthree scalars each of which is asum— a com-mutative (and therefore rotation-invariant) opera-tion. Because it sums across only those partici-pants which are in the� state, completely ignor-ing their�-state interlocutors, it can also mitigateK-specificity if one additionally redefines

ni = min ( ‖Si‖, Kmax ) (18)

nj = min ( ‖Sj‖, Kmax ) (19)

oij = min ( ‖Si · Sj‖, ni, nj) , (20)

as in (Laskowski and Schultz, 2007).Kmax

represents the maximum model-licensed degreeof overlap, or the maximum number of par-ticipants allowed to be simultaneously speak-ing. The EDO model therefore represents aviable conversation-independent,K-independent,andR-independent model of turn-taking for thepurposes in the current work4. The factorαij

4There exists some empirical evidence to suggest thatconversations ofK participants should not be used to trainmodels for predicting turn-taking behavior in conversationsof K

′ participants, forK′ 6= K, because turn-taking is in-herentlyK-dependent. For example, (Fay et al., 2000) foundthat qualitative differences in turn-taking patterns between

in Equation 16 provides a deterministic map-ping from the conversation-independent space(ni, [oij , nj ]) to the conversation-specific space{aij}. The mapping is deterministic because themodel assumes that all participants are identical.This places the EDO model at a disadvantage withrespect to the CD and CI models, as well as to{ΘMI

k }, which allow each participant to be mod-eled differently.

7 Experiments

This section describes the performance of the dis-cussed models on the entire ICSI Meeting Corpus.

7.1 Conversation-Specific Modeling

First to be explored is the prediction of yet-unobserved behavior in conversation-specific set-tings. For each meeting, models are trained onportions of that meeting only, and then used toscore other portions of the same meeting. Thisis repeated over all meetings, and comprises themismatched condition of Section 4; for contrast,the matched condition is also evaluated.

Each meeting is divided into two halves, in twodifferent ways. The first way is the A/B split ofSection 4, representing the first and second halvesof each meeting; as has been shown, turn-takingpatterns may vary substantially from A to B. Thesecond split (C/D) places every even-numberedframe in one set and every odd-numbered framein the other. This yields a much easier setting, oftwo halves which are on average maximally simi-lar but still temporally disjoint.

The perplexities (of Equation 9) in these experi-ments are shown in the second, fourth, sixth andeighth columns of Table 1, under “all”. In thematched A+B and C+D conditions, the condition-ally dependent modelΘCD provides topline MLperformance. Perplexities decrease as model com-plexities fall for direct models, as expected. How-ever, in the more interesting mismatched B+Acondition, the EDO model performs the best. Thisshows that its ability to generalize to unseen datais higher than that of direct models. However, inthe easier mismatched D+C condition, it is out-performed by the CI model due to behavior differ-ences among participants, which the EDO model

small groups and large groups, represented in their study byK = 5 andK = 10, and noted that there is a smooth transi-tion between the two extremes; this provides some scope forinterpolating small- and large- group models, and the EDOframework makes this possible.

1005

Hard split A/B (first/second halves) Easy split C/D (odd/even frames)

Model A+B B+A C+D D+C“all” “sub” “all” “sub” “all” “sub” “all” “sub”

ΘCD 1.0905 1.6444 1.1225 1.8395 1.0915 1.6555 1.0991 1.7403{ΘCI

k } 1.0915 1.6576 1.1156 1.7809 1.0925 1.6695 1.0956 1.7028{ΘMI

k } 1.0978 1.7236 1.1086 1.7950 1.0991 1.7381 1.0992 1.7398ΘMI 1.1046 1.8047 1.1047 1.8059 1.1046 1.8050 1.1046 1.8052ΘEDO 1.0977 1.7257 1.0985 1.7323 1.0977 1.7268 1.0982 1.7313

Table 1: Perplexities for conversation-specific turn-taking models on the entire ICSI Meeting Corpus.Both “all” frames and the subset (“sub”) for whichqt−1 6= qt are shown, for matched (A+B and C+D)and mismatched (B+A and D+C) conditions on splits A/B and C/D.

does not capture.The numbers under the “all” columns in Table 1

were computed using all of each meeting’s frames.For contrast, in the “sub” columns, perplexitiesare computed over only those frames for whichqt−1 6= qt. This is a useful subset because, forthe majority of time in conversations, one personsimply continues to talk while all others remainsilent5. Excludingqt−1 = qt bigrams (leading to0.32M frames from 2.39M frames in “all”) offers aglimpse of expected performance differences wereduration modeling to be included in the models.Perplexities are much higher in these intervals, butthe same general trend as for “all” is observed.

7.2 Conversation-Independent Modeling

The training of conversation-independent models,given a corpus ofK-heterogeneous meetings, isachieved by iterating over all meetings and testingeach using models trained on all of the other meet-ings. As discussed in the preceding section,ΘMI

any

is the only one among the direct models which canbe used for this purpose. It also models exclu-sively single-participant behavior, ignoring the in-teractive setting provided by other participants. Asshown in Table 2, when all time is scored the EDOmodel withKmax = 4 is the best model (in Sec-tion 7.1,Kmax = K since the model was trainedon the same meeting to which it was applied). Itsperplexity gap to the oracle model is only a quarterof the gap exhibited byΘMI

any.The relative performance of EDO models is

even better when only those instantst are consid-ered for whichqt−1 6= qt. There, the perplex-ity gap to the oracle model is smaller than that of

5Retaining onlyqt−1 6=qt also retains instants of transi-tion into and out of intervals of silence.

PPL ∆PPL (%)Model

“all” “sub” “all” “sub”

ΘCD 1.0921 1.6616 — —ΘMI 1.1051 1.8170 14.1 23.5ΘEDO (6) 1.0992 1.7405 7.7 11.9ΘEDO (5) 1.0968 1.7127 5.1 7.7ΘEDO (4) 1.0953 1.6947 3.5 5.0ΘEDO (3) 1.1082 1.8502 17.5 28.5

Table 2: Perplexities for conversation-independentturn-taking models on the entire ICSI MeetingCorpus; the oracleΘCD topline is included in thefirst row. Both “all” frames and the subset (“sub”)for whichqt−1 6= qt are shown; relative increasesover the topline (less unity, representing no per-plexity) are shown in columns 4 and 5. The valueof Kmax (cf. Equations 18, 19, and 20) is shownin parentheses in the first column.

ΘEDO by 78%.

8 Discussion

The model perplexities as reported above maybe somewhat different if the “talk spurt” werereplaced by a more sociolinguistically motivateddefinition of “turn”, but the ranking of models andtheir relative performance differences are likely toremain quite similar. On the one hand, many inter-talk-spurt gaps might find themselves to be within-turn, leading to more� entries in the recordQthan observed in the current work. This wouldincrease the apparent frequency and duration ofintervals of overlap. On the other hand, alterna-tive definitions of turn may exclude some speechactivity, such as that implementing backchannels.Since backchannels are often produced in overlap

1006

with the foreground speaker, their removal mayeliminate some overlap fromQ. (However, asnoted in (Shriberg et al., 2001), overlap rates inmulti-party conversation remain high even afterthe exclusion of backchannels.) Both inter-talk-spurt gap inclusion and backchannel exclusion arelikely to yield systematic differences, and there-fore to be exploitable by the investigated modelsin similar ways.

The results presented may also be perturbedby modifying the way in which a (manuallyproduced) talk spurt segmentation, with high-precision boundary time-stamps, is discretized toyield Q. Two parameters have controlled the dis-cretization in this work: (1) the frame stepTs =100 ms; and (2) the proportionρ of Ts for whicha participant must be speaking within a frame inorder for that frame to be considered� rather than�. ρ = 0.5 was chosen since this posits approx-imately as much more speech (than in the high-precision segmentation) as it eliminates. Highervalues ofρ would lead to more�, leading to moreoverlap than observed in this work. Meanwhile, atconstantρ, choosing aTs value larger than 100 mswould occasionally miss the shortest talk spurts,but it would allow the models, which are all 1st-order Markovian, to learn temporally more dis-tant dependencies. The trade-offs between thesechoices are currently under investigation.

From an operational, modeling perspective, itis important to recognize that the choices of thedefinition for “turn”, and of the way in whichsegmentations are discretized, are essentially ar-bitrary. The investigated modeling alternatives,and the EDO model in particular, require only thatthe multi-participant vocal interaction recordQbe binary-valued. This general applicability hasbeen demonstrated in past work, in which the EDOmodel was trained onutterancesfor use in speechactivity detection (Laskowski and Schultz, 2007),as well as in (Laskowski and Burger, 2007) whereit was trained separately on talk spurts andlaughbouts, in the same data, to highlight the differencesbetween speech and laughter deployment.

Finally, it should be remembered that the EDOmodel is both time-independent and participant-independent. This makes it suitable for compar-ison of conversational genres, in much the sameway as are general language models of words. Ac-cordingly, as for language models, density esti-mation in future turn-taking models may be im-

proved by considering variability across partic-ipants and in time. Participant dependence islikely to be related to speakers’ social character-istics and conversational roles, while time depen-dence may reflect opening and closing functions,topic boundaries, and periodic turn exchange fail-ures. In the meantime, event types such as the lat-ter may be detectable as EDO perplexity depar-tures, potentially recommending the model’s usefor localizing conversational “hot spots” (Wredeand Shriberg, 2003). The EDO model, and turn-taking models in general, may also find use indiagnosing turn-taking naturalness in spoken di-alogue systems.

9 Conclusions

This paper has presented a framework for quan-tifying the turn-taking perplexity in multi-partyconversations. To begin with, it explored the con-sequences of modeling participants jointly by con-catenating their binary speech/non-speech statesinto a single multi-participant vector-valued state.Analysis revealed that such models are particu-larly poor at generalization, even to subsequentportions of the same conversation. This is due tothe size of their state space, which is factorial inthe number of participants. Furthermore, becausesuch models are both specific to the number ofparticipants and to the order in which participantstates are concatenated together, it is generally in-tractable to train them on material from other con-versations. The only such model which may betrained on other conversations is that which com-pletely ignores interlocutor interaction.

In contrast, the Extended-Degree-of-Overlap(EDO) construction of (Laskowski and Schultz,2007) may be trained on other conversations, re-gardless of their number of participants, and use-fully applied to approximate the turn-taking per-plexity of an oracle model. This is achieved be-cause it models entry into and egress out of spe-cific degrees of overlap, and completely ignoresthe number of participants actually present or theirmodeled arrangement. In this sense, the EDOmodel can be said to implement the qualitativefindings of conversation analysis. In predicting thedistribution of speech in time and across partici-pants, it reduces the unseen data perplexity of amodel which ignores interaction by 75% relativeto an oracle model.

1007

References

Paul T. Brady. 1969. A model for generating on-off patterns in two-way conversation.Bell SystemsTechnical Journal, 48(9):2445–2472.

James M. Dabbs and R. Barry Ruback. 1987. Di-mensions of group process: Amount and structureof vocal interaction.Advances in Experimental So-cial Psychology, 20:123–169.

Carole Edelsky. 1981. Who’s got the floor?Langaugein Society, 10:383–421.

Nicolas Fay, Simon Garrod, and Jean Carletta. 2000.Group discussion as interactive dialogue or as serialmonologue: The influence of group size.Psycho-logical Science, 11(6):487–492.

Charles Goodwin. 1981.Conversational Organiza-tion: Interaction Between Speakers and Hearers.Academic Press, New York NY, USA.

John Grothendieck, Allen Gorin, and Nash Borges.2009. Social correlates of turn-taking behavior.Proc. ICASSP, Taipei, Taiwan, pp. 4745–4748.

Joseph Jaffe and Stanley Feldstein. 1970.Rhythms ofDialogue. Academic Press, New York NY, USA.

Adam Janin, Don Baron, Jane Edwards, Dan Ellis,David Gelbart, Nelson Morgan, Barbara Peskin,Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke,and Chuck Wooters. 2003. The ICSI Meeting Cor-pus. Proc. ICASSP, Hong Kong, China, pp. 364–367.

Frederick Jelinek. 1999. Statistical Methods forSpeech Recognition. MIT Press, Cambridge MA,USA.

Hanae Koiso, Yasui Horiuchi, Syun Tutiya, AkiraIchikawa, and Yasuharu Den. 1998. An analysisof turn-taking and backchannels based on prosodicand syntactic features in Japanese Map Task dialogs.Language and Speech, 41(3-4):295–321.

Kornel Laskowski and Tanja Schultz. 2006. Unsu-pervised learning of overlapped speech model pa-rameters for multichannel speech activity detectionin meetings. Proc. ICASSP, Toulouse, France, pp.993–996.

Kornel Laskowski and Susanne Burger. 2007. Analy-sis of the occurrence of laughter in meetings.Proc.INTERSPEECH, Antwerpen, Belgium, pp. 1258–1261.

Kornel Laskowski and Tanja Schultz. 2007. Mod-eling vocal interaction for segmentation in meet-ing recognition.Machine Learning for MultimodalInteraction, A. Popescu-Belis, S. Renals, and H.Bourlard, eds., Lecture Notes in Computer Sci-ence, 4892:259–270, Springer Berlin/Heidelberg,Germany.

Stephen C. Levinson. 1983.Pragmatics. CambridgeUniversity Press.

National Institute of Standards and Technology.2002. Rich Transcription Evaluation Project,www.itl.nist.gov/iad/mig/tests/rt/(last accessed 15 February 2010 1217hrs GMT).

A. C. Norwine and O. J. Murphy. 1938. Character-istic time intervals in telephonic conversation.BellSystem Technical Journal, 17:281-291.

Lawrence Rabiner. 1989. A tutorial on hidden Markovmodels and selected applications in speech recogni-tion. Proc. IEEE, 77(2):257–286.

Antoine Raux. 2008. Flexible turn-taking for spo-ken dialogue systems. PhD Thesis, Carnegie MellonUniversity.

Harvey Sacks, Emanuel A. Schegloff, and Gail Jeffer-son. 1974. A simplest semantics for the organi-zation of turn-taking for conversation.Language,50(4):696–735.

Emanuel A. Schegloff. 2007.Sequence Organizationin Interaction. Cambridge University Press, Cam-bridge, UK.

Mark Seligman, Junko Hosaka, and Harald Singer.1997. “Pause units” and analysis of spontaneousJapanese dialogues: Preliminary studies.DialogueProcessing in Spoken Language SystemsE. Maier,M. Mast, and S. LuperFoy, eds., Lecture Notesin Computer Science, 1236:100–112. SpringerBerlin/Heidelberg, Germany.

Elizabeth Shriberg, Andreas Stolcke, and Don Baron.2001. Observations on overlap: Findings and impli-cations for automatic processing of multi-party con-versation. Proc. EUROSPEECH, Geneve, Switzer-land, pp. 1359–1362.

Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, JeremyAng, and Hannah Carvey. 2004. The ICSI MeetingRecorder Dialog Act (MRDA) Corpus.Proc. SIG-DIAL, Boston MA, USA, pp. 97–100.

David Traum and Peeter Heeman. 1997. Utteranceunits in spoken dialogue.Dialogue Processing inSpoken Language SystemsE. Maier, M. Mast, andS. LuperFoy, eds., Lecture Notes in Computer Sci-ence, 1236:125–140. Springer Berlin/Heidelberg,Germany.

Britta Wrede and Elizabeth Shriberg. 2003. Spot-ting “hot spots” in meetings: Human judgmentsand prosodic cues.Proc. EUROSPEECH, Aalborg,Denmark, pp. 2805–2808.

Victor H. Yngve. 1970. On getting a word in edgewise.Papers from the Sixth Regional Meeting ChicagoLinguistic Society, pp. 567–578. Chicago Linguis-tic Society, Chicago IL, USA.

1008

Modeling Norms of Turn-Taking in Multi-Party Conversation

Documents