Top Banner
Classifying the form of iconic hand gestures from the linguistic categorization of co-occurring verbs Magdalena Lis Centre for Language Technology University of Copenhagen Njalsgade 140 2300 Copenhagen [email protected] Costanza Navarretta Centre for Language Technology University of Copenhagen Njalsgade 140 2300 Copenhagen [email protected] Abstract This paper deals with the relation between speech and form of co-occurring iconic hand gestures. It focuses on multimodal expression of eventualities. We investi- gate to what extent it is possible to au- tomatically classify gestural features from the categorization of verbs in a wordnet. We do so by applying supervised machine learning to an annotated multimodal cor- pus. The annotations describe form fea- tures of gestures. They also contain in- formation about the type of eventuality, verb Aktionsart and Aspect, which were extracted from plWordNet 2.0. Our results confirm the hypothesis that the Eventual- ity Type and Aktionsart are related to the form of gestures. They also indicate that it is possible to some extent to classify cer- tain form characteristics of gesture from the linguistic categorization of their lexi- cal affiliates. We also identify the gestu- ral form features which are most strongly correlated to the Viewpoint adopted in ges- ture. Keywords: multimodal eventuality expression, iconic co-speech gesture, wordnet, machine learn- ing 1 Introduction In face-to-face interaction humans communicate by means of speech as well as co-verbal ges- tures, i.e. spontaneous and meaningful hand movements semantically integrated with concur- rent spoken utterances (Kendon, 2004; McNeill, 1992). Gestures which depict entities are called iconic gestures. Such gestures are co-expressive with speech, but not redundant. According to in- ter alia McNeill (1992; 2005) and Kendon (2004), they form an integral part of a spoken utterance. Iconic gestures are especially well-suited to ex- press spatio-motoric information (Alibali et al., 2001; Krauss et al., 2000; Rauscher et al., 1996) and, thus, often accompany verbal expressions of eventualities, in particular motion eventualities. Eventuality is an umbrella term for entities like events, actions, states, processes, etc. (Ramchard, 2005). 1 On the level of language, such entities are mostly denoted by verbs. Gesturally, they are depicted by means of iconicity relation (McNeill, 1992; McNeill, 2005; Peirce, 1931). This rela- tion does not, however, on its own fully explain the form that a gesture takes - a referent can be depicted in gestures in multiple ways, for instance from different perspectives - that of the observer or that of the agent. How a speaker chooses to repre- sent a referent gesturally determines which physi- cal form a gesture takes. Knowledge about the fac- tors influencing this choice, is still sparse (Kopp et al., 2008). It is, however, crucial not only for our understanding of human communication but also for theoretical models of gesture production and its interaction with speech. Such models can in turn inform generation of natural communica- tive behaviors in Embodied Conversational Agents (Kopp et al., 2008). The present paper contributes to this under- standing. It addresses a particular aspect of ges- ture production and its relationship to speech with focus on multimodal expression of eventualities. Various factors have been suggested to influence eventuality gestures, including referent character- istics (Parrill, 2010; Poggi, 2008), verb Aspect (Duncan, 2002) and Aktionsart (Becker et al., 2011). We present a pilot study investigating the extent to which hand gestures can be automatically 1 In gesture studies the terms ’action’ or ’event’ are habit- ually used in this sense. We adopted the term ’eventuality’ to accommodate the terminology for Aktionsart categories reported in Subsection 3.2.2, where ’action’ and ’event’ are subcategories of what can be termed ’eventualities.’ Proceedings from the 1st European Symposium on Multimodal Communication, University of Malta, Valletta, October 17–18, 2013 41
10

Classifying the form of iconic hand gestures from the linguistic categorization of … · 2014. 6. 18. · the linguistic categorization of their lexi-cal affiliates. We also identify

Mar 08, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Classifying the form of iconic hand gestures from the linguistic categorization of … · 2014. 6. 18. · the linguistic categorization of their lexi-cal affiliates. We also identify

Classifying the form of iconic hand gestures from the linguisticcategorization of co-occurring verbs

Magdalena LisCentre for Language Technology

University of CopenhagenNjalsgade 140

2300 [email protected]

Costanza NavarrettaCentre for Language Technology

University of CopenhagenNjalsgade 140

2300 [email protected]

Abstract

This paper deals with the relation betweenspeech and form of co-occurring iconichand gestures. It focuses on multimodalexpression of eventualities. We investi-gate to what extent it is possible to au-tomatically classify gestural features fromthe categorization of verbs in a wordnet.We do so by applying supervised machinelearning to an annotated multimodal cor-pus. The annotations describe form fea-tures of gestures. They also contain in-formation about the type of eventuality,verb Aktionsart and Aspect, which wereextracted from plWordNet 2.0. Our resultsconfirm the hypothesis that the Eventual-ity Type and Aktionsart are related to theform of gestures. They also indicate that itis possible to some extent to classify cer-tain form characteristics of gesture fromthe linguistic categorization of their lexi-cal affiliates. We also identify the gestu-ral form features which are most stronglycorrelated to the Viewpoint adopted in ges-ture.

Keywords: multimodal eventuality expression,iconic co-speech gesture, wordnet, machine learn-ing

1 Introduction

In face-to-face interaction humans communicateby means of speech as well as co-verbal ges-tures, i.e. spontaneous and meaningful handmovements semantically integrated with concur-rent spoken utterances (Kendon, 2004; McNeill,1992). Gestures which depict entities are callediconic gestures. Such gestures are co-expressivewith speech, but not redundant. According to in-ter alia McNeill (1992; 2005) and Kendon (2004),they form an integral part of a spoken utterance.

Iconic gestures are especially well-suited to ex-press spatio-motoric information (Alibali et al.,2001; Krauss et al., 2000; Rauscher et al., 1996)and, thus, often accompany verbal expressions ofeventualities, in particular motion eventualities.Eventuality is an umbrella term for entities likeevents, actions, states, processes, etc. (Ramchard,2005).1 On the level of language, such entitiesare mostly denoted by verbs. Gesturally, they aredepicted by means of iconicity relation (McNeill,1992; McNeill, 2005; Peirce, 1931). This rela-tion does not, however, on its own fully explainthe form that a gesture takes - a referent can bedepicted in gestures in multiple ways, for instancefrom different perspectives - that of the observer orthat of the agent. How a speaker chooses to repre-sent a referent gesturally determines which physi-cal form a gesture takes. Knowledge about the fac-tors influencing this choice, is still sparse (Koppet al., 2008). It is, however, crucial not only forour understanding of human communication butalso for theoretical models of gesture productionand its interaction with speech. Such models canin turn inform generation of natural communica-tive behaviors in Embodied Conversational Agents(Kopp et al., 2008).

The present paper contributes to this under-standing. It addresses a particular aspect of ges-ture production and its relationship to speech withfocus on multimodal expression of eventualities.Various factors have been suggested to influenceeventuality gestures, including referent character-istics (Parrill, 2010; Poggi, 2008), verb Aspect(Duncan, 2002) and Aktionsart (Becker et al.,2011). We present a pilot study investigating theextent to which hand gestures can be automatically

1In gesture studies the terms ’action’ or ’event’ are habit-ually used in this sense. We adopted the term ’eventuality’to accommodate the terminology for Aktionsart categoriesreported in Subsection 3.2.2, where ’action’ and ’event’ aresubcategories of what can be termed ’eventualities.’

Proceedings from the 1st European Symposium on Multimodal Communication, University of Malta, Valletta, October 17–18, 2013

41

Page 2: Classifying the form of iconic hand gestures from the linguistic categorization of … · 2014. 6. 18. · the linguistic categorization of their lexi-cal affiliates. We also identify

classified from the information about these factors.We extract this information from the categoriza-tion of verbs in a lexical-semantic database calledwordnet. The theoretical background and method-ological framework are discussed in (Lis, 2012a;Lis, 2012b; Lis, submitted).

In the present paper, differing from precedingstudies on the multimodal expression of eventu-alities, we test the hypotheses by applying super-vised learning on the data. Our aim in employ-ing this method is to test the annotation schemeand potential application of the annotations in au-tomatic systems and to study the relationship be-tween speech and gesture not only for relationsbetween single variables but also groups of at-tributes. In this, we follow the approach adoptedby a number of researchers. For example, Jokinenand colleagues (2008) have used classification ex-periments to test the adequacy of the annotationcategories for the studied phenomenon. Louw-erse and colleagues (2006a; 2006b) have appliedmachine learning algorithms on annotated Englishmap-task dialogues to study the relation betweenfacial expressions, gaze and speech. A number ofpapers (Fujie et al., 2004; Morency et al., 2009;Morency et al., 2005; Morency et al., 2007; Navar-retta and Paggio, 2010) describe classification ex-periments testing the correlation between speech,prosody and head movements in annotated multi-modal corpora. Machine learning algorithms havealso been applied to annotations of hand gesturesand the co-occurring referring expressions in or-der to identify gestural features relevant for co-reference resolution (Eisenstein and Davis, 2006;Navarretta, 2011).

Moreover, in the present work, we extend theannotations reported in (Lis, 2012b) with twomore form attributes (Movement and Direction).These attributes are chosen because they belong tofundamental parameters of gesture form descrip-tion (Bressem, 2013) and they are associated withmotion, so are expected to be of importance con-sidering we study eventualities, especially motionones.

The paper is organized as follows. In section 2,we shortly present the background for our study,and in section 3 we describe the multimodal cor-pus and the annotations used in our analyses. Insection 4, we present the machine learning experi-ments, and in section 5 we discuss our results andtheir implications, and we propose directions for

future research.

2 Background

The form of co-verbal, iconic gestures is influ-enced by, among others, the semantics of the co-occurring speech and by the visually perceivablecharacteristics of the entity referred to (Kita andÖzyürek, 2003; McNeill, 1992). Poggi (2008) hassuggested that not only the observable propertiesof the referent should be taken into considerationbut also "the type of semantic entity it constitutes."She has distinguished four such types (Animates,Artifacts, Natural Objects and Eventualities) andproposed that their gestural representation will dif-fer.

Eventualities themselves can still be repre-sented in gesture in various ways, for examplefrom different Viewpoints (McNeill, 1992; Mc-Neill, 2005). In Character Viewpoint gestures (C-vpt), an eventuality is shown from the perspec-tive of the agent, gesturer mimes agent’s behav-ior; in Observer Viewpoint (O-vpt), the narratorsees the eventuality as an observer and in DualViewpoint (D-vpt), the gesturer merges the twoperspectives. Parrill (2010) has suggested that thechoice of Viewpoint is influenced by the eventu-ality structure. She has proposed that eventuali-ties which have trajectory as the more salient ele-ment - elicit O-vpt gesture, while eventualities inwhich the use of character’s hands in accomplish-ing a task is more prominent - tend to evoke C-vptgestures.

Other factors suggested to influence eventual-ity gestures include verb Aspect and Aktionsart.Aspect marks "different ways of viewing the in-ternal temporal constituency of a situation" (Com-rie, 1976). The most common distinction is be-tween perfective and imperfective aspect: the for-mer draws focus to the completeness and resul-tativness of an eventuality, whereas with the lat-ter the eventuality is viewed as ongoing. Duncan(2002) has analyzed the relationship between As-pect of verbs and Handedness in gestures in En-glish and Chinese data. Handedness regards whichhand performs the movement and, in case of bi-handed gestures, whether the hands mirror eachother. Duncan has found that symmetric bi-handedgestures more often accompany perfective verbsthan imperfective ones; the latter mostly co-occurwith two handed non-symmetric gestures. Parrilland colleagues (2013) have investigated the rela-

Proceedings from the 1st European Symposium on Multimodal Communication, University of Malta, Valletta, October 17–18, 2013

42

Page 3: Classifying the form of iconic hand gestures from the linguistic categorization of … · 2014. 6. 18. · the linguistic categorization of their lexi-cal affiliates. We also identify

tionship between verbal Aspect and gesture Iter-ation (repetition of a movement pattern within agesture). They have found that descriptions in pro-gressive Aspect are more often accompanied by it-erated gestures. This is, however, only the case ifeventualities are presented to the speakers in thatAspect in the stimuli.

Aktionsart is a notion similar to, but discerniblefrom, Aspect.2 It concerns Vendler’s (1967) dis-tinction between States, Activities, Accomplish-ments and Achievements, according to differ-ences between the static and dynamic, telic andatelic, durative and punctual. Becker and col-leagues (2011) have conducted a qualitative studyon Aktionsart and temporal coordination betweenspeech and gesture. They have suggested thatgestures affiliated with Achievement and Accom-plishment verbs are completed, or repeated, on thegoal of the verb, whereas in case of gestures ac-companying Activity verbs, the stroke coincideswith the verb itself.

Lis (2012a) has introduced a framework inwhich the relationship between these factors andgestural expressions of eventualities is investi-gated using wordnet databases, i.e. electronic lin-guistic taxonomies. She has employed wordnetto, among others, formalize Poggi (2008) and Par-rill’s (2010) insights. Based on plWordNet 1.5classification, she has distinguished different typesof eventualities and showed their correlation withgestural representation (Lis, 2012b). The presentstudy further builds up on that work, using updated(plWN 2.0), revised (Lis, submitted) and extendedannotations and machine learning experiments.

3 The data

3.1 The corpus

Our study was conducted on the refined annota-tions (Lis, submitted) from the corpus describedin (Lis, 2012a; Lis, 2012b), which has in turnbeen an enriched version of the PCNC corpus cre-ated by the DiaGest research group (Karpinskiet al., 2008). Data collection followed thewell-established methodology of McNeill (1992;2005): the corpus consists of audio-video record-ings of 5 male and 5 female adult native Polishspeakers who re-tell a Canary Row cartoon to anaddressee. The stimulus contains numerous even-

2For a discussion on the differences between Aspect andAktionsart and between the Germanic and Slavic traditionsof viewing these two concept cf. (Młynarczyk, 2004).

Figure 1: A snapshot from the ANVIL tool

tualities and has proved to elicit rich multimodaloutput. The monologues were recorded in a studioas shown in Figure 1 and the whole corpus consistsof approximately one hour of recordings.

3.2 The annotation

Speech has been transcribed with word timestamps by the DiaGest group, who has also iden-tified communicative hand gestures and annotatedtheir phases, phrases and semiotic types in ELAN(Wittenburg et al., 2006). Lis (2012a; 2012b) ex-ported the annotations to the ANVIL tool (Kipp,2004) and enriched it with coding of verbs andViewpoint, Handedness, Handshape and Iterationof gestures. The annotations in the corpus wererefined and, for the purpose of the present study,further extended with two more gesture form at-tributes (Direction and Movement) (Lis, submit-ted).

3.2.1 The annotation of gesturesIconic hand gestures were identified based on Di-aGest’s annotation of semiotic types. Gesturesdepicting eventualities were manually annotatedusing six pre-defined features, as reported in de-tail in (Lis, submitted). Table 1 shows the at-tributes and values for gestures annotation usedin this study. Viewpoint describes the perspec-tive adopted by the speaker and was encoded us-ing the values proposed by McNeill: C-, O- andD-vpt (1992). The attribute Handedness indi-cates whether one (Right_Hand, Left_Hand) ortwo hands are gesturing and whether they aresymmetric or not (Symmetric_Hands versus Non-symmetric_Hands). Handshape refers to configu-

Proceedings from the 1st European Symposium on Multimodal Communication, University of Malta, Valletta, October 17–18, 2013

43

Page 4: Classifying the form of iconic hand gestures from the linguistic categorization of … · 2014. 6. 18. · the linguistic categorization of their lexi-cal affiliates. We also identify

Table 1: Annotations of gesturesAttribute Value

Viewpoint Observer_Viewpoint,Character_Viewpoint,Dual_Viewpoint,

Handshape ASL_C, ASL_G, ASL_5,ASL_O, ASL_S, ComplexOther

Handedness Right_Hand, Left_Hand,Symmetric_Hands,Non-symmetric_ Hands

Iteration Single, Repeated, HoldMovement Straight, Arc, Circle,

Complex, NoneDirection Vertical, Horizontal,

Multidirectional, None

ration of palm and fingers of the gesturing hand(s);the values are taken from American Sign Lan-guage Handshape inventory (Tennant and Brown,2010): ASL_C, ASL_G, ASL_5, ASL_O, ASL_Sand supplemented with the value for hand shapeschanging throughout the stroke (Complex) or notfalling under any of the mentioned categories(Handshape_Other). Iteration indicates whethera particular movement pattern within a stroke oc-curs once (Single) or multiple times (Repeated), orwhether the stroke consists of a static Hold. Move-ment regards shape of the motion, while Direction- the plane on which the motion is performed.

3.2.2 The annotation of verbsVerbs were identified in the word stamp speechtranscript. Information about verbs was extractedfrom the Polish WordNet, plWordNet 2.0, fol-lowing the procedure explained in (Lis, 2012a;Lis, 2012b). In a wordnet, the lexical units areclassified into sets of synonyms, called synsets,which are linked to each other via a number ofconceptual-semantic and lexical relations (Fell-baum, 1998). The most frequently encoded oneis hyponymy, also called IS_A or TYPE_OF re-lation, that connects a sub-class to its super-class, the hyperonym. Non-lexical synsets in theupper-level hierarchies of hyponymy encodings inplWordNet contain information on verb Aspect,Aktionsart and domain (Maziarz, 2012).

A domain denotes a segment of reality andall lexical units belonging to a particular domainshare a common semantic property (Brinton,2000). Lis (2012a; 2012b) has used wordnetdomains to categorize referents of multimodalexpressions according to their type. The attributeEventuality Type was assigned based the domainof the verb used in speech to denote the eventual-

ity. The choice of the domains in focus has beenpartially inspired by Parrill’s distinction betweeneventualities with a more prominent trajectoryversus eventualities with a more prominent han-dling element (Parrill, 2010). Based on this, Lis(2012a; 2012b) has distinguished two EventualityTypes:3 Translocation and Body_Motion. Theformer refers to eventualities with traversal ofa path of a moving object or focus on spatialarrangement and the latter refers to a movementof agent’s body (part) not entailing displacementof the agent as a whole (cf: (Levin, 1993)). Lishas subsumed plWordNet domains to fit thisdistinction. The domains relevant to our study are(with examples of verbs from the corpus given inparentheses):TRANSLOCATION{location or spatial relations}4(spadac ’to fall,’zderzac sie ’to collide’);{change of location or spatial relationschange}(biegac ’to run,’ skakac ’to jump’).BODY_MOTION{causing change of location or causing spatialrelations change}(rzucac ’to throw,’ otwierac ’toopen’);{physical contact}(bic ’to beat’, łapac ’to catch’);{possession}(dawac ’to give,’ brac ’to take’);{producing}(budowac ’to build,’ gotowac ’tocook’).Verbs from the synsets {location or spatialrelations} and its alterational counterpart weresubsumed under the type Translocation. Moreexamples of the verbs from the corpus include:wspinac sie ’to climb,’chodzic ’to walk,’ wypadac’to fall out’. Synsets {causing change of locationor causing spatial relations change} and {physicalcontact}, as well as {possession} and {produc-ing} were grouped under the type Body_Motion.Further verb examples are: przynosic ’to bring,’trzymac ’to keep,’ walic ’to bang,’ dawac ’togive,’ szyc ’to sew.’ Verbs from the remainingdomains were collected under the umbrella term’Eventuality_Other.’ These verbs constituted lessthan 10% of all verb-gesture tokens found in thedata. Examples include: {social relationships}grac ’to play,’ {mental or emotional state} ogla-dac ’to watch.’ For the purpose of the analysesin the present paper, they were combined with

3Note that these categories are orthogonal to Poggi’s(2008) ontological types.

4In wordnets {} indicates a synset.

Proceedings from the 1st European Symposium on Multimodal Communication, University of Malta, Valletta, October 17–18, 2013

44

Page 5: Classifying the form of iconic hand gestures from the linguistic categorization of … · 2014. 6. 18. · the linguistic categorization of their lexi-cal affiliates. We also identify

Table 2: Annotations of verbsAttribute Value

Eventuality Type Translocation,Body_Motion,Other

Aspect Perfective,ImperfectiveAktionsart State, Act, Activity,

Accident, Event,Action, Process

the Body_Motion category.5 The domains weresemi-automatically assigned to the verbs in ourdata. Verb polysemy was resolved with a refinedversion (Lis, submitted) of the heuristics proposedin (Lis, 2012b).

Apart from the domains, the encoding ofhyponymy-hyperonymy relations of verbs inplWordNet provides also information about Ak-tionsart and Aspect. The attribute Aspect hastwo possible values: Perfective and Imperfective.For Aktionsart, seven categories are distinguished:States, Acts, Activities, Accidents, Events, Ac-tions and Processes. They are Laskowski’s (1998)adaptation of Vendler’s (1967) Aktionsart classifi-cation to the features typical for Polish language.6

Table 2 shows the attributes and values for verbsannotation used in our study.

3.2.3 The annotation processGestures and verbs were coded on separate tracksand connected by means of MultiLink option inANVIL. Gestures were linked to the semanti-cally affiliated verb. The verbs and gestures wereclosely related temporally: 80% of the verb on-sets fell within stroke phase or slightly precededit (Lis, submitted). Figure 1 shows a screen-shotof the annotations in the tool. 269 relevant verb-gesture pairs were found in the data. Intercoderagreement was calculated for the majority of thegesture annotation attributes and ranged from 0.67to 0.96 (Lis, submitted) in terms of κ score (Co-hen, 1960), i.e. from substantial to almost perfectagreement (Rietveld and van Hout, 1993).

4 The classification experiments

In the machine learning experiments we wantedto test to which extent we can predict the form of

5The resulting frequency distribution of Typein the verb-gesture pairs: Translocation(150) andBody_Motion+Other(119).

6Laskowski’s (1998) categories of Vendler’s (1967) Ak-tionsart are called Classes. For the sake of simplicity, we usethe term Aktionsart instead of Class to refer to them.

Table 3: Classification of HandshapeHanshape Precision Recall F-score

baseline 0.08 0.28 0.12Aspect 0.08 0.28 0.12Aktionsart 0.22 0.28 0.21Type 0.17 0.32 0.22all 0.19 0.27 0.21

hand gestures from the characteristics of eventual-ities and verbs, as reflected in plWordNet’s cate-gorization. The relevant data were extracted fromgesture and orthography tracks in ANVIL, andcombined using the Multilink annotation. Clas-sification experiments were performed in WEKA(Witten and Frank, 2005) using ten-fold cross-validation to train and test the classifiers. Asbaseline in the evaluation, the results obtained bythe ZeroR classifier were used. ZeroR alwayschooses the most frequently occurring nominalvalue. An implementation of a support vectorclassifier (WEKA’s SMO) was applied in all othercases; various algorithms were tested, with SMOgiving the best results. The results of the exper-iments are provided in terms of Precision, Recalland F-score (Witten and Frank, 2005).

4.1 Classifying the gesture form featuresfrom linguistic information

In these experiments we wanted to test whetherit is possible to predict the form of the gesturefrom the type of the eventuality referred to and in-formation about Aspect and Aktionsart. The firstgroup of experiments regards the Handshape at-tribute with seven possible values. In Table 3, theresults of these experiments are shown. They in-dicate that Aspect information does not at all af-fect the classification of Handshape, and Eventu-ality Type and Aktionsart only slightly contributeto the classification (the best result is obtained us-ing Eventuality Type annotation, F-score improve-ment of 0.1 with respect to the baseline, but is notsignificant).7 Not surprisingly, the confusion ma-trix from this experiment shows that the categorieswhich are assigned more correctly are those thatoccur more often in the data (ASL_5 and ASL_S).

In the following experiment, we wanted to testwhether Aktionsart, Aspect and Eventuality Typeare related to the employment of hands in the ges-tures. Thus, Handedness was predicted using the

7We indicate significant results with ∗. Significance wascalculated with one-tailed t-test and p<0.05.

Proceedings from the 1st European Symposium on Multimodal Communication, University of Malta, Valletta, October 17–18, 2013

45

Page 6: Classifying the form of iconic hand gestures from the linguistic categorization of … · 2014. 6. 18. · the linguistic categorization of their lexi-cal affiliates. We also identify

Table 4: Classification of HandednessHandedness Precision Recall F-score

baseline 0.2 0.44 0.27Aspect 0.2 0.44 0.27Aktionsart 0.33 0.45 0.37Type 0.36 0.48 0.41all 0.35 0.47 0.40

Table 5: Classification of IterationIteration Precision Recall F-score

baseline 0.55 0.74 0.63Aspect 0.55 0.74 0.63Aktionsart 0.55 0.74 0.63Type 0.55 0.74 0.63all 0.55 0.74 0.63

verb related annotations. The results of these ex-periments are in Table 4. Also in this case, As-pect does not contribute to the prediction of ges-ture form. However, the results show that infor-mation about the Eventuality Type to some extentimproves classification with respect to the base-line (F-score improvement: 0.14∗). The mostcorrectly identified gestures were performed withRight_Hand and Symmetrical_Hands, which arethe most frequently occurring Handedness valuesin the data.

In the third group of experiments, we wantedto investigate whether the linguistic categorizationof verbs improves the prediction of the gesture It-eration. The results of these classification exper-iments are in Table 5. They indicate that no sin-gle feature contributes to the classification of handrepetition: in all cases the most frequently occur-ring value, Single, is chosen as in the baseline.

In the fourth group of experiments we analyzedwhether the linguistic categorization of verbs en-hances the prediction of Movement. We presentthe results of these classification experiments inTable 6. They show that none of the investigatedverbal attributes has a relation to the Movement ingesture.

In the fifth group of experiments the relation be-

Table 6: Classification of MovementMovement Precision Recall F-score

baseline 0.37 0.61 0.46Aspect 0.37 0.61 0.46Aktionsart 0.37 0.61 0.46Type 0.37 0.61 0.46all 0.37 0.61 0.46

Table 7: Classification of DirectionDirection Precision Recall F-score

baseline 0.26 0.50 0.34Aspect 0.26 0.50 0.34Aktionsart 0.47 0.55 0.50Type 0.26 0.50 0.34all 0.47 0.55 0.50

Table 8: Predicting the Viewpoint type from lin-guistic information

Viewpoint Precision Recall F-score

baseline 0.29 0.54 0.38Aspect 0.29 0.54 0.38Aktionsart 0.53 0.59 0.53Type 0.71 0.78 0.74all 0.71 0.78 0.74

tween the linguistic categorization of verbs and thedirection of the hand movement was determined.The results of these classification experiments aregiven in Table 7. They indicate that only Aktion-sart contributes to the prediction of Direction (theimprovement with respect to the baseline: 0.16∗).

4.2 Classifying the ViewpointIn the following experiments we investigated towhat extent it is possible to predict the Viewpointin gesture from a) the linguistic categorization ofthe verb and b) from the gesture form.

In the first experiment, we tried to automati-cally identify the Viewpoint in the gesture from theEventuality Type annotation. We also investigatedto which extent the verb Aspect and Aktionsartcontribute to the classification. The results of theseexperiments are in Table 8. The results confirmthat there is a strong correlation between View-point and Eventuality Type (F-score improvementwith respect to the baseline: 0.36∗). We also founda correlation between Viewpoint and Aktionsart.

In Figure 2 the confusion matrix for the bestclassification results are given. Not surprisingly,the classifier did not perform well on the very in-frequent category, i.e. D-vpt.

a b c <-- classified as89 0 12 | a = C-VPT5 0 18 | b = D-VPT

25 0 120 | c = O-VPT

Figure 2: Confusion matrix for predicting View-point from linguistic information

In the last group of experiments we applied theSMO classifier to the data to predict Viewpoint

Proceedings from the 1st European Symposium on Multimodal Communication, University of Malta, Valletta, October 17–18, 2013

46

Page 7: Classifying the form of iconic hand gestures from the linguistic categorization of … · 2014. 6. 18. · the linguistic categorization of their lexi-cal affiliates. We also identify

Table 9: Predicting the Viewpoint type from formfeatures

Viewpoint Precision Recall F-score

baseline 0.29 0.54 0.38Handshape 0.64 0.7 0.67Handedness 0.58 0.64 0.60Iteration 0.67 0.57 0.44Movement 0.55 0.55 0.43Direction 0.67 0.57 0.44all 0.68 0.72 0.69

from Handshape, Handedness, Iteration, Move-ment and Direction. Table 9 summarizes the re-sults of these experiments. They demonstrate astrong correlation between the form of a gestureand the gesturer’s Viewpoint: F-score improve-ment with respect to the baseline is 0.31∗ whenall form related features are used, and all featurescontribute to the classifications. Handshape andHandedness are the features most strongly corre-lated to Viewpoint. In Figure 3 the confusion ma-trix for the best classification results is given.

a b c <-- classified as84 0 17 | a = C-VPT20 0 3 | b = D-VPT36 0 109 | c = O-VPT

Figure 3: Confusion matrix for predicting View-point from form features

5 Discussion and future work

The results of our first group of experiments in-dicate that it is to some extent possible to auto-matically predict certain form characteristics ofhand gestures from the linguistic categorization oftheir lexical affiliates. We found that the Eventual-ity Type extracted from wordnet categorization ofverbs improves classification of Viewpoint in theco-occurring gesture. Our results are in line withLis’ (2012b) claim that the type of referent influ-ences gestural representation. This claim has inturn been inspired by Poggi (2008) and Parrill’s(2010) hypotheses.

Lis (submitted) interprets the finding in termsof Gricean Maxims (Grice, 1976), which amongothers state that speakers tend to convey as muchrelevant information in as economic way as possi-ble. Body Motion refers to a movement of agent’sbody (part) not entailing displacement of the agentas a whole, which can be easily mimed with handgestures from an internal perspective. The trajec-tory or spatial arrangement of Translocation even-

tualities, on the other hand, is less readily reen-acted without the risk of hindering communicativeflow between interlocutors. It can, however, beeasily depicted from an external perspective withgestures drawing paths. Moreover, we have iden-tified the form features of gestures which are mosttightly related to the Viewpoint, that is Handshapeand Handedness. In line with the previous inter-pretation, Lis (submitted) suggests that C-vpt ges-tures often depict interaction with an object andthe hand shapes reflect grasping and holding. O-vpt gestures, on the contrary, focus on shapes andspatial extents and utilize, thus, hand shapes con-venient for depicting lines, i.e. a hand with ex-tended finger(s). It needs to be, however, furtherexamined in how far the distribution of Handshapeand Handedness in our data is motivated by thespecifics of the stimuli.

Our findings also show that the type of even-tuality improves prediction of Handedness. How-ever, Eventuality Type provides a more substan-tial improvement in the prediction of Viewpoint,i.e. aspect of gestural representation rather thanof purely physical form of gesture. This sug-gests that considering such representational for-mat as an intermediate step in modeling gestureproduction may be appropriate. Having found thatreferent properties are only partially predictive ofthe form of iconic gesture, Kopp and colleagues(2008) consider direct meaning-form mapping tohave a weak empirical support. They have in-stead suggested a two-step micro-planning proce-dure where the relationship between referent prop-erties and gesture physical form is mediated byrepresentational format. The present experimentsdo not provide an answer as to whether the two-step approach could lead to modeling aspects ofeventuality gesture production. More analyses areneeded, and they should be addressed in futurework.

While our results indicate that Eventuality Typeis the strongest predictor of gesture form, we havealso found that Handedness and Viewpoint are re-lated to Aktionsart, whereas none of the consid-ered form features showed correlation with verbAspect. An explanation might be that both theEventuality Type and Aktionsart regard more in-herent characteristics of eventuality, while Aspectregards the speaker’s external perspective on theeventuality. It also needs to be noted that not allAktionsart categories are equally represented in

Proceedings from the 1st European Symposium on Multimodal Communication, University of Malta, Valletta, October 17–18, 2013

47

Page 8: Classifying the form of iconic hand gestures from the linguistic categorization of … · 2014. 6. 18. · the linguistic categorization of their lexi-cal affiliates. We also identify

our data.8 The three most frequent Aktionsart cat-egories share the feature ’intentionality,’ but be-long to different groups in Vendler’s classification(Maziarz et al., 2011). It should be investigated inhow far different Aktionsart types in our data arerepresented for different Eventuality Types, as thatmay provide a further explanation of the obtainedresults.

Aspect does not improve the classification forany feature. The observation that Aspect is re-lated to Handedness (Duncan, 2002) and Itera-tion (Parrill et al., 2013) is, thus, not reflectedin this corpus. It needs to be remembered thatthe relationship between Aspect and Iteration wasfound by Parrill and colleagues (2013) only whenthe eventualities were presented to speakers in theappropriate Aspect in the stimuli. Our resultssuggest it may not be generalizable to an over-all correlation between Aspect and gesture Itera-tion. Moreover, Aspect is expressed very differ-ently in the three languages under consideration(Polish - the present study, English (Parrill et al.,2013), and English and Chinese (Duncan, 2002)).Cross-linguistic differences have been found to bereflected in gesturing (Kita and Özyürek, 2003).Whether such differences in encoding of Aspectimpact gestures should be, thus, investigated fur-ther.

The results of the experiments also indicate thatgestural Iteration and Movement are not at all re-lated to the linguistic characteristics of the co-occurring verb and that the only feature improv-ing classification of gesture direction is Aktion-sart. For Iteration, however, our data are biased inthat single gestures are predominant, which mayhave affected the results. Regarding Movementand Direction, we suggest that they may be pri-marily dependent on visual properties of the refer-ent, rather than the investigated factors. For exam-ple, Kita and Özyürek (2003) have found that thedirection of gesture in elicited narrations reflectsthe direction in which an eventuality has been pre-sented in the stimuli. The only improvement iden-tified in our experiments in the classification of Di-rection (due to Aktionsart) requires further inves-tigation.

Our results suggest the viability of the frame-work adopted in the paper, i.e. application of

8The frequency distribution of Aktionsart in the verb-gesture pairs: Activities(115), Acts(56), Actions(58),Events(23), Accidents(15), States(2), Processes(0), and ofAspect: Imperfective(179) and Perfective(126).

wordnet for investigation of speech-gesture en-sembles. Wordnet classification of lexical itemscan be used to shed some light on speech-relatedgestural behavior. Using wordnet as an externalsource of annotation increases coding reliabilityand due to the wordnet machine-readable format,it enables automatic assignment of values. Word-nets exist for numerous languages and the ap-proach may, thus, be applied cross-linguisticallyand help to uncover universal versus language-specific structures in gesture production. The find-ings support the viability of a number of categoriesin the annotation scheme used - they corroboratethat the type of referent is a category relevant tostudying gestural characteristics and they validatethe importance of introducing distinctions amongeventualities for multimodal phenomena. The ex-periments also identify another attribute, i.e. Ak-tionsart, as relevant in the framework.

It has to, however, be noted that our study isonly preliminary, because the results of our ma-chine learning experiments are biased by the factthat for some attributes certain values occur muchmore frequently than others in the data. Futurework should address normalization as a possiblesolution. Moreover, our findings are based on nar-rational data, and need to be tested on differenttypes of interaction. Most importantly, the datasetwe used is small for machine learning purposes.Due to time load of multimodal annotation, smalldatasets are a well-known challenge in gesture re-search. Our results await, thus, validation on alarger sample. Also, cross-linguistic studies oncomparative corpora should be performed.

In the present work only one type of bodily be-haviors, i.e. hand gestures, was taken into account,but people use all their body when communicat-ing. Thus, we plan to extend our investigation togestures of other articulators, such as head move-ments and posture changes. In the present workonly gestures referring to eventualities were con-sidered. Lis (submitted) has recently started ex-tending the wordnet-based framework and investi-gation to animate and inanimate objects.

ReferencesPolish WordNet. Wrocław University of Technology.

http://plwordnet.pwr.wroc.pl/wordnet/.

Tennant, R. and M. Brown The American Sign Lan-guage Handshape Dictionary. Washington, DC:Gallaudet University Press (2010).

Proceedings from the 1st European Symposium on Multimodal Communication, University of Malta, Valletta, October 17–18, 2013

48

Page 9: Classifying the form of iconic hand gestures from the linguistic categorization of … · 2014. 6. 18. · the linguistic categorization of their lexi-cal affiliates. We also identify

Alibali, M. W., Heath, D. C., and Meyers, H. J. Ef-fects of visibility between speakers and listeners ongesture production: Some gestures are meant to beseen. Journal of Memory and Language, 44:159–188 (2001).

Becker, R., Cienki, A., Bennett, A., Cudina, C., De-bras, C, Fleischer, Z., M. Haaheim, T. Mueller,K. Stec, and A. Zarcone. Aktionsarten, speech andgesture. In Gesture and Speech in Interaction ’11,(2011).

Bressem, J. A linguistic perspective on the notation ofform features in gestures. Body – Language – Com-munication. Handbooks of Linguistics and Commu-nication Science. Berlin, New York: Mouton deGruyter (2013).

Brinton, L. The structure of modern English: A lin-guistic introduction. John Benjamins PublishingCompany (2000).

Comrie, B. Aspect. Cambridge: Cambridge UniversityPress (1976).

Cohen, J. A coefficient of agreement for nominalscales. Educational and Psychological Measure-ment, 20(1):37–46 (1960).

Duncan, S. Gesture, verb Aspect, and the natureof iconic imagery in natural discourse. Gesture,2(2):183–206 (2002).

Eisenstein, J.and Davis, R. Gesture features for coref-erence resolution. In Renals, S., Bengio, S., and Fis-cus, J., editors, MLMI 06, pages 154–155 (2006).

Eisenstein, J.and Davis, R. Gesture improve corefer-ence resolution. In Proceedings of the Human Lan-guage Technology Conference of the North Ameri-can Chapter of the ACL, pages 37–40, New York(2006).

Fellbaum, C., WordNet: An Electronic LexicalDatabase. MIT Press, Cambridge, MA (1998).

Fujie, S., Ejiri, Y., Nakajima, K., Matsusaka, Y., andKobayashi, T. A conversation robot using head ges-ture recognition as para-linguistic information. InProceedings of the 13th IEEE International Work-shop on Robot and Human Interactive Communica-tion, 159–154 (2004).

Grice, H. Logic and Conversation. Syntax and Seman-tics, 3:41–58. Academic Press, New York (1976).

Jokinen, K. , Navarretta, C., and Paggio, P. Dis-tinguishing the communicative function of gesture.Proceedings of MLMI (2008).

Karpinski, M., Jarmołowicz-Nowikow, E., Malisz, Z.,Szczyszek, M., Juszczyk, J. Rejestracja, tran-skrypcja i tagowanie mowy oraz gestów w narracjidzieci i dorosłych. Investigationes Linguisticae, 17(2008).

Kendon, A. Gesture: Visible Action As Utterance.Cambridge University Press, Cambridge (2004).

Kipp, M. Gesture Generation by Imitation - From Hu-man Behavior to Computer Character Animation.Boca Raton, Florida (2004).

Kita, S. and A. Özyürek. What does cross–linguisticvariation in semantic coordination of speech andgesture reveal? Evidence for an interface represen-tation of spatial thinking and speaking. Journal ofMemory and Language, 48(1):16–32 (2003).

Kopp, S., Bergmann, K., and Ipke, W. Multimodalcommunication from multimodal thinking – towardsan integrated model of speech and gesture produc-tion. Semantic Computing, 2(1):115–136 (2008).

Krauss, R. M., Chen, Y., and Gottesman, R. F. Lexi-cal gestures and lexical access. a process model. InMcNeill, D., editor, Language and Gesture, pages261–283. Cambridge University Press, New York(2000).

Laskowski, L. Kategorie morfologiczne jezyka pol-skiego — charakterystyka funkcjonalna. PWN,Warszawa (1998).

Levin, B. English Verb Classes and Alternations: APreliminary Investigation. University of ChicagoPress, Chicago (1993).

Lis, M. Annotation scheme for multimodal commu-nication: Employing plWordNet 1.5. In Proceed-ings of the Formal and Computational Approachesto Multimodal Communication Workshop. 24th Eu-ropean Summer School in Logic, Language and In-formation (ESSLLI’12) (2012).

Lis, M. Influencing gestural representation of eventual-ities: insights from ontology. In Proceedings of the14th ACM International Conference on MultimodalInteraction (ICMI’12), 281–288, (2012).

Lis, M. Multimodal representation of entities: Acorpus-based investigation of co-speech hand ges-ture. PhD dissertation, University of Copenhagen(submitted).

Louwerse, M., Jeuniaux, P., Hoque, M., Wu, J., andLewis, G. Multimodal communication in computer-mediated map task scenarios. In Sun, R. andMiyake, N., editors, Proceedings of the 28th AnnualConference of the Cognitive Science Society, Mah-wah, NJ. Erlbaum (2006).

Louwerse, M. M., Benesh, N., Hoque, M., Jeuniaux,P., Lewis, G., Wu, J., and Zirnstein, M. Multi-modal communication in face-to-face conversations.In Sun, R. and Miyake, N., editors, Proceedings ofthe 29th Annual Conference of the Cognitive ScienceSociety, Mahwah, NJ. Erlbaum (2006).

Maziarz, M. Non-lexical verb synsets in upper-hierarchy levels of polish wordnet 2.0. Technicalreport, Wrocław University of Technology (2012).

Proceedings from the 1st European Symposium on Multimodal Communication, University of Malta, Valletta, October 17–18, 2013

49

Page 10: Classifying the form of iconic hand gestures from the linguistic categorization of … · 2014. 6. 18. · the linguistic categorization of their lexi-cal affiliates. We also identify

Maziarz, M., Piasecki, M., Szpakowicz, S., Rabiega-Wisniewska, J. and B. Hojka. Semantic relations be-tween verbs in polish wordnet 2.0. Cognitive Stud-ies, (11):183–200 (2011).

McNeill, D. Hand and Mind: What Gestures Re-veal About Thought. University of Chicago Press,Chicago (1992).

McNeill, D. Gesture and Thought. University ofChicago Press, Chicago (2005).

Melinger, A. and Levelt, W. Gesture and the commu-nicative intention of the speaker. Gesture, 4(2):119–141 (2005).

Młynarczyk, A. Aspectual pairing in Polish. PhD dis-sertation, University of Utrecht (2004).

Morency, L.-P., de Kok, I., and Gratch, J. A proba-bilistic multimodal approach for predicting listenerbackchannels. Autonomous Agents and Multi-AgentSystems, 20:70–84 (2009).

Morency, L.-P., Sidner, C., Lee, C., and Darrell, T.Contextual recognition of head gestures. In Pro-ceedings of the International Conference on Multi-modal Interfaces (2005).

Morency, L.-P., Sidner, C., Lee, C., and Darrell, T.Head gestures for perceptual interfaces: The role ofcontext in improving recognition. Artificial Intelli-gence, 171(8–9):568–585 (2007).

Navarretta, C. Anaphora and gestures in multimodalcommunication. In Proceedings of the 8th Dis-course Anaphora and Anaphor Resolution Collo-quium (DAARC 2011), pages 171–181, Faro, Por-tugal (2011).

Navarretta, C. and Paggio, P. Classification of feed-back expressions in multimodal data. In Proceed-ings of the 48th Annual Meeting of the Associationfor Computational Linguistics (ACL’10), pages 318–324, Uppsala, Sweden (2010).

Parrill, F. Viewpoint in speech–gesture integration:Linguistic structure, discourse structure, and eventstructure. Language and Cognitive Processes,25(5):650–668 (2010).

Parrill, F., Bergen, B. and P. Lichtenstein. Grammat-ical aspect, gesture, and conceptualization: Usingco-speech gesture to reveal event representations. InCognitive Linguistics, 24(1): 135–158 (2013).

Peirce, C. S. Collected Papers of Charles SandersPeirce (1931-58). Hartshorne, P. Weiss and A.Burks, Cambridge, MA: Harvard University Press(1931).

Poggi, I. Iconicity in different types of gestures. Ges-ture, 8(1):45–61 (2008).

Ramchard, G. Post-davidsionianism. Theoretical Lin-guistics, 31(3):359–373 (2005).

Rauscher, F. H., Krauss, R. M., and Chen, Y. Ges-ture, Speech, and lexical access: The Role of Lex-ical Movements in Speech Production. Psychologi-cal Science, 7(4):226–231 (1996).

Rietveld, T. and Hout, R. v. Statistical Techniquesfor the Study of Language and Language Behavior.Mouton De Gruyter, Berlin (1993).

Vendler, Z. Linguistics in Philosophy. Cornell Univer-sity Press, Ithaca, NY (1967).

Witten, J. and Frank, E. Data Mining: Practical ma-chine learning tools and techniques. Morgan Kauf-mann, San Francisco, 2 edition (2005).

Wittenburg, P., Brugman, H., Russel, A., Klassmann,A., and Sloetjes, H. Elan: a professional frameworkfor multimodality research. In LREC’06, Fifth In-ternational Conference on Language Resources andEvaluation (2006).

Proceedings from the 1st European Symposium on Multimodal Communication, University of Malta, Valletta, October 17–18, 2013

50