Vassilis Pitsikalis 1 , Stavros Theodorakis 1 , Christian Vogler 2 and Petros Maragos 1

Multipurpose Design and Creation of GSL Dictionaries

Advances in Phonetics-based Sub-Unit Modeling for Transcription, Alignment and Sign Language Recognition.Vassilis Pitsikalis1, Stavros Theodorakis1, Christian Vogler2 and Petros Maragos1

1 School of Electrical and Computer Engineering, National Technical University of Athens2 Institute for Language and Speech Processing/Gallaudet University

Workshop on Gesture Recognition, June 20, 20111OverviewGestures, signs, and goalsSign language data and visual processingData-Driven Sub-Units without Phonetic Evidence for RecognitionPhonetic modelingWhat is it?Annotations vs phoneticsConversion of annotations to structured phonetic descriptionTraining and alignmentRecognition experimentsConclusionsWorkshop on Gesture Recognition, June 20, 2011221. Gestures versus SignsGesturesIsolated hand, body, and facial movementsCan be broken down into primitives (but rarely are in gesture recognition work)Few constraints, other than conventionSignsHand body and facial movements, both in isolation and as part of sentencesCan be broken down into primitives (cheremes/phonemes/phones)Numerous phonetic, morphological, and syntactic constraintsWorkshop on Gesture Recognition, June 20, 20113SL Recognition vs Gesture RecognitionContinuous SL recognition is invariably more complex than gestures, but:Isolated sign recognition (i.e. the forms found in a dictionary) is essentially the same as gesture recognitionMethods that work well on isolated sign recognition should work well on gesture recognitionExploit 30+ years of research into structure of signsWorkshop on Gesture Recognition, June 20, 20114Subunit ModelingTwo fundamentally different ways to break down signs into parts:Data-drivenPhonetics-based (i.e. linguistics)Similar benefits:ScalabilityRobustnessReduce required training dataWorkshop on Gesture Recognition, June 20, 20115Goals of this PresentationWork with large vocabulary (1000 signs)Compare data-driven and phonetic breakdown of signs into subunitsAdvance state of the field in phonetic breakdown of signs

Workshop on Gesture Recognition, June 20, 201162. Sign Language DataCorpus of 1000 Greek Sign Language Lemmata5 repetitions per signSigner-dependent, 2 signers (only 1 used for this paper)HD video, 25 fps interlacedTracking and feature extractionPre-processing, Configuration, Statistics, Skin color training

Workshop on Gesture Recognition, June 20, 20117Interlaced data and pre-processingWorkshop on Gesture Recognition, June 20, 20118

RefinedSkin color masksDe-interlacedInterlaced2nd VersionFull Resolution, Frame rateFull frame rate : 50fpsResolution : 1440x1088

8Tracking Video, GSL Lemmas Corpus9Workshop on Gesture Recognition, June 20, 2011

9Given these, we move on describing the recognitions framework Recognition setup employed,And the variety of experiments conducted.93. Data-Driven Subunit ModelingExtremely popular in SL recognition latelyGood resultsDifferent approaches existours is based on distinguishing between dynamic and static subunitsWorkshop on Gesture Recognition, June 20, 201110Workshop on Gesture Recognition, June 20, 201111Dynamic (Movement)-Static (Position) Segmentation: Intuitive, Segments + LabelsSeparate Modeling, SUs, Clustering wrt. Feature type (e.g. static vs. dynamic features); Parameters (e.g. Model Order) and Architecture (HMM, GMM); Normalize featuresTraining, Data-Driven LexiconRecognize SUs, SignsDynamic-Static SU Recognition2-state HMM to Segment + Label frames (represented by velocity magn) as Dynamic-D (high veloc) vs Static-S (low veloc).If S, Kmeans clustering by the 2D position(x,y) vectors and then GMM-1 for each cluster for training. If D, by using features like Direction, Scale or Norm Position, hierarchical clustering via DTW (w Euclidean distances). DTW yields a dstance among segments. Hierarchical for flexibility. After, build a bakis 6-state HMM for each cluster.SUs = final HMM per D-cluster and final GMM for S-cluster.Train: Based on sign-level Transcriptions, SUs extracted create SU sequences for each sign and this builds a DD LEXICON.Test: Viterbi Decoding: Fit HMMs to the testing sequence and simultaneously recognize. The Lexicon is imposed.

Scenaria: DONE: easy and difficult (unseen pronounciations). Not done: unseen sign.Figures show SUs via points on x-y plane.11Dynamic-Static SU Extraction

V. Pitsikalis, S. Theodorakis and P. Maragos, Data-Driven Sub-Units and Modeling Structure for Continuous Sign Language Recognition with Multiple Cues, LREC, 20102-state HMM to Segment + Label frames (represented by velocity magn) as Dynamic-D (high veloc) vs Static-S (low veloc).If S, Kmeans clustering by the 2D position(x,y) vectors and then GMM-1 for each cluster for training. If D, by using features like Direction, Scale or Norm Position, hierarchical clustering via DTW (w Euclidean distances). DTW yields a dstance among segments. Hierarchical for flexibility. After, build a bakis 6-state HMM for each cluster.SUs = final HMM per D-cluster and final GMM for S-cluster.Train: Based on sign-level Transcriptions, SUs extracted create SU sequences for each sign and this builds a DD LEXICON.Test: Viterbi Decoding: Fit HMMs to the testing sequence and simultaneously recognize. The Lexicon is imposed.

Scenaria: DONE: easy and difficult (unseen pronounciations). Not done: unseen sign.Figures show SUs via points on x-y plane.12Dynamic-Static SU extractionWorkshop on Gesture Recognition, June 20, 201113

Dynamic clustersStatic clusters

4. Phonetic ModelingBased on modeling signs linguisticallyLittle recent workWorkshop on Gesture Recognition, June 20, 201114PhoneticsPhonetics: the study of the sounds that constitute a wordEquivalently: the study of the elements that constitute a sign (i.e. its pronunciation)Workshop on Gesture Recognition, June 20, 201115The Role of PhoneticsWords consist of smaller parts, e.g.:cat /k/ // /t/So do signs, e.g.:CHAIR (HS, orientation, location, movement)Parts well-known: 30+ years of researchLess clear: a good structured modelGestures can borrow from sign inventoryWorkshop on Gesture Recognition, June 20, 201116The Role of Phonetics in RecognitionThe most successful speech recognition systems model words in terms of their constituent phones/phonemes, not in terms of data-driven subunitsAdding new words to dictionaryLinguistic knowledge & robustnessWhy dont sign language recognition systems do this?Phonetics are complex, and phonetic annotations/lexica are expensive to create

Workshop on Gesture Recognition, June 20, 201117Annotation vs Phonetic StructureThere is a difference between annotations (writing down) of a word and its phonetic structure, required for recognitionAnnotations cannot be applied directly to recognition, although an expert can infer the full pronunciation and structure from an annotationAnnotations for signed languages are much less time consuming than writing the full phonetic structureWorkshop on Gesture Recognition, June 20, 201118Annotation of a SignBasic HamNoSys annotation of CHAIR:

Workshop on Gesture Recognition, June 20, 201119SymmetryHandshapeOrientationLocationMovementRepetitionPhonetic Structure of SignsPostures, Detentions, Transitions, Steady Shifts (PDTS)> improved over 1989 Movement-Hold modelAlternating postures and transitionsCHAIR:Workshop on Gesture Recognition, June 20, 201120PostureTransPostureTransPostureTransDet

shoulderStraightdownchestBack upshoulderStraightdownchestHow Expensive is Phonetic Modeling?Basic HamNoSys annotation of CHAIR:Same sign with full phonetic structure: (starting posture) (downward transition & ending posture) (transition back & starting posture due to repetition) (downward transition & ending posture)

Workshop on Gesture Recognition, June 20, 201121Over 70 characters compared to just 8!8 characters vs over 7021Automatic Extraction of Phonetic StructureFirst contribution: Automatically extract phonetic structure of sign from HamNoSysCombines convenience of annotations with required detail for recognitionRecovers segmentation, postures, transitions, and relative timing of handsBased on symbolic analysis of movements, symmetries, etc.

Workshop on Gesture Recognition, June 20, 201122Training and Alignment of Phonetic SUsSecond contribution: Train classifiers based on phonetic structure, and align with data to recover frame boundariesFrame boundaries not needed for recognition, but can be used for further data-driven analysisClassifiers based on HMMs why?Proven track record for this type of taskNo explicit segmentation required, just concatenate SUs, use Baum-Welch trainingTrivial to scale up lexicon size to 1000s of signs

Workshop on Gesture Recognition, June 20, 201123Phonetic Models to HMMWorkshop on Gesture Recognition, June 20, 201124PostureTransPostureTransPostureTransDet

shoulderStraightdownchestBack upshoulderStraightdownchestPhonetic Subunit Training, AlignmentWorkshop on Gesture Recognition, June 20, 201125

Transition/Epenthesis SegmentsSuperimposed Initial-End Frames + ArrowPosture/Detention SegmentsSingle FrameETTEPPPSign = (pile).E : Epenthesis, T: Transition. P: Posture

25Phonetic Sub-UnitsWorkshop on Gesture Recognition, June 20, 201126

Transition/EpenthesisADD HERE26Phonetic Sub-unitsWorkshop on Gesture Recognition, June 20, 201127

PosturesADD HERE274. Recognition based on both Data-Driven SUs + Phonetic TranscriptionsWorkshop on Gesture Recognition, June 20, 201128SegmentationVisualFront-EndSub-Units+TrainingDeco-dingRecognize Signs, Data-driven SUs1. Data-Driven Subunits(based on Dynamic-Static) Recognize Signs,+ PDTS Sequence+AlignmentVisualFront-EndDeco-ding2. Data+Phonetic SubunitsPDTS SystemSub-units+TrainingSegmentationAlign-mentPDTS Sequence2. P=Posture, D=Detention, T=transition, S=steadyshift (epenthesis).Our Assumptions: Dynamic ~= (T,S), Static ~= (D,P) PDTS transform the hamnosys into sequence of PARTS. CVs PDTS seq has NO timestamps. Thus it needs alignment!Each Phonetic SU (P,D,T,S +their IDs) is NOT partitioned into subclusters (in system2). This is done only in system3.System 2: Build one HMM for each P,D,T,S (GMM-1 for P,D. HMM-5 for T,S).PDTS + Segment does not use PDTS detail labels and works as in D/S segm.PDTS + SUs assigns PDTS detailed labels (e.g. what type of Posture).Alignment is part of Training.Decoding = Testing (alignment is also done during testing)

3. Takes as input alignment from 2 and PDTS-labels.Does editing (merging and splitting) to deal with noisy decisions and/or inconsistencies from visual processing or from annotations.28Workshop on Gesture Recognition, June 20, 201129

Varying Dynamic SUs and method, Static SUs=300, #Signs = 961Varying # Signs and methodData-Driven vs. Phonetic Subunits Recognition Datadriven = system 1, Phonetic = system 2

TR fig: the DD approach always uses the optimum # of D/S clusters (= #Sus) for each experim (each different # of signs).BL fig: #signs=max=961295. Conclusions and The Big PictureRich SL corpora annotations are rare (in contrast to speech)Human annotations of sign language (HamNoSys) are expensive, subjective, contain errors, inconsistenciesHamNoSys contain no time structureData-Driven approaches Efficient but construct abstract SubUnitsWorkshop on Gesture Recognition, June 20, 201130Convert HamNoSys to PDTS; Gain Time Structure and Sequentiality Construct meaningful phonetics-based SUs

Further exploit the PDTS+Phonetic-SUsCorrect Human Annotations AutomaticallyValuable for SU based SL Recognition, Continuous SLR, Adaptation, Integration of Multiple Streams

Thank you !Workshop on Gesture Recognition, June 20, 201131Workshop on Gesture Recognition, June 20, 2011The research leading to these results has received funding from the European Communitys Seventh Framework Programme (FP7/2007-2013) under grant agreement n 231135. Theodore Goulas contributed the HamNoSys annotations for GSL.313131

Vassilis Pitsikalis 1 , Stavros Theodorakis 1 , Christian Vogler 2 and Petros Maragos 1

Documents

sign language recognition

isolated sign recognition

gesture recognitionexploit

gesture recognitionmethods

goalssign language data

sign language datacorpus

phonetic descriptiontraining

phonetic evidence