Multipurpose Design and Creation of GSL Dictionaries
Advances in Phonetics-based Sub-Unit Modeling for Transcription,
Alignment and Sign Language Recognition.Vassilis Pitsikalis1,
Stavros Theodorakis1, Christian Vogler2 and Petros Maragos1
1 School of Electrical and Computer Engineering, National
Technical University of Athens2 Institute for Language and Speech
Processing/Gallaudet University
Workshop on Gesture Recognition, June 20, 20111OverviewGestures,
signs, and goalsSign language data and visual processingData-Driven
Sub-Units without Phonetic Evidence for RecognitionPhonetic
modelingWhat is it?Annotations vs phoneticsConversion of
annotations to structured phonetic descriptionTraining and
alignmentRecognition experimentsConclusionsWorkshop on Gesture
Recognition, June 20, 2011221. Gestures versus
SignsGesturesIsolated hand, body, and facial movementsCan be broken
down into primitives (but rarely are in gesture recognition
work)Few constraints, other than conventionSignsHand body and
facial movements, both in isolation and as part of sentencesCan be
broken down into primitives (cheremes/phonemes/phones)Numerous
phonetic, morphological, and syntactic constraintsWorkshop on
Gesture Recognition, June 20, 20113SL Recognition vs Gesture
RecognitionContinuous SL recognition is invariably more complex
than gestures, but:Isolated sign recognition (i.e. the forms found
in a dictionary) is essentially the same as gesture
recognitionMethods that work well on isolated sign recognition
should work well on gesture recognitionExploit 30+ years of
research into structure of signsWorkshop on Gesture Recognition,
June 20, 20114Subunit ModelingTwo fundamentally different ways to
break down signs into parts:Data-drivenPhonetics-based (i.e.
linguistics)Similar benefits:ScalabilityRobustnessReduce required
training dataWorkshop on Gesture Recognition, June 20, 20115Goals
of this PresentationWork with large vocabulary (1000 signs)Compare
data-driven and phonetic breakdown of signs into subunitsAdvance
state of the field in phonetic breakdown of signs
Workshop on Gesture Recognition, June 20, 201162. Sign Language
DataCorpus of 1000 Greek Sign Language Lemmata5 repetitions per
signSigner-dependent, 2 signers (only 1 used for this paper)HD
video, 25 fps interlacedTracking and feature
extractionPre-processing, Configuration, Statistics, Skin color
training
Workshop on Gesture Recognition, June 20, 20117Interlaced data
and pre-processingWorkshop on Gesture Recognition, June 20,
20118
RefinedSkin color masksDe-interlacedInterlaced2nd VersionFull
Resolution, Frame rateFull frame rate : 50fpsResolution :
1440x1088
8Tracking Video, GSL Lemmas Corpus9Workshop on Gesture
Recognition, June 20, 2011
9Given these, we move on describing the recognitions framework
Recognition setup employed,And the variety of experiments
conducted.93. Data-Driven Subunit ModelingExtremely popular in SL
recognition latelyGood resultsDifferent approaches existours is
based on distinguishing between dynamic and static subunitsWorkshop
on Gesture Recognition, June 20, 201110Workshop on Gesture
Recognition, June 20, 201111Dynamic (Movement)-Static (Position)
Segmentation: Intuitive, Segments + LabelsSeparate Modeling, SUs,
Clustering wrt. Feature type (e.g. static vs. dynamic features);
Parameters (e.g. Model Order) and Architecture (HMM, GMM);
Normalize featuresTraining, Data-Driven LexiconRecognize SUs,
SignsDynamic-Static SU Recognition2-state HMM to Segment + Label
frames (represented by velocity magn) as Dynamic-D (high veloc) vs
Static-S (low veloc).If S, Kmeans clustering by the 2D
position(x,y) vectors and then GMM-1 for each cluster for training.
If D, by using features like Direction, Scale or Norm Position,
hierarchical clustering via DTW (w Euclidean distances). DTW yields
a dstance among segments. Hierarchical for flexibility. After,
build a bakis 6-state HMM for each cluster.SUs = final HMM per
D-cluster and final GMM for S-cluster.Train: Based on sign-level
Transcriptions, SUs extracted create SU sequences for each sign and
this builds a DD LEXICON.Test: Viterbi Decoding: Fit HMMs to the
testing sequence and simultaneously recognize. The Lexicon is
imposed.
Scenaria: DONE: easy and difficult (unseen pronounciations). Not
done: unseen sign.Figures show SUs via points on x-y
plane.11Dynamic-Static SU Extraction
V. Pitsikalis, S. Theodorakis and P. Maragos, Data-Driven
Sub-Units and Modeling Structure for Continuous Sign Language
Recognition with Multiple Cues, LREC, 20102-state HMM to Segment +
Label frames (represented by velocity magn) as Dynamic-D (high
veloc) vs Static-S (low veloc).If S, Kmeans clustering by the 2D
position(x,y) vectors and then GMM-1 for each cluster for training.
If D, by using features like Direction, Scale or Norm Position,
hierarchical clustering via DTW (w Euclidean distances). DTW yields
a dstance among segments. Hierarchical for flexibility. After,
build a bakis 6-state HMM for each cluster.SUs = final HMM per
D-cluster and final GMM for S-cluster.Train: Based on sign-level
Transcriptions, SUs extracted create SU sequences for each sign and
this builds a DD LEXICON.Test: Viterbi Decoding: Fit HMMs to the
testing sequence and simultaneously recognize. The Lexicon is
imposed.
Scenaria: DONE: easy and difficult (unseen pronounciations). Not
done: unseen sign.Figures show SUs via points on x-y
plane.12Dynamic-Static SU extractionWorkshop on Gesture
Recognition, June 20, 201113
Dynamic clustersStatic clusters
4. Phonetic ModelingBased on modeling signs linguisticallyLittle
recent workWorkshop on Gesture Recognition, June 20,
201114PhoneticsPhonetics: the study of the sounds that constitute a
wordEquivalently: the study of the elements that constitute a sign
(i.e. its pronunciation)Workshop on Gesture Recognition, June 20,
201115The Role of PhoneticsWords consist of smaller parts, e.g.:cat
/k/ // /t/So do signs, e.g.:CHAIR (HS, orientation, location,
movement)Parts well-known: 30+ years of researchLess clear: a good
structured modelGestures can borrow from sign inventoryWorkshop on
Gesture Recognition, June 20, 201116The Role of Phonetics in
RecognitionThe most successful speech recognition systems model
words in terms of their constituent phones/phonemes, not in terms
of data-driven subunitsAdding new words to dictionaryLinguistic
knowledge & robustnessWhy dont sign language recognition
systems do this?Phonetics are complex, and phonetic
annotations/lexica are expensive to create
Workshop on Gesture Recognition, June 20, 201117Annotation vs
Phonetic StructureThere is a difference between annotations
(writing down) of a word and its phonetic structure, required for
recognitionAnnotations cannot be applied directly to recognition,
although an expert can infer the full pronunciation and structure
from an annotationAnnotations for signed languages are much less
time consuming than writing the full phonetic structureWorkshop on
Gesture Recognition, June 20, 201118Annotation of a SignBasic
HamNoSys annotation of CHAIR:
Workshop on Gesture Recognition, June 20,
201119SymmetryHandshapeOrientationLocationMovementRepetitionPhonetic
Structure of SignsPostures, Detentions, Transitions, Steady Shifts
(PDTS)> improved over 1989 Movement-Hold modelAlternating
postures and transitionsCHAIR:Workshop on Gesture Recognition, June
20, 201120PostureTransPostureTransPostureTransDet
shoulderStraightdownchestBack upshoulderStraightdownchestHow
Expensive is Phonetic Modeling?Basic HamNoSys annotation of
CHAIR:Same sign with full phonetic structure: (starting posture)
(downward transition & ending posture) (transition back &
starting posture due to repetition) (downward transition &
ending posture)
Workshop on Gesture Recognition, June 20, 201121Over 70
characters compared to just 8!8 characters vs over 7021Automatic
Extraction of Phonetic StructureFirst contribution: Automatically
extract phonetic structure of sign from HamNoSysCombines
convenience of annotations with required detail for
recognitionRecovers segmentation, postures, transitions, and
relative timing of handsBased on symbolic analysis of movements,
symmetries, etc.
Workshop on Gesture Recognition, June 20, 201122Training and
Alignment of Phonetic SUsSecond contribution: Train classifiers
based on phonetic structure, and align with data to recover frame
boundariesFrame boundaries not needed for recognition, but can be
used for further data-driven analysisClassifiers based on HMMs
why?Proven track record for this type of taskNo explicit
segmentation required, just concatenate SUs, use Baum-Welch
trainingTrivial to scale up lexicon size to 1000s of signs
Workshop on Gesture Recognition, June 20, 201123Phonetic Models
to HMMWorkshop on Gesture Recognition, June 20,
201124PostureTransPostureTransPostureTransDet
shoulderStraightdownchestBack
upshoulderStraightdownchestPhonetic Subunit Training,
AlignmentWorkshop on Gesture Recognition, June 20, 201125
Transition/Epenthesis SegmentsSuperimposed Initial-End Frames +
ArrowPosture/Detention SegmentsSingle FrameETTEPPPSign = (pile).E :
Epenthesis, T: Transition. P: Posture
25Phonetic Sub-UnitsWorkshop on Gesture Recognition, June 20,
201126
Transition/EpenthesisADD HERE26Phonetic Sub-unitsWorkshop on
Gesture Recognition, June 20, 201127
PosturesADD HERE274. Recognition based on both Data-Driven SUs +
Phonetic TranscriptionsWorkshop on Gesture Recognition, June 20,
201128SegmentationVisualFront-EndSub-Units+TrainingDeco-dingRecognize
Signs, Data-driven SUs1. Data-Driven Subunits(based on
Dynamic-Static) Recognize Signs,+ PDTS
Sequence+AlignmentVisualFront-EndDeco-ding2. Data+Phonetic
SubunitsPDTS SystemSub-units+TrainingSegmentationAlign-mentPDTS
Sequence2. P=Posture, D=Detention, T=transition, S=steadyshift
(epenthesis).Our Assumptions: Dynamic ~= (T,S), Static ~= (D,P)
PDTS transform the hamnosys into sequence of PARTS. CVs PDTS seq
has NO timestamps. Thus it needs alignment!Each Phonetic SU
(P,D,T,S +their IDs) is NOT partitioned into subclusters (in
system2). This is done only in system3.System 2: Build one HMM for
each P,D,T,S (GMM-1 for P,D. HMM-5 for T,S).PDTS + Segment does not
use PDTS detail labels and works as in D/S segm.PDTS + SUs assigns
PDTS detailed labels (e.g. what type of Posture).Alignment is part
of Training.Decoding = Testing (alignment is also done during
testing)
3. Takes as input alignment from 2 and PDTS-labels.Does editing
(merging and splitting) to deal with noisy decisions and/or
inconsistencies from visual processing or from
annotations.28Workshop on Gesture Recognition, June 20, 201129
Varying Dynamic SUs and method, Static SUs=300, #Signs =
961Varying # Signs and methodData-Driven vs. Phonetic Subunits
Recognition Datadriven = system 1, Phonetic = system 2
TR fig: the DD approach always uses the optimum # of D/S
clusters (= #Sus) for each experim (each different # of signs).BL
fig: #signs=max=961295. Conclusions and The Big PictureRich SL
corpora annotations are rare (in contrast to speech)Human
annotations of sign language (HamNoSys) are expensive, subjective,
contain errors, inconsistenciesHamNoSys contain no time
structureData-Driven approaches Efficient but construct abstract
SubUnitsWorkshop on Gesture Recognition, June 20, 201130Convert
HamNoSys to PDTS; Gain Time Structure and Sequentiality Construct
meaningful phonetics-based SUs
Further exploit the PDTS+Phonetic-SUsCorrect Human Annotations
AutomaticallyValuable for SU based SL Recognition, Continuous SLR,
Adaptation, Integration of Multiple Streams
Thank you !Workshop on Gesture Recognition, June 20,
201131Workshop on Gesture Recognition, June 20, 2011The research
leading to these results has received funding from the European
Communitys Seventh Framework Programme (FP7/2007-2013) under grant
agreement n 231135. Theodore Goulas contributed the HamNoSys
annotations for GSL.313131