Modeling Signs using Functional Data Analysis

Modeling Signs using Functional Data Analysis

Sunita Nayak Sudeep Sarkar Kuntal SenguptaUniversity of South Florida University of South Florida AuthenTec, Inc.

[email protected] [email protected] [email protected]

Abstract 1

We present a functional data analysis (FDA) based methodto statistically model continuous signs of the American SignLanguage (ASL) for use in the recognition of signs in con-tinuous sentences. We build models in the Space of Prob-ability Functions (SoPF) that captures the evolution of therelationships among the low-level features (e.g. edge pix-els) in each frame. The distribution (histogram) of the hori-zontal and vertical displacements between all pairs of edgepixels in an image frame forms the relational distributions.We represent these sequence of relational distributions, cor-responding to the sequence of image frames in a sign, as asequence of points in a multi-dimensional space, capturingthe salient variations in these relational distributions overtime; we call this space the SoPF. Each sign model consistsof a mean sign function and covariance functions, captur-ing the variability of each sign in the training set. We usefunctional data analysis to arrive at this model. Recogni-tion and sign localization is performed by correlating thisstatistical model with any given sentence. We also present amethod to infer and learn sign models, in an unsupervisedmanner, from sentence samples containing the sign; there isno need for manual intervention.

1. IntroductionWhile speech recognition has made rapid advances, signlanguage recognition is lagging behind. With gradual shiftto speech based I/O devices, there is great danger that per-sons who rely solely on sign languages for communicationwill be deprived access to state-of-the-art technology un-less there are significant advances in automated recognitionof sign languages.

Previous works in sign language have been mostly in therecognition of static gestures, e.g. [2, 21, 13] and isolatedsigns, e.g. [19]. Yeasin and Chaudhuri [20] had worked ondynamic hand gestures. Bobick and Wilson [1] had pro-posed a state-based approach to model gestures. Starnerand Pentland [11] were the first to seriously considercon-tinuous sign recognition. Using Hidden Markov Model

1Appeared in Proc. Fourth Indian Conference on Computer Vision,Graphics & Image Processing (ICVGIP), pp. 64-69, December 2004.

(HMM) based representations, they achieved near perfectrecognition with sentences of fixed structure, i.e. contain-ing personal pronoun, verb, noun, adjective, personal pro-noun in that order. Vogler and Metaxas [15, 16, 17] havebeen instrumental in significantly pushing the state-of-the-art in automated ASL recognition using HMMs. In terms ofthe basic HMM formalism, they have explored many vari-ations, such as context dependent HMMs, HMMs coupledwith partially segmented sign streams, and parallel HMMs.The wide use of HMM is also seen in other sign languagerecognizers.

Most of the works in continuous sign language recog-nition have avoided the basic problem of segmentation andtracking of hands by using wearable devices, such as col-ored gloves, or magnetic markers, to directly get the loca-tion features. For example Vogler and Metaxas [15, 16, 17]have used 3D magnetic tracking system; Starner and Pent-land [11] have used colored gloves while Maet.al. [5, 18]have used Cybergloves. In this paper, we restrict ourselvesto plain color images, without the use of any augmentingwearable devices.

There are two kinds of information that can be used forrecognition, viz. manual and non-manual. The manual in-formation relates to the hand motion or shape, while thenon-manual information relates to the facial expressions,head movement, or torso movement. Here we use the man-ual information from hand motion. The hand motion is firstmodeled using relational distributions, which are efficientlyrepresented as points in the Space of Probability functions(SoPF). The points are then transformed into smooth curvesthat are registered and trained to form a unique model for asign using Functional Data Analysis.

2. Data SetA vital component in ASL recognition research is the dataset used in the study. The largest corpus used in ASL recog-nition contains a vocabulary of around 50 signs, embed-ded in approximately 500 sentences [15, 16, 17]. Only re-cently has there been a concerted effort in systematicallyconstructing a common ASL corpus for public dissemina-tion. At Boston University, Neidleet al. [6] have createdsuch a dataset using SignStream, which is a system for lin-

guistic annotation, storage, and retrieval of ASL and otherforms of gestural communication. This dataset also had nowearable aids, but the video was sampled too coarsely. Onan average there were only 5.8 frames per sign. So, we hadto do our own data collection.

Setting the realistic long term goal of automated ASLrecognition, but in a constrained domain, we selected thesentences that would be used while communicating withdeaf people at airports. Data was collected and groundtruthed by an ASL interpreter. A color video camera wasused. The background was kept plain. The dataset has 39distinct signs forming 25 sentences. There are 10 to 12 in-stances of each of the sentences. The details of this data isavailable in [7].

3. Relational Distributions and Spaceof Probability Functions

In most of the previous works incontinuousASL, detec-tion and tracking of hand have been simplified using coloredgloves [12] or magnetic markers [15]. Even other sign lan-guage recognizers have used colored gloves or data gloves.Only recently there has been effort to extract informationand to track directly from color images, without the use ofspecial devices [19], but it has only been used forisolatedsign recognition. As we shall see, our representation doesnot require tracking of hands. We would like these represen-tations to be somewhat robust to low-level errors. We usethe Canny edge pixels of each video frame as the low-levelprimitives.

Grounded on the observation that theorganizationorstructureor relationshipsamong low-level primitives aremore important than the primitives themselves, we focuson the statistical distribution of the relational attributes ob-served in the image, which we refer to asrelational dis-tributions. Such statistical representation also removes theneed for primitive level correspondence or tracking acrossframes. Such representations have been successfully usedfor modeling periodic motion in the context of identifica-tion of a person from gait [14] and non-periodic motionin the context of sign recognition [7]. Here, we use itto build statistical models for non-periodic motion in ASLsigns. Primitive level statistical distributions, such as ori-entation histograms, have been used for gesture recogni-tion [3]. However, the only uses of relational histogramsthat we are aware of are by Huet and Hancock [4], who usedit to model line distributions in the context of image data-base indexing. The novelty of relational distributions lies inthat it offers a strategy for incorporating dynamic aspects.

We refer the reader to [14] for the details of the rep-resentation. Here we just sketch the essentials. LetF = {f1 , ..., fN } represent the set of N primitives in an im-age. For us these are Canny edge pixels of the image. Let

Fk represent a random k-tuple of primitives, and the rela-tionship among k-tuple primitives be denoted byRk. Letthe relationshipsRk be characterized by a set of M at-tributesAk = {Ak1, ..., AkM}. For ASL, we use the dis-tance of the two edge pixels in the vertical and horizontaldirection(dx, dy) as the attributes. We normalize and rep-resent the distance between the pixels in an image size of 32x 32 to reduce the size for further processing. The shape ofthe pattern can be represented by joint probability functions:P (Ak = ak), also denoted byP (ak1, ..., akM ) or P (ak),whereaki is the (discretized in practice) value taken by therelational attributeAki. We term these probabilities as theRelational distributions.

One interpretation of these distributions is:

Given an image, if you randomly pick k-tuplesof primitives, what is the probability that it willexhibit the relational attributeak? What isP(Ak = ak)?

Given that these relational distributions exhibit complicatedshapes that do not readily afford modeling using a combina-tion of simple shaped distribution, we adopt non-parametrichistogram based representation. However, to reduce the sizethat is associated with a histogram based representation, weuse the Space of Probability Functions (SoPF).

As the hands of the signer move, the relational distribu-tion changes. Motion of hands introduces non-stationarityin the relational distributions. Figure 1 shows example ofthe 2-ary relational distributions for the sign ‘CAN’. No-tice the change in the distributions as the hands come down.The change in one direction in relational distributions canbe seen clearly as the hands are coming down, while thereis comparatively less change in the relational distribution inthe other direction.

Let P (ak, t) represent the relational distribution at timet. Let

√P (ak, t) =

n∑i=1

ci(t)Φi(ak) + µ(ak) + η(ak) (1)

describe thesquare rootof each relational distribution asa linear combination of orthogonal basis functions, whereΦi(ak)’s are orthonormal functions, the functionµ(ak) is amean function defined over the attribute space, andη(ak)isa function capturing small random noise variations withzero mean and small variance. We refer to this space asthe Space of Probability Functions (SoPF).

We use the square root function so that we arrive at aspace where the distances are not arbitrary ones but are re-lated to the Bhattacharya distance between the relationaldistributions, which is an appropriate distance measure forprobability distributions. Its proof can be found in [14].Given a set of relational distributions,{P(ak, ti) | i =

Figure 1: Variations in relational distributions with motion.The left column shows the image frames in the sign ‘CAN’.The middle column shows the edge pixels, and the rightcolumn shows the relational distributions

1, ..., T}, the SoPF can be arrived at by principal compo-nent analysis (PCA). In practice, we can consider the sub-space spanned by a few (N � n) dominant vectors associ-ated with the large eigenvalues. Here, most of the variationis captured by the eigen vectors associated with the top 20(largest) eigen values. Thus, a relational distribution canbe represented using these N coordinates (ci(t)s), which ismore compact representation than a normalized histogrambased representation. The ASL sentences form sequencesof points in this Space of Probability Functions.

4. Supervised Learning using Func-tional Data Analysis

In the first learning scenario, we use sign samples that aremanually segmented from sentences. Each sign sampleconsists of a sequence of SoPF coordinates. Each coordi-nate sequence can be looked upon as samples of a smoothcurve, or function, in the SoPF space. We arrive at theunderlying smooth functional representation for each signsample using B-spline interpolation [9, 10]. This convertseach training sequence into functional data, which are thensmoothed and registered to arrive at a single statistical func-tional model [9, 10]. The specific steps involved are as fol-lows:

1. Each training sequence of SoPF coordinates aretime-normalized by linearly interpolated resamplingmapped to a fixed time period, which is chosen to bethe mean length of all the sequences. For further ma-nipulation, the normalized data is again resampled at a20 times finer resolution than the original data.

2. All the time-normalized, discretely sampled, se-quences are then together turned into afunctional dataobject, which represents the underlying sequences ofcontinuous functions in terms of basis functions (B-splines, in our experiments) and the coefficients re-quired to reconstruct the observed data. The functionaldata of theith sequence at time t is represented by

xi(t) =N∑

k=1

αikφk(t) (2)

whereφ1, φ2, φ3, ..., φN are the N basis functions. Thecoefficients,αik, determining the above expansion areobtained by minimizing the sum of squares of the dif-ference of the discrete data,dij , where j=1, 2,..., n rep-resent the n sampling(observation) points, to the corre-sponding values ofxi, i.e.

SSE(di, α) =n∑

j=1

[dij −N∑

k=1

αikφk(tj)]2 (3)

is minimized for theith sequence of the data. Thenumber of basis function in the B-Spline representa-tion, N, can be determined byN = NR + ND + 4,where theNR represents the required resolution, i.ethe minimum number of features or events needed tobe present in the observation.ND is the highest orderof derivative that needs to be retained in the observa-tion. In our experiments, we have used cubic B-Splinesand consideredNR to be 10 andND to be 6.

3. The functional data,xi, represented above is furthersmoothed by minimizing the following penalty crite-rion:

PSSE =∫

[xi(t)− zi(t)]2dt + λPR(zi) (4)

wherezi is the smoothed form of the data and the lastterm on the right side of the equation is forpenaliz-ing the roughnessof zi. PR(zi) can be defined as theintegral of square of the second derivative ofzi, i.e.,

PR(zi) =∫

[z′′i (t)]2dt (5)

The amount of smoothing can be controlled by varyingthe value of the smoothing parameterλ.

(a) (b) (c) (d) (e)

Figure 2: Supervised learning of word models. (a) shows the plots of just the first dimension of SoPF representation w.r.ttime, of five instances of the sign ‘CAN’. (b) shows the interpolated data. (c) shows the smoothed data, (d) shows the meanof the smoothed data. And (e) shows the registered curves.

4. The mean,µ(t), of the smoothed sequences is thencomputed.

5. Each of the smoothed curves is registered to the meancurve,µ(t), by estimating a warping function,hi(t),for each of them so that the registered curves,ri(t) =zi[hi(t)], minimize a global criterion:

REGSSE =M∑i=1

∫T

[ri(t)− µ(t)]2dt (6)

where M is the number of curves and T is interval overwhich the curves are registered. The process of regis-tration is then done iteratively till a convergence crite-rion is reached. We use the convergence criterion of0.01 and an iteration limit of 5 iterations.

6. The covariance is computed at each of the points inthe time axis. Mean and covariance functions togetherform a model of each sign. Both the mean and covari-ance are computed in the same way as for any otherstatistical observations, from all the replications of ob-servation at each time instant in the functional data ob-ject.

For more detailed discussion on the above processes, werefer the reader to [9, 10]. The code at [8] was used for ourexperiments.

Figure 2 illustrates the above modeling process using justthe first dimension in the SoPF representation of each sign.We conduct the actual analysis in a 20 dimensional SoPFspace, however, the figure is sufficient to illustrate how thetraces are simultaneously registered and mean representa-tion is extracted. In addition to the mean, we also store thecovariances among the 20 dimensions at each time instance,i.e. we also store a multidimensionalcovariance function.

5. Unsupervised Learning of SignModels

Is it possible to learn a sign model without supervisionor requiring manual segmentation of words in the training

dataset? In this section, we outline an approach, again basedon functional data analysis, for this task. The data consistsof many ASL sentences, each consisting of different signsof similar temporal duration, but with the constraint that allcontain one common sign. We can automatically generatethe model for this common sign. For this, instead of con-ducting functional data analysis at the word level, we con-sider thewhole sentence. The outline of the steps are asfollows:

1. We build the mean and covariance function representa-tions from the functional data object for the set of train-ing sentences, using the steps outlined in Section 4. Ofcourse, the registration will be of poor quality since,the sentences contain different signs. However, theregistration should be good over the part of sentencescontaining the common word.

2. The trace of the covariance matrix for each time instantforms a measure of the variability of the registrationamong the sentences.

3. The portion of the mean and covariance functions overwhich the variability is low form the model of the com-mon sign. We can additionally use prior knowledge, ifavailable, about the possible location of the commonword to prune out residual ambiguities.

The process is illustrated in Figure 3, again using justone of the 20 dimensions of the SoPF representation. Thecommon sign in the 23 sentences training set correspondsto the sign ‘IDPAPERS’. We have 12 instances of the sen-tence ‘IDPAPERSWHERE’ and 11 instances of the sen-tence ‘IDPAPERSTABLE’. We see the variance in the reg-istered curves is low towards the first half of the sentence,when the common sign occurs. The variance is also lowtowards the end because of end-of-sentence coarticulation,i.e. all the sentences end in a common stance. This is easy tofilter out based on prior knowledge of the common stance,or by simply ignoring the last few frames.

(a) (b) (c) (d)

(e) (f) (g)

Figure 3: Unsupervised learning of sign models. (a) shows the plot of the first dimension of the SoPF representation forsentences with one common word ‘IDPAPERS’, smoothed with smoothing parameter(λ)= 0.1. (b) shows the registeredcurves for the same set of sentences. (c) shows the variation of the standard deviation of the registered sentences. (d) showsthe relevant time period indicating the common sign. (e) shows first SoPF dimension of the mean representation of the modelformed for the sign ‘IDPAPERS’. (f) and (g) respectively show the plots of the first vs. second SoPF dimension, and the firstvs. second vs. third SoPF dimension of the mean of the learnt model for the same sign.

6. Recognition

The models created above for each of the signs are usedfor recognizing the signs incontinuousASL sentences.At present, we use a simple correlation based recognitionprocess. Any given test sequence is turned into a functionaldata object, in much the same way as in the model forma-tion process. The relational distribution of each frame of thetest sequence is represented as a point in the 20 dimensionalSoPF. The test sentence traces a curve in the SoPF space.That curve is interpolated in the same way as in the caseof the training data, and then converted to functional data,using the same B-spline basis functions. Then the test datais smoothed to remove irrelevant features. The smoothingparameter is kept same as in the training set, i.e. 0.1.

Now each sign model is matched to the test sentence bycorrelation. The distance is calculated by summing up thedistance of each point of the mean curve of the sign fromthe test sentence curve, and then normalizing the sum bythe sign’s length. Note that one of the property of the SoPFis that Euclidean distances in this space correspond to Bhat-tacharya distance between the corresponding relational dis-tributions [14]. The sign is said to be located at the point ofminimum correlational distance. The value of the minimumcorrelation is a measure of distance of the sign model to thesentence.

7. Actual Recognition Experiments

The data set used for the experiments consists of 16 signsforming 10 sentences, with two to three signs per sentence.The average length of the test sentences was 90 frames.First, we present supervised learning (Section 4) results.Each learnt mean functional data model is correlated withthe functional data object constructed from the test sen-tence. The model signs are sorted based on the minimumcorrelational distance; the sign with smaller minimum cor-relational distance is more likely to be present in the sen-tence. To compute recognition rates, we consider if thecorrect sign occurs within the topn matches in an-signsentence. In this way, the recognition rate found was to be57%. If we consider the topn + 1 matches, the recognitionrate increases to 69%. We note that these rates are for a verysimple recognition strategy; we expect the rates to be higherfor better recognition strategies. As the focus of this paperis in modeling, we did not yet explore more complicatedstrategies.

The correlation based recognition is good at localizingsigns in a sentence. For most of the signs in the sentences,the location is found near to the actual position. Signs werelocated with about 92% accuracy. We define the error rateas the difference of the actual starting frame number of thesign to computed starting frame number, normalized by thetotal number of frames in the sentence.

For unsupervised modeling, we considered four signs,viz. ‘WHERE’, ‘SUITCASE’, ‘FINISH’ and ‘IDPAPERS’.The built models were tested on sentences not used in train-ing. Four out of six possible occurances of the above wordsin the test data were located with 82% or higher localizationaccuracy.

8. Conclusions and Future WorkThis papers presents a functional approach for supervisedand unsupervised modeling of the signs of American SignLanguage as smooth curves with variance at each point inthe curve, in a multidimensional space. The approach usesplain video data that does not use any wearable aids likedata gloves, magnetic trackers etc., as its input. Instead itrelies on inter-feature relational distribution in any imageframe. We are presently working on automating the thresh-olds used in the above process of self-learning of signs, andusing sentences with common signs at different locations.The use of dynamic time warping while matching the signand the use of covariance while finding the distance fromthe sentence, can significantly improve the recognition rate.Also the above approach has to be tried on a dataset havingmore number of signs and sentences.

References

[1] A. Bobick and A. Wilson. A state-based approach to therepresentation and recognition of gesture.IEEE Trans-actions on Pattern Analysis and Machine Intelligence,19(12):1325 – 1337, December 1997.

[2] Y. Cui and J. Weng. Appearance-based hand sign recogni-tion from intensity image sequences.Computer Vision andImage Understanding, 78(2):157 – 176, May 2000.

[3] W. Freeman and M. Roth. Orientation histograms for handand gesture recognition. InInternational Workshop on Faceand Gesture Recognition, pages 296 – 301. 1995.

[4] A. Huet and E. Hancock. Line pattern retrieval using rela-tional histograms.IEEE Transactions on Pattern Analysisand Machine Intelligence, 12(13):1363 – 1370, 1999.

[5] J. Ma, W. Gao, C. Wang, and J. Wu. A continuous Chinesesign language recognition system. InInternational Con-ference on Automatic Face and Gesture Recognition, pages428 – 433. 2000.

[6] C. Neidle, S. Sclaroff, and V. Athisos. A tool for linguis-tic and computer vision research on visual-gestural languagedata. Behavior Research Methods, Instruments, and Com-puters, 33(3):311 – 320, November 2001.

[7] A. S. Parashar. Representation and interpretation of man-ual and non-manual information for automated AmericanSign Language recognition, Master’s thesis, Department ofComputer Science Engineering, University of South Florida,2003.

[8] J. Ramsay. Matlab,R and S-PLUSFunctions for Functional Data Analysis.ftp://ego.psych.mcgill.ca/pub/ramsay/FDAfuns/ .

[9] J. Ramsay and B. Silverman.Functional Data Analysis.Springer, 1997.

[10] J. Ramsay and B. Silverman.Applied Functional DataAnalysis. Springer, 2002.

[11] T. Starner and A. Pentland. Real-time American Sign Lan-guage recognition from video using hidden Markov models.In Symposium on Computer Vision, pages 265 – 270. 1995.

[12] T. Starner and A. Pentland. Visual recognition of Ameri-can Sign Language using hidden Markov models, Master’sthesis, MIT, Media Lab., 1995.

[13] J. Triesch and C. von der Malsburg. Robust classificationof hand postures against complex backgrounds. InInterna-tional Conference on Automatic Face and Gesture Recogni-tion, pages 170 – 175. 1996.

[14] I. R. Vega.Motion model based on statistics of feature rela-tions: Human Identification From Gait. PhD thesis, Depart-ment of Computer Science Engineering, University of SouthFlorida, 2002.

[15] C. Vogler and D. Metaxas. ASL recognition based on acoupling between HMMs and 3d motion analysis. InInter-national Conference on Computer Vision, pages 363 – 369.1998.

[16] C. Vogler and D. Metaxas. Parallel hidden Markov modelsfor American Sign Language recognition. InInternationalConference on Computer Vision, pages 116 – 122. 1999.

[17] C. Vogler and D. Metaxas. A framework of recognizing thesimultaneous aspects of American Sign Language.Com-puter Vision and Image Understanding, 81:358 – 384, 2001.

[18] C. Wang, W. Gao, and S. Shan. An approach based onphonemes to large vocabulary Chinese sign language recog-nition. In International Conference on Automatic Face andGesture Recognition, pages 393 – 398. 2002.

[19] M. H. Yang, N. Ahuja, and M. Tabb. Extraction of 2d motiontrajectories and its application to hand gesture recognition.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 24:168 – 185, August 2002.

[20] M. Yeasin and S. Chaudhuri. Visual understanding of dy-namic hand gestures.Pattern Recognition, 33(11), 2000.

[21] M. Zhao and F. K. H. Quek. RIEVL: recursive inductionlearning in hand gesture recognition.IEEE Transactions onPattern Analysis and Machine Intelligence, 20(11):1174 –1185, 1998.

Modeling Signs using Functional Data Analysis

Documents