Gesture

Introduction

A child being sensed by a simple gesture recognition algorithm detecting hand

location and movement

Gesture recognition is a topic in computer science and language technology with the

goal of interpreting human gestures via mathematical algorithms. Gestures can originate

from any bodily motion or state but commonly originate from the face or hand. Current

focuses in the field include emotion recognition from the face and hand gesture

recognition. Many approaches have been made using cameras and computer vision

algorithms to interpret sign language. However, the identification and recognition of

posture, gait, proxemics, and human behaviors is also the subject of gesture recognition

techniques.

Gesture recognition can be seen as a way for computers to begin to understand human

body language, thus building a richer bridge between machines and humans than

primitive text user interfaces or even GUIs (graphical user interfaces), which still limit

the majority of input to keyboard and mouse.

Gesture recognition enables humans to interface with the machine (HMI) and interact

naturally without any mechanical devices. Using the concept of gesture recognition, it is

possible to point a finger at the computer screen so that the cursor will move accordingly.

This could potentially make conventional input devices such as mouse, keyboards and

even touch-screens redundant.

Gesture recognition can be conducted with techniques from computer vision and image

processing.

Interface with computers using gestures of the human body, typically hand movements. In gesture recognition technology, a camera reads the movements of the human body and communicates the data to a computer that uses the gestures as input to control devices or applications. For example, a person clapping his hands together in front of a camera can produce the sound of cymbals being crashed together when the gesture is fed through a computer.

One way gesture recognition is being used is to help the physically impaired to interact with computers, such as interpreting sign language. The technology also has the potential to change the way users interact with computers by eliminating input devices such as joysticks, mice and keyboards and allowing the unencumbered body to give signals to the computer through gestures such as finger pointing.

Unlike haptic interfaces, gesture recognition does not require the user to wear any special equipment or attach any devices to the body. The gestures of the body are read by a camera instead of sensors attached to a device such as adata glove.In addition to hand and body movement, gesture recognition technology also can be used to read facial and speech expressions (i.e., lip reading), and eye movements.

The literature includes ongoing work in the computer vision field on capturing gestures

or more general human pose and movements by cameras connected to a computer.

Gesture recognition and pen computing:

In some literature, the term gesture recognition has been used to refer more

narrowly to non-text-input handwriting symbols, such as inking on a graphics

tablet, multi-touch gestures, and mou s e g e s ture recognition. This is computer

interaction through the drawing of symbols with a pointing device cursor (see

discussion at Pen computing).

Gesture Only Interfaces

The gestural equivalent of direct manipulation interfaces is those which use gesture alone. These can range from interfaces that recognize a few symbolic gestures to those that implement fully fledged sign language interpretation. Similarly interfaces may recognize static hand poses, or dynamic hand motion, or a combination of both. In all cases each gesture has an unambiguous semantic meaning associated with it that can be used in the interface. In this section we will first briefly review the technology used to capture gesture input, then describe examples from symbolic and sign language recognition. Finally we summarize the lessons learned from these interfaces and provide some recommendations for designing gesture only applications.

Tracking Technologies

Gesture-only interfaces with syntax of many gestures typically require precise hand pose

tracking. A common technique is to instrument the hand with a glove which is equipped

with a number of sensors which provide information about hand position, orientation, and

flex of the fingers. The first commercially available hand tracker, the Data glove, is

described in Zimmerman, Lanier, Blanchard, Bryson and Harvill (1987), and illustrated

in the video by Zacharey, G. (1987). This uses thin fiber optic cables running down the

back of each hand, each with a small crack in it. Light is shone down the cable so when

the fingers are bent light leaks out through the cracks. Measuring light loss gives an

accurate reading of hand pose. The Dataglove could measure each joint bend to an

accuracy of 5 to 10 degrees (Wise et. al. 1990), but not

the sideways movement of the fingers (finger abduction). However, the CyberGlove

developed by Kramer (Kramer 89) uses strain gauges placed between the fingers to

measure abduction as well as more accurate bend sensing (Figure XX). Since the

development of the Dataglove and Cyberglove many other gloves based input devices

have appeared as described by Sturman and Zeltzer (1994).

Natural Gesture Only Interfaces

At the simplest level, effective gesture interfaces can be developed which respond to

natural gestures, especially dynamic hand motion. An early example is the Theramin, an

electronic musical instrument from the 1920’s. This responds to hand position using two

proximity sensors, one vertical, the other horizontal. Proximity to the vertical sensor

controls the music pitch, to the horizontal one, loudness. What is amazing is that music

can be made with orthogonal control of the two prime dimensions, using a control system

that provides no fixed reference points, such as frets or mechanical feedback. The hands

work in extremely subtle ways to articulate steps in what is actually a continuous control

space. The Theramin is successful because there is a direct mapping of hand motion to

continuous feedback, enabling the user to quickly build a mental model of how to use the

device.

Gesture Based Interaction

Figure XX: The CyberGlove

The CyberGlove captures the position and movement of the fingers and wrist. It has up to22 sensors, including three bend sensors (including the distal joints) on each finger, four abduction sensors, plus sensors measuring thumb crossover, palm arch, wrist flexion and wrist abduction. (Photo: Virtual Technologies, Inc.)

Once hand pose data has been captured by the gloves, gestures can be recognized using a number of different techniques. Neural network approaches or statistical template matching is commonly used to identify static hand poses, often achieving accuracy rates of better than 95% (Väänänen and Böhm 1993). Time dependent neural networks may also be used for dynamic gesture recognition [REF], although a more common approach is to use Hidden Markov Models. With this technique Kobayashi is able to achieve an accuracy of XX% (Kobayashi et. al. 1997), similar results have been reported by XXXX

and XXXX. Hidden Markov Models may also be used to interactively segment out glove input into individual gestures for recognition and perform online learning of new gestures (Lee 1996). In these cases gestures are typically recognized using pre-trained templates; however gloves can also be used to identify natural or untrained gestures. Wexelblat uses a top down and bottom up approach to recognize natural gestural features such as finger curvature and hand orientation, and temporal integration to produce frames describing complete gestures (Wexelblat 1995). These frames can then be passed to higher level functions for further interpretation.Although instrumented gloves provide very accurate results they are expensive and encumbering. Computer vision techniques can also be used for gesture recognition overcoming some of these limitations. A good review of vision based gesture recognition is provided by Palovic et. al. (1995). In general, vision based systems are more natural to use that glove interfaces, and are capable of excellent hand and body tracking, but do not provide the same accuracy in pose determination. However for many applications this may not be important. Sturman and Zeltzer point out the following limitations for image based visual tracking of the hands (Sturman and Zeltzer 1994): The resolution of video cameras is too low to both resolve the fingers easily and cover the field of view encompassed by broad hand motions. The 30- or 60- frame-per-second conventional video technology is insufficient tocapture rapid hand motion.

Fingers are difficult to track as they occlude each other and are occluded by the hand.

There are two different approaches to vision based gesture recognition; model based

techniques which try to create a three-dimensional model of the users hand and use

this for recognition, and image based techniques which calculate recognition

features directly from the hand image. Rehg and Kanade (1994) describe a vision-

based approach that uses stereo camera to create a cylindrical model of the hand.

They use finger tips and joint links as features to align the cylindrical components

of the model. Etoh, Tomono and Kishino (1991) report similar work, while Lee

and Kunii use kinematic constraints to improve the model matching and

recognize 16 gestures with XX% accuracy (1993). Image based methods

typically segment flesh tones from the background images to find hands and

then try and extract features such as fingertips, hand edges, or gross hand

geometry for use in gesture recognition. Using only a coarse description of

hand shape and a hidden markov model, Starner and Pentland are able to

recognize 42 American Sign Language gestures with 99% accuracy (1995). In

contrast, Martin and Crowley calculate the principle components of gestural

images and use these to search the gesture space to match the target gestures

(1997).

1.3 CLASSIFICATION OF GESTURES

Gestures can be static(the user assumes a certain pose or

configuration) or dynamic(with prestroke, stroke, and poststroke

phases).

Some gestures have both static and dynamic elements, as in

sign languages. The automatic recognition of natural continuous

gestures requires their temporal segmentation. The start and end

points of a gesture, in terms of the frames of movement, both in

time and in space are to be specified. Sometimes a gesture is also

affected by the context of preceding as well as following gestures.

Moreover, gestures are often language- and culture-specific.

Gestures can broadly be of the following types:

hand and arm gestures: recognition of hand poses, sign

languages, and entertainment applications (allowing children

to play and interact in virtual environments);

head and face gestures: some examples are: a) nodding or

shaking of head; b) direction of eye gaze; c) raising the

eyebrows; d) opening the mouth to speak; e) winking, f) flaring

the nostrils; and g) looks of surprise, happiness, disgust, fear,

anger, sadness, contempt, etc.;

body gestures: involvement of full body motion, as in: a)

tracking movements of two people interacting outdoors; b)

analyzing movements of a dancer for generatingmatching

music and graphics; and c) recognizing human gaits for

medical rehabilitation and athletic training.

There are many classification of gestures, such as

Intransitive gestures: "the ones that have a universal language

value especially for the expression of affective and aesthetic

ideas. Such gestures can be indicative, exhortative,

imperative, rejective, etc."

Transitive gestures: "the ones that are part of an uninterrupted

sequence of interconnected structured hand movements that

are adapted in time and space, with the aim of

completing a program, such as prehension."

The classification can be based on gesture's functions as:

Semiotic – to communicate meaningful information.

Ergotic – to manipulate the environment.

Epistemic – to discover the environment through

tactile experience.

The different gestural devices can also be classified as haptic

or non-haptic (haptic means relative to contact).

Typically, the meaning of a gesture can be dependent

on the following:

spatial information: where it occurs;

pathic information: the path it takes;

symbolic information: the sign it

makes; affective information: its

emotional quality.

Uses

Gesture recognition is useful for processing information from humans which is not

conveyed through speech or type. As well, there are various types of gestures which can

be identified by computers.

Sign language recognition. Just as speech recognition can transcribe speech to

text, certain types of gesture recognition software can transcribe the symbols

represented through sign language into text.

For socially assistive robotics. By using proper sensors (accelerometers and

gyros) worn on the body of a patient and by reading the values from those

sensors, robots can assist in patient rehabilitation. The best example can be stroke

rehabilitation.

Directional indication through pointing. Pointing has a very specific purpose in

our society, to reference an object or location based on its position relative to

ourselves. The use of gesture recognition to determine where a person is pointing

is useful for identifying the context of statements or instructions. This application

is of particular interest in the field of robotics.

Control through facial gestures. Controlling a computer through facial gestures

is a useful application of gesture recognition for users who may not physically be

able to use a mouse or keyboard. Eye tracking in particular may be of use for

controlling cursor motion or focusing on elements of a display.

Alternative computer interfaces. Foregoing the traditional keyboard and mouse

setup to interact with a computer, strong gesture recognition could allow users to

accomplish frequent or common tasks using hand or face gestures to a camera.

Immersive game technology. Gestures can be used to control interactions within

video games to try and make the game player's experience more interactive or

immersive.

Virtual controllers. For systems where the act of finding or acquiring a physical

controller could require too much time, gestures can be used as an alternative

control mechanism. Controlling secondary devices in a car, or controlling a

television set are examples of such usage.

Affective computing. In affective computing, gesture recognition is used in the

process of identifying emotional expression through computer systems.

Remote control. Through the use of gesture recognition, "remote control with the

wave of a hand" of various devices is possible. The signal must not only indicate

the desired response, but also which device to be controlled.

Input devices

The ability to track a person's movements and determine what gestures they may be

performing can be achieved through various tools. Although there is a large amount of

research done in image/video based gesture recognition, there is some variation within the

tools and environments used between implementations.

Depth-aware cameras. Using specialized cameras such as time-of-flight cameras,

one can generate a depth map of what is being seen through the camera at a short range,

and use this data to approximate a 3d representation of what is being seen. These can be

effective for detection of hand gestures due to their short range capabilities.

Stereo cameras. Using two cameras whose relations to one another are known, a

3d representation can be approximated by the output of the cameras. To get the

cameras' relations, one can use a positioning reference such as a lexian-stripe or

infrared emitters. In combination with direct motion measurement (6D-Vision)

gestures can directly be detected.

Controller-based gestures. These controllers act as an extension of the body so that

when gestures are performed, some of their motion can be conveniently captured by

software. Mouse gestures are one such example, where the motion of the mouse is

correlated to a symbol being drawn by a person's hand, as is the Wii Remote, which

can study changes in acceleration over time to represent gestures.

Single camera. A normal camera can be used for gesture recognition where the

resources/environment would not be convenient for other forms of image-based

recognition. Although not necessarily as effective as stereo or depth aware cameras,

using a single camera allows a greater possibility of accessibility to a wider

audience.

Here we can see that the user action is captured by a camera and the image input is

fed into the gesture recognition system , in which it is processed and compared

efficiently with the help of an algorithm. The virtual object or the 3-d model is then

updated accordingly and the user interfaces with machine with the help of a user

interface display.

DESIGN

The operation of the system proceeds in four basic steps:

1. Image input

2. background subtraction

3. image processing and data extraction

4. Hand gesture recognition and output is displayed at user

interface.

Image input

To input image data into the system, an IndyCam or

videocam can be used with an image capture program used

to take the picture. The camera, used should take first

a background image, and then take subsequent images of a

person. A basic assumption of the system is that these images are

fairly standard: a the image is assumed to be of a person’s

upper body, facing

forward, with only one arm outstretched to a particular side.

Background Subtraction

Once images are taken, the system performs a background

subtraction of the image to isolate the person and create a

mask. The background subtraction proceeds in two steps. First,

each pixel from the background image is channel wise subtracted

from the corresponding pixel from the foreground image.

The resulting channel differences are summed, and if they are

above a threshold, then the corresponding image of the mask is

set white, otherwise it

is set black.

CODE FOR CAPTURING THE BACKGROUND IMAGE:

// check background frameif ( backgroundFrame == null ){ // save image dimension width = image.Width; height = image.Height; frameSize = width * height;

// create initial backgroung image backgroundFrame = grayscaleFilter.Apply( image );

return;}

When we have two images, the background and the image with an object, we could use the Difference filter to get the difference image:

CODE FOR BACKGROUND EXTRACTION:

// apply the grayscale filterBitmap currentFrame = grayscaleFilter.Apply( image );

// set backgroud frame as an overlay for difference filterdifferenceFilter.OverlayImage = backgroundFrame;

// apply difference filterBitmap motionObjectsImage = differenceFilter.Apply( currentFrame );

On the difference image, it is possible to see the absolute difference between two images – whiter areas show the areas of higher difference, and black areas show the areas of no difference. The next two steps are:

1. Threshold the difference image using the Threshold filter, so each pixel may be classified as a significant change (most probably caused by a moving object) or as a non-significant change.

2. Remove noise from the threshold difference image using the Opening filter. After this step, the stand alone pixels, which could be caused by noisy camera and other circumstances, will be removed, so we’ll have an image which depicts only the more or less significant areas of changes (motion areas).

Image processing and data extraction

The object’s image we got as an example represents a quite recognizable human’s body which demonstrates a hands gestures. But, before we get such an image in our video stream, we’ll receive a lot of other frames, where we may have many other different objects which are far from being human bodies. Such objects could be anything else moving across the scene, or it even could be quite bigger noise than the one we filtered out before. To get rid of false objects, let’s go through all the objects in the image and check their sizes. To do this, we are going to use the BlobCounter class:

// process blobsblobCounter.ProcessImage( motionObjectsData );Blob[] blobs = blobCounter.GetObjectInformation( );

int maxSize = 0;Blob maxObject = new Blob( 0, new Rectangle( 0, 0, 0, 0 ) );

// find the biggest blobif ( blobs != null ){ foreach ( Blob blob in blobs ) { int blobSize = blob.Rectangle.Width * blob.Rectangle.Height;

if ( blobSize > maxSize ) { maxSize = blobSize; maxObject = blob; } }}

To use the information about the biggest object’s size, we are going to implement an adaptive background, which we’ve mentioned before. Suppose that, from time to time, we may have some minor changes in the scene, like minor changes of light conditions, movements of small objects, or even a small object that has appeared and stayed on the scene. To take these changes into account, we are going to have an adaptive background – we are going to change our background frame (which is initialized from the first video frame) in the direction of our changes using the MoveTowards filter. The MoveTowards filter slightly changes an image in a direction to make a smaller difference with the second provided image. For example, if we have a background image which contains a scene only, and an object image which contains the same scene plus an object on it, then applying the MoveTowards filter sequentially to the background image will make it the same as the object image after a while – the more we apply the MoveTowards filter to the background image, the more evident the presence of the object on it becomes (the background image becomes "closer" to the object image – the difference becomes smaller).

So, we check the size of the biggest object on the current frame and, if it is not that big, we consider the object as not significant, and we just update our background frame to adapt to the changes:

// if we have only small objects then let's adopt to changes in the sceneif ( ( maxObject.Rectangle.Width < 20 ) || ( maxObject.Rectangle.Height < 20 ) ){ // move background towards current frame moveTowardsFilter.OverlayImage = currentFrame; moveTowardsFilter.ApplyInPlace( backgroundFrame );}

The second usage of the maximum object’s size is to find the one which is quite significant and which may potentially be a human’s body. To save CPU time, our hands gesture recognition algorithm is not going to analyze any object which is the biggest on the current frame, but only objects which satisfy some requirements:

if ( ( maxObject.Rectangle.Width >= minBodyWidth ) && ( maxObject.Rectangle.Height >= minBodyHeight ) && ( previousFrame != null ) ){ // do further processing of the frame}

now we have an image which contains a moving object, and the object's size is quite reasonable so it could be a human’s body potentially. But still we are not ready to pass the image to the hands gesture recognition module for further processing.

Yes, we’ve detected a quite big object, which may be a human’s body demonstrating some gesture. But, what if the object is still moving? What if the object did not stop yet and it is not yet ready to demonstrate us the real gesture it would like to do? Do we really want to pass all these frames to the hands gesture recognition module while the object is still in motion, loading our CPU with more computations? More than that, since the object is still in motion, we may even detect a gesture which is not the one the object would like to demonstrate. So, let’s not hurry with gesture recognition yet.

After we detect an object which is a candidate for further processing, we would like to give it a chance to stop for a while and demonstrate us something – a gesture. If the object is constantly moving, it does not want to demonstrate us anything, so we can skip its processing. To catch the moment when the object has stopped, we are going to use another motion detector, which is based on the difference between frames. The motion detector checks the amount of changes between two consequent video frames (the current and the previous one) and, depending on this, decided if there is motion detected or not. But, in this particular case, we are interested not in the motion detection, but the detection of absence of motion.

// check motion level between framesdifferenceFilter.OverlayImage = previousFrame;

// apply difference filterBitmap betweenFramesMotion = differenceFilter.Apply( currentFrame );

// lock image with between frames motion for faster further processingBitmapData betweenFramesMotionData = betweenFramesMotion.LockBits( new Rectangle( 0, 0, width, height ), ImageLockMode.ReadWrite, PixelFormat.Format8bppIndexed );

// apply threshold filterthresholdFilter.ApplyInPlace( betweenFramesMotionData );

// apply opening filter to remove noiseopeningFilter.ApplyInPlace( betweenFramesMotionData );

// calculate amount of changed pixelsVerticalIntensityStatistics vis = new VerticalIntensityStatistics( betweenFramesMotionData );

int[] histogram = vis.Gray.Values;int changedPixels = 0;

for ( int i = 0, n = histogram.Length; i < n; i++ ){ changedPixels += histogram[i] / 255;}

// free temporary imagebetweenFramesMotion.UnlockBits( betweenFramesMotionData );betweenFramesMotion.Dispose( );

// check motion levelif ( (double) changedPixels / frameSize <= motionLimit ){ framesWithoutMotion++;}else{ // reset counters framesWithoutMotion = 0; framesWithoutGestureChange = 0; notDetected = true;}

As it can be seen from the code above, the difference between frames is checked by analyzing the changedPixel variable, which is used to calculate the amount of changes in percents, and then the value is compared with the configured motion limit to check if we have a motion or not. But, as can also be seen from the code above, we don’t call the gesture recognition routine immediately after we detect that there is no motion. Instead of this, we keep a counter which calculates the amount of consequent frames without motion. And, only when the amount of consequent frames without motion reaches some certain value, we finally pass the object to the hands gesture recognition module.

// check if we don't have motion for a whileif ( framesWithoutMotion >= minFramesWithoutMotion ){ if ( notDetected ) { // extract the biggest blob blobCounter.ExtractBlobsImage( motionObjectsData, maxObject );

// recognize gesture from the image Gesture gesture = gestureRecognizer.Recognize( maxObject.Image, true ); maxObject.Image.Dispose( );

}}

One more comment before we move to the hands gesture recognition discussion. To make sure we don’t have false gesture recognition, we do an additional check – we check that the same gesture can be recognized on several consequent frames. This additional check makes sure that the object we’ve detected really demonstrates us one gesture for a while and that the gesture recognition module provides an accurate result.

// check if gestures has changed since the previous frameif ( ( gesture.LeftHand == previousGesture.LeftHand ) && ( gesture.RightHand == previousGesture.RightHand ) ){ framesWithoutGestureChange++;}else{ framesWithoutGestureChange = 0;}

// check if gesture was not changing for a whileif ( framesWithoutGestureChange >= minFramesWithoutGestureChange ){ if ( GestureDetected != null ) { GestureDetected( this, gesture ); } notDetected = false;}

previousGesture = gesture;

hands gesture Recognition

Now, when we have detected an object to process, we can analyze it, trying to recognize a hands gesture. The hands gesture recognition algorithm described below assumes that target object occupies the entire image, but not part of it:

The idea of our hands gesture recognition algorithm is quite simple, and 100% based on histograms and statistics, but not on things like pattern recognition, neural networks, etc. This makes this algorithm quite easy, in terms of implementation and understanding.

The core idea of this algorithm is based on analyzing two kinds of object histograms – horizontal and vertical histograms, which can be calculated using the HorizontalIntensityStatistics and VerticalIntensityStatistics classes:

We are going to start the hands gesture recognition from utilizing the horizontal histogram since for the first step, it looks more useful. The first thing we are going to do is to find areas of the image which are occupied by hands, and the area, which is occupied by the torso.

Let’s take a closer look at the horizontal histogram. As it can be seen from the histogram, the hands’ areas have relatively small values on the histogram, but the torso area is represented by a peak of high values. Taking into account some simple relative proportions of the human body, we may say that the human hand thickness can never exceed 30% percent of the human body height (30% is quite a big value, but let’s take this for safety and as an example). So, applying a simple threshold to the horizontal histogram, we can easily classify the hand areas and the torso area:

// get statistics about horizontal pixels distribution HorizontalIntensityStatistics his = new HorizontalIntensityStatistics( bodyImageData );int[] hisValues = (int[]) his.Gray.Values.Clone( ); // build map of hands (0) and torso (1)double torsoLimit = torsoCoefficient * bodyHeight;// torsoCoefficient = 0.3 for ( int i = 0; i < bodyWidth; i++ ){ hisValues[i] = ( (double) hisValues[i] / 255 > torsoLimit ) ? 1 : 0;}

From the threshold horizontal histogram, we can easily calculate the hands’ length and the body torso width – the length of the right hand is the width of the empty area on the histogram from the right, the length of the left hand is the width of the empty area from the left, and the torso’s width is the width of the area between the empty areas:

// get hands' lengthint leftHand = 0;while ( ( hisValues[leftHand] == 0 ) && ( leftHand < bodyWidth ) ) leftHand++; int rightHand = bodyWidth - 1;while ( ( hisValues[rightHand] == 0 ) && ( rightHand > 0 ) ) rightHand--;rightHand = bodyWidth - ( rightHand + 1 ); // get torso's width

int torsoWidth = bodyWidth - leftHand - rightHand;

Now, when we have the hand lengths and the torso’s width, we can determine if the hand is raised or not. For each hand, the algorithm tries to detect if the hand is not raised, raised, diagonally down, raised straight, or raised diagonally up. All four possible positions are demonstrated on the image below, in the order they were listed above:

To check if a hand is raised or not, we are going to use some statistical assumptions about body proportions again. If the hand is not raised, its width on the horizontal histogram will not exceed 30% of the torso’s width, for example. Otherwise, it is raised somehow.

// process left handif ( ( (double) leftHand / torsoWidth ) >= handsMinProportion ){ // hand is raised}else{ // hand is not raised}

So far, we were able to recognize a hand position – when the hand is not raised. Now, we need to complete the algorithm by recognizing the exact hand position when it is raised. And to do this, we’ll use the VerticalIntensityStatistics class, which was mentioned before. But now, the class will be applied not to the entire object’s image, but only to the hand’s image:

// extract left hand's imageCrop cropFilter = new Crop( new Rectangle( 0, 0, leftHand, bodyHeight ) );Bitmap handImage = cropFilter.Apply( bodyImageData ); // get vertical intensity statistics of the handVerticalIntensityStatistics stat = new VerticalIntensityStatistics( handImage );

The image above contains good samples, and using the above histograms, it is quite easy to recognize the gesture. But, in some cases, we may not have such clear histograms like the ones above, but only noisy histograms, which may be caused by light conditions and shadows. So, before making any final decisions about the raised hand, let’s perform two small preprocessing steps of the vertical histogram. These two additional steps are quite simple, so their code is not provided here, but can be retrieved from the files included in the article source code.

1. First of all, we need to remove the low values from the histogram, which are lower than 10% of the histogram’s maximum value, for example. The image below demonstrates a hand’s image which contains some artifacts caused by shadows. Such artifacts can be easily removed by filtering low values on the histogram, what is also demonstrated on the image below (the histogram is filtered already).

2. Another type of issue which we need to take care about is a "twin" hand, which is actually a shadow. This could be easily solved by walking through the histogram and removing all peaks which are not the highest peak.

At this point, we should have a clear vertical histograms, like the ones we’ve seen before, so now we are a few steps away from recognizing the hands gesture.

Let’s start by recognizing a straight-raised hand first. If we take a look at the image of a straight hand, then we could make one more assumption about body proportions – the length of the hand is much bigger than its width. In the case of a straight-raised hand, its histogram should have a quite high, but thin peak. So, let’s use these properties to check if the hand is raised straight:

if ( ( (double) handImage.Width / ( histogram.Max - histogram.Min + 1 ) ) > minStraightHandProportion ){ handPosition = HandPosition.RaisedStraigh;}else{ // processing of diagonaly raised hand}

(Note: The Min and Max properties of the Histogram class return the minimum and maximum values with non-zero probability. In the above sample code, these values are used to calculate the width of the histogram area occupied by the hand. See the documentation for the AForge.Math namespace.)

Now, we need to make the last check to determine if the hand is raised diagonally up or diagonally down. As we can see from histograms of diagonally raised up/down hands, the peak for the diagonally up hand is shifted to the beginning of the histogram (to the top, in the case of a vertical histogram), but the peak of the diagonally down hand is shifted more to the center. Again, we can use this property to check the exact type of the raised hand:

if ( ( (double) histogram.Min / ( histogram.Max - histogram.Min + 1 ) ) < maxRaisedUpHandProportion ){ handPosition = HandPosition.RaisedDiagonallyUp;}else{ handPosition = HandPosition.RaisedDiagonallyDown;}

Conclusion

We have got algorithms which, first of all, allow us to extract moving objects from a video feed, and, second, to successfully recognize hands gestures demonstrated by the object. The recognition algorithm is very simple and easy for the implementation, as is for understanding. Also, since it is based only on information from histograms, it is quite efficient in performance, and does not require a lot of computational resources, which is quite important in cases where we need to process a lot of frames per second.

To make the algorithms easy to understand, We’ve used generic image processing routines from the AForge.Imaging library, which is a part of the AForge.NET framework. This means that going from generic routines to specialized ones (routines which may combine several steps in one), it is easily possible to improve the performance of these algorithms even more.

FURTHER IMPROVEMENTS

1. More robust recognition in the case of hand shadows on walls;2. Handling of dynamic scenes, where different kinds of motions may occur behind the main

object.

http://code.google.com/p/aforge/

Gesture

Documents

hidden markov

sign language

gesture recognition

hand gesture

dynamic hand

movetowards

gesture recognition

sign languages