Using Neural Networks and 3D sensors data to model LIBRAS gestures recognition - KDMile 2014

Using Neural Networks and 3D sensors data to model LIBRAS gestures recognition

Gabriel S. P Moreira - Gustavo R. Matuck - Osamu Saotome - Adilson M. da Cunha

ITA – Brazilian Aeronautics Institute of Technology

Introduction

1,799,885 citizens in the country having great difficulty to hear (CENSUS 2010)

The Brazilian Signs Language (LIBRAS) is the sign language used by the majority of the deaf people

Second official language by the Brazilian courts, on laws 10.436 (2002) and 5.626 (2005)

This research is an initial investigation of LIBRAS recognition using 3D sensors and neural networks

Related Research

Main approaches of hand tracking systems:

Data-glove-based - electromechanical gloves with sensors

Vision-based - continuous image frames from regular RGB or 3D sensors

Main machine learning techniques:

Artificial Neural Networks (ANN)

Hidden Markov Models (HMM)

Usage of HMM to recognize 47 Libras gestures types, captured with regular RGB cameras [Souza et al. 2007].

3D hand and gesture recognition system using Microsoft Kinect sensor, for dynamic gestures, and HD color sensor, for static gestures recognition [Caputo et al. 2012]

A real-time system to recognize a set of Libras alphabet (only static gestures), using MS Kinect and neural networks [Anjo et al. 2012].

LIBRAS alphabet

All alphabet signs are made with just one hand.

20 (twenty) static gestures and 6 (six) dynamic gestures (H, J, K, X, Y, Z) executed with movement.

Case Study

Six (6) people volunteered to record the signs of the LIBRAS complete alphabet, divided in three groups: deaf people (2), LIBRAS teachers (2), and students (2)

LIBRAS recognition process

Case Study – Data Acquisition

Gestures recorded using:

COTS 3D sensor Creative Senz3D GestureCam™ [Creative 2013]

Intel® Skeletal Hand Tracking Library (Experimental Release) [Intel 2013]

In this investigation, for each frame, it was recorded the absolute 3D coordinates (X, Y, and Z) of 5 fingertips and the center of the hand palm, based on the sensor distance. That resulted in 18 attributes with continuous position of hand and fingers.

Creative Senz3D GestureCam

Hand Tracking

Case Study – Pre-Processing (1/2)

Normalization to Hand Relative Coordinates – Absolute fingertips coordinates converted to relative coordinates, in relation to hand’s center;

Training Samples Transformations - There were only 11 samples for each alphabet letter. Therefore, a strategy was created and implemented to generate new samples based on training samples, by rotation and scaling geometric transformations.

Case Study – Pre-Processing (2/2) Frames Sampling – Because gesture number frames

varied with gesture duration and camera FPS rate, this processes selected 18 equidistant frames for each gesture each frame containing all finger tips 3D coordinates, at some point in time.

Case Study – Neural Network Model

Multi-Layer Perceptron (MLP), with back-propagation learning, sigmoidal function activation and Sum Squared Error (SSE) to measure the learning performance.

MLP Architecture: Input Layer (18 relative fingertips and hand 3D coords x 18 sampled

frames for each gesture = 324 neurons) Hidden Layer (tested from 50 to 400, stepped by 50) Output Layer (26 neurons – the alphabet letters)

Case Study – Results Analysis Dataset split: Training (64%), Validation (18%), and test (18%) sets The best MLP Network configuration correctly classified 61.54%

(32/52) of unseen data during the network training (test set). Its architecture was composed of 200 neurons on the hidden layer.

Alphabet test sets classification results

Limitations

Difficulty was to find people capable of accurately recording LIBRAS gestures, especially deaf people and teachers

The chosen 3D sensor presented the following limitations An instability of finger tracking on experimental library

(Intel® Skeletal Hand Tracking Library)

The variability of frames per seconds rate may have introduced a bias when comparing samples, because of the temporality of gestures

Conclusion

Initial investigation of LIBRAS recognition, using 3D sensor data and Multi-Layer Perceptron.

Six LIBRAS teachers, students, and deaf people have volunteered for recording the alphabet gestures.

Some strategies were developed for pre-processing gestures data , to deal with temporality of gestures, normalizing coordinates, and geometric transformations (rotation and scaling).

The best model correctly classified 61.53% patterns of the test set.

MLP network was not as effective as expected when trained with few samples and operating over noisy gestures data

Recommendations and Sugestions

Improvements in classification models can be evaluated using more representative samples of gestures data.

As a natural continuation of this research, still on its beginning phase, the authors intend to explore other 3D sensors, pre-processing approaches, and learning models, like Support Vector Machines and Hidden Markov Models.

Thank you!

Gabriel S. P Moreira - Gustavo R. Matuck - Osamu Saotome - Adilson M. da Cunha

ITA – Brazilian Aeronautics Institute of [email protected]

Using Neural Networks and 3D sensors data to model LIBRAS gestures recognition - KDMile 2014

Technology

alphabet gestures

d hand

libras gestures types

temporality of gestures

fordynamic gestures

set of libras alphabet

d sensors data

noisy gestures data