Top Banner
“Ch14-P374633” — 2009/2/25 — 14:13 — page 1 — #1 CHAPTER 14 Real-Time 3D Body Pose Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory, Zurich, Switzerland Luc Van Gool ESAT-PSI/VISICS, Katholieke Universiteit Leuven, Leuven, Belgium Abstract This chapter presents a novel approach to markerless real-time 3D pose estimation in a multi-camera setup. We explain how foreground-background segmentation and 3D reconstruction are used to extract a 3D hull of the user.This is done in real time using voxel carving and a fixed lookup table. The body pose is then retrieved using an example-based classifier that uses 3D Haar-like wavelet features to allow for real-time classification. Average neighborhood margin maximization (ANMM) is introduced as a powerful approach to train these Haar-like features. Keywords: pose estimation, markerless, real time, visual hull, 3D Haar-like features, example-based classification, linear discriminant analysis, average neighborhood margin maximization 14.1 INTRODUCTION Posture recognition has received a significant amount of attention given its importance for human–computer interfaces, teleconferencing, surveillance, safety control, animation, and several other applications.The context of this work is the CyberWalk Project [1], a virtual reality system where the user walks on an omnidirectional treadmill, as shown in Figure 14.1, interacting with the virtual world using body pose commands, and the system detects certain events. For this application a markerless pose detection subsystem has to be fast and robust for detecting a predefined selection of poses. We present an example-based technique for real-time markerless rotation-invariant pose recognition using average neighborhood margin maximization (ANMM) [2] and 3D Haar wavelet-like features [3]. (The latter will be called Haarlets for brevity.) In example- based approaches, observations are compared and matched against stored examples of human body poses. In our approach, these observations consist of 3D hulls of the user. The system makes use of a multi-camera setup, in which the cameras are placed around the user. First, foreground-background segmentation is used to extract the user from the background. Then the segmentations from the different cameras are combined to make 1
28

CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

Nov 16, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 1 — #1

CHAPTER

14Real-Time 3D Body PoseEstimationMichael Van den Bergh, Esther Koller-Meier, Roland KehlETH Zurich, Computer Vision Laboratory, Zurich, Switzerland

Luc Van GoolESAT-PSI/VISICS, Katholieke Universiteit Leuven, Leuven, Belgium

Abstract

This chapter presents a novel approach to markerless real-time 3D pose estimationin a multi-camera setup. We explain how foreground-background segmentationand 3D reconstruction are used to extract a 3D hull of the user.This is done in realtime using voxel carving and a fixed lookup table.The body pose is then retrievedusing an example-based classifier that uses 3D Haar-like wavelet features to allowfor real-time classification. Average neighborhood margin maximization (ANMM)is introduced as a powerful approach to train these Haar-like features.

Keywords: pose estimation, markerless, real time, visual hull, 3D Haar-likefeatures, example-based classification, linear discriminant analysis, averageneighborhood margin maximization

14.1 INTRODUCTIONPosture recognition has received a significant amount of attention given its importance forhuman–computer interfaces, teleconferencing, surveillance, safety control, animation,and several other applications. The context of this work is the CyberWalk Project [1], avirtual reality system where the user walks on an omnidirectional treadmill, as shownin Figure 14.1, interacting with the virtual world using body pose commands, and thesystem detects certain events. For this application a markerless pose detection subsystemhas to be fast and robust for detecting a predefined selection of poses.

We present an example-based technique for real-time markerless rotation-invariantpose recognition using average neighborhood margin maximization (ANMM) [2] and 3DHaar wavelet-like features [3]. (The latter will be called Haarlets for brevity.) In example-based approaches, observations are compared and matched against stored examples ofhuman body poses. In our approach, these observations consist of 3D hulls of the user.The system makes use of a multi-camera setup, in which the cameras are placed aroundthe user. First, foreground-background segmentation is used to extract the user from thebackground. Then the segmentations from the different cameras are combined to make 1

rashm
Pencil
rashm
Pencil
Page 2: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 2 — #2

2 CHAPTER 14 Real-Time 3D Body Pose Estimation

FIGURE 14.1

User walking on the CyberWalk omnidirectional platform.

a 3D hull reconstruction.This is done in real time using voxel carving and a fixed lookuptable [4]. The camera network is distributed, as each camera is connected to a separatePC that runs the foreground–background segmentations.The segmentations are sent to acentral PC that runs the hull reconstruction.The body pose is then determined from this3D hull using an example-based classifier that employs 3D Haarlets to allow for real-timeclassification.

ANMM, which is based on linear discriminant analysis (LDA), is introduced as a pow-erful approach to train these Haarlets. Where the classic AdaBoost [5] runs into memoryissues when training 3D rather than 2D Haarlets [6], the weakened memory require-ments of ANMM allow for a straightforward implementation of a 3D pose detector basedon 3D Haarlets. The benefit of classifying 3D hulls rather than 2D silhouettes is thatthe orientation of the hulls can be normalized. Finally, we explain how an overheadtracker is used to estimate the orientation of the user, in order to normalize the ori-entation of the extracted hull, and thus making the pose estimation system rotationinvariant.

In this chapter we first give an overview of the different real-time pose estimationapproaches. We also provide an overview of the different 3D hull reconstruction tech-niques, and explain the one that we have chosen, considering the real-time nature ofthe system. For pose classification based on these 3D hulls, we present ANMM as a pow-erful new method and evaluate it against LDA. We extend ANMM to 3D and show howit can be used to train 3D Haarlets for real-time classification. The 3D approach benefitsfrom increased robustness and the possibility of making the system rotation invariant.We show these benefits by comparing the system to the 2D case. The result is a poseestimation system with the same or better performance than the state of the art but atfaster, interactive speeds.

14.2 BACKGROUNDThis section provides an overview of methods to estimate body pose; they are dividedin two categories: model-based and example-based. The model-based methods can also

Page 3: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 3 — #3

14.2 Background 3

be called tracking methods, as they track individual body parts in an articulated bodymodel. Example-based methods do not rely on body models but match the input to a setof predefined poses.

14.2.1 TrackingOur first choice was in favor of example-based rather than model-based (tracking)techniques. Model-based approaches typically rely on articulated 3D body models [7–11].In order to be effective they need a high number of degrees of freedom in combinationwith nonlinear anatomical constraints. Consequently, they require time-consuming per-frame optimization, and the resulting trackers are too slow for real time (�25 Hz). Theyare also very sensitive to fast motions and segmentation errors.

Most methods exploit 2D image information for tracking. However, these cues onlyoffer weak support to the tracker, which quickly leads to sophisticated, and thereforeoften rather slow, optimization schemes. Multiple calibrated cameras allow for the com-putation of the subject’s 3D shape, which provides a strong cue for tracking because the3D shape only contains information consistent over all individual views with respect tosome hypothesis and thus discards, for example, clutter edges or spikes in the silhou-ettes.The increased computational power offered by cheap consumer PCs made real-timecomputation of the 3D shape or hull possible and created several interesting approachesto full-body tracking.

Cheung et al. [12] introduced the SPOT algorithm, a rapid voxel-based method forvolumetric human reconstruction. Real-time tracking is achieved by assigning the voxelsin the new frame to the closest body part of the previous one. Based on this registration,the positions of the body parts are updated over consecutive frames. However, thissimple approach does not guarantee that two adjacent body parts will not drift apart,and it can easily lose track of moderately fast motions. Furthermore, to obtain goodsegmentation, the subject has to wear a dark suit. Cheung et al. [13] used both colorinformation and a shape-from-silhouette method for full-body tracking, although not inreal time. Colored surface points (CSPs) segment the hull into rigidly moving body partsbased on the results of the previous frames, and take advantage of the constraint of equalmotion of parts at their coupling joints to estimate joint positions.A complex initializationsequence recovers an actor’s joint positions, which are used to track the same actor innew video sequences.

Mikic et al. [14] proposed a similar voxel-based method for full-body tracking. Aftervolumetric reconstruction, the different body parts are located using sequential templategrowing and fitting.The fitting step uses the placement of the torso computed by templategrowing to obtain a better starting point for the voxel labeling. Furthermore, an extendedKalman filter estimates the parameters of the model given the measurements.To achieverobust tracking, the method uses prior knowledge of average body part shapes anddimensions.

Kehl et al. [4] also proposed a markerless solution for full-body pose tracking.A modelbuilt from super-ellipsoids is fitted to a colored volumetric reconstruction using stochasticmeta descent (SMD), taking advantage of the color information to overcome ambiguitiescaused by limbs touching each other. To increase robustness and accuracy, the trackingis refined by matching model contours against image edges. The results of this tracker

Page 4: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 4 — #4

4 CHAPTER 14 Real-Time 3D Body Pose Estimation

FIGURE 14.2

Full-body pose tracker (Kehl et al. [4]) in action.

are shown in Figure 14.2. Similar to the previously mentioned tracking approaches, thissystem is capable of tracking one frame in approximately 1.3 seconds. As the input datafor a real-time system is generally recorded at 15 to 30 Hz, the tracking is too slow andas a result too sensitive to fast motions.

Tracking-based approaches suffer from a trade-off between complex, accurate track-ing at �1 Hz and faster but more inaccurate tracking. In both cases it is difficult not tolose track of the subject in an interactive system where the user walks or moves a lot.Therefore, we made the choice to look at example-based methods. In example-basedapproaches, instead of tracking articulated body models, observations are compared andmatched against stored examples of human body poses. These stored examples can be2D silhouettes or reconstructed 3D hulls.

14.2.2 Example-Based MethodsExample-based methods benefit from the fact that the set of typically interesting posesis far smaller than the set of anatomically possible ones, which is good for robustness.Because the pose is estimated on a frame-by-frame basis, it is not possible to lose trackof an object. Also, not needing an explicit parametric body model makes these methodsmore amenable to real-time implementation and to pose analysis of structures otherthan human bodies, such as animals. Silhouettes (and their derived visual hulls) seem tocapture the essence of human body poses well, as illustrated in Figure 14.3.

Compared to tracking, not many example-based pose estimation methods exist inthe literature. Rosales and Sclaroff [15] trained a neural network to map example 2Dsilhouettes to 2D positions of body joints. Shakhnarovich et al. [16] outlined a frameworkfor fast pose recognition using parameter-sensitive hashing. In their framework, imagefeatures such as edge maps, vector responses of filters, and edge direction histograms canbe used to match silhouettes against examples in a database. Ren et al. [17] applied thisparameter-sensitive hashing framework to the use of 2D Haarlets for pose recognition.The Haarlets are trained using AdaBoost.

Page 5: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 5 — #5

14.3 Segmentation 5

(a) (b) (c)

FIGURE 14.3

(a) Input image. (b) Silhouettes extracted from input images using foreground–backgroundsegmentation. (c) Silhouettes combined to reconstruct a 3D hull.

The primary limitation of silhouette-based approaches is that the stored silhouettesare not invariant to changes in subject orientation. A visual hull can be reconstructedusing the silhouettes taken from several camera views and can then be rotated to astandard orientation before being used for training or classification.The result is a rotation-invariant system.

The example-based approach proposed by Cohen and Li [18], matches 3D hulls withan appearance-based 3D shape descriptor and a support vector machine (SVM). Thismethod is rotation invariant, but, running at 1 Hz, it is not real time. Weinland et al. [19]and Gond et al. [20] proposed similar hull-based approaches but provide no statistics con-cerning classification speeds.To build a 3D-hull system capable of real-time performance,we aim to combine the speed of Haarlets with the strength of ANMM.

14.3 SEGMENTATIONThe first step in our system is foreground-background segmentation, which considersthe difference between the observed image and a model of the background. Regionswhere the observed image and the background model differs significantly are defined asforeground, as illustrated in Figure 14.4. The background model is typically calculatedfrom a set of images of the empty working volume. Background subtraction works onlyfor static backgrounds since the same model is used for subsequent images. For oursetup, static backgrounds can be assumed except for audience or slightly flickering lighttubes.

It is essential to have a good similarity measurement for two colors. Consideringeither the difference or the angle between two observed color vectors is not advisablebecause both vectors require normalization. It makes a significant difference whetheran angular difference is found for long- or short-signal vectors. Therefore, we use theillumination-invariant collinearity criterion proposed by Mester et al. [21]. Let xf bethe vector of all RGB values within a 3�3 neighborhood in the input image, and letxb be the corresponding vector in the background image. Collinearity can be testedby considering the difference vectors df and db between xf , and xb and estimating

Page 6: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 6 — #6

6 CHAPTER 14 Real-Time 3D Body Pose Estimation

(a) (b) (c) (d)

FIGURE 14.4

Result of our illumination-invariant background subtraction method. (a) Input image.(b) Background. (c) Resulting segmentation. Note that the black loudspeaker in the backgroundbatters a hole in the foreground region. (d) Segmentation result with darkness compensation; theregion in front of the black loudspeaker is now segmented correctly.

the true signal direction u (i.e., df �xf �xf ·u) and then calculating the sum of thedifferences:

D2 � |df |2 � |db|2 (14.1)

Minimizing D2 estimates u and yields zero if the two vectors xf and xb are collinear—thatis, the difference vectors and hence the sum of their norms are zero. If the two vectorsare collinear, no change is judged to be present and the background is still visible. If notcollinear, the pixels are considered to have different colors and a foreground pixel isfound. However, as our observed color vectors are noisy, perfect collinearity is unlikely.Griesser et al. [22] showed that applying a static threshold Ts and an adaptive thresholdTadapt on D2 makes the segmentation robust against noise:

D2

foreground

�background

Ts �Tadapt (14.2)

The adaptive threshold is used to incorporate spatiotemporal considerations. Spatialcompactness is induced by giving a pixel a higher chance to be foreground if several ofits neighbors have this status. A sampled Markov random field (MRF) is used to enforcethis spatial compactness in an iterative manner. Temporal smoothness can be achievedby using the results of the previous frame for initialization of the MRF.

This collinearity test (which is also called darkness compensation) makes the methodintensity invariant and thus provides robustness against lighting changes and shadows.However, dark colors are problematic for this method as they can be seen as a low-intensity version of any color and consequently as a match with any color. To remedythis, an additional component with a constant value Odc is added to both vectors. Thisadditional vector renders the color similarity measure more sensitive to differences, espe-cially when dark pixels are involved. Objects or backgrounds with dark colors can thus be

Page 7: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 7 — #7

14.4 Reconstruction 7

segmented as illustrated in Figure 14.4, where the region in front of the black loudspeakerin the top left of the image is now correctly segmented.

Segmentation is controlled by three user-defined parameters: the static threshold Ts,the darkness offset Odc , and the importance factor B of the spatiotemporal compactness.First, the static threshold Ts is determined with Odc and B set to zero.The darkness offsetOdc is then increased until a balance between appearing shadows and vanishing holes isreached. Finally, the compactness value B is increased until the foreground regions aresmooth and compact.

14.4 RECONSTRUCTIONComputing the visual hull of an object requires its silhouettes in a number of availableimages together with the centers of projection of the corresponding cameras. If we wantto reconstruct an object, we know that it is included in the generalized cone extrudedfrom the silhouette with its origin at the camera center. The intersection of these conesfrom multiple calibrated camera views yields a volume that contains the object. Thisprinciple is called shape from silhouette and produces a volume that approximates theobject reasonably well if a sufficient number of cameras with different lines of sight areused.This approximated volume is called the visual hull of the object and is commonlydefined as the largest possible volume that exactly explains a set of consistent silhouetteimages [23]. Figure 14.5 illustrates the principle for three camera views.

Note that the visual hull is never an exact representation of the object, becauseconcave regions cannot be reconstructed from silhouettes and an infinite number ofcamera views is needed to compute the exact visual hull [24]. However, our resultsshow that even a coarse approximation of the subject’s visual hull from four to five

View 1

View 2

Object

View 3

FIGURE 14.5

Visual hull of the object, created by the intersection of the generalized cones extruded from itssilhouettes.

Page 8: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 8 — #8

8 CHAPTER 14 Real-Time 3D Body Pose Estimation

views is sufficient for body pose estimation. In Shanmukh and Pujari [25] guidelinescan be found for choosing an optimal camera setup for object reconstruction. Our defi-nition of the visual hull in this chapter is limited to using a finite number of cameraviews.

Algorithms for shape from silhouette can be roughly divided into three groups:

Volumetric reconstruction using voxels. This technique divides the working volume intoa discrete grid of smaller volumes, so-called voxels, and projects them successivelyonto the image planes of the available camera views. Voxels lying outside of thesilhouette in at least one view do not belong to the intersection of the cones and canbe discarded. Because of their simplicity, voxel-based procedures have been used forbody tracking [26–30]. Their drawback is their tendency to be expensive as a highnumber must be projected into the image planes.

Polyhedral visual hull. This is a surface-based approach to computing the visual hullfrom a polygonal representation of the silhouettes, applying constructive solid geom-etry (CSG) to compute the intersection of the corresponding polyhedra. Real-timealgorithms were proposed by Matusik et al. [31] and Franco and Boyer [32]. Thepolyhedral visual hull method offers better accuracy than voxel-based procedures asit does not work on a discretized volume. Moreover, the resulting triangle mesh isperfectly suited for rendering on graphics hardware. Still, because of the complexityof the geometric calculation’s these algorithms are limited by their overall fragility,which relies on perfect silhouettes. Corrupted silhouettes often result in incompleteor corrupted surface models. In the application described in this chapter, silhouettesare often corrupted by reflections and noisy segmentation.

Space carving and photo consistency. Space carving is a volumetric reconstructiontechnique that uses both color consistency and silhouettes, as proposed by Kutu-lakos and Seitz [33] and Seitz and Dyer [34]. Voxels that are not photo-consistentacross all camera views in which they are visible are carved away. Photo consistencymethods often assume constant illumination and Lambertian reflectance. The recon-structed volume contains only the surface voxels and is often referenced as the photohull.Visibility of the voxels is critical for this method and is usually solved by makingmultiple plane-sweep passes, each time using only the cameras in front of the planeand iterating until convergence.

Unfortunately, the complexity of this method makes it diffcult to achieve real-timecomputation. Cheung et al. [12,13] thus proposed a mixed approach between visualhull and photo consistency that uses the property that the bounding edge of thevisual hull touches the real object at least one point. Therefore, photo consistencyhas to be tested only for bounding edges of the visual hull, which can be done atmoderate cost. Also unfortunately, the reconstruction is then very sparse and needsmuch input data to be practical.

Voxel-based shape-from-silhouette methods are popular but tend to be computation-ally expensive, as a high number of voxels have to be projected into the camera images.Most implementations speed up this process by using an octree representation to com-pute the result from coarser to finer resolutions (Szeliski [35]); others exploit hardwareacceleration (Hasenfratz et al. [29]). Our method addresses the problem the other way

Page 9: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 9 — #9

14.4 Reconstruction 9

Viewing ray

Reversed projectionusing lookup table

Direct projection

7

7

12

12

15

15

LUT

FIGURE 14.6

Lookup table stored at each pixel in the image with pointers to all voxels that project onto that pixel.Expensive projections of voxels can be avoided and the algorithm can take advantage of smallchanges in the images by addressing only voxels whose pixel has changed.

around, as proposed by Kehl et al. [4]. Instead of projecting the voxels into the cam-era views at each frame, we keep a fixed lookup table (LUT) for each one and store alist at each pixel with pointers to all voxels that project onto that particular pixel (seeFigure 14.6). This way, the image coordinates of the voxels have to be neither com-puted during runtime nor stored in memory. Instead, the LUTs are computed once atstartup. The proposed reversal of the projection allows for a compact representation ofthe voxels: Each is represented by a bit mask where each bit bi is 1 if its projection liesin the foreground of camera i; 0 otherwise. Thus, a voxel belongs to the object (i.e., islabeled as active) if its bit mask contains only 1s. This can be evaluated rapidly by bytecomparisons.

Another advantage of our method is that the voxel space can be updated instead ofcomputed from scratch for each frame. A voxel only changes its label if one of the pixelsit is projected to change from foreground to background or vice versa. Therefore, as wecan directly map from image pixels to voxels, we only have to look up the voxels linkedto those pixels, which have changed their foreground-background status. This leads tofar fewer voxel lookups compared to standard methods, where for each frame all voxelshave to be visited in order to determine their labels. The reconstruction itself is donepixel by pixel through all segmented (binary) images. If a pixel of the current view ihas changed its value compared to the previous frame, the corresponding bit bi for allvoxels contained in the reference list of this pixel is set to the new value and thesevoxels’ labels are determined again. Results of our reconstruction algorithm can be seenin Figure 14.7. With this approach, the reconstruction of a hull from six cameras takesabout 15 ms.

Page 10: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 10 — #10

10 CHAPTER 14 Real-Time 3D Body Pose Estimation

FIGURE 14.7

Examples of 3D hull reconstruction.

14.5 CLASSIFIEROur approach aims to classify poses based on 3D hulls of the subject. In this sectionwe propose an example-based classifier, in which the input samples (hulls) are com-pared to poses stored in a database. Each frame is classified independently from theothers.

14.5.1 Classifier OverviewIn Figure 14.8 the basic classifier structure is shown, where T denotes a transformationfound using average neighborhood margin maximization (ANMM).This transformation isbased on linear discriminant analysis (LDA) and projects the input samples onto a lowerdimensional space where the different pose classes are maximally separated and easier toclassify. Using a nearest neighbors (NN) approach, these projected samples are matchedto stored poses in a database, and the closest match is the output of the system. Later,to improve the speed of the system, the transformation T can be approximated usingHaarlets, which will be discussed in Section 14.6.

14.5.2 Linear Discriminant AnalysisThe goal of the LDA step is to find a transformation that helps to discriminate betweenthe different pose classes. It provides a linear transformation that projects the input hullsonto a lower dimensional space where they are maximally separated before they areclassified.The training examples (hulls) are divided into different pose classes.The voxelvalues of these hulls are stored in an n-dimensional vector, where n is the total number ofvoxels in the input hulls. The idea is to find a linear transformation such that the classesare maximally separable after the transformation [36]. Class separability can be measured

Page 11: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 11 — #11

14.5 Classifier 11

T NNInput

Database

Coefficients Pose

FIGURE 14.8

Basic classifier structure. The input samples (hulls) are projected with transformation T onto alower-dimensional space, and the resulting coefficients are matched to poses in the database usingnearest neighbors (NN).

by the ratio of the determinant of the between-class scatter matrix SB and the within-class scatter matrix SW . The optimal projection Wopt is chosen as the transformationthat maximizes the ratio:

Wopt �arg maxW

|WSBW T ||WSW W T | (14.3)

and is determined by calculating the generalized eigenvectors of SB and SW .Therefore,

W Topt �

[w1 w2 . . . w3

](14.4)

where wi are the generalized eigenvectors of SB and SW corresponding to the m largestgeneralized eigenvalues �i .The eigenvalues represent the weight of each eigenvector andare stored in a diagonal matrix D; the eigenvectors wi represent characteristic featuresof the different pose classes.

A solution to the optimization problem in equation 14.3 is to compute the inverse ofSW and solve an eigenproblem for the matrix S�1

W SB [36]. Unfortunately, SW is singularin most cases because the number of training examples is smaller than the number ofdimensions in the sample vector. Thus, inverting SW is impossible. For this reason, it isbetter to look for an alternative where a different matrix, which does not suffer from thisdimensionality problem, is used.

14.5.3 Average Neighborhood Margin MaximizationLDA aims to pull apart the class means while compacting the classes themselves. Thisintroduces the small sample size problem, which renders the within-class scatter matrixsingular. Furthermore LDA can only extract c �1 features (where c is the number ofclasses), which is suboptimal for many applications. ANMM as proposed by Wang andZhang [2], is a similar approach but one that avoids these limitations. For each data

rashm
Pencil
Page 12: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 12 — #12

12 CHAPTER 14 Real-Time 3D Body Pose Estimation

(a) (b)

FIGURE 14.9

How ANMM works. (a) For each sample, within a neighborhood (gray ), samples of the same classare pulled toward the class center, while samples of a different class are pushed away. (b) The datadistribution in the projected space.

point, ANMM pulls the neighboring points with the same class label toward it, as near aspossible, simultaneously pushing the neighboring points with different labels as far awayas possible. This principle is illustrated in Figure 14.9.

Instead of using the between-class scatter matrix SB and the within-class scatter matrixSW , ANMM defines a scatterness matrix as

S�∑

i,k:xk∈N ei

(xi �xk)(xi �xj

)T

|N ei | (14.5)

and a compactness matrix as

C�∑

j:xj∈N oi

(xi �xk)(xi �xj

)T

|N oi | (14.6)

where N oi is the set of n most similar data in the same class as xi (n nearest homogeneous

neighborhoods) and where N ei is the set of n most similar data that is in a different class

from xi (n nearest heterogenous neighborhoods).TheANMM eigenvectors Wopt can thenbe found by the eigenvalue decomposition of S � C.

ANMM introduces three main benefits compared to traditional LDA: (1) it avoids thesmall sample size problem since it does not need to compute any matrix inverse; (2) itcan find the discriminant directions without assuming a particular form of class densities(LDA assumes a Gaussian form); and (3) many more than c �1 feature dimensions areavailable. Some examples of resulting ANMM eigenvectors are shown in Figure 14.10.Using ANMM rather than LDA, the classifier is able to achieve roughly 10 percent betterperformance.

Page 13: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 13 — #13

14.6 Haarlets 13

FIGURE 14.10

First 4 eigenvectors for the frontal view only, after training for a 12-pose set using the ANMMalgorithm.

14.6 HAARLETSComputing the transformation T as shown in Figure 14.8 can be computationally demand-ing, especially if there are manyANMM eigenvectors.To improve the speed of the system,the transformation T in the classifier can be approximated using Haarlets, as shown inFigure 14.11. In this case the transformation T is approximated by a linear combinationof Haarlets C . An optimal Haarlet set is selected during the training stage. Computingthis set on the input image results in a number of coefficients, which when transformedwith C result in an approximation of the coefficients that would result from the transfor-mation T on the same input data. They can be used for subsequent classification in thesame manner as in the pureANMM case. Because of their speed of computation, Haarletsare very popular for real-time object detection and real-time classification. The ANMMapproximation approach provides a new and powerful method for selecting or trainingthem, especially in the 3D case, where existing methods fail because of the large numberof candidate Haarlets, as noted by Ke et al. [6]. Our approach makes it possible to train3D Haarlets by selecting from the full set of candidates.

Papageorgiou et al. [37] proposed a framework for object detection based on 2DHaarlets, which can be computed with a minimum of memory accesses and CPU opera-tions using the integral image. Viola and Jones [5] used AdaBoost to select suitable 2DHaarlets for object detection. The same approach was used for pose recognition by Renet al. [17]. Our approach uses similar Haarlets although they are three-dimensional, andit introduces a new selection process based on ANMM.

14.6.1 3D HaarletsThe concepts of an integral image and Haarlets can be extended to three dimensions.The3D integral image, or integral volume, is defined as

ii(x, y, z)�∑

x′x,y′y,z′z

i(x′, y′, z′) (14.7)

Using the integral volume, any rectangular box sum can be computed in eight arrayreferences, as shown in Figure 14.12. Accordingly, the integral volume makes it possibleto construct volumetric box features similar to the 2D Haarlets. We introduce the 3DHaarlet set as illustrated in Figure 14.13.

Page 14: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 14 — #14

14 CHAPTER 14 Real-Time 3D Body Pose Estimation

Compute

Haarlet

coefficients

C NNInput

DatabaseSet

Coefficients Coefficients Pose

FIGURE 14.11

Classifier structure illustrating the Haarlet approximation. The pretrained set of Haarlets is computedon the input sample (silhouette or hull). The approximated coefficients are computed as a linearcombination C of the Haarlet coefficients. The contents of the dotted-line box constitute anapproximation of T in Figure 14.8.

F

O

z

y

x

A

B

D

C

H

G

E

FIGURE 14.12

Sum of voxels within the gray cuboid computed with eight array references. If A, B, C, D, E, F , G,and H are the integral volume values at shown locations, the sum can be computed as(B�C �E �H)�(A�D �F �G).

14.6.2 TrainingViola and Jones [5] usedAdaBoost to select suitable 2D Haarlets for object detection.Thesame approach was used for pose recognition by Ren et al. [17]. Considering memory

Page 15: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 15 — #15

14.6 Haarlets 15

FIGURE 14.13

Proposed 3D Haarlets. The first 15 features are extruded versions of the original 2D Haarlets in all 3directions; the final 2 are true 3D center-surround features.

and processing time constraints, Ke et al. [6] noted that it is not possible to evaluatethe full set of candidate 3D Haarlets using AdaBoost, and therefore only a fraction of thefull dictionary can be used at a very limited resolution. This makes it virtually impos-sible to train a useful 3D Haarlet set using AdaBoost. In our approach we introduce anew selection process based on ANMM. The Haarlets are selected to approximate Wopt

(Section 14.5.3) as a linear combination thereof. The particular Haarlet set used here isshown in Figure 14.13.Along with feature type, Haarlets can vary in width, height, depth,and position inside the voxel space. At a 24�24�24 resolution, this results in hundredsof millions of candidate features.

The best Haarlets are obtained from this set by convolving all candidates with thevectors in Wopt and selecting those with the highest coefficients (i.e., the highestresponse magnitudes). This score is found for each candidate Haarlet by calculating itsdot product with each ANMM vector (each row in Wopt) and calculating the weightedsum using the weights of those ANMM vectors, as stored in the diagonal matrix D(i.e., the eigenvalues serve as weights). Thus, the entire ANMM eigenspace is approxi-mated as a whole, giving higher priority to dimensions with a higher weight whenselecting Haarlets. This dot product can be computed very efficiently using the integralvolume.

Most selected Haarlets are redundant unless Wopt is adapted after each new Haarlet isselected, before choosing the next one. Let F be a matrix containing the already selectedHaarlets in vector form, where each row of F is a Haarlet. F can be regarded as a basisthat spans the feature space which can be represented by the Haarlet vectors selectedso far. Basically we do not want the next selected Haarlet to be in the space already

rashm
Pencil
Page 16: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 16 — #16

16 CHAPTER 14 Real-Time 3D Body Pose Estimation

represented by F . Let N be a basis of the null space of F ,

N �null(F) (14.8)

N forms a basis that spans everything not yet described by F . To obtain the newoptimal transformation we project D ·Wopt onto N , where D is the diagonal matrixcontaining the weights of the eigenvectors wi in Wopt.

D′ ·W ′opt �D ·Wopt ·N ·NT (14.9)

or

W ′opt �D′�1 ·D ·Wopt ·N ·NT (14.10)

where D′ is a diagonal matrix containing the new weights �′i of the new eigenvectors

wi in W ′opt,

�′i � ||�i ·wi ·N ·NT || (14.11)

Every time a new Haarlet is selected based on Wopt, F is updated accordingly and thewhole process is iterated until the desired number of Haarlets is obtained. Examples ofselected Haarlets are shown in Figure 14.14.

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15

2025 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15

2025

30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15

2025

30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15

2025

30

30

25

20

15

10

5

030

2520

1510

50

(a)

(b)

0 5 10 1520

2530

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15

2025

30

FIGURE 14.14

(a) Three example ANMM eigenvectors. (b) Approximation using 10 Haarlets. The first exampleshows how a feature is selected to inspect the legs; the last example shows a feature thatdistinguishes between the left and right arm stretched forward.

Page 17: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 17 — #17

14.6 Haarlets 17

14.6.3 ClassificationAfter the ANMM vectors have been computed and the Haarlets have been selected toapproximate them, the next step is to actually classify new silhouettes. This processuses the Haarlets to extract coefficients from the normalized silhouette image; it thencomputes a linear combination of these coefficients to approximate the coefficientsthat would result from the ANMM transformation. An example of such an approximatedANMM feature vector is shown in Figure 14.14.The resulting coefficients can be used toclassify the pose of the silhouette. Given the coefficients h extracted with the Haarlets,the approximated ANMM coefficients l can be computed as

l�L ·h (14.12)

where L is an m�n matrix in which m is the number ofANMM eigenvectors and n is thenumber of Haarlets used for the approximation. L can be obtained as the least squaressolution to the system

Wopt �L ·FT (14.13)

The least squares solution to this problem yields

L�Wopt ·((

FT F)�1

FT)T

(14.14)

L provides a linear transformation of the feature coefficients h to a typically smallernumber of ANMM coefficients l. This allows the samples to be classified directly basedon these ANMM coefficients, whereas an AdaBoost method needs to be complementedwith a detector cascade [5] or a hashing function [16, 17]. Finally, using NN search, thenew silhouettes can be matched to the stored examples (i.e., the mean coefficients ofeach class).

14.6.4 ExperimentsIn this section we evaluate how many Haarlets are needed for a good ANMM approxi-mation, and we measure the speed improvement over using a pure ANMM approach.For this experiment a 50-pose classifier was trained using 2000 training samples of a sub-ject in different positions and orientations. The experiment was set in an office scenariowith a cluttered background and thus sometimes noisy segmentations.The samples wererecorded from six cameras connected to six computers that ran foreground-backgroundsegmentation on the recorded images. From these segmented silhouettes, 3D hulls werereconstructed and normalized for size and orientation to 24�24�24 voxels.

Validation was done using 4000 test samples. The resulting classifier uses 44 ANMMeigenvectors, which can be approximated almost perfectly with 100 Haarlets. Thenumber of Haarlets used determines how well the original ANMM transformation isapproximated, as shown in Figure 14.15. So there is no overfitting, but after a cer-tain number of Haarlets the approximation delivers the same classification performanceas the pure ANMM classification. With 3D ANMM, the classifier achieves 97.52 percentcorrect classification on 50 pose classes.

In Figure 14.15 we also show the performance of a 2D silhouette-based classifier,which will be explained in more detail in Section 14.7. In this 2D case we show the clas-sification performance of a classifier where the Haarlets are trained with ANMM and one

Page 18: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 18 — #18

18 CHAPTER 14 Real-Time 3D Body Pose Estimation

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0 10 20 30 40 50 60 70 80 90 100Number of Haarlets

Co

rrec

t cl

assi

fica

tio

n r

ate

for

50 p

ose

cla

sses

2D Haarlets 3D Haarlets 2D AdaBoost

FIGURE 14.15

Correct classification rates using up to 100 Haarlets for classification.

where they are trained with AdaBoost [5].The ANMM approach has better performance,while the AdaBoost approach suffers from overfitting. Due to its memory constraints, itis not possible to apply AdaBoost to 3D Haarlets [6].

As shown in Figure 14.16 the Haarlet-approximated approach is many times fasterthan pureANMM.The computation time increases almost linearly for theANMM transfor-mation as the number of pose classes increases, because increasing the number of poseclasses increases the number of ANMM feature vectors. Using the ANMM approximation,the integral volume of the hull has to be computed once, after which computing addi-tional Haarlet coefficients requires virtually no computation time relative to the time ofcomputing the integral volume. Considering the processing time required for segmenta-tion (5 ms, in parallel) and reconstruction (15 ms), the total processing time is less than25 ms per frame. (The classification was performed on a standard 3-GHz computer.)

Note that if we decrease the number of cameras used in the system, the correct clas-sification rate decreases linearly down to three cameras, where the correct classificationrate is 91.93 percent (it was 97.52% using six cameras). With fewer than three camerasit is impossible to reconstruct a reasonable 3D hull, and therefore classification is alsoimpossible.The computation time for the reconstruction also decreases linearly to about8 ms for reconstructing a hull from three cameras (it was 15 ms using six cameras).

14.7 ROTATION INVARIANCEThe pose classification problem becomes much more difficult when the subject canfreely change not only position but also orientation. A change of position can easily benormalized, but when classifying 2D silhouettes it is impossible to normalize for the

Page 19: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 19 — #19

14.7 Rotation Invariance 19

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Number of pose classes

Co

mp

uta

tio

n t

ime

for

clas

sifi

cati

on

(m

s)

ANMM pure ANMM approximation

FIGURE 14.16

Classification times in milliseconds for the pure ANMM classifier and the classifier using 100 3DHaarlets to approximate the ANMM transformation. The ANMM approximated version only has tocompute the integral volume once (3.5 ms) and the computation time for the 100 Haarlets isneglible. In the pure ANMM case, however, the number of feature vectors increases with the numberof pose classes and requires 1.115 ms of computation time per vector.

rotation of the subject. In a 3D hull approach, however, it is possible to normalizethe rotation of the 3D hulls before classifying them. Normalizing hull rotation consists ofmeasuring the angle of the hull’s orientation and then rotating it to a standard orientation.The goal is that, regardless of the orientation of the subject, the resulting normalized hulllooks the same, as shown in Figure 14.17.

14.7.1 Overhead TrackerAn overhead tracker is used to determine the subject’s angle of orientation. Our visualtracker, based on a color-based particle filter [38], uses a set of particles to model theposterior distribution of the likely state of the subject. During each iteration, the trackergenerates a set of new hypotheses for the state by propagating the particles using adynamic model.This generates a prior distribution of the state, which is then tested usingthe observation of the image.A human is modeled by a circle and an ellipse representingthe head and shoulders. The color distributions of these two regions are compared to astored model histogram to yield the likelihood for the state of each particle.

Particle filtering is a multiple-hypothesis approach. Several hypotheses exist at thesame time and are kept during tracking. Each hypothesis or sample s represents onehypothetical state of the object, with a corresponding discrete sampling probability �.

Page 20: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 20 — #20

20 CHAPTER 14 Real-Time 3D Body Pose Estimation

8060

4020

0

0

20

40

60

80

020

4060

8080

6040

200

0

20

40

60

80

020

4060

8080

6040

200

0

20

40

60

80

020

4060

8080

6040

200

0

20

40

60

80

020

4060

80

FIGURE 14.17

Examples of different user orientations resulting in similar hulls.

FIGURE 14.18

Each sample is modeled by an ellipse and a circle.

Each sample (particle) consists of an ellipse with position, orientation, and scale, and acircle with a position relative to the center of the ellipse, as shown in Figure 14.18. Theellipse describes the boundary of the object being tracked—in this case the shoulderregion—while the circle represents the head.

Each sample is given as

s � {x, y, Hx, Hy, �, cx, cy} (14.15)

where x and y represent the position of the ellipse; Hx and Hy, the size of the ellipsein the x and y axis; �, the orientation of the ellipse; and cx and cy, the position of thehead circle relative to the ellipse center. In this tracker the ratio between Hx and Hy isconstant.

To test the probability of a sample being a good hypothesis, a color histogram p iscomputed over the pixels inside the ellipse, and another histogram p′ is computed forthe pixels inside the head circle. Each pixel has three color channels (red, green andblue) and each channel is divided into eight bins, giving a total of 512 bins.

rashm
Pencil
Page 21: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 21 — #21

14.7 Rotation Invariance 21

A pixel is assigned to a bin as follows:

u�n3 ·r �n2 ·g �n ·b (14.16)

where n is the number of bins for each channel, and r, g, b are color values between 0and 255. With this formula, each pixel is assigned to a bin u, incremented

p(u) �p(u) �w (14.17)

where w is the weight of the pixel. To increase the reliability of the color distributionwhen boundary pixels belong to the background or are occluded, smaller weights areassigned to pixels that are further away from the region center:

w�1�r2 (14.18)

where r is the distance between the pixel and the center of the ellipse.The resulting histogram is compared to a stored histogram or target model q using

the Bhattacharyya coefficient,

�[ p, q]�

m∑

u�1

p(u)q(u) (14.19)

The larger � is, the more similar the histograms are. We define the distance between twohistograms as

d �√

1��[ p, q] (14.20)

which is called the Bhattacharyya distance [39]. This similarity measure provides thelikelihood of each sample and is used to update the sample set.

To speed up the tracker, the number of pixels that must be evaluated to build thehistogram is reduced. First, a random sampling is made of the pixels that lie inside theshoulder and head regions. This random sampling is fixed for the tracker’s entire run.When calculating the color histogram, only the sampled pixels are evaluated. This notonly benefits the speed but also makes the number of evaluated pixels independent fromthe size of the ellipse; thus, computation time is constant.

The evolution of the sample set is described by propagating each sample accordingto a dynamic model:

st �Ast�1 �wt�1 (14.21)

where A defines the deterministic component of the model, and wt�1 is a multivariateGaussian random variable. Each element of the set is then weighted in terms of theobservations (color histogram), and N samples are drawn with replacement by choosinga particular sample with probability �(n).The tracker state at any given time is computedas a weighted mean state over all current samples at that given time, weighted by theirBhattacharyya distance to the target model. This combination of multiple hypotheses,particle filtering, and random sampling results in a fast, robust overhead tracker, asshown in Figure 14.19.

Note that a tracker requires initialization; if initialization is not desireable for the appli-cation, it is possible to use an alternatives such as an orientation sensor or to determine

Page 22: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 22 — #22

22 CHAPTER 14 Real-Time 3D Body Pose Estimation

FIGURE 14.19

Example of the overhead tracker.

the greatest horizontal direction of variation in the hull. The latter works well, but itlimits the number of usable poses as it cannot distinguish front from back in a subject.Therefore, all poses need to be symmetrical, which is not ideal. This can be avoided byusing other cues to determine which side of the subject is in front, such as face detectionfrom the cameras placed sideways.

Another option is a direct approach, such as proposed by Cohen and Li [18],Weinlandet al. [19] and Gond et al. [20], where a rotation-invariant 3D shape descriptor is usedrather than hull normalization. For two reasons, we chose to first normalize the hulls andthen classify them:

1. We believe higher performance and lower computation times are possible this way,as both the method of Ren et al. [17] and our method achieve very high classificationrates in real time by first determining orientation and then classifying.

2. Disconnecting normalization from classification allows the classification algorithm tobe used for different classification problems as well as, for example, hand gestures,or for classification in a space where the third dimension is time (similar to [6]). Inthis case a different normalization step is required, but the classification algorithmremains mostly the same.

14.7.2 ExperimentsTo test our 3D hull-based rotation-invariant classifier we compared it to a 2D silhouette-based approach. For this experiment we used the same setup, training, and test data asdescribed in Section 14.6.4. A 2D silhouette-based classifier cannot classify the pose ofa person with changing orientation, so it is impossible to compare the two approachesdirectly. It is however possible to train a 2D silhouette-based classifier for several possibleuser orientations. The training samples are divided into 36 individual bins depending on

Page 23: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 23 — #23

14.8 Results and Conclusions 23

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Number of pose classes

Co

rrec

t cl

assi

fica

tio

n r

ate

3D 2D

FIGURE 14.20

Correct classification rates comparing classification based on 2D silhouettes and 3D hulls usingANMM approximation and Haarlets.

the angle of orientation. For each bin a separate 2D classifier is trained. In the classifica-tion stage, depending on the measured angle of rotation, the appropriate 2D classifieris used.

For this experiment we trained a 2D classifier based on 2D Haarlets usingANMM.Thisallowed us to quantify how much a 3D hull approach improves performance. Figure 14.20shows the performance for classification with different numbers of pose classes up to50. The pose classes were randomly selected and averaged over five random samplings.With all 50 pose classes, the 3D system is 97.52 percent correct; the 2D system is 91.34percent correct.

14.8 RESULTS AND CONCLUSIONSUsing the algorithms described in this chapter, we built a real-time pose detectionsystem using six cameras, hardware triggered to ensure that the recorded imageswere synchronized. The six cameras were connected to six computers, each runninga foreground-background segmentation. As segmentation is one of the computationallymore expensive steps in the system, distributing the load benefits system speed signif-icantly. Additionally, the smaller binary silhouettes are easier to send over the networkthan full-color images. The silhouettes are sent to a host computer, which runs the 3Dhull reconstruction and the pose classification. The speed of the reconstruction step issignificantly improved by voxel carving and a fixed lookup table. 3D Haarlets help tomake the pose classification step about four times faster. The overhead tracker is run ona separate computer in parallel, and sends the orientation estimation over the network.

Page 24: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 24 — #24

24 CHAPTER 14 Real-Time 3D Body Pose Estimation

This system is capable of detecting 50 poses with 97.52% accuracy in real time. Examplereconstruction and classification results are given in Figure 14.21, which shows the inputimages for one of the six cameras, as well as the 3D hull reconstruction from a top anda side view and the detected pose.

Input Reconstruction Detected pose

FIGURE 14.21

Example reconstruction and classification results.

Page 25: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 25 — #25

14.8 Results and Conclusions 25

The system described in this chapter introduces a number of technical contributions.We introduced a new and powerful approach to training Haarlets based on ANMM, andextended it to 3D,which makes it possible to train 3D Haarlets.The 3D approach has new,interesting properties such as increased robustness and rotation invariance. Furthermore,in the 3D approach the trained classifier becomes independent from the camera setup.The result is a pose classification system with the same or better performance whencompared to the state of the art but much faster, interactive speeds.

The methods described in this chapter can be ported to other classification problems,such as hand gesture recognition, object detection and recognition, face detection andrecognition, and even event detection where the third dimension of the 3D Haarlets istime. The algorithms described to train 3D Haarlets in this chapter can be exported toany system where 2D or 3D Haarlets require training.

There are some limitations to our system. For example, as the system relies onforeground-background segmentation, the background must be static. A busy but staticbackground is not a problem for the system, which can deal with noisy segmentations.However, the foreground-background segmentation fails on a moving background. Noexperiments have been done with multiple subjects on the scene. This should not bea problem as long as the hulls are not touching, in which case it becomes difficult todeterming which voxels belong to which subject.

Another limitation is the orientation tracker, which requires initialization. Althoughfast and accurate, in future work it will be important to look for an alternative orientationestimation that does not require initialization and is independent of previous frames.Furthermore, the sparse camera placement limits 3D hull reconstruction quality, andtherefore some poses are impossible to detect. At this time the pose classes are limitedto visible arm directions. In the future it will be interesting to look at a sequence ofposes and have the algorithm detect moving gestures based on a sequence of ANMMcoefficients. In such a system the impact of a missed subtle pose will be less apparent,as the sequence as a whole is classified.

Acknowledgments. The work for this chapter was carried out in the context of theSixth Framework Programme of the European Commission: EU Project FP6–511092CyberWalk, and Swiss NCCR project IM2.

REFERENCES[1] CyberWalk project, http://www.cyberwalk-project.org.[2] F.Wang, C. Zhang, Feature extraction by maximizing the average neighborhood margin, IEEE

Computer Society Conference on Computer Vision and Pattern Recognition, 2007.[3] M.V. den Bergh, E. Koller-Meier, L.V. Gool, Fast body posture estimation using volumetric

features, IEEE Visual Motion Computing, 2008.[4] R. Kehl, M. Bray, L.V. Gool, Full body tracking from multiple views using stochastic sampling,

IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005.[5] P. Viola, M.J. Jones, Robust real-time object detection, IEEE Conference on Computer Vision

and Pattern Recognition, 2001.[6] Y. Ke, R. Sukthankar, M. Hebert, Efficient visual event detection using volumetric features,

IEEE International Conference on Computer Vision, 2005.

Page 26: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 26 — #26

26 CHAPTER 14 Real-Time 3D Body Pose Estimation

[7] C. Bregler, J. Malik, Tracking people with twists and exponential maps, IEEE Conference onComputer Vision and Patter Recognition, 1998.

[8] Q. Delamarrre, O. Faugeras, 3D articulated models and multi-view tracking with silhouettes,IEEE International Conference on Computer Vision, 1999.

[9] D.M. Gavrila, L. Davis, 3D model-based tracking of humans in action: a multi-view approach,IEEE Conference on Computer Vision and Pattern Recognition, 1996.

[10] I. Kakadiaris, D. Metaxas, Model-based estimation of 3D human motion, IEEETransactions onPattern Analysis and Machine Intelligence, 22 (12) (2000) 1453–1459.

[11] R. Plänkers, P. Fua, Articulated soft objects for video-based body modeling, IEEE InternationalConference on Computer Vision, 2001.

[12] K. Cheung, T. Kanade, J. Bouguet, M. Holler, A real time system for robust 3D voxelreconstruction of human motions, Proceedings of CVPR, 2000.

[13] K. Cheung, S. Baker, T. Kanade, Shape-from-silhouette of articulated objects and its use forhuman body kinematics estimation and motion capture, Proceedings of CVPR, 2003.

[14] I. Mikic, M. Trivedi, E. Hunter, P. Cosman, Articulated body posture estimation from multi-camera voxel data, Proceedings of CVPR, 2001.

[15] R. Rosales, S. Sclaroff, Specialized mappings and the estimation of body pose from a singleimage, IEEE Human Motion Workshop, 2000.

[16] G. Shakhnarovich, P. Viola, T. Darell, Estimating articulated human motion with parameter-sensitive hashing, IEEE International Conference on Computer Vision, 2003.

[17] L. Ren, G. Shakhnarovich, J.K. Hodgins, H. Pfister, P. Viola, Learning silhouette features forcontrol of human motion, ACM Transactions on Graphics 24 (4) (2005) 1303–1331.

[18] I. Cohen, H. Li, Inference of human postures by classification of 3D human body shape, IEEEWorkshop on Analysis and Modeling of Faces and Gestures, 2003.

[19] D. Weinland, R. Ronfard, E. Boyer, Free viewpoint action recognition using motion historyvolumes, Computer Vision and Image Understanding, 104 (2006) 249–257.

[20] L. Gond, P. Sayd, T. Chateau, M. Dhome, A 3D shape descriptor for human pose recovery,V Conference on Articulated Motion and Deformable Objects, 2008.

[21] R. Mester, T. Aach, L. Dümbgen, Illumination-invariant change detection using a statisticalcolinearity criterion, Pattern Recognition: Proceedings 23rd DAGM Symposium, 2001.

[22] A. Griesser, S.D. Roeck, A. Neubeck, L.V. Gool, Gpu-based foreground-background segmenta-tion using an extended colinearity criterion, Proceedings of VMV, 2005.

[23] A. Laurentini, The visual hull concept for silhouette-based image understanding, IEEE Tran-sactions on Pattern Analysis and Machine Intelligence archive, 16 (2) (1994) 150–162.

[24] A. Laurentini, How many 2D silhouettes does it take to reconstruct a 3D object?, ComputerVision and Image Understanding, 67 (1) (1997) 81–87.

[25] K. Shanmukh, A. Pujari, Volume intersection with optimal set of directions, Pattern Recog-nition Letters, 12 (3) (1991) 165–170.

[26] J. Luck, D. Small, C. Little, Real-time tracking of articulated human models using a 3D shape-from-silhouette method, Proceedings of the International Workshop on Robot Vision, 2001.

[27] C. Theobalt, M. Magnor, P. Schüler, H. Seidel, Combining 2D feature tracking and volumereconstruction for online video-based human motion capture, Proceedings of the 10th PacificConference on Computer Graphics and Applications, 2002.

[28] I. Mikic, M. Trivedi, E. Hunter, P. Cosman, Human body model acquisition and tracking usingvoxel data, International Journal on Computer Vision, 53 (3) (2003) 199–223.

[29] J.-M. Hasenfratz, M. Lapierre, J.-D. Gascuel, E. Boyer, Real-time capture, reconstruction andinsertion into virtual world of human actors, Proceedings of Vision, Video and Graphics,2003.

[30] F. Caillette, T. Howard, Real-time markerless human body tracking using colored voxels and3-D blobs, Proceedings of ISMAR, 2004.

Page 27: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 27 — #27

14.8 Results and Conclusions 27

[31] W. Matusik, C. Buehler, L. McMillan, Polyhedral visual hulls for real-time rendering, Procee-dings of the 12th Eurographics Workshop on Rendering Techniques, 2001.

[32] J. Franco, E. Boyer, Exact polyhedral visual hulls, Proceedings of BMVC, 2003.[33] K. Kutulakos, S. Seitz,A theory of shape by space carving,Technical Report TR692, Computer

Science Department, University of Rochester, Rochester, NY, 1998.[34] S. Seitz, C. Dyer, Photorealistic scene reconstruction by voxel coloring, International Journal

of Computer Vision, 25 (3) (1999) 1067–1073.[35] R. Szeliski, Rapid octree construction from image sequences, Computer, Vision, Graphics

and Image Processing, 58 (1) (1993) 23–32.[36] K. Fukunaga, Introduction to Statistical Pattern Recognition, Second Edition.Academic Press,

1990.[37] C. Papageorgiou, M. Oren,T. Poggio, A general framework for object detection, International

Conference on Computer Vision, 1998.[38] K. Nummiaro, E. Koller-Meier, L.V. Gool, An adaptive color-based particle filter, Image Vision

Computing, 21 (1) (2003) 99–110.[39] F. Aherne, N. Thacker, P. Rockett, The Bhattacharyya metric as an absolute similarity

measure for frequency coded data, Kybernetika (1997) 1–7.

Page 28: CHAPTER Real-Time 3D Body Pose Estimation 14 · CHAPTER Real-Time 3D Body Pose 14 Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, ComputerVision Laboratory,

“Ch14-P374633” — 2009/2/25 — 14:13 — page 28 — #28