Tracking of human movements in image space - · PDF fileTracking of human movements in image space ... Tracking of human movement in image ... Section 2 deals with a general overview

Tracking of human movement in image space Fabio Remondino

Tracking of human movements in image space

Fabio Remondino

1


Table of content

1. Introduction 32. Human tracking overview 43. Data acquisition 64. Algorithms overview 8

4.1 The least square matching tracker 84.2 Object tracking 104.3 The Shi-Tomasi-Kanade tracker 114.4 Detection and tracking of moving objects 14

5. Features selection for tracking human body part 166. Results 17

6.1. Least square matching tracking 176.2 Shi-Tomasi-Kanade tracker 24

6.3 Detection and tracking of moving objects 276.4. Object tracking 30

7. Conclusions 358. Feature works 35Bibliography 36

2


1. Introduction

Human motion analysis is receiving increasing attention from researchers of different fields ofstudy. The interest is motivated by a wide spectrum of applications, such as athletic performanceanalysis, surveillance, man-machine interface, video-conferencing, human-computer interaction,motion capture (games and animation).A complete model of human consists of both the movements and the shape of the body. Many ofthe available systems consider the two modeling processes as separate even if they are very close.Depending on the applications (animation, visualization, medical imaging) different methods canbe used for the measurement of the body shape: laser scanner, infra-red light scanner, photogram-metry, structured light. The modeling of the movement is often obtained capturing the motionwith tracking processes: this can be achieved with photogrammetric methods, electromagnetic ormechanical sensors systems and image-based methods.In general the tracking process can be described as the establishment of correspondences of theimage structure between consecutive frames, based on features related to position, velocity, shape,color and texture. The main problem is to establish automatically the corresponding features indifferent images. The tracking is required for 2D and 3D object localization and it is also used forobject detection, classification and identification.The main goals of motion studies are to detect moving regions (points, features, areas), estimatethe motion, model articulated objects and interpret the motion. It is a very hard task as:- the appearance of the people can vary dramatically from frame to frame;- people can appear in arbitrary poses;- the human body can deform in complex way;- tracked points can be occluded, resulting in ambiguities and multi interpretations;- tracked points (joints) are often not well observable (clothing hide the underlying structure);- it is geometrically under-constrained problem (images are 2D entities of a 3D world).

This work focuses on the tracking of movements of humans in monocular sequences of images.Section 2 deals with a general overview of tracking techniques, including motion capture, humanmodelling processes and moving objects detection. In Section 3 the used techniques for imagesacquisition and the contrast enhancement process are presented. In Section 4 an overview of theimplemented algorithms is described while a short description of interest features is contained inSection 5. Finally Section 6 shows all the results for the validation of the algorithms.

3


2. Human tracking overview

The main problem of tracking humans (and in particular humans movements) is how to capturethe position and motion in space of articulated parts of human body.Typically the tracking process involves the matching between frames using pixels, points, linesand blobs based on their motion, shape or other visual information. Tracking movements of per-sons and modeling the different part of the human body are two applications very close to eachother.There are two main techniques to capture the human motions [2]:

(a) Tracking using body markersThese tracking systems can be divided in [13]:1. Systems which employ sensors on the body that sense artificial external sources (e.g. electro-magnetic field), or natural external sources. These systems provide 3D world-based informationbut their workspace and accuracy is generally limited due to the use of the external sources andtheir formfactor restricts their use to medium and larger sized body parts.2. Systems which employ an external sensor that senses artificial sources or markers on the body,e.g. an electro-optical system that tracks reflective markers, or natural sources on the body (e.g. avideo-camera based system that tracks the pupil and cornea). These systems generally suffer fromocclusion and a limited workspace.3. Systems which employ sensors and sources that are both on the body (e.g. a glove with piezo-resistive flex sensors). The sensors generally have small form-factors and are therefore especiallysuitable for tracking small body parts. These systems allow for capture of any body movementand allow for an unlimited workspace but generally do not provide 3D world-based information.In figure 2.1 some systems for motion capture are presented.

Fig.2.1: Different systems for motion capture. Left and right retro-reflective markers. Middle: electro-mechanical system

All these techniques are used especially in ‘motion capture’ where object’s position and orienta-tion in physical space are recorded as information in a suitable form that animators can use to con-trol elements in a computer generated scene.The disadvantages of these technique are:- displacement of the markers during movement brings to uncertainty in the results;- difficulty to place on complex articulation (like shoulders, knees);- rigidity in movement (psychological effects)- difficult calibration of the system.

4


The main advantage is the capability of some systems to process the data and produce 3D resultsin real time.

(b) Tracking without markers (marker free methods)Marker free methods are based on image sequences processing/analysis. These methods are oftenmodel-based; the image sequences can be acquired either from one camera (monocular vision), orfrom multiple cameras (multi-views).In monocular case different approaches can be used to track the human body: matching point fea-tures, contour extraction (sensitive to noise), 3-D geometric primitives (projected onto the images)[13], probabilistic models of the joint positions [22], particle filtering [3], active part decomposi-tion.In the multi-views approach, multiple cameras acquire simultaneously different views of the per-son and the 3-D body poses and motions at each time instant can be recovered from the multi-image sequences [7].The marker free methods offer the subject complete freedom of movement which is not the caseof tracking with markers.Image understanding and extrapolation of the third dimensionare the main problems for these methods, especially in monoc-ular vision. In this case the 3D coordinates can be inducedfrom the 2D image coordinates e.g. using a Bayesian approachand a set of training data [11] or fitting the projection of athree-dimensional person model through the sequence [21,24]. The main problems of these approaches are the models ofthe different part of the body (using cylinders, cones, ellipticalcylinders), the large number of degrees of freedom of themodel (body joints, rotations, orientations) and the modelingof the motion (prediction of the next steps). In the multi-imageapproach, stereo-vision can be used to extract 3D informationfrom the sequence.

The interest in human motions analysis can alsobe limited to detect moving objects in imagesequences. In applications as real-time tracking,monitoring of wide-area sites or surveillance,tracking approaches based on moving objectslocalization and body shape or body boundariestracking are used (fig.2.3). The moving objectscan be identified in the images using back-ground subtraction or optical flow. If also amotion of the camera is present, a rectificationof the frames must be performed in order to apply the background knowledges [12]. Movingobjects in the scene are often segmented while occlusion problems can be solved using temporalanalysis and trajectory prediction (Kalman filter) [17].

Fig.2.2: Left: geometric primitivesprojected onto the image [21].Right: a volumetric human model [1]

Fig.2.3: Moving shapes tracking

5


3. Data acquisition

Four sequences (fig.3.1, 3.2, 3.3, 3.4) have been acquired with a Sony DCR-VX700E, a Sony dig-ital handycam that records images in digital format on a mini DV tape. The images are stored inDV format with a size of 720x576 pixel and 24 bit color resolution. The DV format is a Sonyproperty, compressed digital and video audio recording standard.As CCD cameras are interlaced, i.e. a full frame is split into two different fields which arerecorded and read-out consecutively, odd and even lines of an image are captured at different timeand a saw pattern is created during the digitizing process.For this reason only the odd (even) lines of an image are used in the algorithm, reducing the reso-lution in vertical direction by 50 per cent.Other two sequences (fig.3.5, 3.6) have been acquired digitizing an old VHS tape. Also in thiscase the digitalization process creates a saw pattern in the images; therefore reduced images areused for the validation of the algorithm.

Fig.3.1: Sequence of 24 frames of a walking man: the camera is rotating on a tripod

Fig.3.2: Sequence of 60 frames: the camera is still and the guy is just raising his arms

Two sequences acquired from VHS tape (fig.3.5, 3.6) have very low resolution because of thevideo-tape and the digitalization process (RAZOR software). No way was successful in theenhancement of the frames: different filters and also motion blur compensation didn’t achievegood results. Therefore just a local contrast enhancement has been done.

6


Fig.3.5: Sequence of 100 frames: two people are walking one over the other. Their trajectories are perpendicular tothe camera which is still far away from them

Fig.3.6: Sequence of 50 frames of moving people walking towards the camera

Fig.3.3: Sequence of 9 frames acquired from VHS tape

Fig.3.4: Sequence of 10 frames from VHS tape

7


4. Algorithms overview

In this section the implemented algorithms are described: least square matching tracker, objecttracking and extraction, Shi-Tomasi-Kanade tracker, detection and tracking of moving objects.

4.1 The least square matching tracker

The basic idea of this algorithm is to track a selected point through a sequence of images usingleast square matching (LSM). The process is based on adaptive least square method technique [9]and is similar to [4]. Assume two image regions are given as discrete two-dimensional functionsf(x,y) and g(x,y) and that f(x,y) is the template in one image and g(x,y) the patch in the otherimage; a correspondence is established if

f(x,y) = g(x,y) (4.1)

Because of random effects (noise) in both images, the above equation is not consistent. Therefore,a noise vector e(x,y) is added, resulting in

f(x,y) - e(x,y) = g(x,y) (4.2)

The location of the function values g(x,y) must be determined in order to provide the match point.This is achieved by minimizing a goal function, which measures the distances between the greylevels in a template and in an other patch. The goal function to be minimized in this approach isthe L2-norm of the residuals of least squares estimation. Eq.4.2 can be considered as a non linearobservation equation which model the vector of observation f(x,y) with a function g(x,y), whoselocation in the other image must be estimated. The location is usually described by shift parame-ters which are estimated with respect to an initial position of g(x,y). In order to account for a vari-ety of systematic image deformations and to obtain a better match, image shaping parameters(affine image shaping) and radiometric corrections can be introduced beside the shift parameters[9]. An affine transformation is often used and the pixel coordinates of the matched point are com-puted as

xnew=a0+a1x+a2y (4.3.1)ynew=b0+b1x+b2y (4.3.2)

where the 6 parameters of the affine transformation must be estimated from eq. (4.2) by minimiz-ing the sum of the squares of the differences between the grey values in image patches. The func-tion g(x,y) in eq. (4.2) is linearized with respect to the unknow parameters and the obtained linearsystem is iterated using a Gauss-Markov method [9].The implemented algorithm uses two images, one as templateand the other as search image. The patches in the search imageare modified by the affine transformation (translations, rota-tion, shearing and scaling) and the corresponding point isfound in the search image after some iterations. Fig.4.1 showsthe result of the least squares matching: the red box is theselected patch in the template image and the green box repre-sents the affinely transformed patch in the search image(emphasize).

Fig.4.1: LSM algorithm: templateimage (left) and search image (right)

8


In [4] three sequences of images of three synchronized cameras are available: spatial correspon-dences between three images at the same instant t and also temporal correspondences betweensubsequent frames of each camera are computed and 3D trajectory can be determined.In our case the algorithm works with monocular sequences of images and only temporal corre-spondences can be found.The fundamental operations of the tracking process are three:1. predict the position in the next frames;2. search the position with the highest cross correlation value;3. establish the point in the next frame using lest square matching.If the images have been taken at near time instants, they are strongly related to each other and theimage position of two corresponding features is very similar. Therefore, for the frame at time t+1,the predicted position of a point is the same as time t (fig.4.2). Around this position a search box isdefined (blue box) and scanned for searching the position which has the higher cross-correlation.This position is considered an approximation of the exact position of the point to be tracked. TheLSM algorithm is then applied at that position (red cross) and the result of the matching is consid-ered the exact position of the tracked point in the next frame.

Fig.4.2: The cross-correlation process to find the approximation for LSM

For the frame at time t+2 a linear prediction of the position of the point from the two previousframes is computed (fig.4.3). Then a search box is defined around this predicted position and thepoint with bigger cross-correlation is used for the LSM computation.

For the next frames a linear prediction (based on theprevious positions) is always computed even if a morecomplicated interpolation could be implemented(splines or kalman filter, especially after occlusions).As the algorithm works with monocular sequences,few automatic controls on the corresponding matchedpoints can be performed. In order to verify the reliabil-ity of the tracked points, two post-processing verification have been implemented:1. cross-correlation computation: it checks if the matched point is reliable between two frames. Ifthe cross-correlation coefficient of a point in two consecutive images is smaller than a predefinedthreshold value, the points is rejected;2. distance between two matched joints: this test can be performed if the camera does not zoomand is stationary or if its movements are slower than the moving objects; in these cases a distance

Frame at time t: in red the patch for LSM Frame at time t+1: in blue thesearch area for cross-correlation

t

t+1

t-1

Fig.4.3: Linear prediction tofind the approximated posi-

tion of the point

9


can be computed, in each frame, between two points on the body that must remain at the same dis-tance (e.g. feet-knee, wrist-shoulder). Then the difference of this distance in two consecutiveframe is calculated and if the difference does not belongs to a predefined domain, the trackedpoint is rejected.A cross-correlation computation has been also implemented to recover lost points after occlu-sions. Manually the user must select the last image where the point is visible and the image wherethe point reappears. The process finds the new position after occlusion using a suitable window;these coordinates are considered an approximation of the point and the LSM is applied to com-pute the correct position.

If the tracked points have been selected in correspondence of the human joints, a final animationof the tracked points can be done and the 2D trajectories can be drawn.

4.2 Object tracking

A tracking process can also involve the extraction of part of objects using few tracked points.Using an images matching process [4] which establishes many correspondences in three consecu-tive images it is possible to extract the full body (or part of it) through the sequence. The processis based on the adaptive least square method [9] and automatically determines a dense set of cor-responding points between images starting from few points sparse on the surface to extract. Thetemplate image is divided into polygonal regions according to which of the seed points is closest(Voronoi tessellation)(fig.4.4).

Fig.4.4: Search strategy for establishment of correspondences between images

Starting from the seed points and using a user-defined border of the interest object, the algorithmtries to match corresponding points in three consecutive images. The central image is used as tem-plate and the other two as search images. The matcher searchs the corresponding points in the twoimages independently. The process starts from a selected point, shift horizontally in the templateand in the search images and applies the LSM algorithm in the shifted location. If the quality ofthe matching is good the matched point is stored and the process continues horizontally until itreaches the region boundaries. The covering of the entire polygonal region of a seed point isachieved by sequential horizontal and vertical shifts.In monocular sequences the reliability of the matched surfaces depends only on the matchingparameters; in multi-views sequences a control can be done using the computed 3D coordinatesand check the wrong correspondences [5].

zoom

seed points

matched points

o seed points . matched points

10


To evaluate the quality of the matched points the follow-ing indicators are used:- a posteriori standard deviation of the least squareadjustment;- standard deviation of the shift in x and y directions.If the quality of the matching is not satisfactory, the algo-rithm computes again the process changing some param-eters like smaller shift from the neighbor or bigger patchsize. At the end of the process, a cloud of 2D points isobtained (fig.4.5 - second row) even if some holes due tonot analyzed area can appear in the results: the algorithmtries to close these gaps by searching from all directionaround. If the holes are in areas with low texture, thematching does not find many correspondences; thereforethe results can be improved by increasing the number ofseed points in these areas or using neighborhood infor-mation.

4.3 The Shi-Tomasi-Kanade tracker

In this section the Shi-Tomasi-Kanade tracker [14, 19, 23] will be briefly described.In general, any function of three variables I(x,y,t), where the space variables x and y as well thetime variable t are discrete and suitably bounded, can represent the intensity of an imagesequence. If the camera moves, the patterns of image intensities change in a complex way; butimages taken at near time instants are usually strongly related to each other, because in generalthey refer to the same scene taken from only slightly different viewpoints.Consider an image sequence I(x,t), with x=[u,v]T the coordinates of an image point.If the time sampling frequency is sufficiently high, we can assume that small image regions aredisplaced but their intensities remain unchanged. Therefore I(x,t) is not arbitrary but satisfies:

(4.4)

where δ(x) is the motion field, specifying the warping that is applied to image points between timeinstant t and t+∆t.The fast-sampling hypothesis allow us to approximate the motion with a translation, that is,δ(x)=x ± d, where d is a displacement vector. So, a later image at time t+∆t can be obtained bymoving every point in the current image, taken at time t, by a suitable amount d.As the image motion model is not perfect and because of image noise, equation (4.4) is notexactly satisfied and can be written as:

(4.5)

where n is a noise function.

Fig.4.5: Triplet of successive frames andfound 2D correspondences

I x t,( ) I δ x( ) t ∆t+,( )=

I x t,( ) I δ x( ) t ∆t+,( ) n x( )+=

11


The tracker’s task is to compute the displacement d for a number of selected points for each pairof successive frames in the sequence. The displacement must be computed minimizing the SSD(Sum of Square Differences) residual:

(4.6)

where W is a small image window centered on the point for which d is computed.By plugging the first-order Taylor expansion of I(x+d,t+∆t) into eq. (4.6) and imposing that thederivatives with respect to d are zero, we obtain the linear system

Gd=e (4.7)

where: (4.8.1)

with (4.8.2)

and e, the error vector, is:

(4.8.3)

with .

The derivatives of the function I can be computed with finite pixel difference but there are alwaysproblems with image noise and local minima. A better solution can be achieved with a convolu-tion of the function with special filter (Gaussian kernel).The tracker is based on eq.(4.7): given a pair of successive frames, d is the solution of (4.7) that is

d=G-1e, and is used to compute the position in the new frame. The procedure is iterated accordingto Newton-Rapshon scheme, until the convergence of the displacement is estimated.

The translation model δ(x)=x ± d, cannot account for certain transformation of the feature win-dow we are tracking, for instance rotation, scaling and shear. An affine motion model is moreaccurate [19]:

, (4.9)

because two rotations, two translations, a scale in x/y and a shear are considered.

ε I x d t ∆t+,+( ) I x t,( )–[ ] 2

W∑=

GIu

2IuIv

IuIv Iv2W

∑=

Iu Iv ∇ Iu∂

∂Iv∂

∂I= =

e I t Iu IvT

W∑=

I t t∂∂I=

x u+

y v+

a1

a2

a3

a4

x

y

a5

a5

+=

12


It computes δ(x) of eq.(4.4) asδ(x)=Ax+d (4.10)

where d is a displacement and A is a 2x2 matrix accounting for affine warping and can be writtenas A=I+D,with D=[dij] a deformation matrix and I the identity matrix.As in the translational case, the motion parameters D and d are estimated by minimizing the resid-uals (SSD):

(4.11)

The equation (4.11) is differentiated with respect to the unknown entries of the matrix D and thevector d and the results are set to zero. Linearizing the resulting system by Taylor expansion, weobtain the linear system:

Tz=a, (4.12)where:z=[d11 d12 d21 d22 d1 d2]T (4.13.1)contains the unknown entries of the deformation matrix D and the displacement vector d;

(4.13.2)

is the error vector that depends on the differences between the two images;

(4.13.3)

andU is a 4x4 matrix containing the products of the first 4 element of the vector a for each of theseelements; V is a 2x4 matrix containing the product of the elements Iu and Iv for the first 4 ele-ments of a; G as in equation (4.8.1).Finally equation (4.12) can be solved iteratively for the entry of z.

In both cases (translational and affine model) feature selection is very important. In [19] is recom-mended that T (or G) is well conditioned, i.e. the ratio between the largest and the smallest eigen-value of T (or G) should not be too big (corner selection).

Once the displacement has been found and the new position of the point has been determined, acontrol on the new position must be done.The control is computed with a cross-correlation process: given a template window around thepoint in frame n and a slave window around the matched point in frame n+1, a cross-correlationcoefficient ρ is computed. The corresponding feature in frame n+1 is accepted if the computed ρis bigger than a user-defined threshold value ρ0.

ε I Ax d t ∆t+,+( ) I x t,( )–[ ] 2

W∑=

a I t uIu uIv vIu vIv Ivu Iv

T

W∑=

TU V

VT

GW∑=

13


Usually the STK tracker is not used for tracking human movements in image sequences; but if theimages have been taken at near time instants they are usually strongly related to each other andthis (extended) tracker can give quite good results for not very long sequences of high textureimages.

4.4 Detection and tracking of moving objects

In applications like video-surveillance and monitoring of human activities, the main idea is todetect and track moving objects (people, vehicles, etc.) as they move through the scene.Considering one image, regions of moving objects should be separated from the static environ-ment. To identify and separate the moving object, different approach have been proposed: back-ground subtraction [17], 2D active shape models [18], combination of motion, skin color and facedetection [8]. If the camera is stationary or its movement are very small compared to the objects, asimple subtraction of two consecutive frames can be used (fig.4.6-c). The resulting image hasmuch larger values for the moving components of the frame than the stationary components.A moving object produces two regions having large values:1. a front region of the object caused by covering of the background by the object;2. a rear region of the object caused by the uncovering of the object from the background.Therefore, using a threshold of the image it is possible to detect the rear region of the movingobject. The threshold value is determined by experiments.The binary thresholded image can contain some noise which can be easily removed with an ero-sion process or with a median filter (Fig.4.6-d).

Fig.4.6: Example of image subtraction. Two frames of a sequence (a, b). Binary image after absolute images differ-ence with noise (c): black pixel represent movements. Result after median filter (d)

Once the moving objects have been localized, their bounding boxes can be computed.For this purpose vertical projections of the binary image is at first performed (fig. 4.7). The differ-ent objects in the image are often already visible from this projection. The position of the objectsin the horizontal axes are determined by slicing the vertical projections. If the counted number ofpixel in a slice is higher than a threshold, then the slice is identified as an area of moving activi-ties. This is done for all the slices along the horizontal axes and finally the adjacent slices withmoving activities are joined together obtaining a set of areas where moving activities have beendetected (fig. 4.7). The size of the slices can be adapted to the specific conditions of the acquiredimages. The smaller the slices are, the better will be the precision of the detected areas, but if the

(c)(b)(a) (d)

14


slices are too small, then different moving objects could be detected as a single moving object.The threshold for the identification of a slice as a moving area depends on the size of the slicesand has to be determined by experiments.

Fig.4.7: Vertical projection (left) with 2 picks representing the two men.Vertical lines (right) delimiting the moving objects

Then the same process is performed with the horizontal projections of the different determinedareas of the horizontal axes. The horizontal projection of a person is sometimes divided in 2 dif-ferent moving areas: indeed the middle of the body is usually not moving during the walk, there-fore it is not detected. Once the moving areas are detected, the square bounding boxes can beobtained.

Fig.4.8: Horizontal projections of the x-axis areas (left) and computed bounding boxes (right)

In case of occlusions (two people walking one towards the other), it can be difficult to divide thevertical projections into its components. To avoid this problem, the center of gravity is computedand the boxes are calculated with respect to this center. Occlusions can also be predicted, detectedand handled properly by estimating the positions and velocities of the object and projecting theseestimation to the image plane[14].Once the boxes have been computed, it is possible to visualize the moving foreground regionsusing background subtraction.

15


5. Features selection for tracking human body part

Regardless the methods used for tracking, not all the parts of an image contain motion informa-tion. Moreover along an edge we can only determine the motion component orthogonal to theedge and so we must take care in selecting the feature to follow in the sequence.In general, to avoid these difficulties, only regions with enough texture are used. In fact a singlepixel cannot be tracked unless it has a very distinctive brightness with respect to all its neighbors.As a consequence, it is often hard or impossible to determine where the single pixel is moved inthe subsequence frame, based only on local information. Because of these problems, we do nottrack single points but windows containing good features and sufficient texture.The point features are usually extracted by local operators, often called ‘interest operators’. Theattributes are computed within a rectangular or circular window, in selected or in all directionsand are usually compared to a threshold to decide whether a feature is good or not.Many feature point extractors have been proposed in the last years [6, 10, 20].Concerning all these ‘interest operators’, some characteristics can be found:1. they work with a predefined or arbitrary idea of how a good window looks like;2. they assume that a good feature is independently of the tracking algorithm;3. they often find features well trackable only in pure translation;4. they often find features which are good only in the first frames.So the resulting features are not guarantee to be the best for the tracking algorithm all over thesequence. Therefore a feature point must be consistently and should have enough information inits neighborhood over the different frames.

Concerning tracking operations, researchers have proposed to track featuresas corners, windows with high spatial frequency content or region wheresome mix of second-order derivatives is sufficiently high [19]. But forhuman body movements tracking, as we want to extract 2D or 3D informa-tion from the tracked points, we cannot take the features ‘randomly’ all overthe body as an interest operator would make, or just in correspondence ofedges, but we must select precisely points (joints).We are interested in capture the movement of the human body, therefore weshould select points which can define the motion. Usually points in corre-spondence of head, shoulders, elbows, wrists, hips, knees and ankles areselected. Once this set of points has been extracted from the image, a humanskeleton can be drawn (fig.5.1).

Fig.5.1: Skeleton ofhuman body (EPFL)

16


6. Results

After selecting some points of interest, we can apply the different algorithms to track the points.The first two part of this chapter present the results obtained with least square matching trackerand Shi-Tomasi-Kanade tracker.The results of the detection of moving object, tracking and computation of bounding boxes arepresented in the third part while the tracking of an whole object and its visualization is shown inthe last part.All the results are in image space: 3D coordinates will be recovered in future works.

6.1 Least square matching tracking

The least square matching tracking process starts from some points selected onthe image. These results consider points selected manually and in particularpositions (fig.6.1.1) as we want to extract a skeleton of the human body.The algorithm using these set of coordinates, computes the corresponding pointsin the other frames. The parameter file used in the computation contains:- used/not used parameters for affine transformation;- max sigma0 of the matching;- max sigma-x and sigma-y in the computation of the affine parameter a0 and b0;- max value for the affine parameters a0 and b0 (translation parameters);- size of the window in the template and search image for LSM;- size of the window in the search image for cross-correlation between first andsecond frame;- size of the window in the search image for cross-correlation in the next frames;- step for cross-correlation computation in the search image;- size of a bigger window in the search image for cross-correlation when thevalue of LSM is not satisfactory.A result is stored when the computed values of the three sigma and of the two translation parame-ters are smaller than the default ones in the parameter file. The default value for sigma0 is 25.0and for sigma-x and sigma-y is 0.20; usually all the 6 parameters of the affine transformation areused and the max value for a0 and b0 is set to 4.0.A post-processing computation checks the reliability of the matched points computing the cross-correlation coefficient between consecutive frames. The default value is 0.75 but the threshold canbe decreased for low resolution images.

In the next pages some results of the LSM tracking process are shown.

Fig.6.1.1: Pointsselected on the

image

17


The first sequence has been acquired from a VHS tape and has very low resolution; the camera ispanning following the walking man and 10 frames were available. 14 points have been selected inthe first frame; at the end of the process, 10 points have been tracked all over the sequence(fig.6.1.2). The average cross-correlation coefficient of all the points is 0.62. The LSM algorithmworked with a sigma0 of 30 while sigma x and sigma y were fixed to 0.30.

Fig.6.1.2: Some frames of the sequence with the tracked points

In the next sequence, consisting of 60 frames, 14 points have been selected in correspondence ofbody joints: head, neck, shoulders, elbows, wrists, hips, knees and ankles. After 10 frames thepoints in correspondences of the elbows were lost while all the other joints have been tracked overthe all sequence (fig.6.1.3). The sigma-y was fixed to 0.30 because the guy was moving his armsin vertical direction and the images have half resolution in vertical direction as only the odd linesare used.

Fig.6.1.3: Points tracked in a sequence of 60 frames

Frame nr.1 Frame nr.5 Frame nr.9

(a)frame nr.1(b)frame nr.11(c)frame nr. 30(d)frame nr. 50(e)frame nr.60

(e)(d)

(b) (c)(a)

18


Because of the presence of clothes, when the guy was moving his arms, the folds of the sweaterchanged, so points selected in correspondence of big movements of the folds were not matched(or not well matched).The cross-correlation coefficient between tracked points in two consecutive frames was calculatedand the results are summarize in Table 1. All the 12 points tracked over the sequence had a cross-correlation coefficient bigger than 0.9.If the camera is still and stays approximately at the same distance from the subject, another con-trol on the tracked points can be done, computing the differences of the distances between twopoints with fixed distance, namely feet-knee or neck-shoulder or neck-head. Figure 6.1.4 showsthe computed differences of the distances in all the frames. There is just a big outlier (with a dif-ference of 4 pixels) while all the other differences are in the interval [-2.4, +2.2] pixels, that is anaverage error of one pixel for every matched point. The big outlier can be due to the folds of thesweater on the wrist, as said before.

Fig.6.1.4: Differences of the distances between some joints all over the sequence of 60 frames

Once the 2D coordinates of the joints are computed, it is possible to build (by now only in 2D) askeleton of the human body and represent the stylized person in all the sequence. An animationhas been created and a visualization is shown in fig.6.1.5 and 6.1.6 with cylindric reconstructionof the human body parts.

Table 1: Average of cross-correlation coefficient

Pt1:wristleft

Pt2:shoulder

left

Pt3:neck

Pt4:head

Pt5:shoulder

right

Pt6:wristleft

Pt7hip left

Pt8hip

right

Pt9kneelefs

Pt10kneerights

Pt11ankleleft

Pt12:ankleright

0.91 0.97 0.98 0.98 0.97 0.92 0.93 0.94 0.94 0.95 0.92 0.93

0 10 20 30 40 50 60−3

−2

−1

0

1

2

3

4

5Distances

Frames

Dis

tanc

es [p

ixel

]

Distance head−neck Distance feet−knee left Distance feet−knee right Distance wrist−shoulder left Distance wrist−shoulder right

19


Fig.6.1.5: Visualization of the computed 2D skeleton to the human body

Fig.6.1.6: Cylindric reconstruction of the human skeleton from 2D points computed with the LSM tracking

X

Y

Frames

X

Y Frames

20


Another sequence is presented in fig.6.1.7.

This sequence is composed of 24 frames. 13 points have been selected in correspondence ofjoints. A point on the left wrist was lost quite immediately because of occlusion after 9 frames;also the points on the left leg were lost due to occlusion.When occlusions occur, a point can be wrongly matched and from the analysis of the cross-corre-lation results, is possible to remove the outlier and to track again this point after the occlusion.The points on the leg have been recovered after occlusion using a cross-correlation pro-cess(fig.6.1.9). A template around the point in the last image where is visible is used; the searcharea is acquired in the image where the point reappear (the user must select both images). Thepoint is found in correspondence of the center of the window with bigger cross-correlation coeffi-cient. Then the LSM algorithm can track the recovered points in the other frames (fig.6.1.8).

Fig.6.1.7: Tracked frames withocclusion of some points

(e)

(c)(b)(a)

(d)(a) frame nr.2(b) frame nr.6(c) frame nr.12(d) frame nr.15(e) frame nr.22

Fig.6.1.8: Some frames ofthe sequence with recov-ered points after occlu-sions

21


Fig.6.1.9: Cross-correlation procedure to recover a point lost because of occlusion

The mean cross-correlation coefficient of the points in all the sequence is 0.88 and the differencesof the distances between joints are in the interval [-2.5,+2.5]. The graph of the differences of thedistances is shown in fig.6.1.10.

Fig.6.1.10: Differences of the distances between some joints

22


A final visualization of the sequence with reconstructed human skeleton is show in fig.6.1.11

Fig.6.1.11: Visualization (every 3 frames) of the skeleton built with the tracked points

The last sequence has been acquired from a VHS tape; the camera was moving following the run-ning man and 9 frames were used. 12 points have been selected in the first frame and 7 pointshave been tracked in all the sequence.The LSM sigma0 was equal 30 while the cross-correlation coefficient had an average of 0.63.In fig.6.1.12 are presented some frames of the sequence with overlapped the stylized skeleton.

Fig.6.1.12: A low resolution sequence of 9 frames: the camera is moving following the running man.7 points have been tracked in all the frames.

X

Y

Frames


23


6.2 Shi-Tomasi-Kanade tracker

The core of the STK algorithm was already available on the web; A GUI to select and visualizethe tracked points and a routine to compute the process for a whole sequence have been added.Given two consecutive frames I(x,t) and J(x,t+1), the principal steps of the program are:- compute matrix T (or G) and a (or e) of eq. (4.12): the image gradient in both windows (for fastconverges) are computed with a gaussian kernel;- compute translation d (in the first few iterations) and affine parameters (in the last iterations)such that SSD difference of I(Ax+d) - J(x) is minimized (equation 4.11);- re-warp J with sub-pixel 2D bilinear interpolation using the computed affine motion;- check the SSD error.For every point, the algorithm computes n iterations and selects the affine motion parameters withsmaller SSD error. The algorithm is very time consuming.

In the first sequence of 24 frames, all the points were tracked (recovering those lost for occlusionswith the cross-correlation process previously described). The results are shown in fig.6.2.1

Fig.6.2.1: Four frames of the sequence. In red the points tracked between consecutive frames,in yellow the reconstructed human skeleton.

Frame nr.9

Frame nr.21Frame nr.15

Frame nr.1

24


The cross-correlation coefficient ρ between con-secutive frames has an average of 0.81.The mean SSD (Sum of Square Differences)error in all the sequence was equal to 0.0075while the differences of the distances betweenselected joints is in the interval [-3,+2.3] pixels(except a big outlier of 4 pixels) (see fig.6.2.2).

Fig.6.2.2: Graph with the computed differences of the distances between selected joints. Only one big outlier ispresent while the other values belong to the interval [-3,+2.3] pixel

In the successive sequence, 30 frames have been used to validate the algorithm.In the first frame (fig.6.2.3-a) 14 points have been selected in correspondence of human joints; inthe last frames (fig.6.2.3-d,e) 10 points were still tracked while the others were lost due to smallcross-correlation coefficient and big SSD.

Fig.6.2.3: Some frames of the sequence with the points tracked with STK algorithm.In yellow the reconstructed human skeleton

(a) frame nr.1(b) frame nr.5(c) frames nr.9(d) frame nr.20(e) frame nr.30

(b) (c)(a)

(e)(d)

25


From the results shown in fig.6.2.3, we can see that the point on the lower left border of thesweater seems to be not correct in the last frames; but it is well visible a movement of the sweaterwhich follows the lifting of the arms. Nevertheless the cross correlation coefficient of that pointthrough the sequence is 0.82.

With the sequences acquired from VHS tape, the STK algorithm didn’t give very good results; theselected points were tracked just for 2-3 frames with reliable precision and then were lost or mis-matched.The STK algorithm needs very good features (in particular in case of movements) and very goodtexture around the point that must be tracked.

26


6.3 Detection and tracking of moving objects

The detection and tracking of moving objects has been tested on two sequences where two peoplewere walking. The program can work with a sequence of n-frames and gives as output the imageswith the different moving objects in colored boxes.The first sequence (100 frames) shows motions that are roughly on a linear path. Trajectories arelinear and parallel to the camera plane and there are occlusions as the two men are directly oneover the other. In the results (fig.6.3.1), there are two color-coded boxes, one for each trackedobject.

Fig.6.3.1: Results of moving people detection: tracking before, during and after occlusions

Frame nr.9

Frame nr.45

Frame nr.47

Frame nr.71

27


In fig.6.3.1, the first column show the projections of the pixels along the vertical and horizontalaxes; it is easy to divide the vertical projections into its components when there are no occlusions,but when they occur, can be difficult to distinguish the two parts (picks) of the projections. Toavoid this problem, the center of gravity of the projections is computed and is used to assign thebounding to the correct object. In the middle column of fig.6.3.1 the computed bounding boxesprojected on the image differences are shown while the last column presents the projections of theboxes on the original image.Occlusions are visible in the second and third row: the bounding boxes are not very precisebecause there is overlap between the vertical projections and the limits of the boxes are based onlyon these projections. More sophisticated computation as temporal analysis or trajectory predictioncan be implemented.The moving foreground regions can be visualized with background subtraction. This part of theprocess is not automatic but could be so if a model of the empty scene is available [17]: once thebounding boxes have been computed, an image where the area inside the boxes is just backgroundis selected. Then a subtraction between the two windows is performed and the moving foregroundcan be reconstructed with few processes of erosion and dilation (fig.6.3.2).

Fig.6.3.2: Foreground moving regions detected by background subtraction

In the second sequence (fig.6.3.3), 50 frames were available; two people were walking towardsthe stationary camera and their trajectories were not perpendicular to the camera.


28


Fig.6.3.3: Bounding boxes of two moving people walking towards the camera

The computed bounding boxes depend on the sliced projections and the size of the slices can beadapted to the specific conditions of the frames. The projections depend on the image differences,therefore in some frames, small movements of the humans (i.e. feet) are not included in the boxes.

Frame nr.5

Frame nr.25

Frame nr.45

Frame nr.45Fig.6.3.4: Foreground regions detected by

background subtraction

29


6.4. Object tracking

To complete the tracking procedure, once few points of an object have been tracked all over asequence, it is possible to extract and visualize the whole moving object by establishing manycorrespondences in some images starting from few tracked points. A clouds of 2D points isobtained and visualized displaying the matched grey value of the image.Using the sequence of fig.3.1, groups of three frames have been created and the middle frame hasbeen used as template image. The seed points have been tracked with LSM tracker (fig.6.4.1-upper row) in the all sequence and then used to establish the correspondences. A cloud of pointswas computed in every frame and then projected onto the image (fig.6.4.1-second row).In fig.6.4.2 all the matched grey value of the triplet are displayed.

Fig.6.4.1: Computed correspondences in a triplet of images

Fig.6.4.2: Object extraction from the computed 2D correspondences in a triplet of images.

30


The central image of fig.6.4.2 is the template image: the number of matched correspondences isbigger than in the search images where much more holes are present due to not analyzed areas.The gaps can be due to poor texture, low contrast of the area, or wrong matching.In figure 6.4.3, some b/w and color results of the sequence are shown.

Fig.6.4.3: Central frames of the triplets: visualized matched points representing the computed 2D correspondencesextracted from b/w images (upper rows). Tracked object in color images (bottom row).


31


A problem encountered in object tracking is the texture: even if the image has high resolution, thematching process does not work with low texture giving big holes located in those regions wherethere is uniform texture of the subject (central part of the trousers or of the sweater). Some indica-tors that evaluate the quality of the results are shown in table 2 as average of all the sequence:

In the second sequence, consisting of 10 frames, the process worked quite well but many gaps inthe results occurred (fig.6.4.4). The seed points used for the measurement have been computedusing the LSM tracker; there were 18 points in the first triplet but in the successive frames thenumber decreased because of not corrected matching or occlusions; therefore in the next tripletssome seed points have been added. The holes in the tracked object are bigger than other sequencesbecause of the low resolution of the images (fig.6.4.5).

Fig.6.4.4: A triplet of the sequence: in the middle column the template image with the matched grey value, at the bor-der the correspondences found in the search images.

In table 3 the indicators of the process are presented.

Table 2: Some indicators to evaluate the quality of the process

mean sigma 5.804 std. dev. 2.028

mean sigma x 0.111 std. dev. 0.039

mean sigma y 0.127 std. dev. 0.034

Table 3: Some indicators to evaluate the quality of the matching: an average of all the 9 frames

mean sigma 5.118 std. dev. 2.829

mean sigma x 0.128 std. dev. 0.056

mean sigma y 0.140 std. dev. 0.048

32


Fig.6.4.5: Central template of the next three triplets of the sequence: big lacks of texture on the tracked object are visible because of few seed points and not found correspondences

In fig.6.4.6 and 6.4.7 other triplets are shown. In this sequence the tracked model was movingonly his arms: in the first experiments, only 14 points were selected as seed points.

Fig.6.4.6: Triplets of a sequence: template image with 16 seed points (upper row).Central template image and search images at the border with the computed 2D clouds of correspondences (lower row)

But big holes occurred because of not matched points (high sigma0) in regions of uniform texture.It was necessary to add other two points on the torso of the man in order to extract the whole body(fig.6.4.6, 6.4-7). Also here the matching algorithm failed in regions with low contrast or homoge-

33


neous texture (fig.6.4.7, 6.4.8), as homologous points can not be assigned reliably or correspond-ing points can not be found at all in the images..

Fig.6.4.7: Object extraction from the triplet of images. In order: first search image (framet-1), template image (framet), second search image (framet+1)

Fig.6.4.8: Central template image of the next three triplet with found correspondences.

34


7. Conclusions

An overview of some methods for human movements detection and tracking in image space hasbeen presented.Two algorithms that track points in image sequences have been used; the first is based on classicphotogrammetric ‘least square matching’, the other one is based on a model of ‘affine imagechanges’ proposed by Shi, Tomasi and Kanade and available on the net. Both algorithms havebeen tested on different sequences and the best results came from the LSM tracking. This algo-rithm can work with longer sequences, with bigger precision and is more reliable than the otherone; moreover the LSM tracker can work also with low texture images and, if no occlusionsoccurs, no big outliers are present. On the other hand, the STK algorithm needs very good texturearound the points to track and an efficient outlier rejection scheme too; indeed this algorithm is avery good tracker for indoor sequences full of features (corners) with high texture, but is very timeconsuming.The ‘object tracking’ algorithm produced nice results when the images have good and not uni-form texture and the seed points were well spread on the object to measure; in low resolutionimages many holes occurred in the results. It can be considered as a process for object extractionbased on tracked points and image matching.The ‘detection’ algorithm is an automatic process to determine the bounding boxes of movingpeople in a sequence of frames; it is a very simple implementation but could work with longsequences and avoid problems of occlusions. The precision of the boxes depended on the projec-tions of the pixels and their slices; therefore was very important the choice of the threshold valueto compute the image difference.

8. Future works

1. The LSM tracker must be improved in the outliers rejection. The cross-correlation processshould be integrated in the main algorithm to reject the mismatched points in real time and not inpost-processing.

2. A more accurate and refined process to detect and track objects in case of occlusions should beadded. Occlusion can be predicted and avoided with sophisticated algorithms while foregroundextraction can be performed with a better background subtraction technique.

3. A camera model could be defined to reconstruct the 3D world from the image coordinateextracted with the tracking process.

4. The object tracking algorithm can be improved adding neighborhoods information in thematching process to close the gaps occurred in the results.

35


Bibliography

1. Aggarwal J.K., Cai Q., 1999: Human motion analysis: a review. Computer Vision and Image Understanding,vol.73, nr.3, March, pp.428-440.

2. Bodmer S., Mehrabani M., 1999: Body modeling and tracking techniques. http://cui.unige.ch/~bodmer2/TPs/BodyBuilding.html (April 2001).

3. Deutscher J., Blake A., Reid I., 2000: Articulated body motion capture by annealed particle filtering. In IEEEComputer Society Conference on Computer Vision and Pattern Recognition, pp.126-133

4. D’Apuzzo N., 1998: Automatic photogrammetric measurement of human faces. International Archive of Photo-grammetry and Remote Sensing, Hakodate, Japan, 32(B5), pp.402-407.

5. D’Apuzzo N., 2000: Motion capture by least square matching. AVATARS’2000, Lausanne, Switzerland.

6. Foerstner W., 1990: A framework for low level feature extraction. ECCV, Lecture notes in CV, vol. 427, pp.383-394.

7 Gavrila D.M., 1999: The visual analysis of human movements: a survey. Computer Vision and Image Understand-ing, vol.73, nr.1, January, pp.82-98.

8. Gavrila D.M., 1996: Vision-based 3D tracking of human in action. PhD thesis, Department of computer science,University of Maryland.

9. Gruen A., 1985: Adaptive least square correlation: a powerful image matching technique. South Africa Journal ofPhotogrammetry and Remote Sensing and Cartography, 14(3), pp.175-187.

10. Harris C., Stephens M., 1988: A combined corner and edge detector. Fourth Alvey Vision Conference, pp.147-151.

11. Howe N, Leventon M, Freeman W., 2000: Bayesian reconstruction of 3D human motion from single-cameravideo. Advances in Neural Information Processing Systems, 12.

12. Intille S.,and Bobick A., 1999: A Framework for Recognizing Multi-Agent Action from Visual Evidence. Proceed-ings of the National Conference on Artificial Intelligence (AAAI).

13. Lerasle F., Rives G., Dhom M., and Yassine A., 1996: Human Body Tracking by Monocular Vision. In ECCV,Cambridge, England, pp. 518-527.

14. Lucas B.D., Kanade T., 1981: An iterative image registration technique with an application to stereo- vision. InIJCAI, pp. 674-679.

15. McKenna S., 2000: Tracking groups of people. Computer Vision and Image Understanding, vol.80, nr.1, October,pp.42-56.

16. Mulder A., 1994: Human movement tracking technology. Technical Report 94-1, School of Kinesiology, SimonFraser University.

17. Rosales R., Sclaroff S., 1998: Improved Tracking of Multiple Humans with Trajectory Prediction and OcclusionModeling. Proc. IEEE CVPR Workshop on the Interpretation of Visual Motion, Santa Barbara, CA.

36


18. Sangi P., Heikkilä J., Silven O, 1999: Experiments with shape-based deformable object tracking. Proc. 11th Scan-dinavian Conference on Image Analysis, June 7-11, Kangerlussuaq, Greenland, pp.311-317.

19. Shi J., Tomasi C., 1994: Good feature to track. In IEEE Computer Society Conference on Computer Vision andPattern Recognition, pp. 593-600.

20. Schmid C., Mohr R., Bauckhage C., 1998: Comparing and evaluating interest points. Proceding ICCV, pp. 230-235.

21. Sidenbladh H., Black M.J., Fleet, D.J., 2000: Stochastic tracking of 3D human figures using 2D image motion.ECCV, Dublin, Ireland

22. Song Y., Feng X., Perona P., 2000: Towards detection of human motion. In IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, pp.810-817

23. Tomasi C., Kanade T., 1991: Detection and tracking of point features. Technical report CMU-CS-91-132, Carn-egie Mellon University, Pittsburgh, PA.

24. Wachter S., Nagel H.H., 1999: Tracking Person in monocular image sequences. Computer Vision and ImageUnderstanding, vol.74, nr.3, June, pp.174-192.

37

Tracking of human movements in image space - · PDF fileTracking of human movements in image space ... Tracking of human movement in image ... Section 2 deals with a general overview

Documents