A Diffusion Process on Riemannian Manifold for Visual Tracking · 1 A Diffusion Process on Riemannian Manifold for Visual Tracking Marcus Chen, Cham Tat Jen, Pang Sze Kim, Alvina

1

A Diffusion Process on Riemannian Manifoldfor Visual Tracking

Marcus Chen, Cham Tat Jen, Pang Sze Kim, Alvina Goh

Abstract—Robust visual tracking for long video sequences is a research area that has many important applications. The mainchallenges include how the target image can be modeled and how this model can be updated. In this paper, we model the targetusing a covariance descriptor, as this descriptor is robust to problems such as pixel-pixel misalignment, pose and illuminationchanges, that commonly occur in visual tracking. We model the changes in the template using a generative process. Weintroduce a new dynamical model for the template update using a random walk on the Riemannian manifold where the covariancedescriptors lie in. This is done using log-transformed space of the manifold to free the constraints imposed inherently by positivesemidefinite matrices. Modeling template variations and poses kinetics together in the state space enables us to jointly quantifythe uncertainties relating to the kinematic states and the template in a principled way. Finally, the sequential inference of theposterior distribution of the kinematic states and the template is done using a particle filter. Our results shows that this principledapproach can be robust to changes in illumination, poses and spatial affine transformation. In the experiments, our methodoutperformed the current state-of-the-art algorithm - the incremental Principal Component Analysis method [34], particularlywhen a target underwent fast poses changes and also maintained a comparable performance in stable target tracking cases.

Index Terms—Tracking, Particle filtering, Template update, Generative Template Model, Riemannian manifolds, log-transformedspace.

F

1 INTRODUCTION1

Visual tracking is an important vision research topic2

that has many applications, ranging from motion-3

based recognition [7], surveillance [18], human-4

computer interaction [10], etc. It also covers many5

aspects of computer vision problems, such as target6

feature representation [46], feature selection [9], and7

feature learning [15]. Even though it has been actively8

researched for decades, many challenges remain espe-9

cially with changes in target poses and appearance,10

and illumination in a long video sequence. Figure 111

shows two simple examples of how a target can vary12

over a short time interval. Often these challenges are13

common and require a good solution in order for14

long stable tracking in many real life tasks. There15

are generally three common approaches to deal with16

target appearance variations. First is to use robust or17

invariant target features such as scale invariant feature18

transformation and color histogram [3]. However, as19

shown by Figure 1, target appearance can change20

significantly over time, and end up totally different21

from the starting frame due to variations in target22

poses and image illumination. The second approach23

is to employ a complete set of possible target mod-24

els [4], aiming to model possible target variations.25

• Marcus Chen and Cham Tat Jen are with the Department of Schoolof Computer Engineering, Nanyang Technological University, is withthe Department.

• Pang Sze Kim and Alvina Goh are with DSO National Laboratories,Singapore.

Fig. 1. Target patches for successive 871 frames, from#1, 31, ...871 from 2 video sequences. Target changesin both illumination, poses, appearances even afterbeing affine warped to a standard size.

However, this requires learning of the target model 26

in advance and can hardly be scalable. Finally, the 27

last approach is to update the template gradually 28

as it evolves. Note that in this paper, we loosely 29

use the term template for target representation, and 30

do not strictly limit to the image patches. There are 31

arX

iv:1

303.

5913

v1 [

cs.C

V]

24

Mar

201

3

2

several choices for a target template found in the32

literature. For example, [38] uses the histogram of33

oriented gradients, while [3] uses the color histogram,34

[45] L1 sparse representation, [23] active appearance35

model, [34] principal subspace of image patches, and36

[31] features covariance.37

The template update problem can be expressed38

mathematically as Eqn. (1).39

Tt = f(Tt, Tt−1

)(1)40

41

where Tt, Tt, t ∈ [1, 2, ...] are the estimated and up-42

dated templates respectively at time t. However, as43

shown in [23], target template updating is a challeng-44

ing task. According to [23], if the template was not45

updated at all, the template would become outdated46

shortly and cannot be used for matching as the target47

appearance would have undergone changes tempo-48

rally. On the other hand, update at every frame would49

result in accumulation of small errors, and eventually50

a template drift and loss target information.51

Recognizing the importance of template update,52

many methods have been proposed. One common53

and intuitive approach is to use linear updating func-54

tion in the respective feature spaces, such as [31]55

on the covariance manifold. This will smoothen the56

changes between the estimated Template and updated57

template. Similarly, Kalman filter has also been used58

in [25] to track template features variables, but not59

target trajectory. On the other hand, there are three60

well-known template update algorithms in the litera-61

ture, namely template alignment [23], Online Expec-62

tation and Maximization (EM) [19], and incremental63

subspace method [34]. Here, we briefly survey these64

three algorithms.65

In template alignment method, [23] proposes a66

heuristic but robust criteria to decide whether to67

update the template at time t. The basic idea is to68

keep the starting template to correct the drift of the69

estimated template. The latest estimated template is70

first matched to the previous updated template. It is71

then warped before checking with the first template.72

For a small template displacement, this method works73

very well. However, by imposing alignment between74

the latest template and the first template, this method75

inherently limit target poses changes to a warping76

model.77

The online EM method [19] employs a mixture of78

three template distributions to account for template79

variations, namely, long term stable template, interframe80

variational template, and outlier template. These tem-81

plates model stable appearance of target, interframe82

changes in appearance of poses, and occlusion or83

outliers respectively. Employing a Gaussian mixture84

model, parameters and membership are estimated on85

the fly using online EM. In this framework, each pixel86

in the target patches is assumed to be independent87

and consequently more stable pixels tend to gain88

more weights in the similarity measure. This could 89

gradually drift the template in the presence of more 90

stable background pixels. 91

The third algorithm is to represent the target in 92

its eigenspace, proposed by [34]. The posterior es- 93

timates of the template are collected over an inter- 94

val, and these estimates are then analyzed online 95

through an Incremental Principal Component Anal- 96

ysis method(IPCA). This method can capture changes 97

in template variation in eigenbases. The mean of the 98

posterior estimates are also kept as stable templates. 99

The authors have tested IPCA with various video 100

sequences, and demonstrated its great robustness to 101

the template variations due to pose changes and illu- 102

mination changes. Figure 2 illustrates an incremental 103

update of eigenbases and means. The images in the 104

3rd row show how the eigenbases evolve over time. 105

It has been shown in the paper that the updated 106

templates could almost reconstruct the original im- 107

age samples over the sequence, reflecting the ability 108

of the eigenbases to model temporal variations. Al- 109

though IPCA is often very robust and can track target 110

very accurately even in noisy, low contrast image 111

sequences, IPCA falls short when the target undergoes 112

fast pose changes and dramatic illumination changes 113

as stated on the paper. This may be because PCA 114

inherently assumes that the target templates over time 115

are from a Gaussian distribution. In abrupt changes 116

in poses and illumination, this assumption does not 117

hold. The unimodal distribution also requires good 118

pixel-wise alignment between the posterior estimate 119

and eigenbases, otherwise uncertainties in template 120

alignment would contribute to template variance and 121

may lead to non-informative basis. A good example 122

from the paper is shown in Figure 2. One can see 123

that from frames #600 to #636, the eigenbases are not 124

representative anymore and the tracker loses track of 125

the target.

(a) Representativeeigenbases

(b) Eigenbases ofmisaligned targetregions

(c) Eigenbases are notrepresentative

Fig. 2. Results of incremental subspace method onthe Sylvester sequence. Pixel-wise misalignment couldrender eigenbasis non-representative. The 1st row arethe sample frames. The 2nd row images are the currentsample mean, tracked region, reconstructed image,and the reconstruction error respectively. The 3rd and4th rows are the top 10 principal eigenbases.

126

3

So far, most of the current state-of-the-art algo-127

rithms update templates in an out-of-chain manner,128

by assuming the posterior estimate is “good enough”129

for template update with pixel-wise alignment. If the130

target poses posterior estimate is inaccurate or there131

is a mis-alignment between the estimated and last132

updated template, the update methods will gradually133

drift. On the other hand, if the template update is not134

good, then the posterior estimate of target poses is135

unlikely to be accurate. These coupled dual problems136

often render these methods unable to track well when137

the targets undergo fast changes in poses or non-rigid138

transformation. However, robustness to fast target139

poses has many real life applications such as human140

tracking, maritime target tracking, etc.141

To solve these dual problems faced by the exist-142

ing state-of-the-art algorithms, [8] introduces a novel143

approach to simultaneously quantify these two uncer-144

tainties by including both of them into the state space145

of a Bayesian framework, instead of just target poses146

in the exist methods. In this manner, no posterior147

estimate is used for updating, instead better matched148

multiple hypothesized templates are propagated au-149

tomatically.150

Paper contributions. To the best of our knowledge,151

almost all the state-of-art algorithms use out-of-chain152

template updating methods. That is to say, the up-153

dating of template model is done after obtaining the154

posterior estimate of the targets position. In this paper,155

we propose a method to update target model in156

tandem with the target kinematics. In other words,157

we model the target template as a part of the state158

space. We choose the covariance descriptor for the159

target descriptor as it is more robust to problems160

such as pixel-pixel misalignment and changes in pose161

and illumination. Since positive definite covariance162

matrices form a Riemannian manifold, we model the163

target template model variation by a random walk164

on the covariance Riemannian manifold. We propose165

a novel superior template propagation mechanism in166

the log-transformed space of the manifold to free the167

constraints imposed inherently by positive semidefi-168

nite matrices, leading to a greater ability in dealing169

with template variations. Our resultant method out-170

performs the state-of-the-art Incremental PCA algo-171

rithm [34] in dealing with fast moving and changing172

targets, as will be clearly shown in the experiments173

section.174

The paper is organized as follows: Section 2 gives175

a brief introduction to both covariance descriptors176

and Riemannian manifold, Section 3 gives a Bayesian177

formulation of simultaneous inference of both target178

kinetics and template posterior distribution, Section 4179

analyzes the template generative process. In Section180

5, we empirically compare our results with IPCA and181

give a short discussion. Finally, section 6 concludes182

this paper. 183

2 TARGET COVARIANCE DESCRIPTOR 184

In this section, we explain the motivation of using co- 185

variance descriptor and its operation on Riemannian 186

manifold. 187

2.1 Covariance Descriptor 188

A covariance descriptor is defined as follows: 189

C =1

N − 1

N∑i=1

(f(i)− f

) (f(i)− f

)T (2) 190

where f is a feature vector, f = 1N

∑Ni=1(f(i)) is 191

the mean of the feature vector over N pixels in the 192

target region. In this paper, we use the following 9- 193

dimensional feature vector: 194

f(i) =[xw, yw, I(xw, yw), |Ixw

|, |Iyw|,√I2xw

+ I2yw, 195

arctan|Ixw ||Iyw|, |Ixxw |, |Iyyw |

]. (3) 196

197

They are x, y coordinates, pixel intensity, x, y direc- 198

tional intensity gradients, gradient magnitude and 199

angle, and second order gradients respectively. w 200

denotes that these features are extracted after warping 201

image patches to a standard size. 202

Since its proposed use in human detection [40], 203

covariance descriptor has gained popularity for many 204

applications, such as face recognition [26], license 205

plate detection [30], and tracking [31], [45]. Some main 206

advantages of choosing the covariance descriptor [42] 207

to model the template include its lower dimensional- 208

ity of 12 (d2 + d) (45 in this paper as d = 9), compared 209

to its number of target pixels (32 × 32 = 1024 in this 210

paper), its ability to fuse multiple possibly correlated 211

features, and its robustness to match targets in differ- 212

ent views and poses. 213

By its definition, covariance matrix is clearly a posi- 214

tive semi-definite matrix, which lies on a Riemannian 215

manifold. We will now briefly explain some basic 216

operations on the Riemannian manifold. 217

2.2 Riemannian Manifold218

Fig. 3. The geodesic distance is the norm of a vectoron the tangent space TC1

M of at point C1 on themanifoldM

4

TABLE 1Operations in Euclidean and Riemannian spaces

Euclidean space Riemannian manifoldCj = Ci +

−−−→CiCj Cj = expCi

(−−−→CiCj)−−−→

CiCj = Cj − Ci−−−→CiCj = logCi

(Cj)

dist(Ci, C − j) = ‖Cj − Ci‖ dist(Ci, Cj) = ‖−−−→CiCj‖Ci

A Riemannian manifold M is a differential mani-219

fold and each of its tangent space TCiM has a metric220

function g which defines the dot products between221

any two tangent vectors yk, yl. The covariance descrip-222

tor is a point on the manifold M, the following oper-223

ations can be applied to it. The Riemannian metric:224

〈yk, yl〉Ci= trace

(C

12i ykC

−1i ylC

− 12

i

). (4)225

226

The exponential map expCi: TCi

M → M, takes a227

tangent vector at point Ci and maps to another point228

Cj :229

Cj = expCi(y) = Ci

1/2 exp(Ci− 1

2 yCi− 1

2

)Ci

12 , (5)230

231

The inverse of the exponential map is the logarithm232

map, which takes a starting point Ci and destination233

Cj , maps to the tangent vector y at point Ci.234

y = logCiCj = Ci

12 log

(Ci− 1

2CjCi− 1

2

)Ci

12 . (6)235

236

Finally, the distance between two covariance matrices237

Ci and Cj is given as:238

d(Ci, Cj) =

√√√√ d∑k=1

ln2 λk (Ci, Cj), (7)239

240

where λk (Ci, Cj) are the generalized eigenvalues of241

Ci and Cj . That is, λkCivk − Cjvk = 0, and d is the242

dimension of the covariance matrices.243

Note that expCi(·) and logCi

(·) are maps on the244

Riemannian manifold, whereas exp(·) and log(·) de-245

note the normal matrix exponential and logarithmic246

operations. Both expCi(y) and tangent vector y are247

both d× d matrices in this paper.248

2.3 Motivation of Manifold Modeling249

High dimensional image data often lies in low di-250

mensional manifold. For an example, a collection of251

rotated handwriting zeros in Figure 4 lie in a dimen-252

sion of 28× 28 = 784 using vectorized representation,253

but have only one rotational parameter. Popular di-254

mensional reduction methods such as ISOMAP [39],255

eigenmap [2], LLE [35] model data using manifold256

structures. In visual tracking, the target patches in257

the image sequence are implicitly bounded by the258

target’s degree of freedom captured by images, such259

as rotation, translation, scaling etc. These implicit260

parameters modeled using low dimensional manifold261

could capture image distance. The simplest and yet 262

Fig. 4. A collection of rotated handwriting zeros by theangle of 450 each time.

(a) Targets (b) Backgrounds (c) SVM re-sults

(d) Euclidean distance (e) Manifold distance

Fig. 5. An illustration of distance between imagescan be better modeled on a manifold, a sequenceof dancing penguin from youtube. In (d), Euclideandistance between targets is larger than the distancebetween target and background; and not on the man-ifold space. Furthermore, in (e), SVM cannot linearlyseparate the targets from backgrounds in vectorizedEuclidean space.

most popular distance measure between images is 263

Euclidean distance between the vectorized images. 264

A simple example of the head of dancing penguin 265

from youtube is adopted to illustrate in Figure 5 that 266

the manifold of covariance descriptor can model the 267

image distance better and can separate target patches 268

from the background better. Using Euclidean distance, 269

the distance between target patches and first target270

5

patch could be larger than the one between the back-271

ground and the first target patch. Furthermore, a test272

of support vector machine using linear kernel showed273

that some background patches are classified into the274

target patches. On the other hand, the separation275

between target patches and background patches were276

separated shown in Figure 5e.277

3 BAYESIAN FRAMEWORK278

In this section, we use a standard Bayesian framework279

[33] to formulate tracking of both template and kinet-280

ics as follows:281

P (Ct, st|z1:t) ∝ P (zt|Ct, st)

∫P (st, Ct|st−1, Ct−1)282

P (Ct−1, st−1|z1:t−1)dst−1dCt−1, (8)283284

where zt is the measurement, st is the kinetic state285

variables, Ct is the covariance descriptor, P (Ct, st|z1:t)286

is the posterior probability of target template and pose287

given the measurement, P (zt|Ct, st) is the observa-288

tional model, and P (st, Ct|st−1, Ct−1) is the dynamical289

model. They are further elaborated in the following290

subsections.291

3.1 Dynamical Model292

The state space in our paper includes both target293

kinetic variables st and template covariance descriptor294

Ct. The state variables are defined in Eqns. (9) and295

(10), and we would like to estimate them through the296

Bayesian framework in Eqn.(8). These state variables297

are propagated from time t−1 to t through a dynam-298

ical model P (st, Ct|st−1, Ct−1) .299

st = [xt, yt, xt, yt, ht, θt], (9)300

Ct = cov (xw, yw, I(xw, yw), |Ixw|, |Iyw

|,301

arctan|Iyw|

|Ixw|,√I2xw

+ I2yw, |Ixxw |, |Iyyw |

), (10)302

303

where xt, yt are the spatial coordinates of the target304

center position at time t, xt, yt are the velocities, ht305

is the scaling factor, and θt is the orientation. xw, yw306

are the coordinates of a pixel on the standard target307

patch warped from xt, yt, I(xw, yw) is the pixel inten-308

sity and {Ixw, Iyw} are the patch intensity gradients,309

{Ixxw, Iyyw

} are the second order gradients. Assuming310

independence between kinetic variables and covari-311

ance, we model the joint dynamics as follows:312

P (st, Ct|st−1, Ct−1) = P (st|st−1)P (Ct|Ct−1), (11)313

st = k(st−1) + ut, (12)314

Ct = expCt−1(nt). (13)315

316

k is the kinetic model and we use a near constant317

velocity linear model k(st−1) = Ast−1. ut is generated318

with an interacting Gaussian models with a jump-319

ing probability of [0.9, 0.1] to model sudden changes320

in target poses. As for template dynamical model, 321

nt ∈ TCiM is a random process on the tangent plane 322

of manifold M. An example of this could be the 323

Brownian motion process as described by [17]. In this 324

paper, we choose to model the template dynamical 325

model in log-transformed space of the manifold as 326

follows: 327

Ct = exp(log(Ct−1) + wt) (14) 328

where wt ∼ N(0,Σ), Σi,j = Σj,i ∼ N(0, σ2

i,j

)329

P (log(Ct)) ∝ exp

−1

2

∑i≤j,i,j∈[1,d]

wt(i, j)2

σ2i,j

(15) 330

331

where wt is simply a random symmetric matrices and 332

N(0, σ2i,j), i, j ∈ [1, d] are normal distributions. Ac- 333

cording to [1], the matrix exponential function maps a 334

symmetric matrix to its corresponding positive semi- 335

definite, exp : Sym(d) → Sym+(d), and it is one-to 336

one mapping. As such, the generated samples of Ct 337

is always a positive semi-definite (PSD) matrix. This 338

frees the inherent constraints of positive eigenvalues 339

in a PSD matrix. This distribution may be considered 340

as a log-normal distribution of the PSD matrices as 341

defined in [36]. 342

C−1t Ct−1

= exp (− log(Ct−1)− wt) exp (log(Ct−1))

= exp(−wt)

(16) 343

344

Generalized Eigenvalues: 345

λkCtv − Ct−1v = 0

C−1t Ct−1v = λkv

exp(−wt)v = λkv

d(Ct−1, Ct) =

[d∑

k=1

[ln2 λk (exp(−wt))

]]1/2

=

[d∑

k=1

λ2k(wt)

]1/2, if ∃w−1t

(17) 346

347

In this paper, for d = 9, wt’s eigenvalues λ1(wt) ≥ 348

λ2(wt) ≥ ... ≥ λ9(wt) can be bounded according 349

to [47], assuming the entries of the noise matrix are 350

bounded by [a, b], i.e. a ≤ wt(i, j) ≤ b: 351

λ9(wt) ≥

{12

(9a−

√a2 + 80b2

)|a| < b

9a otherwise.(18) 352

λ1(wt) ≤

{12

(9b+

√a2 + 80b2

)|a| < b

9b otherwise.(19) 353

354

In other words, the eigenvalues are roughly within 355

an order of magnitude of max(σi,j) for this random 356

process. In this way, the template diffusion spread on 357

the manifold can be easily managed by choosing an 358

appropriate max(σi,j) in wt.359

6

Fig. 6. Evolution of template and generated templates,frame #1, 101, 201, 501, 801, Green cross: top 10similar generated templates, Red: ground truth, Bluecross: the background.

3.2 Observation Model360

The observation model P (zt|Ct, st) measures the like-361

lihood of a target given target poses and template362

values, it is modeled as follows:363

P (zt|Ct, st) ∼ N(0, σ2), (20)364

zt = d(Ct, C∗t ),365

C∗t = g(st, Image),366

P (zt|Ct, st) ∝ exp(− 1

2σ2d2) (21)367

368

Here, d(Ct, C∗t ) is given by Eqn. (7). g is the covariance369

computation operator; g takes the kinetic value st370

of each particle at time t, warps the region to a371

standard size (in this paper, 32×32) before computing372

covariance. 373

3.3 Overall Framework 374

We use a standard particle filter to do sequential 375

inference. The particle filter [5], [33], [16] represents 376

the distribution of state variables by a collection of 377

samples and their weights. The advantage of using 378

a particle filter is that it can deal with non-linear 379

system and multi-modal posterior. The algorithm of 380

the particle filter is as follows: 381

1) Initialization.The particle filter is initialized 382

with a known realization of target state vari- 383

ables. This includes the target initial state values. 384

Covariance of the target C0, i.e. initial template 385

is extracted for comparison later. The parameters 386

of covariance generative process. i.e. template 387

dynamical model are also determined. 388

2) Propagation. Each particle is propagated accord- 389

ing to the propagation model in Eqns. (12) and 390

(14). Both kinetic variables and template are 391

generated through these random processes. 392

3) Measure the likelihood. At each particle i, the 393

covariance descriptor C∗t (i) extracted is com- 394

pared to its corresponding template Ct(i). The 395

likelihood of the particle is then estimated as 396

given in Eqn. (21). 397

4) Posterior estimation. The posterior estimate 398

gives the estimate of the current target state, 399

given all its previous information and measure- 400

ments. This could be maximum a posteriori 401

probability estimate or minimum mean square 402

error estimate (MMSE). In this paper, we use 403

MMSE. 404

5) Resampling. To avoid any degeneracies, resam- 405

pling is conducted to redistribute the weight of 406

particles. 407

6) Loop. Repeat the process from step 2 to 5 as time 408

progresses. 409

4 ANALYSIS OF TEMPLATE GENERATIVE 410

PROCESS411

Fig. 7. Visualization of target in“soccer” sequenceson the covariance manifold, Red:target patches, Blue:background patches.

7

In this section, we show that the covariance descrip-412

tor is a good representation of the target as well as413

the motivation behind performing a random walk as414

given in Section 3.1. Two reasonable criteria for a good415

target representation are as follows:416

• the representation evolves gradually as the target417

undergoes changes in poses, appearance etc,418

• there is clear separation of target and back-419

ground.420

To help visualize the distribution of target covariance421

matrices on the manifold, we use multidimensional422

scaling [21] to construct a visualization of the dis-423

tribution of the covariance matrices. The distance424

matrix is constructed using Riemannian distance as425

given in Eqn.(7). The visualization shows the relative426

positions of targets (red) and backgrounds(blue). Vi-427

sualizing the PETS 2003 Soccer sequence and Dudek428

Face sequence in Figures 7 and 6, we noticed that429

our representation of the targets tended to cluster430

together as they evolved gradually. This evolution is431

smoother and easier to model on the manifold as com-432

pared to the evolution of its original feature values at433

each pixel. This observation motivated us to model434

the template variations by using a random walk on435

the Riemannian manifold. Based on Eqns.(12) and436

(14), Figure 6 illustrates a realization of the random437

walk. This shows that our template dynamical model438

can model the actual target appearance variations.439

Changes in facial expression and face poses cause440

covariance template (shown as red points) to evolve441

slowly on the manifold, and they are well modeled442

by the generated covariances on the manifold (shown443

as green points).444

5 EXPERIMENTS AND RESULTS445

5.1 Experimental data446

We tested our algorithm on some of popular tracking447

datasets, David Ross’s sequences including plush toy448

(toy Sylv), toy dog, david, car 4sequences from his449

website, Dudek Face sequences, and vehicle track-450

ing sequences from PETS2001, soccer sequence from451

PETS2003. The test data information is tabulated in452

Table 2.453

5.2 Performance measure454

As spelled out in [22], a good measure should include455

both overall tracking and goodness of track. This456

paper uses the ratio between on-track length and457

sequence length to capture the performance of overall458

tracking, and on-track accuracy for goodness of track.459

Define tracking errors as: ex(t) = ‖gx(t)−x(t)‖, ey(t) =460

‖gy(t) − y(t)‖, where ex(t), ey(t), gx(t), gy(t) are the461

errors in x, y and ground truth in x, y at time t 462

respectively. 463

γontrack =1

2

(ex(t)

Hx(t)+

ey(t)

Hy(t)

)≤ 1 (22) 464

rontrack =γontrack

l(23) 465

rmsontrack =

√(ex(t)

Hx(t)

)1/2

+

(ey(t)

Hy(t)

)1/2

(24) 466

467

Hx(t), Hy(t) are the ground truth target size at time 468

t. In this work, ground truth on the target center is 469

manually annotated, the target size is assumed to as 470

those of the first frame (this may not be applicable to 471

frames with a large change in target size). 472

5.3 Results and discussion 473

We compared our method with the current state- 474

of-the-art algorithm, the incremental PCA (IPCA) 475

method by David et al [34]. Our results are shown 476

in red and the IPCA in green from Figures 9 to 15. 477

In PLUSH TOY SYLV sequences shown in Figure 478

9,the IPCA failed to recover tracking from frame #609 479

when it locked onto the background, which looks 480

more similar to the upright SYLV. Fast poses changes 481

around frame #609 caused the IPCA eigenbases non- 482

representative as shown in Figure 4. 483

Similarly, in Figure 10, the IPCA failed to follow 484

through when target underwent a fast motion towards 485

the frame #1351. This shortcoming of the IPCA is 486

better reflected in Soccer Sequences of PETS2003. the 487

IPCA started to drift off from frame #628 shown in 488

Figure 11 when the player moved his legs fast, and 489

lost track shortly. In the same sequence in Figure 12, 490

the IPCA found it hard to track the opposite team 491

players who wore dark clothes after a short occlusion 492

at frame #285. 493

In Figure 13, Dudek Face sequences, both methods 494

perform well despite of his rich facial expressions, 495

which have more effects on our covariance descriptor. 496

In the more stable vehicle sequence from PETS2001 497

in Figure 14, again both methods could track well. 498

Figure 15 shows an example of a car sequence, in 499

which our method did not perform satisfactorily. Our 500

method locked onto the background whereas the 501

IPCA showed robustness to the illumination changes. 502

The possible explanation is that our template dynam- 503

ics was unable to account for this dramatic and non- 504

smooth transition of the template when the car went 505

into a shadowed region. Also, a closer look showed 506

that the IPCA eigenbasis looked similar to the target 507

template in shadows. 508

The overall tracking performance on the test cases 509

is summarized in Figure 8. Note that images se- 510

quences of Sylv, PETS2001 and soccer player 4 511

have targets out of the images, this explained the 512

small track duration performance. Nevertheless, our 513

method shown in red generally had longer track 514

length. On the hand, given frames that were on515

8

TABLE 2Test Sequences

Test sequences Source No. of frames Characteristics

Plush Toy (Toy Sylv) David Ross 1344 fast changing, 3D Rotation, Scaling, Clutter, large movementToy dog David Ross 1390 fast changing, 3D Rotation, Scaling, Clutter, large movement

Soccer player 1

PETS 2003 1000 Fast changing, white team, good contrast with background, occlusionSoccer player 2Soccer player 3Soccer player 4 Fast changing, gray(red) team,poor contrast with background, occlusion

Dudek Face Sequence A.D. Jepson 1145 Relatively stable, occlusion, 3D rotationTruck PETS 2001 200 relatively stable, scalingDavid David Ross 503 relatively stable 2D rotationCar 4 David Ross 640 Relatively stable, scaling, shadow, specular effects

(a) Track duration rate rontrack (b) Track accuracy rmsontrack

Fig. 8. The results statistics, our results in blue, IPCA in red.

track for both trackers, IPCA showed better track516

accuracy shown in Figure 8b. For the sequences with517

frequent changes in target appearance such as soc-518

cer sequences, the track goodness was comparable.519

The video sequences may be found on the website,520

http://www.youtube.com/watch?v=KaSrVbGyvq4.521

Discussion. In stable tracking cases, good pixel-wise522

alignment enabled the IPCA to track very well. The523

IPCA was generally very robust to blurring, even illu-524

mination changes, as eigenbasis tended to encompass525

these changes. In other words, some eigenbasis looked526

similar to blurred or illumination-changed templates.527

The distance measure in the IPCA uses a norm of all528

corresponding pixels difference; as such, it tends to be529

very stable and well aligned in the stable target cases.530

On the other hand, it is likely to favor the relatively531

stable regions in the target. When such regions are too532

similar to the background and target poses changes533

at the same time, then the IPCA may lose track very534

quickly in the Soccer Sequence in Figure 12. On the535

other hand, our method uses covariance of gradients536

and intensity; the template feature descriptor is much537

smaller in dimension. This may cause our method538

slightly less precise than the IPCA shown in Figure539

10, which our method did not match to pixel accuracy.540

Figure15, our method lost track when the vehicle541

entered the shadowed region, because the both gra- 542

dients and intensity changed significantly and for an 543

interval. 544

Although our method was slightly not as precise 545

in the stable cases, it gain much more flexibility in 546

the non-stable tracking scenarios. In the cases of non- 547

rigid or fast motion of targets, mis-alignment in the 548

posterior estimate (the new template sample to add 549

to the eigen space in the IPCA) and eigenbases may 550

accumulate over a short interval and consequently 551

render eigenbases non-representative at all. This in- 552

evitably leads to loss in tracking. Our method could 553

deal with these scenarios a lot better for two rea- 554

sons. Firstly, the template descriptor did not require 555

pixel-wise alignment and is robust to mis-alignment. 556

Secondly, the generative process could accommodate 557

multiple hypothesis of the template on the covariance 558

Riemannian manifold, and it automatically selects the 559

better hypothesis as the target template evolves as 560

shown in Figure 6. 561

However, there are some limitations in our algo- 562

rithm. One of them is to the need to careful choose a 563

suitable region for tracking. Since we used the pub- 564

lished features such as intensity and gradients, and 565

second order gradients for covariance, these features 566

are sensitive to specular effects, dark shadows as 567

shown in Figure 15. It is also important to choose 568

a target region with fairly good gradients varia-569

http://www.youtube.com/watch?v=KaSrVbGyvq4

9

Fig. 9. Tracking results on PLUSH TOY SYLVsequences, frame #133, 594, 609, 613, 957, 1338,Green: IPCA, Red: our results. The IPCA failed torecover track from frame # 609.

tions, otherwise the covariance descriptor may be ill-570

conditioned consequently affecting both eigenvalues571

estimation and distance measurements in Equation572

(7).573

6 CONCLUSION574

In this paper, we have proposed a new method to575

update target model in tandem with the target kine-576

matics. More precisely, we have developed a gener-577

ative template model in a principled way within a578

Bayesian framework. A novel template propagation579

mechanism in the log-transformed space of the co-580

variance manifold to free the constraints inherently581

imposed by positive definite matrices. We have shown582

that the simple generative process can allow template583

to evolve naturally with target appearance variation.584

It is hoped that by jointly quantifying the uncertainties585

of the target kinematics and template, we are able to586

achieve more robust visual tracking. We have chosen587

the covariance descriptor as the target representation.588

We have modeled the target template model dynamic589

using a random walk on the covariance Riemannian590

manifold. Our template dynamic model is an example591

of a diffusion process on the covariance Riemannian592

manifold. In the experiments, our algorithm outper-593

formed with the current state-of-the-art algorithm594

IPCA particularly when the target underwent a fast595

and non-rigid poses changes, and also maintained a596

comparable performance when the target was more597

stable. Some future work includes automatic selection598

of covariance features that are more robust to a sud-599

den dramatic change in illuminations.600

Future work includes addressing a number of ques-601

tions such as how should the diffusion speed be602

adjusted and can the diffusion process be better con-603

strained. Another area of work is to deal with illu-604

mination changes in the manifold generative process.605

In order to improve the goodness of track, a more606

discriminative target descriptor is to be explored.607

Fig. 10. Tracking results on toy dog sequences, frame#1, 450, 715, 1014, 1271, 1351, Green: IPCA, Red:our results. The IPCA was slightly more localized instable case, but failed to follow through when the targetunderwent a fast motion towards frame #1351.

Fig. 11. Tracking results on soccer sequences, frame#246, 628, 630, 661, 686, 996, Green: IPCA, Red:our results. The IPCA started to drift off from frame#628 when the player’s legs moved fast, and lost trackshortly.

Fig. 12. Tracking results on soccer sequences, frame#10, 15, 122, 248, 285, 360, Green: IPCA, Red: ourresults. The IPCA started to drift off from frame #15 dueto low contrast between the target and the background.

ACKNOWLEDGMENTS608

We would like to thank DSO National Laboratories, 609

Singapore for partially sponsoring this work, and 610

David Ross for sharing the test sequences. 611

REFERENCES 612

[1] V. Arsigny, P. Fillard, X. Pennec, N. Ayache, et al. Geometric 613

means in a novel vector space structure on symmetric positive- 614

definite matrices. SIAM Journal on Matrix Analysis and Appli- 615

cations, 29(1):328, 2008. 616

10

Fig. 13. Tracking results on DUDEK FACE sequences,frame #1, 361, 459, 605, 795, 1095, Green: IPCA, Red:our results. Both results were comparable despite ofhis rich facial expressions, which had more effects onour covariance descriptor.

Fig. 14. Tracking results on PETS2001 vehicle se-quences, frame #1, 25, 50, 75, 100, 125, Green: IPCA,Red: our results. Both results were comparable.

Fig. 15. Tracking results on car sequences, frame #1,132, 150, 168, 184, 227, Green: IPCA, Red: our re-sults. The IPCA performed better, was robust to illumi-nation changes, but our method mainly used templategradients, which changed dramatically due to shadowand lack of reflection of the car plate. At frame #227,the arrow sign might look too similar to the target ingradients.

[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimension- 617

ality reduction and data representation. Neural computation, 618

15(6):1373–1396, 2003. 619

[3] S. Birchfield. Elliptical head tracking using intensity gradients620

and color histograms. In CVPR, pages 232–, 1998.621

[4] M. Black and A. Jepson. Eigentracking: Robust matching and622

tracking of articulated objects using a view-based representa-623

tion. IJCV, 26(1):63–84, 1998.624

[5] O. Cappe, S. Godsill, and E. Moulines. An overview of625

existing methods and recent advances in sequential monte626

carlo. Proceedings of the IEEE, 95(5):899 –924, May 2007.627

[6] P. C. Cargill, C. U. Rius, D. M. Quiroz, and A. Soto. Per-628

formance evaluation of the covariance descriptor for target629

detection. In Chilean Computer Science Society (SCCC), 2009630

International Conference of the, pages 133 –141, nov 2009.631

[7] C. Cedras and M. Shah. Motion-based recognition a survey.632

Image and Vision Computing, 13(2):129–155, 1995.633

[8] M. Chen, S. Pang, T. Cham, and A. Goh. Visual tracking with634

generative template model based on riemannian manifold of 635

covariances. In Information Fusion (FUSION), 2011 Proceedings 636

of the 14th International Conference on, pages 1–8. IEEE, 2011. 637

[9] R. Collins, Y. Liu, and M. Leordeanu. Online selection of 638

discriminative tracking features. PAMI, 27(10):1631–1643, 2005. 639

[10] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kol- 640

lias, W. Fellenz, and J. Taylor. Emotion recognition in 641

human-computer interaction. Signal Processing Magazine, IEEE, 642

18(1):32–80, 2001. 643

[11] N. Dalal and B. Triggs. Histograms of oriented gradients for 644

human detection. In CVPR, volume 1, pages 886 –893 vol. 1, 645

June 2005. 646

[12] P. T. Fletcher and S. Joshi. Riemannian geometry for the 647

statistical analysis of diffusion tensor data. Signal Processing, 648

87(2):250 – 262, 2007. 649

[13] W. Forstner and B. Moonen. A metric for covariance matri- 650

ces. Technical report, Dept. of Geodesy and Geoinformatics, 651

Stuttgart University, 1999. 652

[14] A. Goh, C. Lenglet, P. Thompson, and R. Vidal. A non- 653

parametric riemannian framework for processing high angular 654

resolution diffusion images (hardi). In CVPR, pages 2496 – 655

2503, june 2009. 656

[15] M. Grabner, H. Grabner, and H. Bischof. Learning features for 657

tracking. In CVPR, pages 1–8. IEEE, 2007. 658

[16] M. Isard and A. Blake. CONDENSATION - conditional density 659

propagation for visual tracking. IJCV, 29:5–28, 1998. 660

[17] K. Ito. The brownian motion and tensor fields on riemannian 661

manifold. Kiyosi Ito selected papers, page 298, 1987. 662

[18] O. Javed and M. Shah. Tracking and object classification for 663

automated surveillance. ECCV 2002, pages 439–443, 2006. 664

[19] A. Jepson, D. Fleet, and T. El-Maraghi. Robust online appear- 665

ance models for visual tracking. PAMI, 25(10):1296 – 1311, oct 666

2003. 667

[20] T. Kaneko and O. Hori. Template update criterion for template 668

matching of image sequences. In Pattern Recognition, 2002. 669

Proceedings. 16th International Conference on, volume 2, pages 670

1–5. IEEE, 2002. 671

[21] J. Kruskal. Multidimensional scaling by optimizing goodness 672

of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, 673

1964. 674

[22] V. Manohar, P. Soundararajan, H. Raju, D. Goldgof, R. Kasturi, 675

and J. Garofolo. Performance evaluation of object detection 676

and tracking in video. Computer Vision–ACCV 2006, pages 151– 677

161, 2006. 678

[23] I. Matthews, T. Ishikawa, and S. Baker. The template update 679

problem. PAMI, 26:810–815, 2004. 680

[24] D. McKeown Jr and J. Denlinger. Cooperative methods for 681

road tracking in aerial imagery. In Computer Vision and Pat- 682

tern Recognition, 1988. Proceedings CVPR’88., Computer Society 683

Conference on, pages 662–672. IEEE, 1988. 684

[25] H. Nguyen, M. Worring, and R. Van Den Boomgaard. Occlu- 685

sion robust adaptive template tracking. In ICCV, volume 1, 686

pages 678–683. IEEE, 2001. 687

[26] Y. Pang, Y. Yuan, and X. Li. Gabor-based region covariance 688

matrices for face recognition. Circuits and Systems for Video 689

Technology, IEEE Transactions on, 18(7):989–993, 2008. 690

[27] X. Pennec. Intrinsic statistics on riemannian manifolds: Basic 691

tools for geometric measurements. J. Math. Imaging Vis., 692

25:127–154, July 2006. 693

[28] X. Pennec, P. Fillard, and N. Ayache. A riemannian framework 694

for tensor computing. IJCV, 66:41–66, 2006. 695

[29] F. Porikli. Integral histogram: a fast way to extract histograms 696

in cartesian spaces. In CVPR, volume 1, pages 829 – 836 vol. 697

1, 2005. 698

[30] F. Porikli and T. Kocak. Robust license plate detection using 699

covariance descriptor in a neural network framework. In Video 700

and Signal Based Surveillance, 2006. AVSS’06. IEEE International 701

Conference on, pages 107–107. IEEE, 2006. 702

[31] F. Porikli, O. Tuzel, and P. Meer. Covariance tracking using 703

model update based on lie algebra. In CVPR, volume 1, pages 704

728 – 735, june 2006. 705

[32] Y. Rathi, A. Tannenbaum, and O. Michailovich. Segmenting 706

images on the tensor manifold. In CVPR, pages 1–8, june 2007. 707

[33] B. Ristic, S. Arulampalam, and N. Gordon. Beyond the Kalman 708

Filter: Particle Filters for Tracking Applications. Artech House, 709

2004. 710

[34] D. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incremental 711

11

Learning for Robust Visual Tracking. IJCV, 77(1):125–141, 2008. 712

[35] S. Roweis and L. Saul. Nonlinear dimensionality reduction by 713

locally linear embedding. Science, 290(5500):2323–2326, 2000. 714

[36] A. Schwartzman. Random ellipsoids and false discovery rates:715

Statistics for diffusion tensor imaging data. PhD thesis, Stanford716

University, 2006.717

[37] M. B. Stegmann. Object tracking using active appearance718

models. Proc. 10th Danish Conference on Pattern Recognition and719

Image Analysis, pages 54–60, 2001.720

[38] F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi.721

Pedestrian detection using infrared images and histograms of722

oriented gradients. In Intelligent Vehicles Symposium, 2006 IEEE,723

pages 206–212. Ieee, 2006.724

[39] J. Tenenbaum, V. De Silva, and J. Langford. A global geometric725

framework for nonlinear dimensionality reduction. Science,726

290(5500):2319–2323, 2000.727

[40] O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fast728

descriptor for detection and classification. ECCV, pages 589–729

600, 2006.730

[41] O. Tuzel, F. Porikli, and P. Meer. Region Covariance: A Fast731

Descriptor for Detection and Classification. In A. Leonardis,732

H. Bischof, and A. Pinz, editors, ECCV 2006, volume 3952733

of Lecture Notes in Computer Science, pages 589–600. Springer734

Berlin / Heidelberg, 2006.735

[42] O. Tuzel, F. Porikli, and P. Meer. Human detection via736

classification on riemannian manifolds. In CVPR, pages 1 –737

8, june 2007.738

[43] O. Tuzel, F. Porikli, and P. Meer. Pedestrian detection via739

classification on riemannian manifolds. PAMI, 30(10):1713 –740

1727, oct. 2008.741

[44] G. Wang, Y. Liu, and H. Shi. Covariance tracking via geometric742

particle filtering. In Intelligent Computation Technology and743

Automation, 2009. ICICTA ’09. Second International Conference744

on, volume 1, pages 250 –254, 2009.745

[45] Y. Wu, H. Ling, E. Blasch, L. Bai, and G. Chen. Visual track-746

ing based on log-euclidean riemannian sparse representation.747

Advances in Visual Computing, pages 738–747, 2011.748

[46] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey.749

ACM Comput. Surv., 38(4), 2006.750

[47] X. Zhan. Extremal eigenvalues of real symmetric matrices751

with entries in an interval. SIAM journal on matrix analysis752

and applications, 27(3):851–860, 2006.753

Marcus Chen is a PhD student in the School of Computer Engi-754

neering, Nanyang Technological University, and member of technical755

staff in DSO National Laboratories. He received his B.S. in Electrical756

and Computer Engineering with university honours in 2007 from757

Carnegie Mellon University, and M.S. in Eletrical Engineering from758

Stanford University in 2008.759

Dr. Cham Tat Jen is an Associate Professor in the School of Com-760

puter Engineering, Nanyang Technological University, and Director761

of the Centre for Multimedia & Network Technology (CeMNet). He762

received his BA in Engineering with triple first class honours in 1993763

and his PhD in 1996, both from the University of Cambridge, during764

which he was awarded the Loke Cheng-Kim Foundation Scholarship,765

the Alexandria Prize, the Engineering Members Prize as well as the766

St Catharines College Senior and Research Scholarships. Tat-Jen767

was subsequently conferred a Jesus College Research Fellowship in768

Science in 1996-97. From 1998 to 2001, he was a research scientist769

at DEC/Compaq CRL in Cambridge, MA, USA, where his experience770

included technology transfer to product groups and showcasing771

research work to Hollywood studios. After joining NTU in 2002, he772

was concurrently a Faculty Fellow in the Singapore-MIT Alliance773

Computer Science Program in 2003-2006.774

Dr. Pang Sze Kim is a principal member of technical staff and775

currently is the head of Signal Processing Laboratorie at DSO776

National Laboratories. Sze Kim obtained his Ph.D. from University 777

of Cambridge. 778

DR. Alvina Goh obtained her Ph.D. from the Department of Biomed- 779

ical Engineering at the Johns Hopkins University in May 2010. 780

Currently, she is a researcher at the DSO National Labs and an 781

adjunct assistant professor in the Department of Mathematics at 782

the National University of Singapore. Her research interests include 783

machine learning, computer vision, and medical imaging. 784

A Diffusion Process on Riemannian Manifold for Visual Tracking · 1 A Diffusion Process on Riemannian Manifold for Visual Tracking Marcus Chen, Cham Tat Jen, Pang Sze Kim, Alvina

Documents