-
1
Maximum Margin Projection Subspace Learning forVisual Data
Analysis
Symeon Nikitidis, Anastasios Tefas and Ioannis Pitas
Abstract—Visual pattern recognition from images often in-volves
dimensionality reduction as a key step to discover thelatent image
features and obtain a more manageable problem.Contrary to what is
commonly practiced today in variousrecognition applications where
dimensionality reduction andclassification are independently
treated, we propose a noveldimensionality reduction method
appropriately combined with aclassification algorithm. The proposed
method called MaximumMargin Projection Pursuit, aims to identify a
low dimensionalprojection subspace, where samples form classes that
are betterdiscriminated i.e., are separated with maximum margin.
Theproposed method is an iterative alternate optimization
algorithmthat computes the maximum margin projections exploiting
theseparating hyperplanes obtained from training a Support
VectorMachine classifier in the identified low dimensional space.
Ex-perimental results on both artificial data, as well as, on
populardatabases for facial expression, face and object
recognitionverified the superiority of the proposed method against
variousstate-of-the-art dimensionality reduction algorithms.
I. I NTRODUCTION
ONE of the most crucial problems that every imageanalysis
algorithm encounters is the high dimensionalityof the image data,
which can range from several hundredsto thousands of extracted
image features. Directly dealingwith such high dimensional data is
not only computationallyinefficient, but also yields several
problems in subsequentlyperformed statistical learning algorithms,
due to the so-called“curse of dimensionality”. Thus, various
techniques have beenproposed in the literature for efficient data
embedding (ordimensionality reduction) that obtain a more
manageable prob-lem and alleviate computational complexity. Such a
popularcategory of methods is the subspace image
representationalgorithms which aim to discover the latent image
features byprojecting linearly or non-linearly the
high-dimensionalinputsamples to a low-dimensional subspace, where
an appropri-ately formed criterion is optimized.
The most popular dimensionality reduction algorithms canbe
roughly categorized, according to their underlying opti-mization
criteria, into two main categories. Those that formtheir
optimization criterion based on geometrical argumentsand those that
attempt to enhance data discrimination in theprojection subspace.
The goal of the first category methods isto embed data into a
low-dimensional space, where the intrin-sic data geometry is
preserved. Principal Component Analysis(PCA) [1] is such a
representative method that exploits theglobal data structure, in
order to identify a subspace wherethesample variance is maximized.
While PCA exploits the globaldata characteristics in the Euclidean
space, the local data
manifold structure is ignored. To overcome this
deficiency,manifold-based embedding algorithms assume that the
datareside on a submanifold of the ambient space and attemptto
discover and preserve its structure. Such representativemethods
include e.g. ISOMAP [2], Locally Linear Embedding(LLE) [3],
Locality Preserving Projections [4], OrthogonalLocality Preserving
Projections (OLPP) [5] and NeighborhoodPreserving Embedding (NPE)
[6].
Discrimination enhancing embedding algorithms aim toidentify a
discriminative subspace, in which the data samplesfrom different
classes are far apart from each other. LinearDiscriminant Analysis
(LDA) [7] and its variants, are suchrepresentative methods that
extract discriminant informationby finding projection directions
that maximize the ratio of thebetween-class and the within-class
scatter. Margin maximizingembedding algorithms [8], [9], [10]
inspired by the greatsuccess of Support Vector Machines (SVMs) [11]
also fall inthis category, since their goal is to enhance class
discriminationin the low dimensional space.
The Maximum Margin Projection (MMP) algorithm [9] isan
unsupervised embedding method that attempts to find dif-ferent
subspace directions that separate data points in differentclusters
with maximum margin. To do so, MMP seeks for sucha data labelling,
so that, if an SVM classifier is trained, theresulting separating
hyperplanes can separate different clusterswith the maximum margin.
Thus, the projection directionof the corresponding SVM trained on
such an optimal datalabelling, is considered as one of the
directions of the seekedsubspace, while considering different
possible data labellingsand enforcing the constraint that the next
SVM projectiondirection should be orthogonal to the previous ones,
severalprojections are derived and added to the subspace. He et. al
[8]proposed a semisupervised dimensionality reduction methodfor
image retrieval that aims to discover both geometrical
anddiscriminant structures of the data manifold. To do so,
thealgorithm constructs a within-class and a between-class graphby
exploiting both class and neighborhood information andfinds a
linear transformation matrix that maps image data toa subspace,
where, at each local neighborhood, the marginbetween relevant and
irrelevant images is maximized.
Recently significant attention has been attracted by Com-pressed
Sensing (CS) [12] that combines data acquisitionwith data
dimensionality reduction performed by RandomProjections (RP). RP
are a desirable alternative of traditionalembedding techniques,
since they offer certain advantages.Firstly, they are data
independent and do not require a trainingphase thus being
computationally efficient. Secondly, as it
-
2
has been shown in the literature [13], [14], [15], a
Gaussianrandom projection matrix preserves the pairwise
distancesbetween data points in the projection subspace and, thus,
canbe effectively combined with distance-based classifiers, suchas
SVMs. Another important aspect for real life applicationsusing
sensitive biometric data is the provision of securityanduser
privacy protection mechanisms, since the use of randomfeatures,
instead of the actual biometric data for e.g. personidentification,
protects the original data [16] from maliciousattacks.
In this paper we integrate optimal data embedding and
SVMclassification in a single framework to be called MaximumMargin
Projection Pursuit (MMPP). MMPP initializes theprojection matrix as
a semiorthogonal Gaussian RP matrix inorder to exploit the
aforementioned merits and based on thedecision hyperplanes obtained
from training an SVM classifierit derives an optimal projection
matrix such that the separatingmargin between the projected samples
of different classes ismaximized. The MMPP approach brings certain
advantages,both to data embedding and classification. In contrary
towhat is commonly practiced where dimensionality reductionand
classification are treated independently MMPP combinesthese into a
single framework. Furthermore, in contrast to theconventional
classification approaches, which consider that thetraining data
points are fixed in the input space, the SVMclassifier is trained
over the projected data samples in theprojection subspace
determined by MMPP. Thus, workingon low dimensional data reduces
the required computationaleffort. Moreover, since the decision
hyperplane identifiedbySVM training is explicitly determined by the
support vectors,data outliers and the overall data samples
distribution insideclasses do not affect MMPP performance, in
contrast toother discriminant subspace learning algorithms, such as
LDAwhich assumes a Gaussian data distribution for optimal
classdiscrimination.
In summary, the novel contributions of this paper are
thefollowing:
• A discrimination enhancing subspace learning method,called
MMPP, that works directly on the compressedsamples is proposed.
• The MMPP algorithm integrates data embedding andclassification
into a single framework, thus possessingcertain desired advantages
(good classification perfor-mance, computational speed and
robustness to data out-liers).
• MMPP is derived both for two class and multiclass
dataembedding problems.
• The MMPP non-linear extension that seeks to identifya
projection matrix that separates different classes in thefeature
space with maximum margin is also demonstrated.
• The superiority of the proposed method against vari-ous
state-of-the-art embedding algorithms for facial im-age
characterization problems and object recognition isverified by
several simulation experiments on populardatasets.
The rest of the paper is organized as follows. SectionII
presents the proposed MMPP dimensionality reduction
algorithm, for a two-class linear classification problem
anddiscusses its initialization using a semiorthogonal
Gaussianrandom projection matrix, in order to form the basis ofthe
projection subspace. MMPP extension to a multiclassproblem is
presented in section III, while its non-linearextension considering
either a Gaussian Radial Basis or anarbitrary degree polynomial
kernel function is derived insection IV. Section V describes the
conducted experimentsand presents experimental evidence regarding
the superiorityof the proposed algorithm against various
state-of-the-art dataembedding methods. Finally, concluding remarks
are drawn inSection VI.
II. M AXIMUM MARGIN PROJECTIONPURSUIT
The MMPP algorithm aims to identify a low-dimensionalprojection
subspace, where samples form classes that arebetter discriminated,
i.e., are separated with maximum margin.To do so, MMPP involves
three main steps. The first step,performed during the
initialization of the MMPP algorithm,extracts the random features
from the initial data and formsthe basis of the low-dimensional
projection subspace usingRP,while the second and the third steps
involve two optimizationproblems that are combined in a single
iterative optimizationframework. More precisely, the second step
identifies theoptimal decision hyperplane that separates different
classeswith maximum margin, in the respective subspace determinedby
the projection matrix, while the third step updates theprojection
matrix, so that the identified separating marginbetween the
projected samples of different classes is increased.Next, we first
formulate the optimization problems consideredby MMPP, discuss
algorithm initialization and demonstratethe iterative optimization
framework considering both a twoclass and a multiclass separation
problem. Subsequently, wederive the non-linear MMPP algorithms
extension and proposeupdate rules considering polynomial and
Gaussian kernelfunctions to project data into a Hilbert space,
using the so-called kernel trick.
A. MMPP Algorithm for the Binary Classification Problem
Given a setX = {(x1, y1), ..., (xN , yN )} of N training
datapairs, wherexi ∈ Rm, i = 1, ..., N are them-dimensionalinput
feature vectors andyi ∈ {−1, 1} is the class labelassociated with
each samplexi, a binary SVM classifier at-tempts to find the
separating hyperplane that separates trainingdata points of the two
classes with maximum margin, whileminimizing the classification
error defined according to whichside of the decision hyperplane
training samples of each classfall in. Considering that each
training sample ofX is firstlyprojected from the
initialm-dimensional input space to a low-dimensional subspace
using a projection matrixR ∈ Rr×m,wherer ≪ m and performing the
linear transformatiońxi =Rxi, the binary SVM optimization problem
is formulated asfollows:
minw,ξi
{
1
2w
Tw + C
N∑
i=1
ξi
}
(1)
-
3
subject to the constraints:
yi(
wTRxi + b
)
≥ 1− ξi (2)
ξi ≥ 0, i = 1, . . . , N, (3)
where w ∈ Rr is the normal vector of the separating hy-perplane,
which isr-dimensional, since training is performedin the projection
subspace,b ∈ R is its bias term,ξ =[ξ1, . . . , ξN ]
T are the slack variables, each one associated witha training
sample andC is the term that penalizes the trainingerror.
The MMPP algorithm attempts to learn a projection matrixR, such
that the low-dimensional data sample projectionis performed
efficiently, thus enhancing the discriminationbetween the two
classes. To quantify the discrimination powerof the projection
matrixR, we formulate our MMPP algo-rithm based on geometrical
arguments. To do so, we employa combined iterative optimization
framework, involving thesimultaneous optimization of the separating
hyperplane nor-mal vectorw and the projection matrixR, performed
bysuccessively updating the one variable, while keeping the
otherfixed. Next we first discuss the derivation of the
optimalseparating hyperplane normal vectorwo, in the
projectionsubspace determined byR and subsequently, we
demonstratethe projection matrix update with respect to the
fixedwo.
1) Finding the optimalwo in the projection subspacedetermined
byR: The optimization with respect tow, isessentially the
conventional binary SVM training problemperformed in the projection
subspace determined byR, ratherthan in the input space. To solve
the constrained optimizationproblem in (1) with respect tow, we
introduce positiveLagrange multipliersαi and βi each associated
with one ofthe constraints in (2) and (3), respectively and
formulate theLagrangian functionL(w, ξ,R,α,β):
L(w, ξ,R,α,β) =1
2w
Tw + C
N∑
i=1
ξi
−N∑
i=1
αi[
yi(
wTRxi + b
)
− 1 + ξi]
−N∑
i=1
βiξi. (4)
The solution can be found from the saddle point of theLagrangian
function, which has to be maximized with respectto the dual
variablesα and β and minimized with respectto the primal onesw, ξ
and b. According to the Karush-Kuhn-Tucker (KKT) conditions [17]
the partial derivativesofL(w, ξ,R,α,β) with respect to the primal
variablesw, ξ andb vanish deriving the following equalities:
∂L(w, ξ,R,α,β)
∂w= 0 ⇒ w =
N∑
i=1
αiyiRxi, (5)
∂L(w, ξ,R,α,β)
∂b= 0 ⇒
N∑
i=1
αiyi = 0, (6)
∂L(w, ξ,R,α,β)
∂ξi= 0 ⇒ βi = C − αi. (7)
By substituting the terms from the above equalities into (4),
weswitch to the dual formulation, where the optimization problemin
(1) is reformulated to the maximization of the followingWolfe dual
problem:
maxα
N∑
i=1
αi −1
2
N∑
i,j
αiαjyiyjxTi R
TRxj
(8)
subject to the constraints:
N∑
i=1
αiyi = 0, αi ≥ 0, ∀ i = 1, . . . , N. (9)
Consequently, solving (8) forα the optimal separating
hy-perplane normal vectorwo in the reduced dimensional
spacedetermined byR, is subsequently derived from (5).
2) Maximum margin projection matrix update for fixedwo:At each
optimization roundt we seek to update the projectionmatrix R(t−1),
so that its new estimateR(t) improves theobjective function in (8)
by maximizing the margin betweenthe two classes. To do so, we first
project the high dimensionaltraining samplesxi from the input space
to a low dimensionalsubspace, using the projection matrixR(t−1)
derived duringthe previous step, and subsequently, train the binary
SVMclassifier in order to obtain the optimal Lagrange multipliersαo
specifying the normal vector of the separating hyperplanew
(t)o .To formulate the optimization problem for the
projection
matrix R, we exploit the dual form of the binary SVMcost
function defined in (8). However, since term
∑N
i=1 αiis constant with respect toR, we can remove it from
thecost function. Moreover, in order to retain the
geometricalcorrelation between samples in the projection subspace,
weconstrain the derived updated projection matrixR(t) to
besemiorthogonal. Consequently, the constrained optimizationproblem
for the projection matrixR update can be summa-rized by the
objective functionO(R) as follows:
minR
O(R) =1
2
N∑
i,j
αi,oαj,oyiyjxTi R
TRxj , (10)
subject to the constraints:
RRT = I, (11)
whereI is anr×r identity matrix. The orthogonality
constraintintroduces an optimization problem on the Stiefel
manifoldsolved to find the dimensionality reduction matrixR.
In the literature, optimization of problems with orthogonal-ity
constraints is performed using a gradient descent algorithmalong
the Stiefel manifold geodesics [18], [19]. However,the simplest
approach to take the constraintRRT = I intoaccount, is to updateR
using any appropriate unconstrainedoptimization algorithm, and then
to projectR back to the con-straint set [20]. This is the direction
we have followed in thispaper, where we first solve (10), without
the orthonormalityconstraints on the rows of the projection matrix
and obtainŔ.Consequently, the projection matrix update is
accomplishedorthonormalizing the rows ofŔ by performing a
Gram-Schmidt procedure. Thus, we solve (10) forR keepingw(t)o
-
4
fixed, by applying a steepest descent optimization
algorithm,which, at a given iterationt, invokes the following
update rule:
Ŕ(t) = R(t−1) − λt∇O(R
(t−1)), (12)
whereλt is the learning step parameter for thet-th iterationand
∇O(R(t−1)) is the partial derivative of the objectivefunctionO(R)
in (10) with respect toR(t−1), evaluated as:
∇O(R(t−1)) =N∑
i,j
αi,oαj,oyiyjR(t−1)
xixTj
=
N∑
i=1
αi,oyiw(t)o x
Ti . (13)
Thus,Ŕ(t) is derived as:
Ŕ(t) = R(t−1) − λt
(
N∑
i=1
αi,oyiw(t)o x
Ti
)
. (14)
Obtaining the projection matrix́R(t) that increases the
separat-ing margin between the two classes in the projection
subspace,we subsequently orthonormalize its rows to deriveR(t).
An efficient approach for setting an appropriate value tothe
learning step parameterλt based on the Armijo rule [21]is presented
in [22], which is also adopted in this work.According to this
strategy, the learning step takes the formλt = β
gt , where gt is the first non-negative integer valuefound
satisfying:
O(R(t))−O(R(t−1)) ≤ σ〈∇O(R(t−1)),R(t) −R(t−1)〉,(15)
where operator〈., .〉 is the Frobenius inner product. Parametersβ
and σ in our experiments have been set toβ = 0.1 andσ = 0.01, which
is an efficient parameter selection, as hasbeen verified in other
studies [22], [23].
After deriving the new projection matrixR(t), the previ-ously
identified separating hyperplane is no longer optimal,since it has
been evaluated in the projection subspace deter-mined byR(t−1).
Consequently, it is required to re-projectthe training samples
usingR(t) and retrain the SVM classifierto obtain the current
optimal separating hyperplane and itsnormal vector. Thus, MMPP
algorithm iteratively updatesthe projection matrix and evaluates
the normal vector of theoptimal separating hyperplanewo in the
projection subspacedetermined byR, until the algorithm
converges.
In order to verify whether the learned projection matrixR(t)
at each iteration roundt is optimal or not, we track the
partialderivative value in (13) to identify stationarity. The
followingstationarity check step is performed, which examines
whetherthe following termination condition is satisfied:
||∇O(R(t))||F ≤ eR||∇O(R(0))||F , (16)
whereeR is a predefined stopping tolerance satisfying:0 <eR
< 1. In our conducted experiments, we considered thateR = 10
−3. The combined iterative optimization process ofthe MMPP
algorithm for the binary classification problem issummarized in
Algorithm 1.
Algorithm 1 Outline of the Maximum Margin Projection Pur-suit
Algorithm Considering a Binary Classification Problem.
1: Input: The setX = {(xi, yi), i = 1, . . . , N} of
Nm-dimensional two class train data samples.
2: Output: The optimal maximum margin projection matrixRo and
the optimal separating hyperplane normal vectorwo.
3: Initialize: t = 1 andR(0) ∈ Rr×m as a semiorthogonalGaussian
random projection matrix.
4: repeat5: Project xi to a low dimensional subspace
performing
the linear transformation:x́i = R
(t−1)xi ∀i = 1, . . . , N .
6: Train the binary SVM classifier in the projection sub-space
by solving the optimization problem in (8) subjectto the
constraints in (9) to obtain the optimal Lagrangemultipliersαo
.
7: Obtain the normal vector of the optimal separatinghyperplane
as:w
(t)o =
∑N
i=1 αi,oyiR(t−1)
xi.8: Evaluategradient∇O(R(t−1)) =
∑Ni=1 αi,oyiw
(t)o x
Ti .
9: Determine learning rateλt.10: Update projection matrixR(t−1)
givenw(t)o as:
R(t) = Orthogonalize
(
R(t−1) −
λt∑N
i=1 αi,oyiw(t)o x
Ti
)
.
11: t = t+ 1, Ro = R(t) andwo = w(t)o .
12: until ||∇O(R(t))||F ≤ 10−3||∇O(R(0))||F ,
B. MMPP algorithm initialization
In order to initialize the iterative optimization framework,
itis first required to train the binary SVM classifier and
obtainthe optimalwo in a low dimensional subspace determined byan
initial projection matrixR(0), used in order to
performdimensionality reduction and form the basis of the
projectionsubspace. To do so, we constructR(0) as a
semiorthogonalGaussian random projection matrix. To deriveR(0) we
createthem×r matrixR of i.i.d., zero-mean, unit variance
Gaussianrandom variables, normalize its first row and
orthogonalizethe remaining rows with respect to the first, via a
Gram-Schmidt procedure. This procedure results in the
Gaussianrandom projection matrixR(0) having orthonormal rows
thatcan be used for the initialization of the iterative
optimizationframework.
III. MMPP A LGORITHM FOR MULTICLASSCLASSIFICATION PROBLEMS
The dominant approach for solving multiclass classifica-tion
problems using SVMs has been based on reducing themulticlass task
into multiple binary ones and building a setof binary classifiers,
where each one distinguishes samplesbetween one pair of classes
[24]. However, adopting such anone-against-one multiclass SVM
classification schema to our
-
5
MMPP algorithm requires to learn one projection matrix foreach
of thek(k − 1)/2 binary SVM classifiers that handle ak-class
classification problem. Clearly, this approach becomesimpractical
for classification tasks involving a large numberof classes, as for
instance, in face recognition.
A different approach to generalize SVMs to multiclassproblems is
to handle all available training data togetherforming a single
optimization problem by adding appropriateconstraints for every
class [25], [26]. However, the size ofthe generated quadratic
optimization problem may becomeextremely large, since it is
proportional to the product ofthe number of training samples
multiplied by the number ofclasses in the classification task at
hand. Crammer and Singer[27] proposed an elegant approach for
multiclass classification,by solving a single optimization problem,
where the numberof added constraints is reduced and remains
proportional tothenumber of the available training samples. More
importantly,such a one-against-all multiclass SVM formulation
enablesusto learn a single maximum margin projection matrix
commonfor all training samples, independently of the class they
belongto. Therefore, we adopt this multiclass SVM formulation
[27]in this work.
In the multiclass classification context, the training samplesxi
are assigned a class labelyi ∈ {1, . . . , k}, wherek is thenumber
of classes. We extend the multiclass SVM formulationproposed in
[27], by considering that all training samples arefirst projected
on a low-dimensional subspace determined bythe projection matrixR.
Our goal is to solve the MMPPoptimization problem and to learn a
common projection matrixR for all classes, such that the training
samples of differentclasses are projected in a subspace, where they
are separatedwith maximum margin, and also, to derivek separating
hy-perplanes, where thep-th hyperplanep = 1, . . . , k determinedby
its normal vectorwp ∈ Rr, separates the training vectorsof the p-th
class from all the others with maximum margin.
The multiclass SVM optimization problem in the
projectionsubspace is formulated as follows:
minwp,ξi
{
1
2
k∑
p=1
wTp wp + C
N∑
i=1
ξi
}
, (17)
subject to the constraints:
wTyiRxi −w
Tp Rxi ≥ b
pi − ξi, i = 1, . . . , N p = 1, . . . , k.
(18)Here bias vectorb is defined as:
bpi = 1− δpyi
=
{
1 , if yi 6= p0 , if yi = p,
(19)
whereδpyi is the Kronecker delta function which is1 for yi =
pand0, otherwise.
Similar to the binary classification case, we employ acombined
iterative optimization framework that successivelyoptimizes either
variableswp, p = 1, . . . , k keeping matrixR fixed, (thus,
training the multiclass SVM classifier inthe projection subspace
determined byR) or updates theprojection matrixR, so that it
improves the objective functioni.e., it projects the training
samples in a subspace where themargin that separates the training
samples of each class from
all the others, is maximized. Next, we first demonstrate
theoptimization process with respect to the normal vectors ofthe
separating hyperplanes in the projection subspace ofRand
subsequently, we discuss the projection matrixR update,while
keeping the optimal normal vectorswp,o fixed.
1) Finding the optimalwp,o in the projection subspacedetermined
byR: Since the derivation of the followingdual optimization problem
is rather technical, we will brieflydemonstrate it and refer the
interested reader to [28], [29]forits complete exposition. To solve
the constrained optimizationproblem in (17) we introduce positive
Lagrange multipliersαpi ,each associated with one of the
constraints in (18). Note thatit is not required to introduce
additional Lagrange multipliersregarding the non-negativity
constraint applied on the slackvariablesξi. This is already
included in (18), since, foryi = p,bpi = 0, inequalities in (18)
becomeξi ≥ 0. The LagrangianfunctionL(wp, ξ,R,α) takes the
form:
L(wp, ξ,R,α) =1
2
k∑
p=1
wTp wp + C
N∑
i=1
ξi
−N∑
i=1
k∑
p=1
αpi[(
wTyi−wTp
)
Rxi + ξi − bpi
]
.
(20)
Switching to the dual formulation, the solution of
theconstrained optimization problem in (17) can be found fromthe
saddle point of the Lagrangian function in (20), whichhas to be
maximized with respect to the dual variablesα andminimized with
respect to the primal oneswp andξ. To findthe minimum over the
primal variables we require that thepartial derivatives ofL(wp,
ξ,R,α) with respect toξ andwp vanish, which gives the following
equalities:
∂L(wp, ξ,R,α)
∂ξi= 0 ⇒
k∑
p=1
αpi = C, (21)
∂L(wp, ξ,R,α)
∂wp= 0 ⇒ wp =
N∑
i=1
(
αpi − Cδpyi
)
Rxi
⇔ wp =N∑
i=1
(
αpi −k∑
p=1
αpi δpyi
)
Rxi.
(22)
By substituting terms from (21) and (22) into (20),
andexpressing the corresponding to thei-th training samplebias
terms and Lagrange multipliers in a vector form asbi = [b
1i , . . . , b
ki ]
T andαi = [α1i , . . . , αki ]
T , respectively andperforming the substitutionni = C1yi − αi,
(where1yi isa k-dimensional vector with all its components equal to
zeroexcept of theyi-th, which is equal to one) the saddle pointof
the Lagrangian is reformulated to the minimization of thefollowing
Wolfe dual problem:
minn
1
2
N∑
i,j
xTi R
TRxjn
Ti nj +
N∑
i=1
nTi bi
, (23)
-
6
subject to the constraints:
k∑
p=1
npi = 0, npi ≤
{
0 , if yi 6= pC , if yi = p
∀ i = 1, . . . , N , p = 1, . . . , k. (24)
By solving (23) forn, and consequently, deriving the
optimalvalue of the actual Lagrange multipliersαo, the
normalvectorwp,o is derived from (22), corresponding to the
optimaldecision hyperplane that separates the training samples
ofthep-th class from all the others with maximum margin in
theprojection subspace ofR.
2) Maximum margin projection matrix update for fixedwp,o:
Similar to the binary classification case, we formulatethe
optimization problem by exploiting the dual form of themulticlass
SVM cost function in (23). To do so, we removethe term
∑N
i=1 nTi bi from (23), since it is independent of the
optimized variableR and derive the objective functionO(R).In
addition we impose orthogonality constraints on the
derivedprojection matrixR(t) rows, thus leading to the
followingoptimization problem:
minR
O(R) =1
2
N∑
i,j
xTi R
TRxjn
Ti nj , (25)
subject to the constraints:
RRT = I. (26)
To derive a new estimate ofRo at a given iterationtthe steepest
descent update rule in (12) is invoked, where∇O(R(t−1)) is the
partial derivative of (25) with respect toR
(t−1):
∇O(R(t−1)) =N∑
i,j
R(t−1)
xixTj n
Ti,onj,o
= −N∑
i=1
k∑
p=1
αpi,o
(
w(t)yi,o
−w(t)p,o
)
xTi .(27)
Thus,Ŕ(t) is updated as:
Ŕ(t) = R(t−1) + λt
(
N∑
i=1
k∑
p=1
αpi,o
(
w(t)yi,o
−w(t)p,o
)
xTi
)
.
(28)The projection matrix update is followed by the
orthonormal-ization of the rows ofŔ(t), in order to to satisfy the
imposedconstraints. Similar to the binary classification task,
MMPPalgorithm for multiclass classification problems
successivelyupdates the maximum margin projection matrixR and
eval-uates the normal vectorswp,o p = 1, . . . , k of the k
optimalseparating hyperplanes in the projection subspace
determinedby R. The involved learning rate parameterλt is set using
thepreviously presented methodology for the binary
classificationcase, while the iterative optimization process is
terminated bytracking the partial derivative value in (27) and
examiningthetermination condition in (16).
IV. N ON-LINEAR MAXIMUM MARGIN PROJECTIONS
When data can not be linearly separated in the initialinput
space, a common approach is to perform the so-calledkernel trick,
using a mapping functionφ(.) that maps (usuallynon-linearly) the
input feature vectorsxi to a possibly highdimensional spaceF ,
called feature space, which usually hasthe structure of a Hilbert
space [30][31], where the dataare supposed to be linearly or near
linearly separable. Theexact form of the mapping function is not
required to beknown, since all required subsequent operations of
the learningalgorithm are expressed in terms of dot products
between theinput vectors in the Hilbert space performed by the
kernel trick[32].
To derive the non-linear extension of the MMPP algorithm,we
assume that the low dimensional training sample repre-sentations
are non-linearly mapped in a Hilbert space usingakernel function
and seek to identify such a projection matrixthat separates
different classes in the feature space with maxi-mum margin. Next
we will only demonstrate the derivation ofthe update rules for the
maximum margin projection matrixR, both for the two class and the
multiclass classificationproblems, considering two popular kernel
functions: an arbi-trary degree polynomial kernel function and the
Radial BasisFunction (RBF). However, it is straightforward to
extend thenon-linear MMPP algorithm, such as to exploit other
popularkernel functions using the presented methodology.
A d-degree polynomial kernel function is defined as:K(xi,xj) =
(x
Ti xj + 1)
d. Considering that training samplesare first projected into the
low dimensional subspace deter-mined byR, the d-degree polynomial
kernel function overthe projected samples takes the form:
K(Rxi,Rxj) =(
(Rxi)TRxj + 1
)d
. (29)
Consequently, the partial derivative∇O(R(t−1)) of the ob-jective
function for the binary classification case in (10) isevaluated as
below:
∇O(R(t−1)) =12
∑N
i,j αi,oαj,oyiyjK(R(t−1)
xi,R(t−1)
xj)
∂R(t−1)
= dN∑
i,j
αi,oαj,oyiyj
×
(
(
R(t−1)
xi
)T
R(t−1)
xj + 1
)d−1
R(t−1)
xixTj ,
(30)
while for the multiclass formulation is evaluated using
thecostfunction in (25) as:
∇O(R(t−1)) =12
∑N
i,j K(R(t−1)
xi,R(t−1)
xj)nTi nj
∂R(t−1)
= dN∑
i,j
(
(
R(t−1)
xi
)T
R(t−1)
xj + 1
)d−1
× R(t−1)xixTj n
Ti,onj,o. (31)
On the other hand, the RBF kernel function is defined usingthe
projected samples as:K(Rxi,Rxj) = e−γ||Rxi−Rxj ||
2
,
-
7
whereγ is the Gaussian spread. Similarly, the partial
derivativeof (10), with respect toR(t−1) is evaluated as
follows:
∇O(R(t−1)) = 2γR(t−1)N∑
i,j
αi,oαj,oyiyj(
xixTi + xjx
Tj
)
× K(R(t−1)xi,R(t−1)
xj), (32)
while, for the multiclass separation problem, it is
evaluatedfrom (25):
∇O(R(t−1)) = 2γR(t−1)N∑
i,j
(
xixTi + xjx
Tj
)
× K(R(t−1)xi,R(t−1)
xj)nTi,onj,o. (33)
The update rules for the maximum margin projection ma-trix are
subsequently derived by substituting the respectivepartial
derivatives in (12). Moreover, similar extensionscanbe derived for
other popular non-linear kernel functions, bysimply evaluating
their partial derivatives with respect tothe projection matrixR and
by modifying accordingly therespective update rules.
V. EXPERIMENTAL RESULTS
We compare the performance of the proposed method withthat of
several state-of-the-art dimensionality reduction tech-niques, such
as PCA, LDA, Subclass Discrimnant Analysis(SDA) [33], LPP,
Orthogonal LPP (OLPP) and the linearapproximation of the LLE
algorithm called NeighborhoodPreserving Embedding (NPE). Moreover,
in our comparison,we include RP [34] and also we directly feed the
initialhigh dimensional samples without performing
dimensionalityreduction to a multiclass SVM classifier, to serve as
ourbaseline testing methods. Experiments have been performedfor
facial expression recognition on the Cohn-Kanade database[35], for
face recognition on the Extended Yale B [36] and ARDatasets [37]
and for object recognition on ETH-80 imagedataset [38].
On the experiments for facial expression and face recog-nition
as our classification features, we either consideredonly the facial
image intensity information or its augmentedGabor wavelet
representation, which provides robustness toillumination and facial
expression variations [39]. To createthe augmented Gabor feature
vectors we convolved each facialimage with Gabor kernels
considering5 different scales and8 directions. Thus, for each
facial image, and for each Gaborkernel a complex vector containing
a real and an imaginarypart was generated. Based on these parts we
computed theGabor magnitude information creating in total 40
featurevectors for each facial image. Each such feature vector
wassubsequently downsampled, in order to reduce its dimensionusing
interpolation and normalized to zero mean and unit vari-ance.
Finally, for each facial image we derived its augmentedGabor
wavelet representation by concatenating the 40 featurevectors into
a single vector. On the experiments for objectrecognition we used
the cropped and scaled to a fixed sizeof 128× 128 pixels binary
images of ETH-80 containing thecontour of each object.
To determine the optimal projection subspace dimension-ality for
MMPP, PCA, RP and SDA a validation step wasperformed exploiting the
training set. Moreover, for SDA theoptimal number of subclasses
that each class is partitionedto has been also specified using the
same validation stepand exploiting the stability criterion
introduced in [40].Forvalidation we randomly divided the training
set into training(80% of the samples) and validation (20% of the
samples) setsand the parameters that resulted to the best
recognition rate onthe validation set were subsequently adopted for
the test set.
In order to train the proposed MMPP algorithm and derivethe
maximum margin projection matrix, we have combined ouroptimization
algorithm with LIBLINEAR [41], which providesan efficient
implementation of the considered multiclass linearkernel SVM
formulation. Moreover, for the fairness of theexperimental
comparison, the discriminant low-dimensionalfacial representations
derived from each examined algorithmwere also fed to the same
multiclass SVM classifier imple-mented in LIBLINEAR for
classification. We should note that,by adopting LIBLINEAR we
explicitly exploit a linear kernel.However, as it has been shown in
the literature [34], linearSVMs are already appropriate for
separating different classesand this also makes it possible to
directly compare betweendifferent algorithms and draw trustworthy
conclusions regard-ing their efficacy. Nevertheless, better
performance couldbeachieved by MMPP algorithm by projecting the
input highdimensional samples non-linearly and using non-linear
kernelSVMs for their classification.
A. Facial Expression Recognition in the Cohn-KanadeDatabase
The Cohn-Kanade AU-Coded facial expression database isamong the
most popular databases for benchmarking methodsthat perform facial
expression recognition. Each subject ineach video sequence of the
database poses a facial expression,starting from the neutral
emotional state and finishing at theexpression apex. To form our
data collection we discardedthe intermediate video frames depicting
subjects perform-ing each facial expression in increasing intensity
level andconsidered only the last video frame depicting each
formedfacial expression at its highest intensity. Face
detectionwasperformed on these images and the resulting facial
Regions OfInterest (ROIs) were manually aligned with respect to the
eyesposition. Subsequently, they were anisotropically scaledto
afixed size of150×200 pixels and converted to grayscale. Thus,in
our experiments, we used in total 407 images depicting100 subjects,
posing 7 different expressions (anger, disgust,fear, happiness,
sadness, surprise and the neutral emotionalstate). Figure 1 shows
example images from the Cohn-Kanadedataset, depicting the
recognized facial expressions arrangedin the following order:
anger, disgust, fear, happiness, sadness,surprise and the neutral
emotional state.
To measure the facial expression recognition accuracy,
werandomly partitioned the available samples into5 approxi-mately
equal sized subsets (folds) and a5-fold cross-validationhas been
performed by feeding the projected discriminantfacial expression
representations to the linear SVM classifier.
-
8
Fig. 1. Sample images depicting facial expressions in the
Cohn-Kanadedatabase.
This resulted into such a test set formation, where
someexpressive samples of an individual were left for testing,
whilehis rest expressive images (depicting other facial
expressions)were included in the training set. This fact
significantly in-creased the difficulty of the expression
recognition problem,since person identity related issues arose.
Table I summarizes the best average facial expressionrecognition
rates achieved by each examined embeddingmethod, both for the
considered facial image intensity andthe augmented Gabor features.
The mean facial expressionrecognition rates attained by directly
feeding the initialhighdimensional data to the linear SVM
classifier are also providedin Table I. Considering the facial
image intensity as thechosen classification features, MMPP
outperforms, in termsof recognition accuracy percentage points, all
other com-peting embedding algorithms. The best average
expressionrecognition rate attained by MMPP is80.1% using
120-dimensional discriminant representations of the initial
30,000-dimensional input samples. Exploiting the augmented
Gaborfeatures significantly improved the recognition performance
ofall examined methods, verifying the appropriateness of
thesedescriptors in the task compared against the image
intensityfeatures. MMPP algorithm performance increased by
morethan9% reaching an average recognition rate of89.2%. AgainMMPP
attained the highest average expression recognition
rateoutperforming the second best method (LDA) by2.7%.
It is significant to highlight the difference in
expressionrecognition performance between PCA, RP and the
proposedalgorithm in low dimensional projection spaces. Figure
2shows the average facial expression recognition accuracy at-tained
by each method when the projection subspace dimen-sionality varies
from3 to 325. As it can be observed, MMPPnot only performs robustly
independently of the size of theprojection space, but also the gain
in recognition accuracygoes up to50% in very low dimensional
projection spaces (i.e.considering a 3D projection space), where
the other methodsappear to attain a significantly degraded
performance. Figure3 demonstrates the average facial expression
recognition rateattained by MMPP, when the initial 48,000
dimensional aug-mented Gabor wavelet representations are projected
on a 3dimensional space, versus the number of MMPP
algorithmsiterations. It should be noted that since for the
initialization ofMMPP it is exploited a semiorthogonal Gaussian RP
matrix,thus practically dimensionality reduction is performed
usingRP, the reported recognition accuracy at the first
iteration,which is 40.2%, is identical to that achieved by RP. As
canbe observed, in the provided graph of Figure 3, the
iterativeoptimization process of MMPP enhances class
discriminationin the projection subspace, since the attained mean
expression
recognition rate is increased reaching a highest recognitionrate
of88.3%.
3 50 100 150 200 250 300 35020
30
40
50
60
70
80
90
Projection Subspace DimensionalityAvg
. Fac
ial E
xpre
ssio
n R
ecog
nitio
n R
ate
(%)
MMPP IntensityMMPP GaborPCA IntensityPCA GaborRP GaborRP
Intensity
Fig. 2. Average facial expression recognition rates in the
Cohn-Kanadedatabase for low dimensional projection spaces.
0 50 100 150 200 250 300
40
50
60
70
80
90
Number of Iterations
Avg.
Fac
ial E
xpre
ssio
n Re
cogn
ition
Rat
e (%
)
Fig. 3. Average facial expression recognition rates attained by
MMPPversus the number of iterations. The initial 48,000 dimensional
Gabor waveletrepresentations derived from Cohn-Kanade database are
projected on a 3Dspace.
B. MMPP Algorithms Convergence and Computational Com-plexity
To investigate MMPP optimization performance, we exam-ined its
ability to minimize the cost function in (23) in everyoptimization
round, thus maximizing the separating marginbetween the projected
samples of different classes. We alsoinvestigated whether MMPP is
able to reach a stationary pointby monitoring the gradient
magnitude in (27).
In the conducted experiment, we used approximately half(200) of
the available expressive images from Cohn-Kanade,inorder to train
MMPP and learn a 100-dimensional discriminantprojection subspace.
From the expressive facial images weextracted the augmented Gabor
features by convolving eachwith Gabor kernels of5 different scales
and8 orientationsand downsampled, normalized and concatenated the
derivedfilter responses following the same procedure as
previouslydescribed. The resulting from each facial image48,
000-dimensional feature vectors were used in order to trainMMPP and
obtain the projection matrixR of dimensions100 × 48, 000. In Figure
4a, the objective function in (23)value versus the number of
iterations is plotted, which is
-
9
TABLE IBEST AVERAGE EXPRESSION RECOGNITION ACCURACY RATES(%) IN
COHN-KANADE DATABASE . IN PARENTHESES IT IS SHOWN THE DIMENSION
THAT
RESULTS IN THE BEST PERFORMANCE FOR EACH METHOD.
SVM PCA LDA SDA LPP OLPP NPE RP MMPP(6) (6) (6) (6)
Intensity 73.4(30, 000) 74.5(260) 74.2 76.4(55) 76.6 75.2 76.4
75.2(500) 80.1(120)Gabor 77.8(48, 000) 84.6(150) 86.5 86.1(69) 85.5
83.3 84.8 79.8(500) 89.2(80)
0 50 100 150 200 250 3000
5
10
15
20
25
30
35
Number of Iterations
Obj
ectiv
e Fu
nctio
n Va
lue
0 50 100 150 200 250 3000
50
100
150
200
250
300
350
400
Number of Iterations
Gra
dien
t Nor
m
(a) (b)Fig. 4. MMPP convergence results using augmented Gabor
features derived from half of the Cohn-Kanade images; a) Objective
function value versus thenumber of iterations, b) Gradient
Frobenius norm versus thenumber of algorithm iterations.
monotonically decreasing, verifying that the separating
marginincreases per iteration. Moreover, the gradient
Frobeniusnormvalue after each update is demonstrated in Figure 4b.
In thisex-periment MMPP required 304 iterations in order to
sufficientlydecrease the gradient norm and reach the convergence
point(i.e. to satisfy the termination condition in (16)). In Table
II weshow the recorded CPU training time, measured in
seconds,required by PCA, LDA, SDA, LPP and MMPP algorithmsin this
dataset. All algorithms have been implemented onMatlab R2012b [42]
and the required by each method CPUtime during training has been
recorded on a 2.66 GHz and 8GB RAM computer. As it can be observed
since PCA, LDAand LPP all solve a Generalized Eigenvalue Problem
(GEP)have the shortest training times. SDA although it also solvesa
GEP problem its training time is significantly higher sinceit
requires to determine the optimal subclasses partition usingthe
stability criterion [40] which is costly. MMPP requiredthehighest
training time in the comparison which is attributedtothe considered
iterative optimization framework.
TABLE IITRAINING TIME IN SECONDS REQUIRED BYPCA, LDA, SDA,
LPPAND
MMPP ON COHN-KANADE DATASET
DimensionalityInput Projection PCA LDA SDA LPP MMPP
48, 000 100 0.55 0.81 46.3 0.7 48.7
To visualize the ability of MMPP algorithm to estimateuseful
subspaces that enhance data discrimination, we run theproposed
algorithm, aiming to learn a 2D projection space ina two class toy
classification problem, using artificial data. Togenerate our toy
dataset we collected 600 300-dimensional
samples for each class, with the first class features
drawnrandomly from a standard normal distributionN (0, 1) andthe
second class samples drawn from aN (0.2, 1) normaldistribution and
used 100 samples of each class for training,while the rest were
used to compose the toy test set. Figure5 shows the 2D projection
of the two classes training datasamples after different iterations
of the MMPP algorithm,where circled samples denote the identified
support vectors.As can be observed, the proposed algorithm was
able, after afew iterations, to perfectly separate linearly the two
classes, bycontinuously maximizing the separating margin.
Moreover,asa side effect of MMPP algorithm, we observed that the
SVMtraining process converges faster and into a more sparse
solu-tion after each iteration of MMPP algorithm, since the
numberof identified support vector decreases as class
discriminationincreases.
C. Face Recognition in the Extended Yale B Database
The Extended Yale B database consists of2, 414 frontalfacial
images of38 individuals, captured under64 differentlaboratory
controlled lighting conditions. The database ver-sion used in this
experimental evaluation has been manuallyaligned, cropped and then
resized to168× 192 pixels by thedatabase creators. For our
experimental comparison, we haveconsidered three different
experimental settings, by randomlyselecting10%, 30% and50% of the
available images of eachsubject for training, while the rest of the
images were used fortesting. In this experiment we did not exploit
the augmentedGabor features, since the recognition accuracy rates
attainedusing the facial image intensity values as our
classificationfeatures were already sufficiently high. Table III
presentsthehighest face recognition rate achieved by each method.
As can
-
10
Iteration 1 Iteration 7 Iteration 30
Iteration 40 Iteration 100 Iteration 150
Fig. 5. Training data 2D projection at different iterationsof
the MMPP algorithm. Circled data samples denote the identified
support vectors which reduceduring MMPP algorithms convergence.
TABLE IIIFACE RECOGNITION ACCURACY RATES(%) IN THE EXTENDED YALE
B DATABASE. IN PARENTHESES IT IS SHOWN THE DIMENSION THAT RESULTS
IN
THE BEST PERFORMANCE FOR EACH METHOD.
SVM PCA LDA SDA LPP OLPP NPE RP MMPP(32,256) (37) (37) (37)
(37)
Train 10% 90.6 90.8(255) 97.0 96.8(150) 97.0 97.2 97.0 90.9(400)
97.2(150)Train 30% 94.5 95.5(300) 99.7 99.8(271) 99.7 99.7 99.7
96.0(350) 99.8(500)Train 50% 94.7 96.2(500) 100.0 100.0(300) 100.0
99.8 99.9 96.8(300) 100.0(150)
be observed, the proposed MMPP method achieves the
bestperformance across all considered experiments.
D. Face Recognition in the AR Database
The AR database is much more challenging than the Ex-tended Yale
B dataset and exhibits significant variations amongits facial image
samples. It contains color images correspond-ing to 126 different
subjects, depicting their frontal facialview under different facial
expressions, illumination conditionsand occlusions (sunglasses and
scarf). For this experimentweused the pre-aligned and cropped
version of the AR databasecontaining in total2, 600 facial images
of size120×165 pixels,corresponding to100 different subjects
captured during twosessions, separated by two weeks time. Thus,13
images areavailable for each subject per each session.
In order to investigate MMPP algorithms robustness, wehave
conducted three different experiments with increasingdegree of
difficulty. For the first experiment (Exp 1), weformed our training
set by considering only those facialimages with illumination
variations captured during the first
session, while for testing we considered the respective
imagescaptured during the second recording session. For the
secondexperiment (Exp 2), we used facial images with both
varyingillumination conditions and facial expressions from the
firstsession for training and the respective images from the
secondsession for testing. Finally, for the third experiment
(Exp3),we used all the first session images for training and the
restfor testing.
Table IV summarizes the highest attained recognition rateand the
respective subspace dimensionality, by each methodin each performed
experiment. The proposed method achievedrecognition rates equal
to93.3% 91% and 87% in eachexperiment respectively using the facial
image intensities,while for the augmented Gabor features attained
recognitionrates equal to97.2% 96.2% and93.4% which are the best
orthe second best among all examined methods.
E. Object Recognition in the ETH-80 Image Dataset
ETH-80 image dataset [38] depicts 80 objects divided into8
different classes, where for each object 41 images have been
-
11
TABLE IVFACE RECOGNITION ACCURACY RATES(%) IN THE AR DATABASE.
IN PARENTHESES IT IS SHOWN THE DIMENSION THAT RESULTS IN THE
BEST
PERFORMANCE FOR EACH METHOD.
SVM PCA LDA SDA LPP OLPP NPE RP MMPP(99) (99) (99) (99)
Exp 1Intensity 82.0(19, 800) 91.0(300) 93.3 91.5(300) 93.5 93.5
93.5 88.3(500) 93.3(100)Gabor 85.6(88, 000) 93.3(399) 96.5
95.8(321) 96.7 96.7 96.5 89.4(500) 97.2(350)
Exp 2Intensity 79.7(19, 800) 85.4(500) 88.7 89.4(400) 90.1 93.0
90.4 85.6(500) 91.0(100)Gabor 81.9(88, 000) 91.7(500) 92.4
93.7(500) 92.7 93.9 93.1 86.9(500) 96.2(250)
Exp 3 Intensity 76.4(19, 800) 82.4(500) 85.2 86.0(250) 85.1 87.9
84.7 81.3(500) 87.0(200)Gabor 80.2(88, 000) 89.2(500) 89.9
91.7(500) 90.1 91.0 89.3 81.5(500) 93.4(300)
captured from different view points, spaced equally over
theupper viewing hemisphere. Thus, the database contains
3,280images in total. For this experiment we used the cropped
andscaled to a fixed size of128 × 128 pixels binary
imagescontaining the contour of each object. In order to form
ourtraining set we randomly picked 25 binary images of eachobject,
while the rest were used for testing. Table V showsthe highest
attained object recognition accuracy rate by eachmethod and the
respective subspace dimensionality. AgainMMPP outperformed in this
experiment attaining the highestobject recognition rate of84.6%. It
is significant to note thatall discriminant dimensionality
reduction algorithms in ourcomparison, based on Fisher discriminant
ratio (i.e. LDA, LPP,OLPP and NPE) attained a reduced performance
comparedagainst the baseline approach which is feeding directly
theinitial high dimensional feature vectors to the linear SVMfor
classification. This can be attributed to the fact that sinceeach
category in the ETH-80 dataset includes images depicting10
different objects captured from various view angles, datasamples
inside classes span large in-class variations. As aresult all the
aforementioned methods which have the Gaussiandata distribution
optimality assumption [33], [43] fail toiden-tify appropriate
discriminant projection directions. In contrastto the proposed MMPP
method which depends only on thesupport vectors and the overall
data samples distribution insideclasses does not affect its
performance.
VI. CONCLUSION
We proposed a discrimination enhancing subspace learningmethod
called Maximum Margin Projection Pursuit algorithmthat aims to
identify a low dimensional projection subspacewhere samples form
classes that are separated with maximummargin. The proposed method
is an iterative alternate opti-mization algorithm that computes the
maximum margin pro-jections exploiting the separating hyperplanes
obtained fromtraining an SVM classifier in the identified low
dimensionalspace. We also demonstrated the non-linear extension of
ouralgorithm that identifies a projection matrix that
separatesdifferent classes in the feature space with maximum
margin.Finally we showed that it outperforms current
state-of-the-art linear data embedding methods on challenging
computervision recognition tasks such as face, expression and
objectrecognition on several popular datasets.
REFERENCES
[1] I. Jolliffe, Principal component analysis. New York:
Springer-Verlag,1986.
[2] J. Tenenbaum, V. De Silva, and J. Langford, “A global
geometricframework for nonlinear dimensionality reduction,”Science,
vol. 290,no. 5500, pp. 2319–2323, 2000.
[3] S. Roweis and L. Saul, “Nonlinear dimensionality reduction
by locallylinear embedding,”Science, vol. 290, no. 5500, pp.
2323–2326, 2000.
[4] X. He and P. Niyogi, “Locality preserving projections,”in
Advancesin Neural Information Processing Systems, vol. 16,
Vancouver, BritishColumbia, Canada, 2003.
[5] D. Cai, X. He, J. Han, and H. Zhang, “Orthogonal
laplacianfaces for facerecognition,” IEEE Transactions on Image
Processing, vol. 15, no. 11,pp. 3608–3614, 2006.
[6] X. He, D. Cai, S. Yan, and H. Zhang, “Neighborhood
preservingembedding,” inInternational Conference on Computer Vision
(ICCV),2005.
[7] K. Fukunaga,Introduction to statistical pattern recognition,
2nd ed.Academic Press, 1990.
[8] X. He, D. Cai, and J. Han, “Learning a maximum margin
subspace forimage retrieval,”IEEE Transactions on Knowledge and
Data Engineer-ing, vol. 20, no. 2, pp. 189–201, February 2008.
[9] F. Wang, B. Zhao, and C. Zhang, “Unsupervised large margin
discrim-inative projection,” IEEE Transactions on Neural Networks,
vol. 22,no. 9, pp. 1446–1456, September 2011.
[10] A. Zien and J. Candela, “Large margin non-linear
embedding,” inProceedings of the 22nd International Conference on
Machine Learning(ICML), 2005, pp. 1060–1067.
[11] V. Vapnik, The Nature of Statistical Learning Theory. New
York:Springer-Verlag, 1995.
[12] D. Donoho, “Compressed sensing,”IEEE Transactions on
InformationTheory, vol. 52, no. 4, pp. 1289–1306, 2006.
[13] A. Majumdar and R. Ward, “Robust classifiers for data
reduced via ran-dom projections,”IEEE Transactions on Systems, Man,
and Cybernetics,Part B: Cybernetics, vol. 40, no. 5, pp. 1359–1371,
2010.
[14] Q. Shi, C. Shen, R. Hill, and A. Hengel, “Is margin
preserved afterrandom projection?” inProceedings of the 29th
International Conferenceon Machine Learning (ICML), 2012.
[15] S. Paul, C. Boutsidis, M. Magdon-Ismail, and P. Drineas,
“Randomprojections for support vector machines,”International
Conference onArtificial Intelligence (AISTATS), 2013.
[16] J. Pillai, V. Patel, R. Chellappa, and N. Ratha, “Secureand
robust irisrecognition using random projections and sparse
representations,” IEEETransactions on Pattern Analysis and Machine
Intelligence, vol. 33,no. 9, pp. 1877–1893, 2011.
[17] R. Fletcher,Practical methods of optimization; (2nd ed.).
New York,NY, USA: Wiley-Interscience, 1987.
[18] K. R. Varshney and A. S. Willsky, “Learning
dimensionality-reducedclassifiers for information fusion,”
inInternational Conference on In-formation Fusion. IEEE, 2009, pp.
1881–1888.
[19] D.-S. Pham and S. Venkatesh, “Robust learning of
discriminative pro-jection for multicategory classification on the
stiefel manifold,” in IEEEConference on Computer Vision and Pattern
Recognition. IEEE, 2008,pp. 1–7.
[20] K. Torkkola, “Feature extraction by non-parametric mutual
informationmaximization,” The Journal of Machine Learning Research,
vol. 3, pp.1415–1438, 2003.
[21] D. Bertsekas,Nonlinear programming, 2nd ed. Athena
Scientific, 1999.
-
12
TABLE VOBJECT RECOGNITION ACCURACY RATES(%) IN THE
ETH-80DATABASE. IN PARENTHESES IT IS SHOWN THE DIMENSION THAT
RESULTS IN THE BEST
PERFORMANCE FOR EACH METHOD.
SVM PCA LDA SDA LPP OLPP NPE RP MMPPETH-80 80.3(16, 384)
81.9(20) 74.4(7) 79.8(300) 74.2(7) 74.4(7) 74.8(7) 79.4(200)
84.6(80)
[22] C.-J. Lin, “Projected gradient methods for nonnegative
matrix factoriza-tion,” Neural Computation, vol. 19, no. 10, pp.
2756–2779, 2007.
[23] C. Lin and J. Moré, “Newton’s method for large
bound-constrainedoptimization problems,”SIAM Journal on
Optimization, vol. 9, no. 4,pp. 1100–1127, 1999.
[24] J. C. Platt, N. Cristianini, and J. Shawe-Taylor, “Large
margin DAG’s formulticlass classification,” inAdvances in Neural
Information ProcessingSystems, vol. 12. Cambridge, MA: MIT Press,
2000, pp. 547–553.
[25] J. Weston and C. Watkins, “Support vector machines for
multi-classpattern recognition,” inEuropean Symposium On Artificial
NeuralNetworks, April 1999.
[26] V. Vapnik, Statistical learning theory. J. Wiley, New York,
1998.[27] K. Crammer and Y. Singer, “On the learnability and design
of output
codes for multiclass problems,”Machine Learning, vol. 47, no.
2-3, pp.201–233, May 2002.
[28] ——, “On the algorithmic implementation of multiclass
kernel-basedvector machines,”Journal of Machine Learning Research,
vol. 2, pp.265–292, 2001.
[29] S. Nikitidis, N. Nikolaidis, and I. Pitas, “Multiplicative
update rulesfor incremental training of multiclass support vector
machines,” PatternRecognition, vol. 45, no. 5, pp. 1838–1852, May
2012.
[30] B. Schölkopf, S. Mika, C. J. C. Burges, P. Knirsch, K.
R.Müller,G. Rätsch, and A. J. Smola, “Input space versus feature
space in kernel-based methods,”IEEE Transactions on Neural
Networks, vol. 10, no. 5,pp. 1000–1017, September 1999.
[31] K. R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B.
Schölkopf, “Anintroduction to kernel-based learning
algorithms,”IEEE Transactionson Neural Networks, vol. 12, no. 2,
pp. 181–201, March 2001.
[32] B. Schölkopf and A. J. Smola,Learning with kernels:
Support vectormachines, regularization, optimization, and beyond.
Cambridge, MA:MIT Press, 2002.
[33] M. Zhu and A. Martinez, “Subclass discriminant analysis,”
IEEE Trans-actions on Pattern Analysis and Machine Intelligence,
vol. 28, no. 8,pp. 1274–1286, August 2006.
[34] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma,
“Robust facerecognition via sparse representation,”IEEE
Transactions on PatternAnalysis and Machine Intelligence, vol. 31,
no. 2, pp. 210–227, 2009.
[35] T. Kanade, J. Cohn, and Y. Tian, “Comprehensive database
for facialexpression analysis,” inIEEE International Conference on
AutomaticFace and Gesture Recognition, March 2000, pp. 46–53.
[36] A. Georghiades, P. Belhumeur, and D. Kriegman, “From few to
many:Illumination cone models for face recognition under variable
lighting andpose,”IEEE Transactions on Pattern Analysis and Machine
Intelligence,vol. 23, no. 6, pp. 643–660, 2001.
[37] A. Martinez and R. Benavente, “The AR face database,”CVC
TechnicalReport, 1998.
[38] B. Leibe and B. Schiele, “Analyzing appearance and contour
basedmethods for object categorization,” inIEEE International
Conferenceon Computer Vision and Pattern Recognition (CVPR). IEEE,
June2003.
[39] C. Liu and H. Wechsler, “Gabor feature based classification
using theenhanced fisher linear discriminant model for face
recognition,” IEEETransactions on Image processing, vol. 11, no. 4,
pp. 467–476, 2002.
[40] A. M. Martinez and M. Zhu, “Where are linear feature
extraction meth-ods applicable?”IEEE Transactions on Pattern
Analysis and MachineIntelligence, vol. 27, no. 12, pp. 1934–1944,
2005.
[41] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin,
“LIBLINEAR: Alibrary for large linear classification,”The Journal
of Machine LearningResearch, vol. 9, pp. 1871–1874, 2008.
[42] MATLAB, version R2012b. Natick, Massachusetts: The
MathWorksInc., 2012.
[43] F. De la Torre and T. Kanade, “Multimodal oriented
discriminantanalysis,” in International Conference on Machine
Learning (ICML).ACM, 2005, pp. 177–184.